Recently, many ISO Compliance teams in ISO certified organizations, utilizing Amazon Web Services, and located in the Eastern parts of North America had reasons to be on high alert.
On February 28th 2017 a four-hour outage impacted one of Amazon Web Services’ (AWS) largest cloud regions, US-EAST-1 in North America. Since many enterprises rely on AWS — this outage, many times longer than the expected annual downtime for the S3 cloud storage system where the issue occurred, is highly concerning.
The outage, caused by high error rates affecting the Amazon Simple Storage Service (Amazon S3), commenced at 12:35 pm ET and was fully restored by 4:49 pm ET, according to AWS. Amazon S3 is ‘object storage with a simple web service interface to store and retrieve any amount of data from anywhere on the web’ says AWS.
This service is marketed as being ‘designed to deliver 99.999999999% durability’; a claim which is now clearly questionable!
Business Continuity Implications and Actions
One of those concerns regarding the use of this Amazon Web Service might well be to determine how successful that service meets ISO compliance guidelines.
Another concern is to ascertain the level of impact and whether or not such outages will cause business continuity professionals to review their business continuity plans.
In trying to address the above issues and to basically capture the lessons that can be learned from such an outage, the Continuity Central group conducted a survey amongst its readers to attempt to answer these questions and concerns.
The survey asked whether the respondent’s organization used any AWS services. The responses show just how widespread AWS’s reach is: 35.5 percent of respondents’ organizations use AWS and 11 percent don’t yet but plan to do so in the future.
20 percent of respondents said that their organization had been affected by the February 28th outage. These respondents were invited to briefly describe the impacts; and some of those substantive responses received are as follows:
- We had about 25 separate user-facing systems that were unavailable, affecting users in the 100s of thousands. When the outage first occurred, we first tried to switch to an alternate region for those systems that are in multiple regions (not all of them are), but could not because we could not use the AWS load balancing service, which was also impacted by the outage. In the end we just had to wait for Amazon to resolve the problem and test it. The only effective business continuity actions we were able to take were around communications.
- We were minimally impacted and the outage window was well within our tolerable allowances therefore only situation status/monitoring and communication was undertaken.
- Not directly but indirectly. For example, in the course of business we noticed links in others’ websites were not working, creating a delay in our own business process.
- We could not upload attachments to Hubspot. We could not login to GoToMeeting.
- This affected a vendor that uses Amazon Web Services. During the outage we were unable to monitor a large number of devices in the field that we are responsible for. Fortunately, there was no adverse impact beyond that.
Business continuity plan reviews
The survey asked ‘Will your organization be reviewing its business continuity plans in the light of the AWS outage?’ Interestingly, 40 percent of respondents said that their organization would be conducting a review of business continuity plans following the outage. 36 percent said that no review would take place; and 24 percent replied that they did not know whether a review would take place.
The final question in the survey asked respondents ‘What lessons do you think the business continuity profession can learn from the AWS incident?’ 62 percent of respondents took the time to answer the question and some of those substantive responses are listed below (published verbatim except for spell corrected):
- If you buy something that is supposed to be almost 100% reliable it needs to be tested in the buyer’s environment to make sure that is true – clearly not in this case.
- Cloud is not as resilient as it’s cracked up to be. It is fallible just like other solutions.
- This stresses the importance of planning properly. Strategic objectives changes overtime. Therefore, you always have to ask questions with regards to the relevance of your business continuity plans. Are they aligned with organizational objectives? The fact this downtime took this long to sort out, it means there was a “belief” that this will never happen.
- Another reminder that business continuity planning must be an ongoing process…identify and mitigate new or previously unidentified risks (internal and external) and test plans to ensure that we are prepared to respond and recover.
- Obviously even cloud technology has its flaws. Assure there are strong SLA penalties in contracts with cloud vendors and well-documented manual processes to continue to ‘process’ work until systems come back online.
- Assumptions about reliance on the uptime of third and fourth and fifth party providers need to be assessed.
- I’ve spent most of a four-decade career working in high-demand, high-availability computing environments where ‘failure is not an option’ were words to live by. The hard-learned reality is that ‘failure is *always* an option,’ regardless of the time/money/energy invested in building so-called bullet-proof systems. There is not, has never been, and likely will never be any such thing. The lesson for business continuity planners? Simple: failures will happen. Prepare for that, and don’t be deflected from the task by those who wear rose-colored glasses. At the end of the day, we must be the ‘dinosaurs’ who understand that man-made systems will suffer from man-made flaws.
- Claims about reliability numbers are void. The downtime clearly proves that AWS’s claim was false. The whole BC architecture has to be analyzed instead of believing marketing numbers about availability.
Click here to read more of the results from this survey.
If this topic is high on your BC team’s checklist of things to do, please pass this information along to those team member in your organization. And, as always, if you have some thoughts, concerns or comments to add this posting, please do so at your earliest convenience.
By: Ben J. Carnevale, Managing Editor