What the AWS Outage Teaches Us About the Shared Responsibility Model
It was essentially a fat finger on the command line. Everyone in IT has done it at one point in their career. I once deleted the contents of an entire disk with an accidental recursive flag, which coincidentally happens to be when I moved from engineering to sales. And whilst automation and scripting has removed many of these human errors, they do still occur. But that is not what surprises me about the AWS S3 outage. What has caught my attention is the lack of understanding of public cloud services.
I’ve seen hundreds of comments like "This is what happens when you give the keys to the kingdom to one company", and even read an article that pointed out that whilst other retailers experienced outages, Amazon.com did not, subtly inferring that AWS may be willing to manipulate their competitors’ infrastructure (patently untrue). Most articles are looking to blame the big bad public cloud providers and the risk they pose to the internet. I’ve not read any related articles about how, without AWS (and the other public cloud platforms), most of the nearly-free apps that we use daily would not exist. Or would exist as an enterprise app that companies would over-pay for. I’ve not read analytical comparisons between typical enterprise data center failures and AWS failures, quantifying the perceived risk. And I’ve not read many articles pointing out that there were lots of companies who understand AWS that were not impacted in the least.
Why were some companies impacted and others not? It’s down to something Cloudreach calls "Intelligent Cloud Adoption". Moving to the public cloud does not absolve you of all responsibilities. Application management, security, disaster recovery, governance, financial optimisation; these are all things that still must be considered, implemented and managed. Sure, AWS is designed in a way to prevent most failures, and you can watch James Hamilton provide details in this video. But, if you are running business critical systems in the public cloud you must ensure that those systems will survive failure. Netflix does it (and here’s how), and so do most of our enterprise customers. Not for every application, but for the ones that your business depends on.
Every cloud services provider has a shared responsibility model, and AWS has has even published an operational check list that highlights how to deal with building a resilient and highly available architecture. If you have moved workloads to the cloud (and by now who hasn’t?) then you have a duty to ensure that your infrastructure supports your business requirements. And, if you are new to the public cloud or struggling with the very real skills-gap between on-prem operations and cloud operations, there is an entire ecosystem of AWS and Azure partners to help, including Cloudreach, who just recently were ranked as a leader in the Gartner MQ for Cloud Hyperscale MSPs. We are ready when you are.