Chaos Day in the Met Office Cloud
At Cloudreach we are hugely passionate about building systems in the cloud. Our engineers spend their time designing and implementing some frankly pretty cool systems, and we also help teach and empower Cloud Operations teams with the knowledge and experience that we’ve built up from many years at the forefront of tech. Usually we’re all about creating great things really quickly, whether at humungous scale for the enterprise environment, or strongly compliant systems to handle tough regulatory needs.
Recently The Met Office – who we’ve worked alongside for a number of years on their cloud journey – asked us to do something a little different to the usual. They asked us to help destroy some critical parts of their application infrastructure (albeit, very carefully), and to see how their CloudOps teams had learnt to investigate and fix problems, and what they still had to improve. The ‘Cloud Chaos Day’ was born.
Everyone I spoke to about this both within the Met Office and at Cloudreach loved the idea. I was swamped with ideas from colleagues, some of them incredibly sneaky, with examples of cunning methods of how to break things, and how to create hard to spot problems. Since it generated such interesting discussions and energy, I figured it was a great subject to blog about and share our experiences and tips for doing one yourself. So read on as this post discusses the Hows, the Whys, and What we learned from a day of Chaos in the Cloud.
“The adoption of public cloud technology for operational service delivery is a big step for any organisation, introducing new practices that in some cases are quite different from those used for traditional on-premise delivery. The ‘cloud chaos day’ has enabled us to test our operating procedures in a safe environment, giving the Met Office the confidence that we are suitably prepared for the launch of our new services”
Richard Bevan, Head of Operational Technology at Met Office
First up, Why?
The Chaos Day was born out of one question:
“How can we see how ready our Cloud Operations team is to handle our application in production, before it’s being used by customers?”
And the solution we came up with was… to break it. The idea behind this solution was that based on what we learn from doing this, we can identify the gaps in knowledge, training or documentation that they need to resolve problems that might happen in future.
You might have heard of Netflix’s Chaos Monkey. In case you haven’t, here’s the short explanation. Netflix built an automated tool which randomly breaks small and large parts of their cloud environments, so that they can ensure systems are fully redundant and designed to handle outages and other problems. By not knowing what caused it to stop working, it presents truly unexpected challenges to identify and resolve. It’s a great tool, but for new Cloud Operations teams who are still getting used to the cloud, that can be too much to handle right away.
So as a simpler starting point, we chose to manually select areas of applications to break , and to let the Cloud Operations team be informed of and debug the problems in conjunction with the application development teams.
The overall goals were to identify problem areas and to develop a plan for what areas to improve. Whether the problem areas were in cloud systems knowledge, product experience, product documentation or even correct access, we wanted to know what was hard, what was easy, and so on.
How we chose to approach it
As i mentioned above, we decided that initially we’d start out with a simple approach and manually break items to begin with. In the longer term, we’d love to automate this, but for now as we are all about Agile Development at Cloudreach, we wanted to take an iterative approach. We started off simple with the idea of improving over time. Firstly, we spoke to many different stakeholders including technical and application colleagues from the Met Office, as well as Cloudreach Engineers and Architects in order to determine what we could break, and to build the list of potential items to break.
As time goes by these will become a large list of manual things we can break, and then develop into manual triggers of automated problems, and then random selection of problems, and so on until full automation of the process is achieved!
So now that we had defined our approach, we then worked closely with the Met Office’s application teams to identify which parts of our target applications which could be broken without production impact. If doing this yourself, a tool which you should strongly consider when doing this is AWS CloudFormation which allows you to quickly and easily deploy a complete set of ‘stacks’ that make up your application infrastructure.
Cloudreach has also developed and released the Sceptre tool for managing CloudFormation environments, which makes this even easier! Using such tools and a bit of scripting we can create fully working copies of applications in different AWS accounts, so that testing can be done safe in the knowledge that we wouldn’t adversely affect production services.
Armed with our list of possible items, we then made a plan for the day, and estimated what we’d break, how long it might take to fix, what the user report would look like, and more. I did some last minute research into each item that we were proposing to break in order to make sure I could get in and do the proposed break, and then we were ready to go.
“This marked a milestone for the Met Office Cloud Operations team, a practical validation of the development journey undertaken so far. The Cloud Operations team consists of senior engineers from a number of traditional on premises operational support teams. Joining in partnership to support agile development teams to exploit the best of public cloud, the hard work and open relationships between Software Development, Cloud Operations & CloudReach was demonstrated practically in the Chaos day.”
Jon Sams, Cloud Operations Lead at Met Office
How Chaos went on the day
On the day itself we sat in a meeting room together and I presented each breakage as a set of symptoms, one at a time. As the team worked through each problem, we noted down the things we’d discovered and learned about each system or cloud service, any documentation or knowledge we didn’t have, and so on.
As the first few problems were carefully unravelled, identified and fixed by the team, it was clear that we were learning a lot and everyone was enjoying the experience! Our networking specialist was attacking problems from one direction, our support specialist the other, and they were managing to meet nicely in the middle and eventually track down the root cause. Thankfully my fears of making things too easy turned out to be incorrect, and the balance worked out just right.
By the time things wrapped up in the afternoon, we’d come out with some pretty good outputs; a giant bundle of notes of lessons learned, lots of empty snack packets, and a Cloud Operations team pleased with their problem solving skills!
- Know your team. If your team is mainly networking specialists, it’s going to to be easy for them to find networking problems. If you’ve got a mix, do a range of things so everyone gets a chance to share their knowledge.
- Be careful! In the cloud you can create copies of environments to test these things with. So spin one up. Don’t risk your production data if you don’t need to.
- Make backups that you can quickly restore from. You might break something you didn’t intend to. Make sure you have a rollback and restore plan for every ‘breakage’ you make, so that you can fix any unintended consequences quickly!
- Start simple — in real breakages or accidental changes, simple stuff happens as well. As you see how the team responds, you can increase the difficulty, break multiple things at once, etc.
- Don’t be tempted to be too clever too early. Remember, the goal is find out areas for improvement, not to defeat your Cloud Operations team!
- Timebox the breakages – typically beyond about 30-45 mins per breakage will help keep people engaged without losing focus.
- Audit tools such as AWS Cloudtrail can be your undoing with a clever team looking for changes. You can avoid this somewhat by using different users, or have something such as a lambda function or cron on an instance to trigger the changes. However, ultimately you’ll probably have to restrict your teams from jumping straight to CloudTrail or it will get pretty boring fast!
- Try and present your problems to the Cloud Operations team as users would – an email with screenshots, error messages etc.
- Test your breakages out beforehand – no need to keep people waiting for you to break things.
- Plan to break more things than you do – just in case some don’t work or are solved really easily!
- Bring snacks! It helps keep things relaxed and energy levels up. Our preference is for doughnuts.
- Also, take regular breaks! After every 2-3 problems is ideal.
I hope that’s helped to inspire you a little. Enjoy the chaos!
For more posts by James Wells, click here