Learning about failure at AWS re:Invent

Daniel Smythe 1st December 2017
AWS re:invent failure

So this week I have had the pleasure of attending AWS re:Invent in the crazy Las Vegas. As it’s been my first time at re:Invent, and Vegas for that matter, there’s been a lot to take in. However one common theme I noticed from attending the various talks is how openly and frequently failure is discussed.

Failure is good – fail fast, fail often!

Now I know what you may be thinking…

“How can one of the biggest tech events of the year be about failure?”

“Is he still drunk when writing this?”

Let me explain; as I have previously stated in articles about agility and technology, innovation is all about failure and learning from failure. This is what all successful businesses have done and what I truly believe re:Invent this year is all about. I’ll try illustrate my point through a few examples below.

 

Example 1: Snapchat and its user stories

 

Snapchat did a session which illustrated how they moved to DynamoDB for their inbox stories. Snapchat had initially moved to Dynamo for 3 reasons; cost, performance and improved latency. The failure came after their test when they achieved NONE of these improvements, but they didn’t find this to be a problem and this actually helped them in the end.

They investigated and realised they needed to refactor their application to achieve their benefits, and when they did this they achieved large cost savings – happy days! But the key thing this failure gave them is through the power of Dynamo they got access to more data, investigated their application even further, and found unique random inefficient legacy behaviour of their code specifically regarding their retirement and deletion of stories which has been refactored since.

Innovation came off the back of this failure, as Snapchat was able to move away from their old database engine and use Dynamo to scale to meet their peak traffic over the New Year’s Eve period, where people tend to go crazy snapping their friends and family.

It’s amazing what failure can achieve.

Watch this talk in full here.

 

Example 2: AWS ASAP Governance framework

 

Amazon led a breakout session on how Amazon.com itself uses AWS Management Tools. This looked at how Amazon.com went from one shared account to 10,000 AWS accounts, which in itself was a story of repeated failure and learning/evolving with that failure.

This process for Amazon.com spawned the creation of the ASAP (their in-house) Governance application which helped them solve some of the challenges of managing 10,000 accounts across their business. I will definitely refer to this talk in the future, as their “guardrail over gate approach” was a breath of fresh air. If enterprises truly want to innovate they need to support their teams and allow them to make failures – putting up guardrails to help them from making mistakes instead of gates to stop them which ultimately will slow down innovation. Ultimately if Amazon.com can do it at this scale, why can’t other big enterprises?

 

Example 3: AWS Prime Day

 

There was a statement in the session ‘How DynamoDB powered Amazon Prime Day 2017’ which really resonated with me and encapsulates the theme of this blog. The statement was,

“fixing to one compute resource is designing for failure”

The speaker gave few examples but basically the tl:dr on this is no matter how big they built their relational database and server, Amazon.com would break it – it’s just not possible to scale to meet demand when you have 600 million different items for sale with 7000 attributes each and 12.9 billion rps to database on one day. You have to design in scalability and fail to be successful.

Watch this talk in full here.

 

These are only 3 quick examples, but they highlight how big tech companies such as Amazon.com and Snapchat have learnt from failure and are in fact encouraging other companies to go through this process.

Fail often, learn from it, innovate, and go make it happen.

Have a great day people and fail well.

Thanks

Daniel