A practical example of why we should dispose of development environments when not in use and what can happen when an environment is left active over the weekend.
Development stack left running over the weekend… uh oh!
This story begins innocently with some personal development and a simple FarGate stack. Nothing too taxing, right? Basic VPC network and a few accessories to power my project – think:
There were a few extra data sources plugged in but nothing that anyone would assume much cost for (think empty S3 Buckets, tiny RDS and ElastiCache).
Now, the details of this project are currently top secret, but needless to say there were quite a few containers. The running model includes 16 containers, which appeared healthy and working properly. The stack was left running on the weekend.
Then the lab spend exploded
Now, everyone enjoys their weekend and no one wants an email from HQ about their lab’s spend and its inevitable termination that will cause panic, confusion and that “What exactly is going on here?” feeling.
Once back to work and digging into what had happened for the spend to rise so rapidly – expecting to see something obvious like a misconfigured instance to run more powerful than needed or some overused storage…. nope! AWSConfig! (Closely followed by NAT Gateway for those who were wondering, but that is explainable given the stack’s purpose.)
What happened? Database goes offline, health checks fail…
With the AWS Account rightly terminated to stop it running away, it was difficult to find out what had caused this. From the billing data it was evident that AWSConfig had caused the accounts overspend and the inevitable shutdown.
For best practices and also to ensure some good guardrails inside the Cloudreach Labs, AWSConfig service is used to loosely protect the accounts (we trust our developers). And with $0.003 per evaluation this is not much of a cost implication on any environment. Despite this story, I still highly recommend the service – the actual root cause of the issue was due to administration, or a lack thereof.
Looking at the last of the logs available, I noticed that a database had gone offline via another cost saving mechanism which coincidentally happened around the time that the heavy spending pattern began.
As the database went offline and the aggressive health checks that were set began failing, containers started to rebuild, thinking they were all unhealthy, each one needing validation against the AWSConfig Service as it transitioned. That still doesn’t sound like much, but if we break our handy Python CLI for some calculations, which assumes a container takes ~5 mins to fail its health check and be replaced with ~32 (mostly more..) of them across 16 services:
32 X 12 = <384 containers per hour
384 X 24 = <9216 containers per day
9216 x 0.003 = <$27.64 per day
… which is quite a lot of money to spend on not developing your personal pet project!
Remediation – destroy the environment
Unless the development environments are going to be actively used and – most importantly – monitored to ensure pet projects are not going to go down on the weekend (this does not include official product development environments that may be running tests), I would highly recommend destroying the environment for the weekend.
There are so many tools that can snapshot a cloud environment and render it to a code base (Terraform, CloudFormation to name a couple) so that it can be rebooted later, so there is really no excuse for leaving personal development environments active on the weekend.
Another benefit would be a reduced attack surface for security, as many attacks of the past have been possible by pivoting through the generally less protected development environment – which is another reason we should not be considering turning off AWS Config in any hurry.
Don’t leave your development environments unattended
Remember: keep AWSConfig, but shut down your personal development environments with the pet projects and save those pennies to play with far more fun things! How about a hard flex by saving up for Ground Station? Please contact the Cloudreach team for more information.
Paul Hardy is a Principal Systems Developer at Cloudreach with a passion for Offensive Security. Having worked with Cloudreach for almost a decade he has