Monitoring events in your AWS Environment
Monitoring events in your AWS Environment
Today on the Cloudreach blog I’d like to talk about a topic that is in massive demand at many of our enterprise level clients – monitoring events in your AWS environment! Thanks to products such as AWS Organisations, it’s now easy to have hundreds of AWS accounts. Each of these generates logs of all actions using CloudTrail. There can be millions, perhaps even billions, of actions going on in your cloud estate every month! Some of these you need to keep an eye on, some you need to react to and investigate, and others you don’t know for sure. So if you had to get control of a large cloudy environment, monitoring these actions is a key step!
Recently I was tasked with solving this challenge for a large public sector enterprise, and rather than spending a lot of effort in implementing the the ‘old school’ heavyweight options, either an ELK stack, or the large commercial solutions out there, I wanted to see if this could be done in more lightweight and Serverless way. It turns out that it was, and I decided to share my interesting journey, which at the end resulted in a CloudFormation template that deploys all the resources required from scratch! So, here’s some guidance on doing this yourself.
Before we begin
Before you get started you’ll need to make sure that centralised logging is in place. AWS have some guidance on their Answers page, but basically all you need is an S3 bucket which is receiving logs from CloudTrail , which can be done either in CloudFormation and using the new Stack Sets functionality to deploy to multiple accounts, or manually configuring it on each.
A great feature of S3, aside from it’s low costs, is that S3 Events can be used to react to PUTs – that is, new objects being stored in the bucket – and we can then trigger an AWS Lambda Function from this event. For more complex uses cases, you could also trigger a SNS, or push the event into an SQS queue for processing.
There are a lot of potential options for deploying and managing Serverless code in the Cloud, such as Zappa , Serverless and many others. In this case I decided to stick with the time honoured method of a simple zipped function uploaded to an S3 Bucket. For larger projects, I’d absolutely recommend a framework but for this one I was really focused on seeing if i could keep everything (mostly) inside a single CloudFormation template for ease of maintenance and updates and so on.
Now that we have a Lambda function being triggered by our log file delivery, then we can respond to the event being sent to us. There’s a good reference sample event that you can refer to in the AWS documentation, though i’d also recommend looking at the real events that are passed into your Lambda code as well, just to make sure you capture the correct events you are looking for.
As well as the Lambda function, i also created an SNS Topic, the necessary IAM policies and roles, and two DynamoDB tables. These latter are for the lookup of which events are on our list to respond to, and to translate the AWS account id numbers into a more readable account alias. The SNS topic is for notifying operations teams, and later on could also be used to directly raised tickets in JIRA or ServiceNow for investigation of suspicious actions.
Content for the DynamoDB tables is very easy to load using BatchWriteItem and JSON files of the data you’d like to import. Either the AWS CLI or any of the SDKs can be used to do this import.
Here’s a diagram of what all this looks like
Hooking it all together
Now that all of our components are in place, it’s time to hook them all together, which is done in our Lambda function. My own choice for the Lambda function was Python and the Boto3 library which makes it very easy to work with AWS services programmatically , though there are a wide range of SDKs available if you’d rather work in Node.js, Java, or C#.
The Lambda function has to now do a few things, such as:
- Download the gzipped log file from S3
- Open and load the JSON of the CloudTrail event into memory for parsing. There’s a useful reference guide by AWS to the JSON structure of the event.
- Look up each event in the DynamoDB table, and determine if it’s something we need to react on
- Build a message containing info such as a the action, the user or role or service that triggered it, when it happened and in which account
- Publish to the SNS topic the message, so that it can be delivered to the operations team or the ticket system for action.
Once all these are in place, using CloudWatch Logs and metrics, it’s easy to keep an eye on the executions, make sure the DynamoDB tables have enough provisioned capacity and so on.
Getting more detail on the source of an event
You might also wish to get exact detail on which user, service, Lambda Function etc did an action, which can be tricky if you are using AssumeRole , Federated Login and so on. Thankfully there is a a great AWS Blog post on how to cross reference events. In order to be able to refer to past events, you can either use a tool such as AWS Athena to search Cloudtrail logs in the S3 bucket, or have your lambda log actions such as AssumeRole and federated logins to a DynamoDB table, and this can then be used to determine the exact originator of the event. Care should be taken with both methods – in Athena a good mapping of table to file structure and well written queries are important for cost control, and in DynamoDB, the TTL feature can be used to expire Login data in order to keep table sizes under control.
The End Result
So there we have it, now we’ve put a system in place alerting on actions we want to know about. It’s entirely serverless, so the maintenance is very easy – really all there is to do is keep track of capacity and any script errors, plus add/remove actions to our DynamoDB database – as well as very low cost. In fact, for 1 million operations a month, our costs work out at around $0.23 a month of Lambda, around $3 more DynamoDB costs, and the same again in SNS.
So under $10 a month to keep an eye on actions over a few hundred accounts, not too bad a result!
For more posts by James Wells, click here