In the last few years, we have assisted many customers in their journey to the cloud, building landing zones and executing server migrations.
Minimising downtime while preserving data integrity is a key requirement that customers are always demanding and we commit to making it happen.
Network Security is also a mandatory requirement. Locking down security groups allowing only the required connectivity is a must to avoid data exfiltration, reduce cyberattacks blast radius and protect your applications.
If the migration strategy is a rehosting (a.k.a. Lift & Shift), most of the system/functional test failures after the cutover are due to missing Security Groups or NACL rules, difficult to capture during the discovery sessions.
Detecting these gaps during the post-cutover triaging phase, and fixing them in a timely manner, is the key to the migration’s success.
In the next paragraphs, I’ll go through the services that AWS provides to have visibility on network connectivity logs in a VPC.
VPC Flow Logs
Released 6 years ago, this VPC feature accomplishes a very simple task: it logs the network traffic metadata for each ENI (Elastic Network Interface) in your VPC.
Over the years, the number of supported metadata that can be logged has increased (full list here).
We are interested in the source/destination IPs, the protocol, and the action (reject vs accept).
The latest versions of the service include very useful information such as the Region, AWS Service, VPC, flow direction, and traffic path.
VPC Flow logs can be pushed to Amazon S3 or Cloudwatch Logs, depending on the consumption model you want to adopt.
Logging is not real-time. Usually, the data is stored after 5 minutes from the event if using Cloudwatch as target and 10 minutes if using S3.
In our triaging, we have to use Cloudwatch Logs as a target as it allows us to query networking logs using Cloudwatch Logs Insights.
You can (and should, for security purposes) enable VPC Flow Logs from your VPC settings.
For each ENI, a Log Stream is created by VPC Flow Logs. This is how it looks in the AWS Console:
VPC Flow Logs limitations
Given the nature of VPC Flow Logs (one log stream for each ENI) it can be difficult to triage network malfunctioning, as you would have to:
- get the ENI id for the EC2 instance / AWS service
- Search for the ENI id in the VPC Flow Logs Log Group
- Filter by action and target IP using full text search
This approach, although valid, might not be as immediate as you would like it to be. Correlating logs from different Log Streams is challenging (flowlogs-reader is a powerful tool for that) as most of the time you don’t know which server is not working as expected. The typical scenario to triage is “The application X is not working” and not “server A can’t connect to server B”.
This is why you need a tool to quickly query multiple log streams at once, without knowing which ENI id are in scope.
Cloudwatch Logs Insights
AWS released Cloudwatch Log Insights during the 2018 re:Invent.
Quoting the documentation:
CloudWatch Logs Insights enables you to interactively search and analyze your log data in Amazon CloudWatch Logs. You can perform queries to help you more efficiently and effectively respond to operational issues. If an issue occurs, you can use CloudWatch Logs Insights to identify potential causes and validate deployed fixes.
CloudWatch Logs Insights includes a purpose-built query language with a few simple but powerful commands. CloudWatch Logs Insights provides sample queries, command descriptions, query autocompletion, and log field discovery to help you get started. Sample queries are included for several types of AWS service logs.
CloudWatch Logs Insights automatically discovers fields in logs from AWS services such as Amazon Route 53, AWS Lambda, AWS CloudTrail, and Amazon VPC, and any application or custom log that emits log events as JSON.
That is literally what it does. The powerful aspect of the tool is that you can define a query that spans up to 20 Log Groups and all the Log Streams underneath them.
Previously, you had to use Amazon Athena to perform similar data analytics. The final outcome is similar, though setting up and working with Amazon Athena isn’t as straightforward as Cloudwatch Logs Insights and has an additional delivery delay as it’s using data from S3.
This is how Cloudwatch Logs Insights looks in the AWS Console:
You can select multiple Log Groups to search through, their timeframe, and finally, the query.
We are later going to define a specific query for VPC Flow Logs.
The pricing model is based on the quantity of data analysed. It tends to be reasonably cheap for querying data produced in the last minutes in Cloudwatch Logs (this is the scope we’re interested in) given that it only costs $0.005 per GB of data scanned after the 5GB included in the Free Tier. The amount of data stored depends on the number of servers active and the VPC flow logs format.
Putting it all together
Once you have enabled VPC flow logs targeting Cloudwatch Logs, you will see one log stream for each ENI. Remember that each service deployed in a VPC uses one or more ENI (e.g. Lambda, FSx, RDS, etc..).
Let’s now say that we have just cut over a three-tier application composed of 3 servers (Web, App and DB).
Let’s now assume that for some unknown reason the application is not working as expected, for example, a timeout or an HTTP 503 error. When migrating legacy or poorly documented applications, it might be difficult to understand the root cause analysis of a malfunctioning.
To detect (or exclude) network-related problems related to missing security group or NACL rule, open Cloudwatch Insights and use the following query:
filter( action="REJECT" and #We are only interested in rejected connections dstAddr like /^(10\.|192\.168\.)/ and #Regex to include only internal networks srcAddr like /^(10\.|192\.168\.)/ and ( #List of server IPs in triaging scope, we are srcAddr = "10.0.0.6" or #interested in both inbound and outbound dstAddr = "10.0.0.6" or #connections (srcAddr+dstAddr) srcAddr = "10.0.0.7" or dstAddr = "10.0.0.7" or srcAddr = "10.0.0.8" or dstAddr = "10.0.0.8" or ) )| stats count(*) as #Avoid duplicate results (count them instead) records by srcAddr,dstAddr,dstPort,protocol | #We are interested in source/destination IPs, #destination port and protocol sort records desc | limit 5 #Only show the first 5 entries #Web Server 10.0.0.6 #App Server 10.0.0.7 #DB Server 10.0.0.8
Set a 1 hour timeframe (or longer if you need to have a better picture). My advice is to reduce the amount of data queried to achieve better performance at a lower cost.
The resulting query will look similar to the following:
In this example:
- The Application Server is trying to connect to the SQL database using a dynamic port
- A client in the on-premise network is trying to connect to the Web server using HTTP
- The web server is trying to connect to SQL port 1434/UDP (typical for SQL named instances)
- The application server is trying to connect to a file share in another VPC in AWS (445/TCP corresponds to the SMB protocol)
- The App server is trying to connect to the Web server using a custom port
Note that the protocol number corresponds to the IANA Assigned Internet Protocol Numbers (e.g. 6 = TCP, 17 = UDP).
It is important to highlight that a connection reported as “accepted” in VPC Flow Logs doesn’t necessarily mean that the end-to-end connectivity is working.
For example, an on-premise Firewall blocking a connection will still result in a log with a status of “ACCEPT”, as AWS is allowing the connection. A good practice to detect these blocked flows is to isolate connections with no return flow.
Fixing and testing these gaps would require minutes instead of hours.
This approach will minimise the application downtime due to migration, allowing teams to focus on performance and functional testing and release the application to end-users sooner.
In large migration projects, automation is the key to minimise the effort spent by engineers to build and run the Cloudwatch Logs Insights queries.
Click here to discover more about our ‘smart’ approach to cloud migrations.