How AWS CloudTrail can help Resolve DNS Gotchas in the Cloud

Kieran Doonan 3rd February 2016

What’s the problem? Many services in Cloud environments use DNS to reference resources. Services like RDS and ELB in AWS may have their underlying hosts change from time to time and it can cause some unintentional outages that can be hard to detect.


The main problem is that some applications like to only resolve DNS once. If the first time an application resolves
example.eu-west-1.elb.amazonaws.com it stores it in memory, then it can save a little bit of time and resources the next time the same DNS name is used. DNS is designed to be cached, but only for a set time. Every DNS entry comes with a “Time to Live” (TTL) which specifies how long the entry is supposed to be valid for, and it usually ranges from about 5 minutes to 24 hours.


Here’s an example

This one has caused us issues in the past: Nginx. If you’re not familiar with Nginx, it’s a web server that can be used as a reverse proxy. A common setup is to have an ELB with Nginx instances caching requests from your other web servers. In this configuration, you’d typically have a proxy_pass in the Nginx config which points to the web tier ELB as shown here:

http {

   server {

       server_name www.example.com;

 

       location / {

           proxy_pass http://webtier.eu-west-1.elb.amazonaws.com;

           …

       }

   }

}

This will likely work for you for some of the time, until one day your entire cache layer goes down at once and only comes back up when you restart Nginx on each instance, because the IP addresses of the ELB changed, but Nginx never re-resolves the ELB’s name into the new IP address.


Is there a solution?


To get around this with the community version of Nginx, you have to force it to re-evaluate DNS with a configuration like the below (replacing the resolver value with a relevant DNS server).

http {

   resolver X.X.X.X;

   server {

       server_name www.example.com;

       set $web “webtier.eu-west-1.elb.amazonaws.com”;

 

       location / {

           proxy_pass http://$web;

           …

       }

   }

}

Referencing the web ELB by a variable and having a resolver defined will make Nginx cache the DNS entry for as long as the TTL. This aligns much better with how DNS is supposed to work (and also means there’s no real performance impact).

 

AWS CloudTrail for Detection

Most other applications that have this problem have an option to get around it, but sometimes it can be difficult to detect. You could run through every part of your application and run tests to see if this problem shows up (e.g. by referencing a DNS entry, updating it and then checking if the changes are eventually reflected in the application) but if you’re using AWS, there’s a good way to detect when it’s happened.

Using the scenario described previously as an example (an Nginx cache in front of a web tier): When the IP addresses for the web ELB change, all the Nginx instances won’t be able to act as a reverse proxy. It’s worth noting that the IP addresses I’m referring to are for the ELB itself, and not for the instances behind it. The IPs only need to change when the ELB itself needs to scale or the underlying hosts serving it need to be refreshed.

Fortunately, there’s a good way to see when this happens. The AWS CloudTrail service logs every API action taken on your AWS account. This includes not just your actions, but also the actions of some internal AWS systems. The below screenshot shows the ELB service deleting one of it’s unused network interfaces.

 

 

If your application has problems every time something like this happens, then DNS being cached for too long is likely your problem. Some important things to note are the source IP address (elasticloadbalancing.amazonaws.com) and username (root), this shows that the API call comes from an ELB service itself, and not someone manually making the change.

 

Liked this? Check out our Slide Deck on the same topic!