What I wish I knew before I started doing lift and shift migrations to AWS
In a previous blog post I’ve discussed how to integrate Terraform with rehosting migrations.
In this technical article (mainly sysadmin oriented) I’ll discuss what is needed to ensure a smooth migration minimizing surprises on migration day, and what should always be included in a lift and shift migration runbook.
I will cover the following phases:
- Cutover preparation: Steps that should be performed before the cutover event
- Post-Cutover: Steps that should be performed after the cutover event, including some troubleshooting tips and tools
As usual, in the context of a large-scale migration, all of the steps mentioned in this article should be automated to reduce manual effort and errors.
However it can be challenging to automate the whole process given the differences between commands in different operating systems and versions.
Phase 1: Cutover Preparation (on-premise server)
Let’s start from on-premise server tasks. Assuming you have already installed AWS MGN, these are the other things to consider doing prior to the migration:
1. Install the SSM Agent
This component is crucial to streamline access through SSM Session Manager and run SSM commands post-migration.
Most of the sysadmin automation in AWS is based on SSM. With AWS MGN you can automatically install the SSM Agent as part of the post-launch settings.
In case you can’t use it (e.g. if you are still using CloudEndure), it’s highly suggested to install SSM on-premise and leave the Windows Service as “stopped” but enabled at boot.
Instructions on how to install SSM Agent can be found here.
2. AWS Powershell (Windows only)
It’s always good to use AWS CLI via Powershell in Windows machines.
However, most of the time AWS CLI is not used from the server itself.
There is an exception where AWS Powershell is mandatory: in case you need to enable VSS Consistent Snapshots in Windows Servers, which is recommended to backup SQL Server, File Servers, Domain Controllers via EBS Snapshots.
3. Check the free space
To execute a cutover or test cutover of a Windows Server, MGN will require 2 GB of free space in the C drive of the server.
While the server is replicating, make sure you always have enough space in the C drive to avoid last minute cleanup before the migration.
4. Reduce the DNS TTL
In case of DNS TTL longer than 20 minutes, to avoid users and testers cleaning their DNS cache after the migration, it’s recommended to reduce the TTL to 15 minutes before the migration, and set the previous value afterwards.
5. Local Admin user
A local administrator account should be available before beginning the cutover process.
In case access is needed to either source or target machines, the local administrator account will allow logging in without relying on Active Directory.
This can be accomplished by using a GPO that creates a local administrator user for Cloudreach. This GPO would be assigned to each server that will be migrated and removed after the migration.
6. Root CA Certificates
AWS uses HTTPS communications for all the APIs to their services endpoints (e.g. ssm.eu-west-1.amazonaws.com). CA Root Certificates need to be updated so that the various domains *.amazonaws.com can be trusted. This usually happens periodically when the server is patched.
Sometimes, servers are not fully patched or are missing the CA Root Certificates needed.
To overcome this problem (as we will need to use the Endpoints) it’s highly recommended to install the root CAs certificates from this AWS website.
7. Before the cutover (Windows only)
- Update machine password – issue a “netdom reset machinename” to update the machine password to expire in 30 days. This will ensure that the server will be able to authenticate successfully to the Active Directory domain after the migration and avoid domain trust issues.
- Shutdown phase – When it’s time to shutdown the source server, use the command shutdown -s to prevent the source server from installing Windows Updates. We need the source server to be shutdown as soon as possible to avoid delays while the target is being created.
Phase 2: Post-Cutover
Once we have hit the cutover button in AWS MGN (or CloudEndure), the server will be created in the designated AWS Account.
Depending on the migration strategy, you may want to take an AMI and transfer the snapshots to another AWS Account, and/or integrate the EC2 instance with your IaC tool like Terraform.
At a certain point, you are ready to perform the basic sanity check tests in the target instance, such as:
- Instance Status check shows as 2/2
- SSH/RDP Access using local or domain users
- The Microsoft DNS has been updated dynamically with the new IP (if dynamic DNS is being used)
Let’s now have a look at what you should consider doing after the migration:
1. Enable termination protection
To avoid surprises on migration day (and after!) it’s recommended to enable termination protection. Rehosted servers are usually “pets” so you want to treat them as such.
2. Cloudwatch Agent installation and configuration
If Cloudwatch is the tool of choice for logging and/or monitoring, you can install it now.
Before the migration you should create a configuration file and store it in SSM so it can be retrieved by all the instances migrated to AWS.
- Use the SSM document AWS-ConfigureAWSPackage with the option AmazonCloudWatchAgent to install the agent
- Use the SSM document AmazonCloudWatch-ManageAgent specifying the SSM Parameter Store where the Monitoring and Logging configurations are defined
3. AWS Drivers update
It is important that all the drivers are up to date to maximize the performance of our migrated instances.
As usual, you can use SSM and the document AWS-ConfigureAWSPackage with the following packages:
- NVME Storage Drivers – AWSNVMe
- PV Networking Drivers – AWSPVDriver
- ENA Networking Drivers – AwsEnaNetworkDriver
- AWS VSS Components – AwsVssComponents
Note that not all of the packages are compatible with Linux Instances.
4. AWS MGN Agent removal
This step is performed automatically by AWS MGN after the migration but sometimes (for example, when a reboot is triggered manually during the uninstallation process) it may fail. Instructions for uninstalling the agent can be found here.
AWS MGN uses EBS snapshots to create servers in AWS.
The first time an EBS block is accessed, AWS retrieves it from S3, severely impacting the disk queue length. All the EBS volumes associated with migrated servers will underperform for a certain period of time and the performance will not be consistent with the IOPS and throughput values associated with them.
To avoid this, we need to “pre-warm” the EBS volumes.
For Windows, you can use fio, while for Linux you can use dd. Everything is very well-documented in the AWS docs. The amount of time taken to pre-warm drives depends on the volume size and the IOPS/throughput.
In the graph above you can see the Queue Length metric for an EBS volume.
In this example fio has run twice (you can see the two trapezoids corresponding to each run):
- The first time, when the EBS volume wasn’t pre–warmed, the run took 90 minutes to complete and the Queue Length was steady at around 30.
- The second time, the elapsed time has reduced drastically to 60 minutes and the Queue Length to 3.
Database applications are quite sensitive to Queue Length (which is correlated to the Latency), hence it’s extremely important to pre-warm all the I/O intensive drives and avoid executing performance tests before the pre-warming has completed.
6. Set AWS NTP
Unless you have other strict requirements, it’s highly recommended to use the AWS provided NTP servers. Instructions for Linux are documented here while for Windows servers you can have a look here.
7. Test VSS Backup
If your Windows server is configured to be backed up using VSS, it’s a good idea to test it, as there are functionalities depending on the Instance Profile permissions, AWS Powershell module and the AWS Vss Components that need to be verified.
8. Prevent Trust Domain issues
After the cutover we always recommend turning off the source server and disconnecting the Network Interface Cards to avoid it accidentally up again and interfering with its replica in AWS.
For Windows servers, to prevent the old source instance from affecting domain trust on the destination instance, run the command netdom reset [machinename]
9. Connectivity Troubleshooting
This is not a “step” or a “task” ,just guidance on troubleshooting connectivity issues. I’ve published a blog article on how to triage Security Group and NACL issues using Cloudwatch Logs Insights by querying VPC flow logs.
Unfortunately, Security Groups and NACL are not the only “connectivity blockers”. Newer deployments use AWS Network Firewall, which also produces logs to scan for DROPs.
Almost all organizations use one or more perimeter firewalls on-premise.
Migration teams do not usually have access to them, hence we use more “traditional” detection methods .
For example, netstat is a solid CLI tool installed in any Windows and Linux distributions, which allows you to detect connections blocked by firewalls (including Security Groups).
By simply using the command “netstat -ano | grep SYN_SENT” (linux) or “netstat -ano | findstr SYN_SENT” (Windows, Powershell) you can easily see all of the connectivity that is not working as expected from a TCP perspective (SYN_SENT means that that client is trying to connect with a server, but it hasn’t received the server response).
The output of this analysis is usually something like: “The application xyz is not working because it is probably not able to connect to the server yzx using port y” which will help Networking and Firewall Administrators to open up the required connectivity.
10. On-premise agents removal
We want the target server to be as clean as possible, without legacy infrastructure software installed.
These are usually the agents that need to be removed after the migration:
- VMware agent
- Hyper-V agent
- Physical Server Management tools (e.g. HPE Integrated Lights Out, BladeLogic, etc..)
- Backup agent (if switching to AWS Backup/EBS snapshots)
- Monitoring agent (if switching to other monitoring platforms like DataDog or Cloudwatch)
In this article I’ve shown you some aspects that should be taken in account for every rehosting migration. Similar considerations can be applied to other cloud providers and migration tools.
I hope you can make use of these tips during your journey to the cloud.