Where did my MGN source server go?
Solving why AWS Application Migration Service source servers sometimes *disappear*
In one of my current consulting projects, I'm helping a company move 15 Ubuntu VMware servers from a traditional private virtual cloud provider to AWS EC2 using the AWS Application Migration Service (MGN). To replicate a server with MGN, you install a lightweight replication agent on each server. The agent connects to the MGN service and then begins replicating the server's volumes at the block level.
The first six server migrations were uneventful. Then I added a new server (we'll call it "Server7") and noticed that one of the first six servers disappeared (we'll call this one "Server6"). "That's odd," I thought. I tried reinstalling the replication agent on Server6. Wah-lah! It showed up in the list of source servers again, but then... Server7 disappeared. The same thing happened with two more servers I tried replicating. I was playing a tedious game of "Whac-A-Mole."
It became clear that AWS MGN somehow determined these two servers were the same server. At first, it wasn't obvious. Each server had a different hostname, different private IP address(es), different everything on the surface.
I also noticed that whichever server was active as a source server in the MGN source server list would never finish replicating and reach a "Healthy" status. Instead, it would stall out and continually perform time-consuming rescans. This went on for a few days.
I opened a support ticket with AWS but was not making much progress on the root issue. I decided to dig into the install directory for the AWS Migration Agent under
/var/lib/aws-replication-agent. I SSHed into Server6 and found an
agent.config file that contained JSON representing the current configuration. I noted the
sourceServerId value as well as the
installationIdentifierValue value. These map to the "AWS ID" and "VMware virtual machine identifier" fields in the MGN Console UI, respectively.
// agent.config - other fields removed for brevity
I then SSHed into Server7 and compared these values in the
agent.config. They matched Server6. For source servers running on VMware, AWS uses the VMware UUID of the VM as the unique identifier. I confirmed the UUID of both servers by running
sudo dmidecode | less and looking for the VMware section. Both servers had the same UUID:
Serial Number: VMware-56 4d 2a d7 f1 fa 9a 0c-f8 66 78 e2 46 fa 53 3c
Apparently, when some of the server VMs were originally created, a new UUID was not assigned to them on creation. This had not caused any obvious problems yet, but it was a showstopper when using MGN.
The solution was to update the server UUIDs to a new, unique value. You may find the following articles helpful:
https://kb.vmware.com/s/article/1002403 (specific to Windows VMs)
I'll update this article with detailed steps once we work through the process of changing the UUIDs through our current VPC provider.
A quick explanation regarding the rescans I was seeing... the replication issues were a result of the two source servers sending replication data through the replication agent to a single source server record in MGN. AWS saw bytes coming from two different agents for the same server volume. As you'd expect, this does not work well. 🤣