Case Study: A near zero downtime MySQL database migration from on-premise to AWS

Case Study: A near zero downtime MySQL database migration from on-premise to AWS

Today at lunch I was asked how Database migrations from on-premise to AWS work. And, given that I recently performed a migration for a production ~1.5 Terabyte database with a sub-60 second downtime I figure I have something to say from practical experience.

First, a little background about the client’s project to paint some background.

  • This client intends to transition about 80% of their infrastructure from on-premise hardware to AWS.
  • The migration as a whole is risk-averse and must have a fall-back plan along the way to minimize impact to customers should a target-environment not behave exactly as expected.
  • Each service will be migrated one-at-a-time. This translates into cross-datacenter communications. Making everything very latency-sensitive as there are many little-requests running across the wire.

Given the reality of a one at a time service cut-over plan the client needed latency between the existing Datacenter in Las Vegas to be as close as possible to the target AWS region. Normally, I’d suggest that the region be chosen based on distance to customers. But in this case, the more important factor was cross-provider latency minimization.

The next step was to size things up. For this we calculated the current MySQL master server’s capacity in terms of total megahertz. This was calculated roughly as follows: Core count * speed per core in Mhz = total megahertz for the current environment. Add 25% for safety margin. Disk IO was calculated by observing IOPS and total throughput. We needed to support high IOPS and the Provisioned IOPs EBS volumes were absurdly expensive in comparison to direct-attached SSDs.

Next, because of the specific query pattern for the client’s application, we knew in advance that disk IO was a huge factor and based on some preliminary testing with AWS EBS volumes, the latency was too high for the application to perform well. This ruled out EBS volumes specifically, and by consequence, AWS RDS for this project. There are also special triggers in the database that are specifically not compatible with RDS at the time of this writing.

The new target master was sized up to be an i3.8xlarge (we later sized this down to a i3.4xlarge, then subsequently to i2.2xlarge after slicing customer segments up) - This provided gobs of Disk IO, roughly parallel core count and a comparable Memory allocation.

Now, for some known risks — An i3 type instance has directly attached SSDs. If the instance has an underlying hardware failure or someone clicks “Stop” on the instance then poof, bye bye data. As such, we added two additional replicas. A comparably sized slave1 in the other availability zone and one additional slave for MySQL backups, EBS volume snapshotting, and S3 synchronization of the binlog every 5 minutes. Thus our failover plan in the event of a AZ outage is to point DNS for the master to the slave and then stop slave replication. Make the slave1 the new master and spin up a new instance from the EBS snapshot and catch binary logs up. A very manual process but we’ve tested our recovery process a handful of times so it’s functional enough.

After some very painfully long rsync processes we had our new target AWS-based ‘master’ up and running. Binary logs were configured to exist for over two weeks on both the new master as well as the original master in the Las Vegas DC. Catching up the slave took well over 24 hours. It was not a quick process and used gobs of bandwidth.

On the new AWS master we attached EBS volumes to the i3.8xlarge big-boy, stopped MySQL, and used rsync to copy files onto persistent storage. I had originally planned on using dd; however, I ran into some weird issues when going from ephemeral storage over to the EBS volume. Due to the size of the volume I chose to favor rsync for it’s ability to resume should errors occur. That and it’s pretty straightforward to md5sum the files that were copied. cross-AZ replication to the next slave was slow and for we used an EBS snapshot after performing the initial synchronization.

When it came time to restoring an EBS volume from a snapshot that process was real slow. There was a handy trick that AWS support offered for pre-warming EBS volumes more efficiently:

seq 0 $(($(cat /sys/block/<device name>/size) / (1 << 10))) | xargs -n1 -P32 -I {} sudo dd if=/dev/<device name> of=/dev/null skip={}k count=1 bs=4 

Next came time to configure the /etc/mysql/my.cnf with a few differences:

read_only = 1
server-id = 3  # This has to be unique across all servers that are slave to the original master

Regarding the server-id bit this has to be unique across instances. Here’s how our server-id’s looked:

  • original master: 1
  • AWS new master: 2
  • AWS slave 1: 3
  • AWS slave 2: 4

AWS slave 2 represents our backup system MySQLdump is done once per day at the end of the day. This takes a long time so we perform this at the end of the day. EBS snapshots are taken twice daily. These take about 15 minutes or so.

Replication looked like this:

server-id=2 (new AWS master) was slave to server-id=1 (the original vegas system). server-id=3 and server-id=4 (the two slaves in AWS) were slave to server-id=2.

In this environment we also had ProxySQL in-between the application servers and the databases for connection pooling, caching, and some read/write splitting. Not everyone does so I’m going to skip the steps there. Needless to say, we had a separate deployment mechanism to ‘go-live’ with traffic shifting between DCs and this was part of the fail-back plan as well.

In preparation for our go-live we reduced the TTL on the DNS records for the database masters and slaves.

At go-live we blocked 3306 traffic from the application servers to the original master to block traffic and then changed the DNS records. This was to ensure that no additional writes would be applied and instead just fail during DNS record propagation.

Once traffic was cut over to the new AWS environment we watched performance on the new system and as expected we had grossly oversized the platform. Subsequent resizings followed a similar process; however, for these we used the EBS snapshots as the baseline and then caught up the new master (as a slave) with the existing master and then updated DNS during non-peak traffic times.

Jack Peterson