2019-05-29 10:33 UTC

Fun MySQL fact of the day: replication delay

Now that we have a basic understanding of how MySQL's replication mechanism works, let's poke it with a stick and see what happens. That is, of course, fun.

Let's first start running some commands, like STOP SLAVE SQL_THREAD. Based on what we learned yesterday, we know a couple things:

The IO thread will continue to run and will continue to download binary logs from the replica's master. Disk usage will start growing as now relay logs aren't being applied/deleted.
With the SQL thread stopped, no further changes to the data will be made. The replica will be stuck in time.

Now, let's START SLAVE SQL_THREAD and then STOP SLAVE IO_THREAD. Based on what we learned on Monday, we know a couple things:

No binary logs will be downloaded from the replica's master.
While the SQL thread is still running, without any new relay logs, no further changes to the data will be made. The replica will be stuck in time.

You'll probably agree: we likely don't want to be stopping either replication thread too often on a replica that is currently serving requests. Intentionally stopping either thread will eventually have impact on our services. Unfortunately, our environment often has other priorities. Let's consider an example wherein a replica's connection to its master is degraded. Let's say it's subject to packet loss. In this case, we already know that the SQL thread, which reads off the local disk and writes to the local disk, will be unaffected. However, this is not true for the IO thread: it will be unable to download a master's binary logs quickly or efficiently, creating replication delay on the replica causing it to be stuck in time. Similarly, if we consider a situation wherein the replica database's IO subsystem or CPU become too heavily saturated, the SQL thread will begin slowing down. When the SQL thread takes longer to execute each change, we will, again, start to observe some amount of replication delay on the replica causing it to be stuck in time.

Thankfully, we do have a couple ways to measure (and hopefully monitor!) this type of replication lag, and we'll look at those as we finish up the week.