B.

MySQL 5.6 GTIDs: Evaluation and Online Migration

At Booking.com, we use MySQL to operate our web site where tens of million of pages are viewed and over 625,000 room nights are reserved every day. This is a lot to ask for our database infrastructure. To be able to sustain such load, we split, shard and replicate our data extensively. Unfortunately, a conventional master-slave setup has a problem at our scale and that problem is specifically around failing over master databases and repointing slaves. Luckily, there are some advancements in the MySQL space that can help us achieve the flexibility we need. And like any new technology there is always a drawback.

MySQL 5.6 and MariaDB 10.0 implement a feature called Global Transaction IDs (GTIDs) to provide a facility for easily repointing a slave to a new master. In previous versions, doing this was fairly error prone and not always possible without lots of manual work. This feature could save us quite a bit of headache except MySQL 5.6 GTIDs lack a reasonable deployment strategy (bug#69059). MariaDB 10.0 does not have this limitation, but since a large part of our infrastructure is MySQL we need to be able to deploy these changes with a minimal amount trouble. Oracle is working on allowing setting GTID_MODE=ON online (worklog#7083) but we could use this feature a lot sooner than in MySQL 5.7.

Before going further, let's briefly describe replication with GTIDs in MySQL 5.6. When GTIDs are disabled (the default configuration) all transactions are ANONYMOUS meaning there isn't a global identifier, and instead, each transaction is identified by file and position only. When GTIDs are enabled, each transaction is assigned a globally unique identifier stored in the binary logs of the master. When the slave downloads the binary logs and executes the transactions, it remembers their GTIDs. If then we need to repoint the slave, the GTIDs stored on the slave are sent to the new master which uses them to determine what transactions need to be sent downstream.

When a new replication tree is built using MySQL 5.6, it can be initialized with GTIDs enabled. However, with an existing replication tree (with GTIDs disabled), enabling GTIDs requires a step where all databases are stopped: a GTID enabled database was not designed to replicate from/to a GTID disabled database. To enforce that, a check is done at slave start up validating that the master GTID mode is compatible with the slave mode. This is a serious hurdle.

We have lots of very large replication trees. We cannot stop all databases in a tree simultaneously without a severe negative impact to our customers. Plus, before we can enable GTIDs everywhere, we need to evaluate the feature and train ourselves on its usage. And it requires adjusting our automation and monitoring systems to be GTID aware.

That isn't to say we aren't interested in the feature, but there had to be a better way to try it out and also deploy it without killing our business. We believe we have found that solution.

While testing GTIDs in our lab environment, we were able to replicate ANONYMOUS transactions to a GTID enabled slave. This can lead to unexpected results because each slave would generate different GTIDs for the same transaction. We reported this as a bug (#71527). However, after opening this bug, we realized that this behavior can be used to our advantage to evaluate and migrate to GTIDs. Basically, we want to introduce an intermediate master to generate GTIDs in a replication tree. We submitted a feature request specifically for this mode of operation (new GTID mode: ANONYMOUS_IN-GTID_OUT (#71543)). After some deeper investigation, we realized that this mode can be implemented with very small patch to the MySQL source. All we needed to do was disable the GTID mode compatibility check on the intermediate master.

From the root of the MySQL source tree, in the file sql/rpl_slave.cc, there is a function called get_master_version_and_clock(). In this function, the following code is responsible for checking the slave and master having compatible GTID modes:

if (mi->master_gtid_mode > gtid_mode + 1 ||
    gtid_mode > mi->master_gtid_mode + 1)
{
  mi->report(ERROR_LEVEL, ER_SLAVE_FATAL_ERROR,
             "The slave IO thread stops because the master has "
             "@@GLOBAL.GTID_MODE %s and this server has "
             "@@GLOBAL.GTID_MODE %s",
             gtid_mode_names[mi->master_gtid_mode],
             gtid_mode_names[gtid_mode]);
  DBUG_RETURN(1);
}

Commenting this block of code results in a mysqld with the new ANONYMOUS_IN-GTID_OUT mode when setting its GTID mode to ON.

Let's see how this new mode can be used. We will work on the replication tree below where A is the master and B to F are slaves. For simplicity, we suppose that binary logs and enforce_gtid_consistency are already enabled on all databases.

-----
| A |
-----
  |
  +------+------+------+------+
  |      |      |      |      |
  V      V      V      V      V
-----  -----  -----  -----  -----
| B |  | C |  | D |  | E |  | F |
-----  -----  -----  -----  -----

The first step to evaluate/migrate to GTIDs is to restart B in the new GTID mode ANONYMOUS_IN-GTID_OUT. For us, that means deploying our patched mysqld and enable GTIDs on B. Then, C can be migrated under B using the following commands:

#!/bin/bash

master=b.example.com
slave=c.example.com

# Stop the slave replication and note its position relative to its master.
s_sss=$(ssh $slave mysql <<< "STOP SLAVE; SHOW SLAVE STATUS\G")
s_RMLF=$(awk '$1 == "Relay_Master_Log_File:"{print $2}' <<< "$s_sss")
s_EMLP=$(awk '$1 == "Exec_Master_Log_Pos:"{print $2}' <<< "$s_sss")

# Wait for the new master to be at the same level as the slave (or ahead),
#   stop its replication, note its position, and start back the replication.
m_sss=$(ssh $master mysql <<< "SELECT MASTER_POS_WAIT('$s_RMLF', $s_EMLP);
                               STOP SLAVE; SHOW SLAVE STATUS\G START SLAVE;")
m_RMLF=$(awk '$1 == "Relay_Master_Log_File:"{print $2}' <<< "$m_sss")
m_EMLP=$(awk '$1 == "Exec_Master_Log_Pos:"{print $2}' <<< "$m_sss")
m_EGS=$(awk '$1 == "Executed_Gtid_Set:"{print $2}' <<< "$m_sss")

# Restart the slave replication until the stopped position of the new
#   master and wait that this position is reached.
ssh $slave mysql <<< "START SLAVE UNTIL MASTER_LOG_FILE = '$m_RMLF',
                                        MASTER_LOG_POS  = $m_EMLP;
                      SELECT MASTER_POS_WAIT('$m_RMLF', $m_EMLP);"

# Reset the slave replication, restart the database with GTID_MODE=ON,
#   initialize GTID_PURGED, change its master and start back replication.
ssh $slave mysql <<< "STOP SLAVE; RESET SLAVE ALL;"
ssh -n $slave sed -i -e "'/^GTID_MODE/s/OFF/ON/'" /etc/my.cnf
ssh -n $slave /etc/init.d/mysql restart
ssh $slave mysql <<< "SET GLOBAL GTID_PURGED='$m_EGS';
                      CHANGE MASTER TO MASTER_HOST          = '$master',
                                       MASTER_USER          = 'repl',
                                       MASTER_PASSWORD      = 'pass',
                                       MASTER_AUTO_POSITION = 1;
                      START SLAVE;"

After repeating the commands above with D, we have the new replication tree below where GTIDs are partially enabled:

  • WRITES on A are propagated to B, E and F,
  • B generates GTIDs for those transactions,
  • the generated GTIDs are propagated to C and D.
    -----
    | A |
    -----
      |
      +--------------------+------+
      |                    |      |
      V                    V      V
    -----                -----  -----
    | B |                | E |  | F |
    -----                -----  -----
      |
      +------+
      |      |
      V      V
    -----  -----
    | C |  | D |
    -----  -----

After evaluating GTIDs on B, C and D (and learning about how to operate them in production, including handling of unexpected problems), enabling GTIDs everywhere is straightforward:

  • Move E and F under B (see the commands above),
  • Point the WRITES to B,
  • Move A under B.

The resulting replication tree is the following:

-----
| B |
-----
  |
  +------+------+------+------+
  |      |      |      |      |
  V      V      V      V      V
-----  -----  -----  -----  -----
| A |  | C |  | D |  | E |  | F |
-----  -----  -----  -----  -----

B can stay in the mode ANONYMOUS_IN-GTID_OUT as it is not a slave anymore. However, if we want to have A as the master and/or all nodes in pure GTID mode, we can:

  • Move C through F under A,
  • Point the WRITES to A,
  • Restore original binaries on B,
  • Move B under A.

And that is how we've been evaluating MySQL 5.6 GTIDs. In some of our replication trees we've deployed a patched mysqld that has ANONYMOUS_IN-GTID_OUT mode enabled. And under those intermediate masters, we've deployed unpatched mysqld with GTIDs enabled.

However, we are not quite ready to completely switch our mission critical databases and replication trees over to GTIDs. We need to continue evaluating to be sure we have not missed something important. Nonetheless, we are very happy to have unlocked a path to online migration.

If you want to know more about the cool things we do with MySQL at Booking.com and if you are at Percona Live at the beginning of April, we are giving a talk on Data materialisation for high-performance Web front-ends. You can also come to meet us at booth #318. We are also looking for smart people to work with us on MySQL and help us solve deeper problems similar to implementing GTIDs, see our careers page for more details.

comments powered by Disqus