B.

Abstracting Binlog Servers and MySQL Master Promotion without Reconfiguring all Slaves

In a MySQL replication deployment, the master is a single point of failure. To recover after the failure of this critical component, a common solution is to promote a slave to be the new master. However, when doing so using classic methods, the slaves need to be reconfigured. This is a tedious operation in which many things can go wrong. We found a more simple way to achieve master promotion using Binlog Server. Read on for more details.

When a master fails in a MySQL replication deployment, the classic way to promote a slave to be the new master is the following:

  1. Find the most up-to-date slave.
  2. If the most up-to-date slave is not a good candidate master, level a suitable candidate with the most up-to-date slave [1].
  3. Repoint the remaining slaves to the new master.

The procedure above needs to contact all slaves in step #1, and to reconfigure all slaves in step #3. This becomes increasingly complex in Booking.com environments where we have very wide, and still growing, replication topologies; it is not uncommon to have more than fifty (and sometimes more than a hundred) slaves replicating from the same master. Many things can go wrong when tens of slaves need to be contacted and reconfigured:

  • some slaves might be down for maintenance or for taking a backup,
  • some slaves could be temporarily unreachable for other reasons,
  • and a few slaves could be processing a big backlog of relay logs (including delayed slaves), which will make them hard/unsuitable to reconfigure.

A way to reduce the complexity of master promotion is presented below, but to get there, we must first give some context about Binlog Servers and abstract them into a service.

Reminders about Binlog Servers

In a previous post, I described how to take advantage of Binlog Server to perform master promotion without GTIDs and without log-slave-updates, while still requiring to reconfigure all slaves. To do this, the slaves must replicate through a Binlog Server. This gives us the following deployment with a single Binlog Server:

+---+
| A |
+---+
  |
 / \
/ X \
-----
  |
  +----------+----------+----------+----------+----------+
  |          |          |          |          |          |
+---+      +---+      +---+      +---+      +---+      +---+
| B |      | C |      | D |      | E |      | F |      | G |
+---+      +---+      +---+      +---+      +---+      +---+

or with redundant Binlog Servers:

+---+
| A |
+---+
  |
  +--------------------------------+
  |                                |
 / \                              / \
/ X \                            / Y \
-----                            -----
  |                                |
  +----------+----------+          +----------+----------+
  |          |          |          |          |          |
+---+      +---+      +---+      +---+      +---+      +---+
| B |      | C |      | D |      | E |      | F |      | G |
+---+      +---+      +---+      +---+      +---+      +---+

or with more than one site with redundant Binlog Servers.

  +---+
  | A |
  +---+
    |
    +-----------+------------------------+
    |           |                        |
   / \         / \                      / \         / \
  / X \       / Y \                    / Z \------>/ W \
  -----       -----                    -----       -----
    |           |                        |           |
  +-+-----------+-+                    +-+-----------+-+
  |               |                    |               |
+---+           +---+                +---+           +---+
| S1|    ...    | Sn|                | T1|    ...    | Tm|
+---+           +---+                +---+           +---+

These schemas are becoming increasingly complex - let's simplify them by abstracting the Binlog Servers.

Binlog Server Abstraction

By hiding the Binlog Servers in an abstracted layer, which I call the Distributed Binlog Serving Server (DBSS), a deployment on three sites becomes the following:

   +---+
   | M |
   +---+
     |
+----+----------------------------------------------------------+
|                                                               |
+----+---------+-----------+---------+-----------+---------+----+
     |         |           |         |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T1| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

Of course, the DBSS is built with many Binlog Servers. One way to build the layer above minimizing the number of slaves served by the master is described below. Other ways to build this layer can be imagined [2], but let's stick to this one, for now.

+----|----------------------------------------------------------+
|    +---------------------+---------------------+              |
|    |                     |                     |              |
|   / \                   / \                   / \             |
|  / X1\----->/ \        / X2\----->/ \        / X3\----->/ \   |
|  -----     / Y1\       -----     / Y2\       -----     / Y3\  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

In the deployment above, using one DNS A record per site resolving to both Xi and Yi, if a Binlog Server fails, its slaves will reconnect to the other one. If the Yi Binlog Server fails, nothing more needs to be done. If the Xi Binlog Server fails, the corresponding Yi must be repointed to the master. This repointing is easy, as, by design, a Binlog Server is identical to its master. Only the destination server must be changed, and the binary log filename and position stay the same.

When the Master Fails...

Equipped with the above implementation of the DBSS, in a situation when the master fails, we end up with the state below; each site might be at a different position in the binary log stream of the failed master.

+---------------------------------------------------------------+
|                                                               |
|   / \                   / \                   / \             |
|  / X1\----->/ \        / X2\----->/ \        / X3\----->/ \   |
|  -----     / Y1\       -----     / Y2\       -----     / Y3\  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

The first step of master promotion is to level the Binlog Servers in the DBSS. To do so, the most up-do-date Binlog Server must be found and all other Binlog Servers must be chained to it. In the deployment above, only three servers must be contacted, which is much easier than tens of slaves. If the most up-to-date Binlog Server is X2, levelling the Binlog Servers consists of the temporary replication architecture below.

+---------------------------------------------------------------+
|                                                               |
|   / \ <-----------------/ \-----------------> / \             |
|  / X1\----->/ \        / X2\----->/ \        / X3\----->/ \   |
|  -----     / Y1\       -----     / Y2\       -----     / Y3\  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

Levelling should happen very quickly (if it does not, one of the Binlog Servers is lagging, which should not happen). After that, the slaves will quickly follow. Once a slave is up to date (this actually does not need levelling, a slave of X2 or Y2 could have been promoted before levelling), master promotion can be performed. Shown below, a slave from the third site on the right has been chosen to be the new master, but any slave on any of the three sites could have been used.

+------------------------------------------------|--------------+
|    +---------------------+---------------------+              |
|    |                     |                     |              |
|   / \                   / \                   / \             |
|  / X1\----->/ \        / X2\----->/ \        / X3\----->/ \   |
|  -----     / Y1\       -----     / Y2\       -----     / Y3\  |
|    |       -----         |       -----         |       -----  |
+----|---------|-----------|---------|-----------|---------|----+

Note that the other slaves have not been touched: they are still connected to their Binlog Server. This means that this solution works well even if one of the slaves is unavailable during master promotion. This solution also works very well with delayed or lagging slaves, as those slaves are simply not good candidates for becoming the new master. For some time, the lagging slaves will process the binary logs of the old master that are still stored on the Binlog Servers.

The Trick for not Reconfiguring every Slave

Promoting a slave to be the new master in a DBSS deployment requires working some magic on a slave to make its binary log position (SHOW MASTER STATUS) matches what is expected by the Binlog Servers. Let's take an example: if the last binary log stored on the levelled Binlog Servers is binlog.000163, we could repoint the Binlog Servers to a new master if the SHOW MASTER STATUS of this new master is at the beginning of binary log filename binlog.000164.

When doing that promotion, from the point of view of the Binlog Servers, their master is simply restarted with a different server_id and server_uuid. From the point of view of the slaves, they are processing the binary logs of the old master (up to and including binlog.000163) followed by the binary logs of the new master (starting at binlog.000164).

So, the trick is to have our candidate master at the right binary log position. This can be made possible by:

  1. configuring all nodes with binary logging enabled,
  2. with all identical log-bin value (binlog in the example above),
  3. and without enabling log-slave-updates.

Configuration #3 above allows us to assume that the master will consume binary log filenames much faster than the slaves. This way, the slaves will always be behind the master in their binary log filenames [3]. As such, bringing a slave to the right binary log filename is as simple as doing FLUSH BINARY LOGS in a loop until the slave is in the correct position. To avoid this loop from taking too much time, we can run a cron job on our slaves that makes sure they are not too far away from their master (maximum ten binary logs away, for an example).

Summary of Master Promotion

In the following replication deployment, with log-bin=binlog and with log-slave-updates disabled:

   +---+
   | M |
   +---+
     |
+----+----------------------------------------------------------+
|                                                               |
+----+---------+-----------+---------+-----------+---------+----+
     |         |           |         |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T1| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

If M fails, we first level the Binlog Servers in the DBSS.

Once this is done, and let's take T1 as our candidate master, we need to perform the following on it:

  1. FLUSH BINARY LOGS until the binary log filename follows the last one from the levelled DBSS,
  2. PURGE BINARY LOGS TO <latest binary log file>,
  3. RESET SLAVE ALL.

The step #2 above drops all binary logs on the new master that could conflict with the one from the previous master. The binary logs of the old master are stored on the DBSS and we must be sure to avoid having similar, but misleading data, on the new master.

We now have this:

   +\-/+
   | X |
   +/-\+

+---------------------------------------------------------------+
|                                                               |
+----+---------+---------------------+-----------+---------+----+
     |         |                     |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T1| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

where we repoint the DBSS to T1 to get the following:

   +\-/+                 +---+
   | X |                 | T1|
   +/-\+                 +---+
                           |
+--------------------------+------------------------------------+
|                                                               |
+----+---------+-----------+---------+-----------+---------+----+
     |         |           |         |           |         |
   +---+     +---+       +---+     +---+       +---+     +---+
   | S1| ... | Sn|       | T2| ... | Tm|       | U1| ... | Uo|
   +---+     +---+       +---+     +---+       +---+     +---+

and we have achieved master promotion without reconfiguring all slaves.

A Cleaner Way

The trick above works well, but preforming FLUSH BINARY LOGS in a loop is not the cleanest of solutions. It would be much better if there was a way to set the binary log to the desired filename in a single operation. With this idea in mind, we created the following two feature requests:

MariaDB 10.1.6 is already implementing a RESET MASTER TO syntax. Let's hope that Oracle will provide something similar in MySQL 5.7.

What about the Software?

This idea and procedure is all well and good, but it is not very useful if you cannot use it yourself. The currently available version of the Binlog Server, the MaxScale Binlog Router plugin, does not yet implement all the configuration hooks needed to make this procedure easy. Booking.com is currently working with MariaDB to implement the missing hooks in a new version of MaxScale. We are in the last testing phase of a Binlog Router plugin that support the following:

  • STOP SLAVE, START SLAVE, SHOW MASTER STATUS, SHOW SLAVE STATUS, CHANGE MASTER TO: these new commands allow easier configuration of the Binlog Server.
  • The CHANGE MASTER TO command not only allows to easily chain Binlog Servers, but also to bootstrap a Binlog Server without editing the configuration file. Moreover, this command allows to repoint MaxScale to a new master at binary log filename N+1, effectively enabling to perform master promotion.
  • Transaction safety: when the master fails, the Binlog Server could have downloaded a partial transaction. If we replace the master with a slave, this transaction should not be sent to slaves. So this feature of the next version of MaxScale will make sure such partial transactions are not sent downstream.
  • DBSS identity: the initial design of the Binlog Server was intended to impersonate the master, and did not consider swapping the master at the top of the hierarchy. In a DBSS deployment, swapping the master should not be made visible to slaves, so the Binlog Servers should present the slave with a different server_id and server_uuid to those of the master. The next version of the MaxScale Binlog Router supports that virtual master feature.

This next version of the MaxScale Binlog Router will be generally available once we are done with the testing. Stay tuned on the MariaDB web site for the announcement and the failover procedure. In the meantime, you can still experiment with master promotion without reconfiguring all slaves by using the current version of MaxScale and following this proof of concept procedure.

If you are interested in this topic and would like to learn more, I am giving a talk about Binlog Servers at Percona Live Amsterdam. Feel free to grab me after the talk, catch me at the Booking.com booth (#205) or share a drink with me at the Community Dinner, to exchange thoughts on this subject. (You can also post a comment below.)

I will also be giving a talk about Binlog Servers at Oracle Open World in San Francisco at the end of October.

One last thing: if you want to know more about other cool things we do at Booking.com, I suggest you come to our other talks at Percona Live Amsterdam in September:

[1] Slave levelling can be done with MHA, with MySQL 5.6 or MariaDB 10.0 GTIDs, or with Pseudo-GTIDs when using earlier versions of MySQL and MariaDB.

[2] If we were not concerned about WAN bandwidth, all Binlog Servers could be directly connected to the master. Another solution could be to connect all master-local Binlog Servers directly to the master and to use the chained strategy for remote Binlog Servers. (This hybrid deployment could be well-suited to a semi-sync deployment, but I am diverging from the subject of this post.)

[3] The same can be achieved when using log-slave-updates, by using smaller max_binlog_size on the master than on all the slaves.

comments powered by Disqus