B.

Evaluating MySQL Parallel Replication Part 4: More Benchmarks in Production

Parallel replication is a highly-expected feature of MySQL available in MariaDB 10.0 and in MySQL 5.7. We already presented benchmark results with MariaDB 10.0 in the previous post of the series; in this post, we will look at a new type of replication introduced in MariaDB 10.1: optimistic parallel replication.

This post has an annex: Under the Hood. Benchmarking is a complex art and reporting results accurately is even harder. If all the details were put in a single article, it would make a very long post. The links to the annex should satisfy readers eager for more details.

Optimistic parallel replication is built on top of the previous implementation of parallel replication. To fully understand optimistic parallel replication in MariaDB 10.1, we must dive back in the implementation details of parallel replication in MariaDB 10.0.

Key Design Decision: In-Order Commit

In Part 1 of the series, we explained how parallel replication works in MariaDB 10.0 and in early version of MySQL 5.7. In short, both MySQL 5.7 and MariaDB 10.0 identify parallelism on the master and send this information to the slaves via the binary logs [1]. However, MariaDB and MySQL differ in the way transactions are committed on slaves.

In MariaDB, transactions run in parallel on slaves are committed in the same order as they appear in the binary logs of the master [2]. So if a slave runs T1 and T2 in parallel and those transactions appear in this order in the binary logs of the master, the thread running T2 will wait for T1 to complete before committing T2 (even if T2 is ready to commit before T1).

By default, a MySQL slave running transactions in parallel is simply committing transactions as they complete without enforcing any ordering. In the above example, T2 will commit before T1. This could (and will most of the time) generate different transactions ordering in the binary logs of the slaves [3] [4].

When committing in-order, a transaction that needs to commit first (T1) can be blocked by another transaction that will commit later (T2). It is surprising that such dependencies are generated by transactions committing at the same time on the master, but this can still happen. This is a deadlock situation: T1 is blocked by T2 and T2 is waiting for T1.

The storage engine will not detect this deadlock because both dependencies are not under its control. The first dependency is known to the storage engine: T2 is holding a resource needed by T1. However, the second dependency is in the server, more precisely in the parallel replication code: T2 cannot commit before T1. To resolve this situation, the in-order parallel replication applier must detect that T2 is blocking T1 and it must kill T2. Once killed, T2 will release its resources, allowing T1 to proceed. T2 will then be retried.

This transaction killing and retrying does not happen very often in MariaDB 10.0, but its implementation is essential to avoid blocking replication. The occurrence of those retries can be monitored with the slave_retried_transactions global status. Below is a graph from such monitoring where we can see that three retries were needed for a four-minute interval. This is especially small considering that ~2.5K transactions were run per seconds on this slave (three retries for ~600,000 transactions).

E2 Conservative SBHD Partial Transaction Retries Graph # 0: Retried Transactions on a MariaDB 10.0 Slave

Once this deadlock detection and resolution is implemented, the following becomes very straightforward:

Run all transactions in parallel, with in-order commit making sure that the data is consistent, and with deadlock detection avoiding replication to block.

This is called optimistic parallel replication and is implemented in MariaDB 10.1. This type of parallel replication does not rely on the master to identify parallelism. The slave tries to run as many transactions in parallel as possible to the limit of the slave_parallel_threads parameter while enforcing in-order commit. If no conflict happens, maximum speed is achieved. If a conflict (deadlock) is detected, a transaction that was optimistically run will be rolled-back to unblock in-order commit (this transaction will be retried later).

As long as there are not too many conflicts, or as long as the cost of rollbacks does not outweigh the benefit of running more transactions in parallel, optimistic parallel replication should give good results. What is yet unknown is how those results will compare to the conservative parallel replication (conservative is the MariaDB 10.1 name for the parallel replication in MariaDB 10.0). To answer that question, tests need to be done. The results are presented below.

Of course, the full implementation details are more complex (more details can be found in the annex about rollback and retries, DML vs DDL and non-transactional storage engines and the different optimistic parallel replication modes). However, this introduction is enough to understand the rest of this post, so let us go directly to benchmarking.

The Test

The test is the same as in the previous post: catching up with 24 hours of transactions. The four test environments are also very similar, their description can be found in the annex. In the graphs below, we compare speedups for non-optimistic executions (slave_parallel_mode=conservative) with speedups for optimistic execution (slave_parallel_mode=aggressive). We had to choose aggressive over optimistic because the latter will not give good results replicating via an intermediate master (more details can be found in the annex).

In the graphs, the Y axis shows the speedups with a value of one being the reference time without using parallel replication (slave_parallel_mode=none). The X axis shows the number of threads used: notice the logarithmic scale. The curve stopping at 40 on the X axis is showing the non-optimistic results. The data used to plot those graphs can be found in the annex.

Some reminders from the previous post:

  • Parallelism identification (slave group commit) was done with a maximum group size of 35, so increasing the number of threads past 40 for non-optimistic tests is not interesting.
  • The binary logging configuration had a noticeable impact on performance. The catch-up times are generally longer when enabling log-slave-updates, and disabling binary logging is not such a big win. The results below are obtained with binary logging enabled but with log-slave-updates disabled, referred to as SB in Part 3.
  • The workload of the four environments are different: E2 is a CPU-bound workload, E1 is also mostly CPU-bound but with some cache misses, E3 is a mixed CPU and IO workload, and E4 is an IO-bound workload.
  • As E4 is an IO-bound workload (mostly cache misses in the InnoDB buffer pool), reducing durability - referred to as ND in Part 3 - does not get a noticeable improvement (similar behavior has been observed for E3). For that reason, the results presented below only include high durability for E3 and E4 (referred to as HD).

E1 SB-HD Speedups Graph # 1a: E1 SB-HD - Conservative vs Aggressive Speedups
(CPU-bound workload with some cache misses)

E1 SB-ND Speedups Graph # 1b: E1 SB-ND - Conservative vs Aggressive Speedups
(CPU-bound workload with some cache misses)

E2 SB-HD Speedups Graph # 2a: E2 SB-HD - Conservative vs Aggressive Speedups
(CPU-bound workload)

E2 SB-ND Speedups Graph # 2b: E2 SB-ND - Conservative vs Aggressive Speedups
(CPU-bound workload)

E3 SB-HD Speedups Graph # 3a: E3 SB-HD - Conservative vs Aggressive Speedups
(Mixed CPU and IO workload)

E4 SB-HD Speedups Graph # 4a: E4 SB-HD - Conservative vs Aggressive Speedups
(IO-bound workload)

Discussion

The first surprise comes from observing speedups with very high number of threads. Even though the servers used for the tests only have 12 hyper-threaded cores for a total of 24 threads, speedups are still increasing past 80 threads for all configurations and up to 2560 threads for E3. Obviously, we cannot use more processing power when raising slave_parallel_threads from 40 to 80, so those growing speedups cannot be explained simply by using more CPU.

We think that those speedups are caused by replication prefetching. By increasing the number of threads, transactions that need data that is not in cache will trigger a read from disk earlier (prefetching). Triggering that read earlier is a big win and does not consume much CPU because the thread will go in IOWait state. Even if that transaction causes a conflict and is rolled-back, the data will be in cache for the retry. Also in this case, the cost of the rollback is insignificant compared to the gain of having the data in cache for the retry, so extra conflicts are not a problem. This concept is not new; it is known as replication prefetching and has already been discussed before by Baron Schwartz, Domas Mituzas and Yoshinori Matsunobu.

For all environments, aggressive parallel replication can produce much better speedups than conservative parallel replication. It is also very nice to see more significant speedups with low durability on E1 and E2 (conservative parallel replication is not giving good speedups there). It is absolutely obvious that aggressive parallel replication is a great improvement over conservative parallel replication.

I suspect (and I hope) that there are still bottlenecks to be removed. Optimistic parallel replication is a great improvement over conservative parallel replication, but we are still far from the speedup of 6 to 10 that we are looking for (maybe those expectations are unrealistic...).

Something that might slow down our tests are DDL (Data Definition Language: [CREATE | ALTER | TRUNCATE | DROP | ...] TABLE, ...) because DDL instructions are blocking the replication pipeline. As explained in detail in the annex, before being able to run a DDL, all previously started transactions need to commit. Moreover, before starting any transactions after a DDL, the DDL must complete. Below are some extracts from SHOW GLOBAL STATUS during the test:

  • E1: 197,499 Com_create_table
  • E1: 1148 Com_truncate
  • E1: 489 Com_drop_table
  • E1: 484 Com_rename_table

So in the quickest run of E1 (7060 seconds), we were doing in average 27 CREATE TABLE per second [5]. That obviously cannot be good for parallel replication. To ease identifying such problems, I opened MDEV-10664 - Add statuses about optimistic parallel replication stalls.

Conclusion

As stated in Part 3 (but is still worth repeating): it is possible to test MariaDB parallel replication even if the master is an old version of MySQL/MariaDB. In our case, our masters were running MySQL 5.6, but the same could be applied to other versions.

Overall, the optimistic parallel replication shows very promising results: almost all speedups are better than conservative parallel replication.

The biggest surprise was to see speedup increasing past 80 threads. We could have thought that more threads than processing units would slow things down, but it is not the case. This is probably caused by threads being most of the time in a waiting state: either waiting for a previous transaction to commit, or waiting for an IO. For a pure CPU workload, we can expect contention and this is probably what explains the thrashing for E2.

Another surprise is that the best speedup (3.78) is achieved for an IO bound workload (E4). In this case, the biggest win of parallel replication seems to be getting more read IOPS from the disks subsystem by scheduling many IOs in parallel. In this environment (and probably in others), optimistic parallel replication with a high number of threads is acting as a replication prefetcher.

Up to now, all our tests were done using magnetic disks. It is unclear how both conservative and optimistic parallel replication would behave with solid state disks (SSDs). More tests are needed to understand how parallel replication will behave with SSDs.

If you are interested in this topic and would like to learn more, I will be giving a talk about MySQL and MariaDB Parallel Replication at Percona Live Amsterdam in October. All talks by Booking.com are:

We will also be hosting the Percona Live Community Dinner on October 4th. You can also meet us there if you want to know more about the cool things we do at Booking.com.

[1] MySQL 5.7 and MariaDB 10.0 have slightly different implementation of parallelism identification on the master. MariaDB 10.0 uses the binary log group commit optimization as accurately described in Part 1 (the group commit id is shown in MariaDB mysqlbinlog output as cid). From version 5.7.6, MySQL is tagging each transaction with two logical timestamps (last_committed and sequence_number in MySQL mysqlbinlog output).

[2] MariaDB also has an out-of-order parallel replication functionality based on its GTID implementation. This type of parallel replication might not commit transactions in the same order as they appear in the binary logs of the master. To take advantage of out-of-order parallel replication, hints must be given by the application to advertise what can be run in parallel. This is not the type of parallel replication we are discussing in this post (we are focusing on the in-order type).

[3] By allowing out-of-order transaction commit on slaves, MySQL can alleviate a problem previously discussed in Part 3 where too small a value for slave_parallel_threads could give suboptimal speedups.

[4] The slave-preserve-commit-order option allows enabling in-order commit in MySQL but this option needs log-slave-updates. I opened Bug#75396 to have this restriction removed as needing log-slave-updates to enable slave-preserve-commit-order looks like a step backward:

  • The goal of parallel replication is to get faster replication
  • According to most of our tests, log-slave-updates slows down replication
Moreover, keeping a copy of the master transactions on slaves (log-slave-updates) is not needed when Binlog Servers are already keeping an exact copy of the binary logs of the master. So needing log-slave-updates to enable slave-preserve-commit-order is a restriction we could do without.

[5] This number of CREATE TABLE is very high. It looks like the pattern CREATE TABLE IF NOT EXISTS is very common for E1. Checking for table existence and avoiding the CREATE TABLE when it already exists might be a big win for parallel replication in E1. And for the people that do not believe 27 CREATE TABLE per second is possible (I also had to check it by myself):

$ for i in $(seq 4978 5003); do
  mysqlbinlog $(printf "binlog.%06d" $i) |
    grep -i "CREATE TABLE IF NOT EXISTS"; done | wc -l
211753

comments powered by Disqus