Parallel replication is a highly-expected feature of MySQL available in MariaDB 10.0 and in MySQL 5.7. We already presented benchmark results with MariaDB 10.0 in the previous post of the series; in this post, we will look at a new type of replication introduced in MariaDB 10.1: optimistic parallel replication.
This post has an annex: Under the Hood. Benchmarking is a complex art and reporting results accurately is even harder. If all the details were put in a single article, it would make a very long post. The links to the annex should satisfy readers eager for more details.
Optimistic parallel replication is built on top of the previous implementation of parallel replication. To fully understand optimistic parallel replication in MariaDB 10.1, we must dive back in the implementation details of parallel replication in MariaDB 10.0.
Key Design Decision: In-Order Commit
In Part 1 of the series, we explained how parallel replication works in MariaDB 10.0 and in early version of MySQL 5.7. In short, both MySQL 5.7 and MariaDB 10.0 identify parallelism on the master and send this information to the slaves via the binary logs . However, MariaDB and MySQL differ in the way transactions are committed on slaves.
In MariaDB, transactions run in parallel on slaves are committed in
the same order as they appear in the binary logs of the master .
So if a slave runs
T2 in parallel and those transactions
appear in this order in the binary logs of the master,
the thread running
T2 will wait for
T1 to complete before committing
(even if T2 is ready to commit before T1).
By default, a MySQL slave running transactions in parallel is simply
committing transactions as they complete without enforcing any ordering.
In the above example,
T1. This could (and will most of the
time) generate different transactions ordering in the binary logs of the
slaves  .
When committing in-order, a transaction that needs
to commit first (
T1) can be blocked by another transaction that
will commit later (
T2). It is surprising that such dependencies are
generated by transactions committing at the same time on the master,
but this can still happen.
This is a deadlock situation:
T1 is blocked by
T2 is waiting for
The storage engine will not detect this deadlock because both dependencies
are not under its control. The first dependency is known to the storage engine:
T2 is holding a resource needed by
T1. However, the second dependency is in
the server, more precisely in the parallel replication code:
T2 cannot commit before
To resolve this situation, the in-order parallel replication applier
must detect that
T2 is blocking
T1 and it must kill
release its resources, allowing
T1 to proceed.
T2 will then be retried.
This transaction killing and retrying does not happen
very often in MariaDB 10.0, but its implementation is essential to avoid
blocking replication. The occurrence of those retries
can be monitored with the
global status. Below is a graph from such monitoring where
we can see that three
retries were needed for a four-minute interval.
This is especially small considering that
~2.5K transactions were run per seconds on this slave (three retries for
Once this deadlock detection and resolution is implemented, the following becomes very straightforward:
Run all transactions in parallel, with in-order commit making sure that the data is consistent, and with deadlock detection avoiding replication to block.
This is called optimistic parallel replication and is implemented in MariaDB 10.1. This type of parallel replication does not rely on the master to identify parallelism. The slave tries to run as many transactions in parallel as possible to the limit of the slave_parallel_threads parameter while enforcing in-order commit. If no conflict happens, maximum speed is achieved. If a conflict (deadlock) is detected, a transaction that was optimistically run will be rolled-back to unblock in-order commit (this transaction will be retried later).
As long as there are not too many conflicts, or as long as the cost of rollbacks does not outweigh the benefit of running more transactions in parallel, optimistic parallel replication should give good results. What is yet unknown is how those results will compare to the conservative parallel replication (conservative is the MariaDB 10.1 name for the parallel replication in MariaDB 10.0). To answer that question, tests need to be done. The results are presented below.
Of course, the full implementation details are more complex (more details can be found in the annex about rollback and retries, DML vs DDL and non-transactional storage engines and the different optimistic parallel replication modes). However, this introduction is enough to understand the rest of this post, so let us go directly to benchmarking.
The test is the same as in the previous post:
catching up with 24 hours of transactions.
The four test environments are also very similar, their description can
be found in the
In the graphs below, we compare speedups for non-optimistic executions
with speedups for optimistic execution (
We had to choose
optimistic because the latter will
not give good results replicating via an intermediate master (more details can
in the annex).
In the graphs, the Y axis shows the speedups with a
value of one being the reference time without using parallel replication
slave_parallel_mode=none). The X axis shows the number of threads used:
notice the logarithmic scale. The curve stopping at 40 on the X axis
is showing the non-optimistic results.
The data used to plot those graphs can be found in the
- Parallelism identification (slave group commit) was done with a maximum group size of 35, so increasing the number of threads past 40 for non-optimistic tests is not interesting.
- The binary logging configuration had a noticeable
impact on performance.
The catch-up times are generally longer when enabling
log-slave-updates, and disabling binary logging is not such a big win. The results below are obtained with binary logging enabled but with
log-slave-updatesdisabled, referred to as SB in Part 3.
- The workload
of the four environments are different:
E2is a CPU-bound workload,
E1is also mostly CPU-bound but with some cache misses,
E3is a mixed CPU and IO workload, and
E4is an IO-bound workload.
E4is an IO-bound workload (mostly cache misses in the InnoDB buffer pool), reducing durability - referred to as ND in Part 3 - does not get a noticeable improvement (similar behavior has been observed for
E3). For that reason, the results presented below only include high durability for
E4(referred to as HD).
(CPU-bound workload with some cache misses)
Graph # 1b: E1 SB-ND - Conservative vs Aggressive Speedups
(CPU-bound workload with some cache misses)
Graph # 2a: E2 SB-HD - Conservative vs Aggressive Speedups
Graph # 2b: E2 SB-ND - Conservative vs Aggressive Speedups
Graph # 3a: E3 SB-HD - Conservative vs Aggressive Speedups
(Mixed CPU and IO workload)
Graph # 4a: E4 SB-HD - Conservative vs Aggressive Speedups
The first surprise comes from observing speedups with very high number of threads.
Even though the servers used for the tests only have
12 hyper-threaded cores for a total
of 24 threads, speedups are still increasing past 80 threads for all
configurations and up to 2560 threads for
E3. Obviously, we cannot use
more processing power when
slave_parallel_threads from 40 to 80,
so those growing speedups cannot be explained simply by using more CPU.
We think that those speedups are caused by replication prefetching. By increasing the number of threads, transactions that need data that is not in cache will trigger a read from disk earlier (prefetching). Triggering that read earlier is a big win and does not consume much CPU because the thread will go in IOWait state. Even if that transaction causes a conflict and is rolled-back, the data will be in cache for the retry. Also in this case, the cost of the rollback is insignificant compared to the gain of having the data in cache for the retry, so extra conflicts are not a problem. This concept is not new; it is known as replication prefetching and has already been discussed before by Baron Schwartz, Domas Mituzas and Yoshinori Matsunobu.
For all environments, aggressive parallel replication can produce much better speedups than conservative parallel replication. It is also very nice to see more significant speedups with low durability on E1 and E2 (conservative parallel replication is not giving good speedups there). It is absolutely obvious that aggressive parallel replication is a great improvement over conservative parallel replication.
I suspect (and I hope) that there are still bottlenecks to be removed. Optimistic parallel replication is a great improvement over conservative parallel replication, but we are still far from the speedup of 6 to 10 that we are looking for (maybe those expectations are unrealistic...).
Something that might slow down our tests are DDL (Data Definition Language:
[CREATE | ALTER | TRUNCATE | DROP | ...] TABLE, ...) because DDL instructions
are blocking the replication pipeline. As explained in detail in the
before being able to run a
DDL, all previously started transactions need to commit.
Moreover, before starting any transactions after a DDL, the DDL must complete.
Below are some extracts from
SHOW GLOBAL STATUS during the test:
- E1: 197,499 Com_create_table
- E1: 1148 Com_truncate
- E1: 489 Com_drop_table
- E1: 484 Com_rename_table
So in the quickest run of E1 (7060 seconds), we were doing in average
per second . That obviously cannot be good for parallel
replication. To ease identifying such problems, I opened
MDEV-10664 - Add statuses about optimistic parallel replication stalls.
As stated in Part 3 (but is still worth repeating): it is possible to test MariaDB parallel replication even if the master is an old version of MySQL/MariaDB. In our case, our masters were running MySQL 5.6, but the same could be applied to other versions.
Overall, the optimistic parallel replication shows very promising results: almost all speedups are better than conservative parallel replication.
The biggest surprise was to see speedup increasing past 80 threads. We could have thought that more threads than processing units would slow things down, but it is not the case. This is probably caused by threads being most of the time in a waiting state: either waiting for a previous transaction to commit, or waiting for an IO. For a pure CPU workload, we can expect contention and this is probably what explains the thrashing for E2.
Another surprise is that the best speedup (3.78) is achieved for an IO bound workload (E4). In this case, the biggest win of parallel replication seems to be getting more read IOPS from the disks subsystem by scheduling many IOs in parallel. In this environment (and probably in others), optimistic parallel replication with a high number of threads is acting as a replication prefetcher.
Up to now, all our tests were done using magnetic disks. It is unclear how both conservative and optimistic parallel replication would behave with solid state disks (SSDs). More tests are needed to understand how parallel replication will behave with SSDs.
If you are interested in this topic and would like to learn more, I will be giving a talk about MySQL and MariaDB Parallel Replication at Percona Live Amsterdam in October. All talks by Booking.com are:
- Data Diversity at Booking.com
- Datastore axes: Choosing the scalability direction you need
- MySQL/MariaDB Parallel Replication: inventory, use-cases and limitations
- MySQL Time Machine by replicating into HBase
- Using multisource replication in MySQL 5.7 for resharding
- Splitting a Database Without Down-time
We will also be hosting the Percona Live Community Dinner on October 4th. You can also meet us there if you want to know more about the cool things we do at Booking.com.
MySQL 5.7 and MariaDB 10.0 have slightly different implementation
of parallelism identification on the master. MariaDB 10.0 uses the binary
log group commit optimization as accurately described in
(the group commit id is shown in MariaDB mysqlbinlog output as
From version 5.7.6, MySQL is tagging each transaction with
two logical timestamps
sequence_number in MySQL
 MariaDB also has an out-of-order parallel replication functionality based on its GTID implementation. This type of parallel replication might not commit transactions in the same order as they appear in the binary logs of the master. To take advantage of out-of-order parallel replication, hints must be given by the application to advertise what can be run in parallel. This is not the type of parallel replication we are discussing in this post (we are focusing on the in-order type).
By allowing out-of-order transaction commit on slaves, MySQL can alleviate
a problem previously discussed in
where too small a value for
give suboptimal speedups.
option allows enabling in-order commit in MySQL but this option needs
I opened Bug#75396
to have this restriction removed as needing
log-slave-updates to enable
slave-preserve-commit-order looks like a step backward:
- The goal of parallel replication is to get faster replication
- According to most of our tests,
log-slave-updatesslows down replication
log-slave-updates) is not needed when Binlog Servers are already keeping an exact copy of the binary logs of the master. So needing
slave-preserve-commit-orderis a restriction we could do without.
This number of
CREATE TABLE is very high. It looks
like the pattern
CREATE TABLE IF NOT EXISTS is very common
E1. Checking for table existence and avoiding the
CREATE TABLE when it already exists might be a big win for
parallel replication in
E1. And for the people that do not
CREATE TABLE per second is possible (I also had to
check it by myself):
$ for i in $(seq 4978 5003); do mysqlbinlog $(printf "binlog.%06d" $i) | grep -i "CREATE TABLE IF NOT EXISTS"; done | wc -l 211753