B.

Evaluating MySQL Parallel Replication Part 3, Annex: Under the Hood

This is the annex to Evaluating MySQL Parallel Replication Part 3: Benchmarks in Production.

There is no introduction or conclusion to this post, only landing sections: reading this post without its context will probably be very hard. You should start with the main post and come back here for more details.

Environments

As in the two previous posts (Part 1 and Part 2), we are using the same four environments. Each environment is composed of four servers:

-----     -----     -----     -----
| A | --> | B | --> | C | --> | D |
-----     -----     -----     -----
  • A is a true production master running MySQL 5.6,
  • B is an intermediate master running MariaDB 10.0.16 without parallel replication enabled (slave_parallel_threads = 0),
  • C is an intermediate master running MariaDB 10.0.16 with slave group commit enabled (slave_parallel_threads > 1, binlog_commit_wait_count > 1 and binlog_commit_wait_usec > 1),
  • D is the slave where parallel replication tests are run with different parameters.

The slave group commit parameters on C are the following:

Parameters E1 E2 E3 E4
slave_parallel_threads
40
40
40
40
binlog_commit_wait_count
35
35
35
35
binlog_commit_wait_usec
100.000
100.000
50.000
150.000

To monitor parallelism identification (group committing) on C, the binlog_commits and binlog_group_commits global statuses are gathered regularly.

The D slave was stopped on a Thursday between 17:20 and 17:30 and a backup of the database was taken. Every test starts by recreating the filesystem that hosts the database and restoring the backup on this empty filesystem.

When stopping the database (for backup), the buffer pool was dumped using innodb_buffer_pool_dump_at_shutdown. When starting MariaDB (after backup restoration), innodb_buffer_pool_load_at_startup was used to load the buffer pool. No test was started before the buffer pool was fully loaded.

Every test then started by running the D slave until Friday at 06:00 (a little more than 12 hours). This allowed further warming up of InnoDB, including giving work to purge and write threads.

Then, the D slave ran 24 hours of transactions where parallelism was identified using slave group commit on the C intermediate master. The number of commits and the group commit sizes on C for that period are shown in the graphs below for the four test environments.

E1 commits and group commit sizes E2 commits and group commit sizes E3 commits and group commit sizes E4 commits and group commit sizes Graphs # 0: Commits and Group Commit Sizes in the four Environments

The D servers have the following properties:

Property E1 E2 E3 E4
CPU
2x Intel E5-2640
2x Intel X5675
RAM
96 GB
192 GB
96 GB
Disk Controller
HP P220i
HP P410i
Controller Cache
512 MB (25% Read / 75% Write)
Logical Disk
838.3 GB (RAID 1)
1.6 TB (RAID 1+0)
Physical Disks
2x SAS 900.1 GB 10K RPM
6+1x SAS 600 GB 10K RPM
Database Size
382 GB
234 GB
586 GB
491 GB
Buffer Pool Size
71 GB
162 GB
76 GB
Binlog Format
ROW
STATEMENT
ROW

To monitor the commit rate on D, the COM_COMMIT global status is gathered regularly.

To prevent D from filling up its disk, relay_log_space_limit is set to 4 GB (some D do not have enough disk space to store all the binary logs of C).

Group Commit: Slave vs. Master

It is briefly mentioned in the main post that slave group commit identifies less parallelism than a true master would. More details are given here.

Let's consider this execution timeline which shows a master where T5 is conflicting with one of the four previous transactions:

   ------Time----->
T1:      B--C
T2:   B-----C
T3:   B-----C
T4:    B----C
T5:  B-- . . --C
T6:       B----C

Using slave group commit and from the binary logs of the timeline above, parallelism can be identified in two groups as shown below.

   ---------------Time---------------->
T1:  B-- . . . . . . . . C
T2:     B----- . . . . . C
T3:           B----- . . C
T4:                 B----C
T5:                       B---- . . C
T6:                            B----C

But if we run the master with delayed commit, we could get the execution timeline below, in which T6 passed T5 to join the first commit group.

   ------Time----->
T1:      B-- . C
T2:   B----- . C
T3:   B----- . C
T4:    B---- . C
T5:  B-- . . . .--C
T6:       B----C

So on a master, transactions blocked in their execution do not block other non-conflicting transactions from joining a commit group, but those block parallelism identification in slave group commit. This is why slave group commit is not as efficient at maximizing group size as a parallel execution on the master would be.

Results

As outlined in the main post, our tests are run in the following binary log configurations:

  • Intermediary Master (IM): binary logs and log-slave-updates enabled.
  • Slave with Binary Logs (SB): binary logs enabled and log-slave-updates disabled.
  • Standard Slave (SS): binary logs and log-slave-updates disabled.

And in the following durability configurations:

  • High Durability (HD): sync_binlog = 1 and innodb_flush_log_at_trx_commit = 1.
  • No Durability (ND): sync_binlog = 0 and innodb_flush_log_at_trx_commit = 2 (also described/known as relaxed durability).

For each of those configurations (6 in total: IM-HD, IM-ND, SB-HD, SB-ND, SS-HD, and SS-ND), the tests are run with different values of slave_parallel_threads (SPT). The full results are presented in the following four tables. The times presented are in the format hours:minutes.seconds. Below the time taken to process 24 hours of transactions, the speedup achieved from the single-threaded run is presented (in bold) and when applicable, the speedup from the previous run is also presented (in normal font).

Normally, the tests are run at least twice and the shorter time is kept. This is relaxed for E1, E3, and E4 in the SS-HD and SS-ND configurations where only one run is performed: the results from E2 show that SS-HD and SS-ND are very similar to SB-HD and SB-ND respectively, and the limited results with E1, E3, and E4 are consistent with this observation. Moreover, again for E1, E3, and E4 in SS-HD and SS-ND and for the same reasons, some SPT values are skipped (the skipped values are shown as N/A in the tables below).

Table # 1: E1 Execution Times and Speedups
SPT IM-HD IM-ND SB-HD SB-ND SS-HD SS-ND
0
10:36.02
5:33.17
9:11.11
5:22.29
9:11.41
5:23.03
5
7:23.35
1.43
5:22.43
1.03
6:47.38
1.35
5:15.03
1.02
N/A
N/A
10
6:36.31
1.60
1.12
5:18.34
1.05
1.01
6:16.49
1.46
1.08
5:10.02
1.04
1.02
6:16.54
1.46
5:09.01
1.05
20
6:14.40
1.70
1.06
5:15.23
1.06
1.01
6:00.22
1.53
1.05
5:07.16
1.05
1.01
N/A
N/A
40
6:07.04
1.73
1.02
5:14.12
1.06
1.00
5:53.42
1.56
1.02
5:05.36
1.06
1.01
5:53.07
1.56
5:06.10
1.06

Table # 2: E2 Execution Times and Speedups
SPT IM-HD IM-ND SB-HD SB-ND SS-HD SS-ND
0
3:01.00
1:16.24
2:50.58
1:11.30
2:51.22
1:11.28
5
1:55.12
1.57
1:14.07
1.03
1:40.32
1.70
1:11.19
1.00
1:39.47
1.72
1:10.41
1.01
10
1:41.58
1.78
1.13
1:10.40
1.08
1.05
1:28.48
1.93
1.13
1:09.40
1.03
1.02
1:28.49
1.93
1.12
1:08.47
1.04
1.03
15
1:36.12
1.88
1.06
1:10.20
1.09
1.00
1:25.55
1.99
1.03
1:10.01
1.02
1.00
1:25.39
2.00
1.04
1:09.10
1.03
0.99
20
1:34.15
1.92
1.02
1:10.07
1.09
1.00
1:24.52
2.01
1.01
1:09.03
1.04
1.01
1:24.19
2.03
1.02
1:08.00
1.05
1.02
25
1:32.34
1.96
1.02
1:10.14
1.09
1.00
1:23.02
2.06
1.02
1:08.53
1.04
1.00
1:23.35
2.05
1.01
1:09.00
1.04
0.99
30
1:31.27
1.98
1.01
1:10.06
1.09
1.00
1:23.10
2.06
1.00
1:10.03
1.02
0.98
1:21.56
2.09
1.02
1:08.25
1.04
1.01
40
1:30.15
2.01
1.01
1:09.35
1.10
1.01
1:22.12
2.08
1.01
1:08.32
1.04
1.02
1:22.13
2.08
1.00
1:08.25
1.04
1.00
80
1:29.49
2.02
1.00
1:09.53
1.09
1.00
1:21.16
2.10
1.01
1:08.39
1.04
1.00
1:21.52
2.09
1.00
1:08.50
1.04
0.99

Table # 3: E3 Execution Times and Speedups
SPT IM-HD IM-ND SB-HD SB-ND SS-HD SS-ND
0
10:07.42
9:19.50
9:58.58
9:06.20
9:58.06
9:05.08
5
8:45.11
1.16
8:31.04
1.10
8:41.47
1.15
8:24.40
1.08
N/A
N/A
10
8:24.25
1.20
1.04
8:20.14
1.12
1.02
8:23.07
1.19
1.04
8:16.23
1.10
1.02
8:22.49
1.19
8:15.59
1.10
20
8:12.57
1.23
1.02
8:11.05
1.14
1.02
8:06.12
1.23
1.03
8:05.13
1.13
1.02
N/A
N/A
40
8:06.17
1.25
1.01
8:11.05
1.14
1.02
8:07.26
1.23
1.00
8:04.31
1.13
1.00
8:09.47
1.22
8:04.30
1.13

Table # 4: E4 Execution Times and Speedups
SPT IM-HD IM-ND SB-HD SB-ND SS-HD SS-ND
0
7:40.57
7:31.37
7:43.06
7:26.24
7:41.10
7:23.34
5
6:32.18
1.17
6:32.42
1.15
6:36.06
1.17
6:29.52
1.15
N/A
N/A
10
6:15.19
1.23
1.05
6:15.16
1.20
1.06
6:17.59
1.23
1.05
6:13.48
1.19
1.04
6:15.36
1.23
6:11.27
1.19
20
6:05.46
1.26
1.03
6:06.57
1.23
1.02
6:05.40
1.27
1.03
6:05.48
1.22
1.02
N/A
N/A
40
6:02.08
1.27
1.01
6:00.28
1.25
1.02
5:59.30
1.29
1.02
5:59.57
1.24
1.02
6:02.21
1.27
5:55.19
1.25

Graphs during Tests

If you spot something we might have missed in the graphs below, please post a comment. Those graphs include the number of commits per second, CPU stats and Read IOPS for all environments, for the Slave with Binary Logs configuration (log-slave-updates disabled), in both durability settings (high and no/relaxed).

E1 SBHD Commits E1 SBHD CPU E1 SBHD RIOPS Graphs # 1a: E1 Stats - Slave with Binary Logs - High Durability

E1 SBND Commits E1 SBND CPU E1 SBND RIOPS Graphs # 1b: E1 Stats - Slave with Binary Logs - Relaxed Durability

E2 SBHD Commits E2 SBHD CPU E2 SBHD RIOPS Graphs # 2a: E2 Stats - Slave with Binary Logs - High Durability

E2 SBND Commits E2 SBND CPU E2 SBND RIOPS Graphs # 2b: E2 Stats - Slave with Binary Logs - Relaxed Durability

E3 SBHD Commits E3 SBHD CPU E3 SBHD RIOPS Graphs # 3a: E3 Stats - Slave with Binary Logs - High Durability

E3 SBND Commits E3 SBND CPU E3 SBND RIOPS Graphs # 3b: E3 Stats - Slave with Binary Logs - Relaxed Durability

E4 SBHD Commits E4 SBHD CPU E4 SBHD RIOPS Graphs # 4a: E4 Stats - Slave with Binary Logs - High Durability

E4 SBND Commits E4 SBND CPU E4 SBND RIOPS Graphs # 4b: E4 Stats - Slave with Binary Logs - Relaxed Durability

Workloads

It is mentioned briefly in the main post that the four test environments have different workloads:

  • E2 is a CPU-bound workload (the dataset fits in RAM).
  • E1 is also mostly CPU-bound but with some cache misses in the InnoDB buffer pool, needing a page fetch from disk before doing a write.
  • E3 is a mixed CPU and IO workload (more cache misses in the InnoDB buffer pool but still with enough cache hit to get a good commit throughput).
  • E4 is an IO-bound workload (mostly cache misses).

We can confirm that E2 does not do much IO looking at IOWait and Read IOPS in the graphs below (commit throughput is also good).

E2 SB-HD 0 Commits E2 SB-HD 0 CPU E2 SB-HD 0 RIOPS Graphs # 5: E2 is CPU-bound
(good commit throughput, low IOWait and few Read IOPS).

E1 has a little more IOWait and Read IOPS but still not that much as you can see in the graphs below. However, commit throughput is smaller than E2, but still good (the transaction sizes are bigger).

E1 SB-HD 0 Commits E1 SB-HD 0 CPU E1 SB-HD 0 RIOPS Graphs # 6: E1 is also mostly CPU-bound
(good commit throughput, low IOWait and few Read IOPS).

E4 is clearly an IO-bound workload: commit throughput is low, IOWait is high, and Read IOPS are high as shown in the graphs below.

E4 SB-HD 0 Commits E4 SB-HD 0 CPU E4 SB-HD 0 RIOPS Graphs # 7: E4 is IO-bound
(low commit throughput, high IOWait and high Read IOPS)

E3 has a much higher commit throughput than E4 but with lots of Read IOPS as shown in the graphs below. This makes us put this workload in a mixed CPU and IO-bound category (confirmed by a high IOWait but still lower than E4).

E3 SB-HD 0 Commits E3 SB-HD 0 CPU E3 SB-HD 0 RIOPS Graphs # 8: E3 is mixed CPU and IO-bound
(respectable commit throughput, high IOWait and high Read IOPS)

Note: the Write IOPS are not as important because they are buffered in the RAID controller write cache. The Read IOPS are important because a write operation to the database that needs an InnoDB page fetch from disk is bounded by the disk seek time.

Additional Discussions

Another Problem with Long-Running Transactions

In the main post, we discussed the impact of long-running transactions on the parallel replication pipeline (idle threads while the long transaction is running). When the number of slave threads is not smaller than the group commit size, the time taken to execute the group is bound by the time taken to run the longest transaction, as shown below.

   --------Time-------->
T1:  B-----------C
T2:           B--C
T3:           B--C
T4:           B--C
T5:              B--C
T6:              B--C

The transactions above on the master are executed as below on the slave with SPT=4.

   --------Time-------->
T1:  B-----------C
T2:  B-- . . . . C
T3:  B-- . . . . C
T4:  B-- . . . . C
T5:               B--C
T6:               B--C
              1
     12345678901234567

But if the number of slave threads is smaller than the group commit size, we get a longer run, as shown below (SPT=2):

   ---------Time-------->
T1:  B-----------C
T2:  B-- . . . . C
T3:               B--C
T4:               B--C
T5:                   B--C
T6:                   B--C
              1         2
     123456789012345678901

So we lost 4 units of time in the global run by decreasing SPT. What is interesting is that there is enough time for executing both T3 and T4 while T1 executes (and while the thread running T2 is idle), so maybe it is possible to avoid this situation.

Note that MySQL 5.7 solves this problem by allowing out-of-order commit. With slave-preserve-commit-order set to 0 (the default), MySQL will commit T2 without waiting for T1 to commit. This will allow T3 to start executing while T1 is still running. However, this changes the behavior of the slave: T2 (and eventually T3 and T4) becomes visible before T1, which is not possible in single-threaded replication. Some people (including the author of this post) might be nervous about this and would prefer to set slave-preserve-commit-order to 1.

A future solution to manage long-running transactions and group size larger than SPT could be to delegate committing transactions. If the thread running T2 delegates its commit to another thread, it can start running T3 event if T2 is not committed. This could be an interesting optimization to implement in MariaDB or in MySQL.

(Note that the problem described above - group size larger than SPT - cannot explain the modest speedup observed in our tests as we identify parallelism with binlog_commit_wait_count = 35 and we run tests with SPT up to 40.)

Relaxed Durability Burns CPU for E1 and E2

As presented in the main post, E1 and E2 have respective speedups of ~1.06 and ~1.04 for SB-ND. However, as shown in the graphs below, while increasing SPT, CPU consumption also increases noticeably without significantly improving the commit throughput.

E1 SBND Commits E1 SBND CPU Graphs # 9: E1 Stats - Slave with Binary Logs - Relaxed Durability

E2 SBND Commits E2 SBND CPU Graphs # 10: E2 Stats - Slave with Binary Logs - Relaxed Durability

Either we hit a synchronization bottleneck in the parallel applier or there is room for optimization: there is probably interesting work to be done here.

comments powered by Disqus