B.

Troubleshooting: A journey into the unknown

Troubleshooting is a journey. It’s a long, unpredictable trek, one where you know the start and the end points but have zero knowledge about the actual path you need to take in order to get there, to find the root cause of the problem. In your backpack you have knowledge, past experience, and various troubleshooting techniques. For a systems engineer, the enjoyable task of finding the root cause of a problem often feels exactly like this journey into the unknown.

This particular journey relates to an issue we had on servers in our Distributed Load Balancing (DLB) service. The issue itself had nothing to do with load balancing, but the applicable knowledge gained was invaluable. The journey was worthwhile, as always.

The start

Here is the start point of our journey. What you see is that there is a sudden drop of incoming traffic to a single member of the DLB group. Other members instantly picked up the traffic in an equal manner thanks to the Equal-cost Multi-Path routing we have deployed.

Drop in Requests

The rate of incoming traffic dropped to zero within a sampling interval (we pull and store metrics every 10 seconds) and traffic recovered after ~50 seconds. In our systems there are five possible reasons for these kinds of traffic drops:

  1. Switches on the north and south sides stop selecting the server as the next hop for incoming traffic due to a configuration change
  2. The Bird Internet Routing Daemon stops running on DLB server
  3. anycast-healthchecker withdraws routes for all services as they fail their health check.
  4. The BFD protocol on switch side detects that the server isn’t sending hello messages and stops routing traffic to it
  5. Network cable or network card glitch

We examined the log of anycast-healthchecker and found out that all services were successfully responding to health checks. We then looked at the bird log and found the following:

08:35:42.981081+01:00 lb-101 bird: **BGP1: Received: Other configuration change**
08:35:42.981362+01:00 lb-101 bird: BGP1: BGP session closed
08:35:42.981474+01:00 lb-101 bird: BGP1: State changed to stop
08:35:42.991510+01:00 lb-101 bird: BGP1: Down
08:35:42.991775+01:00 lb-101 bird: bfd1: Session to 10.248.16.254 removed
08:35:42.991883+01:00 lb-101 bird: BGP1: State changed to down
08:35:42.991985+01:00 lb-101 bird: BGP1: Starting
08:35:42.992090+01:00 lb-101 bird: BGP1: State changed to start
08:35:42.992191+01:00 lb-101 bird: bfd1: Session to 10.248.16.254 added
08:35:42.992299+01:00 lb-101 bird: BGP1: Started
08:35:42.992399+01:00 lb-101 bird: BGP1: Connect delayed by 5 seconds
08:35:42.992502+01:00 lb-101 bird: BGP2: **Received: Other configuration change**
.......

All DLB servers are dual-home and they establish BGP peering with the switches on the north and south sides. According to RFC4486, the messages in bold indicate that Bird daemon received a BGP message to reset the BGP peering due to a configuration change on the switch side.

We looked at the Bird code and switch logs, and we found out that the switch asked for resetting the BGP peering due to three consecutive missing BFD hello messages. Such messages are exchanged over UDP protocol with an interval of 400 milliseconds and a tolerance of no more than three missed packets (after which the session is declared down).

The DLB server hadn’t sent BFD hello messages for a period of 1.2 seconds! The most interesting part from the above log is that the failure happened concurrently with both BGP peering, which are established over 2 different network cards to different physical switches.

This made us believe that something on the host caused the loss of 3 consecutive BFD messages; it’s very unlikely to have hardware issues at the same time on two different network cards, cables, or switches.

Several occurrences of the issue

The exact same issue was happening on multiple servers at random times across the day. In all occurrences we saw the same lines in the bird log. So, we knew the end of our journey, we just needed to find what makes the system to not send three consecutive UDP packets every 400 milliseconds. We store logs to ElasticSearch and created a kibana dashboard to visualize those errors and started investigating each occurrence.

Our servers are directly connected to the Internet, therefore we looked at possible small duration attacks on the TCP layer. UDP traffic is not allowed, thus we excluded the possibility of an attack with UDP 80/443 traffic. We didn’t notice any sudden increase of incoming TCP, ICMP, and HTTP traffic before the occurrence of the issue.

We also looked at the haproxy log for possible SSL attacks, but we didn’t notice any unusual traffic pattern. So we knew that there was nothing external to the system that could explain the problem.

The first interesting find

The next stage of our journey was haproxy itself. We use collectd for collecting system statistics and haproxystats for haproxy statistics. Both tools help us to gather a lot of performance metrics about haproxy and the system as well. Furthermore, haproxy emits log messages, which contain very useful information that can help figure out what‘is going on in the server.

haproxy exposes CPU usage per process (we run 10 processes): we noticed a spike to 100% utilization around the same time Bird received the messages to reset the BGP peering. In the following graph we can see that all of the haproxy processes had 100% CPU utilization for at least 1 data point.

haproxy CPU usage

The sudden increase of CPU usage wasn’t always followed by BGP peering resets. In some cases, BFD issues were reported by Bird before those CPU spikes. Nevertheless, we continued to investigate the CPU spikes as they were very suspicious.

The CPU utilization of a process is the sum of User Level and System Level CPU usage. Thus, we needed to know if haproxy was spending all this CPU power for performing its tasks (SSL computation, business logic processing, etc.) or for asking the system to do something like dispatching data to various TCP sockets or handling incoming/outgoing connections. The two graphs below suggest that haproxy was spending CPU cycles at the system level. haproxy cpu user level

haproxy cpy system level

This gave us a really good starting point for doing more in-depth analysis on what was causing those CPU spikes. We reviewed the haproxy configuration several times and there was nothing suspicious there. haproxy software hadn’t been upgraded recently, so we excluded a possible software regression which could have caused this behaviour.

We contacted HAPROXY TECHNOLOGIES, INC for assistance. They asked us to collect more information about sessions and TCP connections as there was a bug that could cause high number of TCP connections in CLOSE-WAIT state – but, according to them, that specific bug couldn’t cause CPU spikes.

We also looked at memory utilization of haproxy and there wasn’t anything suspicious there either. But, in all the occurrences we saw a sudden increase of free memory. The system freed ~600MB of memory around the same time as we’d been seeing those CPU spikes.

Free memory

It wasn’t very clear to us if those two observations (CPU spikes and the sudden increase of free memory) were the cause or the symptom of our issue. Moreover, this sudden increase of memory could be related with garbage collector being invoked by some other services. So, more digging was required to clear up the fog in our path.

(s)Tracing the unknown factor

We run many daemons and cron jobs on our servers. In some occurrences of our problem we saw a puppet run happening at the same time. We decided to look at what was executed on every puppet run.

We set up some scripts that were running pidstat against puppet and few other daemons. Since the issue was happening at random times across a lot of servers, we had to pick few servers to run those scripts and wait for the problem to appear.

After a few days of waiting, we had several traces to analyze. Puppet really loves CPU and it can easily lock a CPU for 4-5 seconds. But it wasn’t causing our problem. Other daemons were hungry for memory and CPU resources at a level that couldn’t explain the sudden increase of free memory.

HAProxy support department suggested deploying a script which could run strace and dump sessions when haproxy CPU usage at system level went beyond 30%. The script below was deployed on a single server and was manually invoked for all haproxy processes.

#! /bin/bash
#
# hapee_tracing.sh
KILL_FILE="/tmp/kill_hapee_tracing"
BASE_DIR="/var/log/pidstats/"
SOCKET_DIR="/run/lb_engine/"
PROCESS_NUMBER="$1"
PIDS=( $(cat /run/hapee-lb.pid) )
PID_INDEX=$(($PROCESS_NUMBER-1))
PID=${PIDS[${PID_INDEX}]}


mkdir -p "${BASE_DIR}"


while true ; do
        if [ -f "${KILL_FILE}" ] ; then
            exit 0
        fi
        timeout 60 pidstat -u -p "${PID}" 1 | stdbuf -i0 -o0 -e0 egrep 'hapee-lb' | stdbuf -i0 -o0 -e0 awk '{print $6}' | while read line
        do
            if [ -f "${KILL_FILE}" ] ; then
                exit 0
            fi
            system_cpu=$(echo "${line}" | awk -F. '{print $1}')
            if [ "${system_cpu}" -gt 30 ] ; then
                echo 'show sess all' | socat ${SOCKET_DIR}process-${PROCESS_NUMBER}.sock stdio > ${BASE_DIR}sessions_$(date +%F:%H:%M:%S)_${PROCESS_NUMBER}_${PID} &
                timeout 5 strace -ttvs200 -p "${PID}" -o ${BASE_DIR}strace_$(date +%F:%H:%M:%S)_${PROCESS_NUMBER}_${PID}
            fi
        done
        PIDS=( $(cat /run/hapee-lb.pid) )
        PID=${PIDS[${PID_INDEX}]}
done

We deployed a script to dump the number of connections, using the ss tool, and the sar tool was adjusted to capture CPU, memory, and network statistics. All those tools were gathering information every second. Since it only takes 1.2 seconds for a BFD session to be detected as down, we had to gather information on such a small interval.

While we were waiting for the problem to appear on the target machine, we decided to move the Bird daemon to a CPU core which wasn’t used by haproxy. Those 10 haproxy processes are pinned to the last 10 CPUs of the system, so we pinned Bird daemon to CPU 1 and assigned -17 nice level to it. We did that in order to make sure it had enough resources to process BFD messages while haproxy was spinning at 100% CPU utilization. We also changed the CPU priority on puppet agent to utilise less CPU resources.

Light at the end of the tunnel

The issue appeared on the target server and our tracing tools ended up collecting 150MB of data (something very close to 10 millions of lines to read!). We analysed pidstat, sar, ss and strace outputs and we made the following observations together with Willy Tarreau, the author of HAProxy and Linux Kernel developer (note that the high CPU utilization started at 12:41:21 and lasted till 12:41:30):

No change of incoming requests per second prior the problem:

12:38:00 PM  active/s passive/s    iseg/s    oseg/s
12:41:18 PM   4403.00    836.00  48416.00  66313.00
12:41:19 PM   4115.00    819.00  48401.00  67910.00
12:41:20 PM   1417.00    786.00  43005.00  57608.00
12:41:21 PM   4225.00    824.00  35247.00  49883.00
12:41:22 PM   1198.00    814.00  21580.00  25604.00
12:41:23 PM   3446.00    768.00  24229.00  33893.00
12:41:24 PM   4269.00    773.00  30462.00  46604.00
12:41:25 PM   2259.00    821.00  24347.00  33772.00
12:41:26 PM    994.06    880.20  13207.92  15813.86
12:41:27 PM   4878.00    802.00  32787.00  50708.00
12:41:28 PM   2988.00    816.00  36008.00  53809.00
12:41:29 PM   3865.00    883.00  34822.00  53514.00

haproxy stopped for ~500 milliseconds in the middle of some operations, indicating that it was interrupted:

12:41:21.913124 read(932, "\25\3\3\0\32", 5) = 5
12:41:21.913140 read(932, "\0\0\0\0\0\0\0\1\244=\241\234Jw\316\370\330\246\276\220N\225\315\2333w", 26) = 26
12:41:21.913171 sendto(55, "<a href=\"&#47;general.sv.html?label=gen173nr-1FCAEoggJCAlhYSDNiBW5vcmVmaAKIAQGYAS64
12:41:21.913199 sendto(55, "; form.append( input ).append( check ); } }); } </script> \n<script src=\"https://r-e
12:41:22.413476 recvfrom(1143, "T[16], r.MB, r.MN, null), T[17], r.ME(T[18], r.MB, r.MN, null), T[12], r.ME(T[19
12:41:22.413512 sendto(55, "T[16], r.MB, r.MN, null), T[17], r.ME(T[18], r.MB, r.MN, null), T[12], r.ME(T[19], r
12:41:22.413539 connect(2665, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("10.198.156.38")}, 16)

High rate of page free / cache free activity, system was freeing 600MB of RAM per second for a period of ~6 seconds:

12:38:00 PM   frmpg/s   bufpg/s   campg/s
12:41:16 PM  -5423.00      0.00   -754.00
12:41:17 PM  -1784.00      1.00   1504.00
12:41:18 PM  -1868.00      1.00    337.00
12:41:19 PM  -1110.00      2.00    416.00
12:41:20 PM  16308.00    -27.00 -10383.00
12:41:21 PM  77274.00    -56.00 -71772.00
12:41:22 PM 154106.00   -147.00 -121659.00
12:41:23 PM 121624.00   -253.00 -93271.00
12:41:24 PM 109223.00   -238.00 -84747.00
12:41:25 PM 140841.00   -384.00 -116015.00
12:41:26 PM 142842.57   -573.27 -121333.66
12:41:27 PM  83102.00   -263.00 -59726.00
12:41:28 PM 118361.00  -1185.00 -80489.00
12:41:29 PM 168908.00   -558.00 -103072.00

System had 92% of the memory already allocated prior to the issue and started freeing a second later:

12:38:00 PM kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
12:41:13 PM   7417424  91456184     92.50    238480  60751344   9223012      9.23  34595324  28501704     72000
12:41:14 PM   7409968  91463640     92.51    238488  60752448   9223012      9.23  34595720  28501756     72956
12:41:15 PM   7382684  91490924     92.53    238488  60762264   9236604      9.24  34620324  28501852     82740
12:41:16 PM   7360992  91512616     92.56    238488  60759248   9255500      9.26  34638704  28501984     79724
12:41:17 PM   7353856  91519752     92.56    238492  60765264   9255500      9.26  34640664  28502168     85748
12:41:18 PM   7346384  91527224     92.57    238496  60766612   9252824      9.26  34642028  28502248     72148
12:41:19 PM   7341944  91531664     92.57    238504  60768276   9254356      9.26  34644128  28502328     73528
12:41:20 PM   7407176  91466432     92.51    238396  60726744   9252828      9.26  34593160  28509368     79052
12:41:21 PM   7716272  91157336     92.20    238172  60439656   9258516      9.27  34253456  28564120     80248
12:41:22 PM   8332696  90540912     91.57    237584  59953020   9269996      9.28  33694136  28646036     81072
12:41:23 PM   8819192  90054416     91.08    236572  59579936   9286824      9.29  33269068  28698896     84388
12:41:24 PM   9256084  89617524     90.64    235620  59240948   9290336      9.30  32890520  28744820     94256
12:41:25 PM   9819448  89054160     90.07    234084  58776888  10491136     10.50  32403472  28796428    100308
12:41:26 PM  10396532  88477076     89.49    231768  58286700   9364872      9.37  31895556  28838756    103444
12:41:27 PM  10728940  88144668     89.15    230716  58047796   9387052      9.39  31647348  28850604    121412
12:41:28 PM  11202384  87671224     88.67    225976  57725840   9358996      9.37  31286796  28871824    127252
12:41:29 PM  11878016  86995592     87.99    223744  57313552   9360844      9.37  30955468  28800072    138596
12:41:30 PM  12599780  86273828     87.26    221836  56961620   9350056      9.36  30688424  28713648    153488
12:41:31 PM  12638924  86234684     87.22    221840  56983328   9345268      9.35  30706596  28714976    175972
12:41:32 PM  12606460  86267148     87.25    221840  57016040   9350112      9.36  30737556  28721456    208684
12:41:33 PM  12591284  86282324     87.27    221840  57034704   9345904      9.35  30751396  28724348    217184

CPU saturation at system level while the system was freeing memory:

12:38:00 PM    CPU         %user       %nice   %system   %iowait        %steal         %idle
12:41:20 PM      0         11.22        0.00      8.16      0.00          0.00         80.61
12:41:20 PM      1         14.29        0.00      6.12      0.00          0.00         79.59
12:41:20 PM      2         18.00        0.00     11.00      0.00          0.00         71.00
12:41:20 PM      3         16.00        0.00     12.00      0.00          0.00         72.00
12:41:20 PM      4         30.61        0.00     15.31      0.00          0.00         54.08
12:41:20 PM      5         11.34        0.00      3.09      0.00          0.00         85.57
12:41:20 PM      6         28.57        0.00      9.18      0.00          0.00         62.24
12:41:20 PM      7         16.16        0.00     12.12      0.00          0.00         71.72
12:41:20 PM      8         20.83        0.00      6.25      0.00          0.00         72.92
12:41:20 PM      9         19.39        0.00      3.06      0.00          0.00         77.55
12:41:20 PM     10         14.29        0.00     12.24      0.00          0.00         73.47
12:41:20 PM     11         16.16        0.00      4.04      0.00          0.00         79.80
12:41:21 PM    all         13.29        0.00     41.88      0.00          0.00         44.83
12:41:21 PM      0          6.06        0.00     17.17      0.00          0.00         76.77
12:41:21 PM      1         14.14        0.00     22.22      0.00          0.00         63.64
12:41:21 PM      2         14.29        0.00     45.92      0.00          0.00         39.80
12:41:21 PM      3         18.18        0.00     46.46      0.00          0.00         35.35
12:41:21 PM      4         10.00        0.00     51.00      0.00          0.00         39.00
12:41:21 PM      5         14.14        0.00     46.46      0.00          0.00         39.39
12:41:21 PM      6         20.00        0.00     41.00      0.00          0.00         39.00
12:41:21 PM      7         15.31        0.00     38.78      0.00          0.00         45.92
12:41:21 PM      8         14.14        0.00     45.45      0.00          0.00         40.40
12:41:21 PM      9         11.00        0.00     47.00      0.00          0.00         42.00
12:41:21 PM     10         10.00        0.00     52.00      0.00          0.00         38.00
12:41:21 PM     11         11.11        0.00     50.51      0.00          0.00         38.38
12:41:22 PM    all          9.58        0.00     84.18      0.00          0.00          6.24
12:41:22 PM     0           4.08        0.00     70.41      1.02          0.00         24.49
12:41:22 PM     1           2.02        0.00     62.63      0.00          0.00         35.35
12:41:22 PM     2          16.00        0.00     77.00      0.00          0.00          7.00
12:41:22 PM     3          14.00        0.00     86.00      0.00          0.00          0.00
12:41:22 PM     4           7.00        0.00     93.00      0.00          0.00          0.00
12:41:22 PM     5           3.00        0.00     97.00      0.00          0.00          0.00
12:41:22 PM     6          12.00        0.00     85.00      0.00          0.00          3.00
12:41:22 PM     7          15.00        0.00     83.00      0.00          0.00          2.00
12:41:22 PM     8          13.86        0.00     84.16      0.00          0.00          1.98
12:41:22 PM     9           9.09        0.00     90.91      0.00          0.00          0.00
12:41:22 PM     10         10.00        0.00     90.00      0.00          0.00          0.00
12:41:22 PM     11          9.00        0.00     91.00      0.00          0.00          0.00
12:41:23 PM    all         17.73        0.00     75.75      0.00          0.00          6.52
12:41:23 PM     0          24.00        0.00     67.00      0.00          0.00          9.00
12:41:23 PM     1           5.05        0.00     55.56      0.00          0.00         39.39
12:41:23 PM     2          14.14        0.00     80.81      0.00          0.00          5.05
12:41:23 PM     3          26.73        0.00     73.27      0.00          0.00          0.00
12:41:23 PM     4          14.00        0.00     86.00      0.00          0.00          0.00
12:41:23 PM     5          24.00        0.00     75.00      0.00          0.00          1.00
12:41:23 PM     6          16.00        0.00     76.00      0.00          0.00          8.00
12:41:23 PM     7          13.27        0.00     79.59      0.00          0.00          7.14
12:41:23 PM     8          18.00        0.00     75.00      0.00          0.00          7.00
12:41:23 PM     9          19.61        0.00     79.41      0.00          0.00          0.98
12:41:23 PM     10         16.00        0.00     83.00      0.00          0.00          1.00
12:41:23 PM     11         21.00        0.00     78.00      0.00          0.00          1.00
12:41:24 PM    all         16.99        0.00     70.14      0.08          0.00         12.78
12:41:24 PM      0         11.34        0.00     49.48      0.00          0.00         39.18
12:41:24 PM      1         13.13        0.00     45.45      0.00          0.00         41.41
12:41:24 PM      2         19.00        0.00     66.00      0.00          0.00         15.00
12:41:24 PM      3         20.41        0.00     71.43      0.00          0.00          8.16
12:41:24 PM      4         19.00        0.00     79.00      0.00          0.00          2.00
12:41:24 PM      5         17.17        0.00     79.80      0.00          0.00          3.03
12:41:24 PM      6         21.21        0.00     67.68      0.00          0.00         11.11
12:41:24 PM      7         20.20        0.00     67.68      0.00          0.00         12.12
12:41:24 PM      8         19.39        0.00     63.27      0.00          0.00         17.35
12:41:24 PM      9          7.22        0.00     90.72      0.00          0.00          2.06
12:41:24 PM     10         14.14        0.00     83.84      0.00          0.00          2.02
12:41:24 PM     11         20.00        0.00     79.00      0.00          0.00          1.00
12:41:26 PM      9         21.78        0.00     78.22      0.00          0.00          0.00

Some low-rate activity for page swapping:

12:39:40 PM  pswpin/s pswpout/s
12:39:43 PM  0.00          0.00
12:39:44 PM  0.00         78.22
12:39:45 PM  0.00         75.00
12:39:46 PM  0.00          0.00
(...)
12:41:20 PM  0.00          9.00
12:41:21 PM  0.00         43.00
12:41:22 PM  0.00         84.00
12:41:23 PM  0.00         70.00
12:41:24 PM  0.00         53.00
12:41:25 PM  0.00         74.00

haproxy was writing data, which is odd: it shouldn’t know how, considering that when the service starts, it closes all the file descriptors related to all the files that can cause I/O operations to the filesystem:

12:41:21 PM   UID           PID   kB_rd/s   kB_wr/s kB_ccwr/s  Command
12:41:22 PM   498         28849       0.00     0.00      0.00  hapee-lb
12:41:22 PM   498         28892       0.00  5964.00      0.00  hapee-lb
12:41:22 PM   498         28894       0.00     0.00      0.00  hapee-lb
12:41:22 PM   498         28895       0.00     0.00      0.00  hapee-lb
12:41:22 PM   498         28896       0.00     0.00      0.00  hapee-lb
12:41:22 PM   498         28897       0.00  6276.00      0.00  hapee-lb
12:41:22 PM   498         28899       0.00    20.00     0.00  hapee-lb
12:41:22 PM   498         28901       0.00     0.00      0.00  hapee-lb
12:41:22 PM   498         28902       0.00     0.00      0.00  hapee-lb
12:41:22 PM   498         28904       0.00     0.00      0.00  hapee-lb
12:41:22 PM   498         28905       0.00     0.00      0.00  hapee-lb

All the haproxy processes started to do some minor page faults. They’d touched a free memory area for the first time since that area was last reclaimed:

12:41:20 PM   UID           PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
12:41:21 PM     0          5206      0.00      0.00   46340   1824   0.00  hapee-lb-system
12:41:21 PM   498         28849      0.00      0.00  177540  85296   0.09  hapee-lb
12:41:21 PM   498         28892    102.00      0.00  213204 121164   0.12  hapee-lb
12:41:21 PM   498         28894    179.00      0.00  216592 124324   0.13  hapee-lb
12:41:21 PM   498         28895    116.00      0.00  213360 122676   0.12  hapee-lb
12:41:21 PM   498         28896    153.00      0.00  211840 122544   0.12  hapee-lb
12:41:21 PM   498         28897    106.00      0.00  210236 121816   0.12  hapee-lb
12:41:21 PM   498         28899     55.00      0.00  210196 118000   0.12  hapee-lb
12:41:21 PM   498         28901    140.00      0.00  212192 120288   0.12  hapee-lb
12:41:21 PM   498         28902    125.00      0.00  214616 123212   0.12  hapee-lb
12:41:21 PM   498         28904     81.00      0.00  215988 117196   0.12  hapee-lb
12:41:21 PM   498         28905    110.00      0.00  211984 112692   0.11  hapee-lb

Memory usage of haproxy processes remained stable and didn’t change one second later, showing that it was just touching memory that was aggressively reclaimed by the system:

12:41:21 PM   UID           PID  minflt/s  majflt/s     VSZ    RSS   %MEM  Command
12:41:22 PM     0          5206      0.00      0.00   46340   1824   0.00  hapee-lb-system
12:41:22 PM   498         28849      0.00      0.00  177540  85296   0.09  hapee-lb
12:41:22 PM   498         28892    284.00      0.00  213356 121660   0.12  hapee-lb
12:41:22 PM   498         28894    231.00      0.00  217144 124900   0.13  hapee-lb
12:41:22 PM   498         28895    231.00      0.00  213936 122992   0.12  hapee-lb
12:41:22 PM   498         28896      8.00      0.00  211840 122544   0.12  hapee-lb
12:41:22 PM   498         28897    223.00      0.00  210380 122132   0.12  hapee-lb
12:41:22 PM   498         28899    311.00      0.00  212492 118752   0.12  hapee-lb
12:41:22 PM   498         28901    223.00      0.00  212460 120640   0.12  hapee-lb
12:41:22 PM   498         28902    214.00      0.00  214616 123516   0.12  hapee-lb
12:41:22 PM   498         28904    219.00      0.00  215988 117404   0.12  hapee-lb
12:41:22 PM   498         28905      2.00      0.00  211984 112692   0.11  hapee-lb

Willy Tarreau also inspected session information as they were dumped from haproxy memory and didn’t find anything unusual. He finished his investigation with the following:

  1. virtual machines using memory ballooning to steal memory from the processes and assign it to other VMs. But from what I remember you don't run on VMs (which tends to be confirmed by the fact that %steal is always 0)
  2. batched log rotation and uploading. I used to see a case where logs were uploaded via an HTTP POST using curl which would read the entire file in memory before starting to send, that would completely flush the cache and force the machine to swap, resulting in random pauses between syscalls like above, and even packet losses due to shortage of TCP buffers.

Given the huge amount of cache thrashing we're seeing (600 MB/s), I tend to think we could be witnessing something like this. The fact that haproxy magically pauses between syscalls like this can be explained by the fact that it touches unmapped memory areas and that these ones take time to be allocated or worse, swapped in. And given that we're doing this from userspace without any syscall but consecutive to a page fault instead, it's accounted as user CPU time.

I also imagined that one process could be occasionally issuing an fsync() form (after a log rotation for example), paralyzing everything by forcing huge amounts of dirty blocks to the disks; that didn't seem to be the case and There wasn’t ever any %iowait in sar reports, implying that we weren’t facing a situation where a parasitic load is bugging us down in parallel.

Another point fueling the theory of memory shortage is sar's output (again) showing that the memory was almost exhausted (92% including the cache) and that it started getting better at the exact same second the incident happens.

To sum up: memory shortage led to a sudden and high-rate freeing of memory which locked all CPUs for ~8 seconds. We knew that high CPU usage from haproxy was the symptom and not the cause.

Finding the memory eater(s)

What triggered our system to free memory at such high rate (600MB/s) and why was this so painful for our system? Why did the kernel use so much memory (~92%) for caches while active memory was always below ~8GB? There were many questions to answer, which brought us back to tracing mode.

Willy suggested to issue echo 1 > /proc/sys/vm/drop_caches upon log rotation, which we did in all servers. We also issued once echo 2 > /proc/sys/vm/drop_caches in two DLB groups. Both of these actions calmed our issue down but only for a small amount of time.

From the many processes running on our servers, we picked 5 with the highest resident memory (RSZ) and started monitoring them very closely with pidstat. We also started monitoring memory activities, noticing a high number of entries for dentry objects in the cache:

Active / Total Objects (% used)        : 388963608 / 415847504 (93.5%)
Active / Total Slabs (% used)          : 19781213 / 19781213 (100.0%)
Active / Total Caches (% used)         : 69 / 101 (68.3%)
Active / Total Size (% used)           : 73098890.88K / 78163097.48K (93.5%)
Minimum / Average / Maximum Object : 0.01K / 0.19K / 15.88K


OBJS      ACTIVE     USE  OBJ SIZE  SLABS     OBJ/SLAB CACHE SIZE  NAME
414578178 387795876   0%  0.19K     19741818  21       78967272K   dentry
244998    244998    100%  0.10K     6282      39       25128K      buffer_head
160344    158020     98%  0.64K     6681      24       106896K     proc_inode_cache
158781    149401     94%  0.19K     7561      21       30244K      kmalloc-192
119744    94951      79%  0.06K     1871      64       7484K       kmalloc-64
59616     48444      81%  0.25K     1863      32       14904K      kmalloc-256

atop was also reporting ~100% of SLAB memory as reclaimable memory:

MEM | tot        94.3G |  free        9.2G | cache   5.4G | dirty  18.1M  | buff   15.1M | slab   75.7G |  slrec  75.5G | shmem   4.1G | shrss   0.0M  | shswp   0.0M |

The output of tracing tools we had put in production didn’t provide much useful indicators about which process(es) could cause that high memory consumption for caches. haproxy log (which is rotated every hour) had ~3.5GB of data and dropping page caches upon log rotation excluded rsyslogd from the investigation as well.

We started to read documentation about memory management and realized that our system may not be tuned correctly, considering that our servers have 96GB of total memory, only ~8GB of active memory and have the following memory settings in place:

  • vm.vfs_cache_pressure set at 100
  • vm.dirty_background_ratio set at 3
  • vm.dirty_ratio set at 10

So, the system had a lot of free memory to use for caches – which it did, and it wasn't aggressively reclaiming memory from caches even when dentry objects in cache occupied 80GB. That led the system to have around 800MB in free memory in some cases.

We changed vm.vfs_cache_pressure to 200 and freed reclaimable slab objects (includes dentries and inodes) by issuing echo 2 > /proc/sys/vm/drop_caches. We started to see more free memory available (~7GB) after 2 days and then we increased vm.vfs_cache_pressure to 1000. That made the system to reclaim memory more aggressively – and the issue was almost entirely resolved.

We continued our investigation in the area of dentry caches and found this bug report for curl tool. The bug report states that, when curl makes a HTTPs request there were many access system calls to files that don’t exist, have random names and are unique per invocation:

me at node1 in ~
strace -fc -e trace=access curl 'https://foobar.booking.com/' > /dev/null
Process 2390 attached
% time         seconds  usecs/call         calls        errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00        0.003180               0          6647          6643 access
------ ----------- ----------- --------- --------- ----------------
100.00        0.003180                      6647          6643 total


me at node1 in ~
strace -f -e trace=access curl 'https://foobar.booking.com/' 2>&1 |head -10
(...)
access("/etc/pki/nssdb", W_OK)              = -1 EACCES (Permission denied)
access("/home/me/.pki/nssdb/.3219811409_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)
access("/home/me/.pki/nssdb/.3219811410_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)
access("/home/me/.pki/nssdb/.3219811411_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)
access("/home/me/.pki/nssdb/.3219811412_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)
access("/home/me/.pki/nssdb/.3219811413_dOeSnotExist_.db", F_OK) = -1 ENOENT (No such file or directory)

We knew that we use curl in the check_cmd for each service check in anycast healthchecker daemon and that check runs every 10 seconds for ~10 services. So, we fired up a one-liner to plot the number of dentry objects in cache per second:

while (true); do
echo "$graphite_name_space.$(hostname|sed -e 's/\./_/g').dentry $(sudo slabtop -o | egrep 'dentry'|awk '{print $1}') $(date '+%s')"| nc 127.0.0.1 3002;
sleep 0.9;
done

In the following graph we can see that the number of dentry objects was increasing at a high and constant rate:

dentries

Bingo! We found the tool which was polluting dentry cache. Finally, we see our destination; time to prepare the cake.

The fix was very easy – just setting the environment variable NSS_SDB_USE_CACHE to YES was enough:

me at node1 in ~
NSS_SDB_USE_CACHE=YES strace -fc -e trace=access curl 'https://foobar.booking.com/' > /dev/null
Process 14247 attached
% time         seconds  usecs/call         calls        errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00        0.000009               0            32            30 access
------ ----------- ----------- --------- --------- ----------------
100.00        0.000009                        32            30 total

We adjusted check_cmd for each service check in anycast-healthchecker which utilized the curl tool, and we also modified a cron job, which was running curl many times for HTTPs site. In the following graph, we can clearly see that the pollution was stopped as the number of dentry objects in the cache wasn’t increasing:

fix curl

Conclusions

  • Your sampling intervals for statistics may hide problems. In our case, we collect metrics every 10 seconds and we were able to (eventually) see the issue clearly. If you collect metrics every minute on systems that receive ~50K requests per second, you won’t be able to see problems that last less than minute. In other words, you fly blind. Choose the metrics to collect and pick the intervals very wisely.
  • Abnormal system behaviors must be investigated and the root cause must be found. This secures the stability of your system.
  • Reshuffling TCP connections when a single member disappears and appears in ECMP group didn’t impact our traffic as bad as we initially thought it would do.

I would like to thank Marcin Deranek, Carlo Rengo, Willy Tarreau and Ralf Ertzinger for their support in this journey.

comments powered by Disqus