To run a website like Booking.com you need a lot of servers. Not just webservers, but also mail servers, database servers, name servers, proxy servers and a lot more. How do we make sure they are all operational and doing what we like them to do? We monitor them.
The first decision we must make is what to monitor about our server health and where. Running out of disk space is an issue we want to watch out for on every machine, but we don't really care about Replication Delay on anything but database servers in a replication chain. So each server has a set of core checks, on top of which we add checks based on the role of the machine.
After we've decided what to check for, we obviously want to inform our staff of any failures. We use multiple methods for this. Email, XMPP (Jabber) and SMS text messages to two lucky sysadmins who carry the phone 24/7. To make this easier on our staff, we've deployed a "follow the sun" model. That means our team in Singapore kicks the day off with monitoring. When they are done, the European team takes over. By the time the (West Coast) US team walks into their office, they take over to hand it back to Singapore when they walk out.
This only leaves us with weekends and public holidays. Since we're an Amsterdam based company, most of the staff works in European offices and thus we distribute the weekend shifts amongst these sysadmins.
We use Nagios for our monitoring. But since our system is rather big and complex (did I mention we monitor roughly 5000 physical servers?), we can't have all monitoring done by one machine. That would be silly, for what would happen if that machine went down? "Quis custodiet ipsos custodes"? We always run our monitoring servers in pairs so that if one goes down, the other can alert us of that event.
But two would only harden our setup a little. It wouldn't be powerful enough to monitor all our machines, due to the scale of operations. Besides the scaling factor, we would run into a lot of firewall issues. So we have a set of monitoring servers in all of our network segments. That includes a set outside of our own network.
As mentioned, we rely on Nagios. But we also rely on Puppet for distributing the checks around our machines and a thing we like to call ServerDB. Some people might call this a CMDB (we don't).
In ServerDB, we store information about all our physical assets and here it is that we assign roles to them. A role simply defines what the server is supposed to do. Is it routing emails? Is it serving XML traffic? Is it a sending out faxes? All our servers are grouped by at least one role (multiple roles are possible).
ServerDB is also the place where we define in what state the machine is supposed to be. Is it in production? Or is it currently being set up? Is it in maintenance mode? Based on this state, we decide whether we want to wake sysadmins up when something goes bad.
Now that we have groups of servers (based on roles), we can add checks to them. Since we use plain old Nagios, the scripts can basically be in any language (we try to stick to Perl and Python) and can check basically anything you can dream of, as long as it has a threshold that can be detected.
All that is left now is to instruct the monitoring servers to fetch this information and use it. A few cronjobs query our calendar (who is supposed to be "on call" ?) and query ServerDB (what should we check and where?).
Guys! We're almost running out of space!! ZOMG@!
When a machine misbehaves badly or simply has a component failure, the system administrators are informed by SMS messages. But we don't want to have this much attention for all warnings we get. A warning might tell us that a disk is filling up, but there's still enough space left for now. These warnings (as well as the critical errors) are usually best monitored through the Nagios webinterface. But as I mentioned, we have quite a few monitoring hosts. To give us a centralised view of our entire setup, we use Multisite which also allows us to create custom dashboards for different teams/views.
The future of monitoring
As with most of our sub-systems, scaling the business up always makes it fun and interesting. With monitoring this is no different. In the past we've used and tested other tools and Nagios forks but so far we haven't been able to find anything that works for us. For one, we would like to use our data a little smarter. We already throw a huge amount of data at our Graphite servers so why check again with Nagios?
Another problem we have to tackle is to make failing over to the secondary monitoring server (remember we operate them in pairs?) automatic.
And by the time we're all done with that, the next issue will probably present itself. We just need to make sure we read the warnings before they become critical.