[February 8, 2021]
Today, we’ve experienced a brief outage, followed by a few hours of degraded performance, caused by one of our upstream providers performing an unannounced maintenance on one of our critical database nodes.
Incident timeline
(all times are UTC)
- 11:01 – our main database node suddenly went offline
- 11:01 – 11:10 – we’ve started debugging possible issues, only to eventually find the “maintenance” information on our upstream provider’s status page. We were not emailed/notified beforehand regarding this maintenance, that would last 3 hours, as per its description.
- 11:19 – by this time, we’ve decided to switch everything to the main database slave node, and started by bringing the main website online, along with the Server Monitoring data collection gateway and the White Label reports.
- 11:25 – by this time the Uptime Monitoring and Server Monitoring services are fully functional again, running on the main database slave node.
- 11:26 – we switch the Blacklist Monitoring queue as well, and power it up, but in doing so we start observing degraded performance on the main database slave node, so we decide to power off the Blacklist Monitoring service temporarily.
- 12:33 – the main database server comes back online and we begin the database integrity checks.
- 12:48 – the main database server returns no errors and is ready to catch up to the missed data from the main database slave node. We begin the replication, and unfortunately the main database node was roughly 4800 seconds behind, and had a lot of catching up to do. The process starts and runs smoothly, with the main database node catching up quite fast.
- 12:49 – we retry starting up the Blacklist Monitoring queue, but this starts causing degraded performance again, so we keep it turned off for now, in order for the Uptime Monitoring service to keep on working properly.
- 14:12 – the main database node has caught on with its replication from the main database slave node, and is now ready to take over. We briefly power down services in order to move everything back to the main database node.
- 14:33 – by this time everything had been moved back to the main database node, and all of the services (Uptime Monitoring, Server Monitoring, Blacklist Monitoring) were now back to fully working again. The Blacklist Monitoring had gathered up a backlog of over 100k monitors to process, and this was all processed within the next 2 hours.
Service impact
- Uptime Monitoring and Server Monitoring have both suffered an initial 18 minutes unexpected outage, plus another ~3 minutes planned outage when we’ve switched the systems back to the main database node.
- Blacklist Monitoring system has suffered about 3 hours of downtime total, being turned on and off throughout the incident.
- The platform suffered some delay in sending out webhook notifications at the beginning and end of the incident, due to changing the main database node.
We’ve decided to “sacrifice” the Blacklist Monitoring service during this incident, in order to keep the others running smoothly. Unlike the Uptime Monitoring or the Server Monitoring that are critical, for our users, to run every minute, the Blacklist Monitoring system does allow for some downtime, as long as its backlog is processed afterwards, which it was.
Corrective actions
- We’ll look into why the main database slave was suffering degraded performance when taking over as the main database node, and we’ll perform any optimizations or upgrades needed in order to have it at the required performance standard.
- We’ll optimize the procedures of moving all of the services from one database node to another, as we feel that this process could go faster, thus minimizing the downtime in a similar future scenario.
- We’ve contacted our upstream provider to check as to why we weren’t notified prior to this maintenance, so we would have been better prepared for the downtime.