Partial service interruption in EU-West

Incident Report for Bonsai

Postmortem

At 16:07 UTC (8:07am PST) our on-call engineer was alerted to a sequence of server errors and failed health checks in our EU (Ireland) region. Initial triage showed greater than 1% of production clusters in the region to be affected, and the incident was escalated to second level responders for assistance.

Of the three nodes running the cluster group, one was indicating as non-responsive, and the other two were failing to re-elect a master node. Our first course of action was to manually initiate a node failover to replace the non-responsive server. A replacement node provisioned itself automatically while we proceeded to investigate the cause of the failure to re-elect a master.

Elasticsearch is designed to be run in a distributed environment. This includes certain algorithms for reaching consensus between multiple nodes. As part of this process, production Elasticsearch clusters are commonly designed with three nodes, so that a quorum can be reached and distributed decisions can still be made with two out of the three nodes online. This allows a distributed system to operate normally during many kinds of planned maintenance or unexpected failures.

In an event like this, a single node failure should not have been enough to cause a cluster outage. However, further investigation uncovered an edge case bug in our configuration generation.

During an earlier routine maintenance, an incorrect number of nodes had been reported, causing the cluster to be configured as a four-node cluster. Quorum for a four-node cluster would then have been three nodes. With only two of the expected minimum of three nodes present, Elasticsearch safeguards kicked in and locked down the cluster.

Once this bug was identified, it was a simple matter to correct the settings and perform a rolling restart of the affected nodes.

The cluster was restored to normal operation as of about 10:29 UTC (8:29am PST), for a total downtime of about 22 minutes. Approximately 7.5% of production clusters in our EU (Ireland) region were affected.

We strive to provide the best possible uptime, and we apologize to those affected by this incident.

Posted Feb 19, 2018 - 17:43 UTC

Resolved

This incident has been resolved.

Posted Feb 19, 2018 - 17:17 UTC

Monitoring

The server group is back online as of 16:29 UTC.

Posted Feb 19, 2018 - 16:29 UTC

Identified

We have been automatically paged in response to a problem with a server group in the EU-West region and are working to resolve ASAP.

Posted Feb 19, 2018 - 16:07 UTC