At 16:07 UTC (8:07am PST) our on-call engineer was alerted to a sequence of server errors and failed health checks in our EU (Ireland) region. Initial triage showed greater than 1% of production clusters in the region to be affected, and the incident was escalated to second level responders for assistance.
Of the three nodes running the cluster group, one was indicating as non-responsive, and the other two were failing to re-elect a master node. Our first course of action was to manually initiate a node failover to replace the non-responsive server. A replacement node provisioned itself automatically while we proceeded to investigate the cause of the failure to re-elect a master.
Elasticsearch is designed to be run in a distributed environment. This includes certain algorithms for reaching consensus between multiple nodes. As part of this process, production Elasticsearch clusters are commonly designed with three nodes, so that a quorum can be reached and distributed decisions can still be made with two out of the three nodes online. This allows a distributed system to operate normally during many kinds of planned maintenance or unexpected failures.
In an event like this, a single node failure should not have been enough to cause a cluster outage. However, further investigation uncovered an edge case bug in our configuration generation.
During an earlier routine maintenance, an incorrect number of nodes had been reported, causing the cluster to be configured as a four-node cluster. Quorum for a four-node cluster would then have been three nodes. With only two of the expected minimum of three nodes present, Elasticsearch safeguards kicked in and locked down the cluster.
Once this bug was identified, it was a simple matter to correct the settings and perform a rolling restart of the affected nodes.
The cluster was restored to normal operation as of about 10:29 UTC (8:29am PST), for a total downtime of about 22 minutes. Approximately 7.5% of production clusters in our EU (Ireland) region were affected.
We strive to provide the best possible uptime, and we apologize to those affected by this incident.