Increased 503 errors in AWS EU West (Ireland) region

Incident Report for Bonsai

Postmortem

At roughly 0700 UTC, performance on one of the Elasticsearch 6.5.4 nodes in the EU-West-1 (Ireland) region began to degrade. This initially presented as a flapping condition, where load and connectivity would degrade enough to trigger a pager alert to the on-call engineer, but then would recover enough to automatically cancel the page.

At roughly 0745 UTC the node degraded to the point where the on-call engineer was alerted to respond to a critical issue. The engineer discovered that internode communication within the server group had failed, with the nodes being collectively unable to elect a master.

The engineer responded by performing a complete simultaneous reboot of all nodes in the server group, which took all clusters offline for a minute or two. When all nodes returned to service, they were able to rejoin the cluster and performance stabilized by 0752 UTC.

Affected users would have seen a period of sporadic HTTP 503 errors, as well as a period of persistent 503 errors during the full restart.

We do not anticipate further service interruptions at this time. We have also taken the long-term of step of retiring the affected server group, and have migrated all users to fresh Elasticsearch 6.5.4 nodes in a new server group.

Users with additional questions or concerns should inquire at [email protected].

Posted Oct 22, 2019 - 17:27 UTC

Resolved

Starting at 2:00am CDT (07:00 UTC), some clusters in the AWS EU West (Ireland) region experienced downtime, resulting in prolonged 503 errors. Operators were notified and intervened to restore connectivity as of 2:52am CDT (07:52 UTC).

Posted Oct 21, 2019 - 07:54 UTC