At roughly 0700 UTC, performance on one of the Elasticsearch 6.5.4 nodes in the EU-West-1 (Ireland) region began to degrade. This initially presented as a flapping condition, where load and connectivity would degrade enough to trigger a pager alert to the on-call engineer, but then would recover enough to automatically cancel the page.
At roughly 0745 UTC the node degraded to the point where the on-call engineer was alerted to respond to a critical issue. The engineer discovered that internode communication within the server group had failed, with the nodes being collectively unable to elect a master.
The engineer responded by performing a complete simultaneous reboot of all nodes in the server group, which took all clusters offline for a minute or two. When all nodes returned to service, they were able to rejoin the cluster and performance stabilized by 0752 UTC.
Affected users would have seen a period of sporadic HTTP 503 errors, as well as a period of persistent 503 errors during the full restart.
We do not anticipate further service interruptions at this time. We have also taken the long-term of step of retiring the affected server group, and have migrated all users to fresh Elasticsearch 6.5.4 nodes in a new server group.
Users with additional questions or concerns should inquire at support@bonsai.io.