Elevated error rates and red cluster reports in US-East
Incident Report for Bonsai
Postmortem

A node in one of the multitenant Elasticsearch 2.4 server groups failed and was replaced by the ASG. However, the replacement node did not provision correctly, leading to a degraded state for about 20% of the users on this server group.

Our team was automatically paged to respond. The replacement node was fixed, and impacted clusters were allowed to recover. Some clusters without replication were restored from a recent snapshot.

At it’s peak, the incident impacted just under 1% of all traffic on the server group, although this affected some users disproportionately, particularly those without replication.

Posted Aug 05, 2018 - 14:38 UTC

Resolved
All clusters have recovered and are operating normally.
Posted Aug 05, 2018 - 14:31 UTC
Monitoring
Impacted clusters are recovering now.
Posted Aug 05, 2018 - 14:00 UTC
Identified
We have identified the issue and are working on a fix.
Posted Aug 05, 2018 - 13:45 UTC
Investigating
We have been automatically paged in response to some errors and unhealthy clusters in the US-East region, and are investigating now.
Posted Aug 05, 2018 - 13:35 UTC
This incident affected: Region Health: Bonsai Virginia.