Elevated error rates and red cluster reports in US-East
Incident Report for Bonsai
Postmortem

A node in one of the multitenant Elasticsearch 2.4 server groups failed and was replaced by the ASG. However, the replacement node did not provision correctly, leading to a degraded state for about 20% of the users on this server group.

Our team was automatically paged to respond. The replacement node was fixed, and impacted clusters were allowed to recover. Some clusters without replication were restored from a recent snapshot.

At it’s peak, the incident impacted just under 1% of all traffic on the server group, although this affected some users disproportionately, particularly those without replication.

Posted 4 months ago. Aug 05, 2018 - 14:38 UTC

Resolved
All clusters have recovered and are operating normally.
Posted 4 months ago. Aug 05, 2018 - 14:31 UTC
Monitoring
Impacted clusters are recovering now.
Posted 4 months ago. Aug 05, 2018 - 14:00 UTC
Identified
We have identified the issue and are working on a fix.
Posted 4 months ago. Aug 05, 2018 - 13:45 UTC
Investigating
We have been automatically paged in response to some errors and unhealthy clusters in the US-East region, and are investigating now.
Posted 4 months ago. Aug 05, 2018 - 13:35 UTC
This incident affected: Region Health: Bonsai Virginia.