Elevated error rates and red cluster reports in US-East

Incident Report for Bonsai

Postmortem

A node in one of the multitenant Elasticsearch 2.4 server groups failed and was replaced by the ASG. However, the replacement node did not provision correctly, leading to a degraded state for about 20% of the users on this server group.

Our team was automatically paged to respond. The replacement node was fixed, and impacted clusters were allowed to recover. Some clusters without replication were restored from a recent snapshot.

At it’s peak, the incident impacted just under 1% of all traffic on the server group, although this affected some users disproportionately, particularly those without replication.

Posted Aug 05, 2018 - 14:38 UTC

Resolved

All clusters have recovered and are operating normally.

Posted Aug 05, 2018 - 14:31 UTC

Monitoring

Impacted clusters are recovering now.

Posted Aug 05, 2018 - 14:00 UTC

Identified

We have identified the issue and are working on a fix.

Posted Aug 05, 2018 - 13:45 UTC

Investigating

We have been automatically paged in response to some errors and unhealthy clusters in the US-East region, and are investigating now.

Posted Aug 05, 2018 - 13:35 UTC

This incident affected: Region Health: Bonsai Virginia.