A node in one of the multitenant Elasticsearch 2.4 server groups failed and was replaced by the ASG. However, the replacement node did not provision correctly, leading to a degraded state for about 20% of the users on this server group.
Our team was automatically paged to respond. The replacement node was fixed, and impacted clusters were allowed to recover. Some clusters without replication were restored from a recent snapshot.
At it’s peak, the incident impacted just under 1% of all traffic on the server group, although this affected some users disproportionately, particularly those without replication.