Concurrency Issues in US East

Incident Report for Bonsai

Postmortem

At approximately 21:40 UTC, our operations team was alerted to a regression in the US-East region. The regression was impacting around 4% of the Elasticsearch 2.4.2 clusters in the region.

On investigation, it was determined that a bug in Elasticsearch 2.4 had exhausted an internal thread queue, resulting in requests being routed to other nodes. This led to a cascading failure as the internal queues in the remaining nodes filled up, causing subsequent connection requests to be rejected across the server group. For end users, this manifested as a spike in HTTP 429 responses from their cluster.

A rolling restart was initiated to purge the queues and reset the cluster state. For most users, this resolved the problem immediately, although some users with some large shards remained in a "yellow" cluster state for a period of time.

Posted Jul 10, 2018 - 23:03 UTC

Resolved

This incident has been resolved.

Posted Jul 10, 2018 - 22:47 UTC

This incident affected: Region Health: Bonsai Virginia.