At approximately 21:40 UTC, our operations team was alerted to a regression in the US-East region. The regression was impacting around 4% of the Elasticsearch 2.4.2 clusters in the region.
On investigation, it was determined that a bug in Elasticsearch 2.4 had exhausted an internal thread queue, resulting in requests being routed to other nodes. This led to a cascading failure as the internal queues in the remaining nodes filled up, causing subsequent connection requests to be rejected across the server group. For end users, this manifested as a spike in HTTP 429 responses from their cluster.
A rolling restart was initiated to purge the queues and reset the cluster state. For most users, this resolved the problem immediately, although some users with some large shards remained in a "yellow" cluster state for a period of time.