At 10:20am CDT (03:20 UTC) one of our Elasticsearch 6.5.4 server groups in Virginia suffered a cascading failure due to degraded performance on two of three nodes. Clusters utilizing these nodes quickly became unresponsive to requests, leading to HTTP 503 and 429 responses.
We were alerted at 10:31am (03:21 UTC) and began an investigation. Within a few minutes, the issue was escalated internally for incident response. The Bonsai team organized into incident response roles, and began working to troubleshoot and resolve the problem while communicating with impacted customers.
Our investigation showed two of three Elasticsearch nodes with high garbage collection activity, causing them to be unresponsive to requests. We placed free tier clusters into maintenance mode to help shed load while performing a restart of the affected nodes. After recovering all shards, performance returned to normal at 11:04am CDT (04:04 UTC).
During subsequent monitoring of the systems, we observed a similar performance regression begin at 11:17am CDT (04:17 UTC). We performed a similar intervention, placing free tier clusters into maintenance mode, and initiating another rolling restart of the cluster.
While intervening to stabilize this second occurrence, we provisioned additional capacity in order to rebalance clusters within the region and mitigate the potential impact of ‘noisy neighbor’ effects. Impacted production clusters were migrated to newer hardware at 11:50am CDT (04:50 UTC).
We are considering this incident resolved, however we are still reviewing the impact and considering potential root causes. Long GC pauses are rare on Bonsai, and simultaneous impacts to nodes within the same group is unprecedented. The reoccurrence of the issue within a few minutes is also highly suspect, and is a major point of focus.
Customers with questions, comments, or concerns should send an email to firstname.lastname@example.org.