Increased 503's on some clusters in Virginia running version 6.5.4

Incident Report for Bonsai

Postmortem

At 10:20am CDT (03:20 UTC) one of our Elasticsearch 6.5.4 server groups in Virginia suffered a cascading failure due to degraded performance on two of three nodes. Clusters utilizing these nodes quickly became unresponsive to requests, leading to HTTP 503 and 429 responses.

We were alerted at 10:31am (03:21 UTC) and began an investigation. Within a few minutes, the issue was escalated internally for incident response. The Bonsai team organized into incident response roles, and began working to troubleshoot and resolve the problem while communicating with impacted customers.

Our investigation showed two of three Elasticsearch nodes with high garbage collection activity, causing them to be unresponsive to requests. We placed free tier clusters into maintenance mode to help shed load while performing a restart of the affected nodes. After recovering all shards, performance returned to normal at 11:04am CDT (04:04 UTC).

During subsequent monitoring of the systems, we observed a similar performance regression begin at 11:17am CDT (04:17 UTC). We performed a similar intervention, placing free tier clusters into maintenance mode, and initiating another rolling restart of the cluster.

While intervening to stabilize this second occurrence, we provisioned additional capacity in order to rebalance clusters within the region and mitigate the potential impact of ‘noisy neighbor’ effects. Impacted production clusters were migrated to newer hardware at 11:50am CDT (04:50 UTC).

We are considering this incident resolved, however we are still reviewing the impact and considering potential root causes. Long GC pauses are rare on Bonsai, and simultaneous impacts to nodes within the same group is unprecedented. The reoccurrence of the issue within a few minutes is also highly suspect, and is a major point of focus.

Customers with questions, comments, or concerns should send an email to [email protected].

Posted Jun 18, 2019 - 19:43 UTC

Resolved

This incident has been resolved, and starter clusters have been re-enabled.

Posted Jun 18, 2019 - 17:03 UTC

Update

We have provisioned additional capacity. Production clusters will have a brief read-only period as we migrate them to the new hardware.

Starter clusters remain under maintenance.

Posted Jun 18, 2019 - 16:45 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 18, 2019 - 16:33 UTC

Investigating

We're seeing more signs of performance regression and our team is currently investigating. All starter clusters have been placed in maintenance mode.

Posted Jun 18, 2019 - 16:21 UTC

Monitoring

The cluster group is stabilized and starter clusters that have been placed in maintenance mode have been re-enabled.

Posted Jun 18, 2019 - 16:04 UTC

Update

We are continuing to work on a fix for this issue.

Posted Jun 18, 2019 - 16:03 UTC

Identified

The issue was identified shortly after increased errors were reported by our systems, and our team is working on a fix.

Posted Jun 18, 2019 - 16:00 UTC

This incident affected: Region Health: Bonsai Virginia.