Increased 503's on some clusters in Virginia running version 6.5.4
Incident Report for Bonsai
Postmortem

At 10:20am CDT (03:20 UTC) one of our Elasticsearch 6.5.4 server groups in Virginia suffered a cascading failure due to degraded performance on two of three nodes. Clusters utilizing these nodes quickly became unresponsive to requests, leading to HTTP 503 and 429 responses.

We were alerted at 10:31am (03:21 UTC) and began an investigation. Within a few minutes, the issue was escalated internally for incident response. The Bonsai team organized into incident response roles, and began working to troubleshoot and resolve the problem while communicating with impacted customers.

Our investigation showed two of three Elasticsearch nodes with high garbage collection activity, causing them to be unresponsive to requests. We placed free tier clusters into maintenance mode to help shed load while performing a restart of the affected nodes. After recovering all shards, performance returned to normal at 11:04am CDT (04:04 UTC).

During subsequent monitoring of the systems, we observed a similar performance regression begin at 11:17am CDT (04:17 UTC). We performed a similar intervention, placing free tier clusters into maintenance mode, and initiating another rolling restart of the cluster.

While intervening to stabilize this second occurrence, we provisioned additional capacity in order to rebalance clusters within the region and mitigate the potential impact of ‘noisy neighbor’ effects. Impacted production clusters were migrated to newer hardware at 11:50am CDT (04:50 UTC).

We are considering this incident resolved, however we are still reviewing the impact and considering potential root causes. Long GC pauses are rare on Bonsai, and simultaneous impacts to nodes within the same group is unprecedented. The reoccurrence of the issue within a few minutes is also highly suspect, and is a major point of focus.

Customers with questions, comments, or concerns should send an email to support@bonsai.io.

Posted 4 months ago. Jun 18, 2019 - 19:43 UTC

Resolved
This incident has been resolved, and starter clusters have been re-enabled.
Posted 4 months ago. Jun 18, 2019 - 17:03 UTC
Update
We have provisioned additional capacity. Production clusters will have a brief read-only period as we migrate them to the new hardware.

Starter clusters remain under maintenance.
Posted 4 months ago. Jun 18, 2019 - 16:45 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 4 months ago. Jun 18, 2019 - 16:33 UTC
Investigating
We're seeing more signs of performance regression and our team is currently investigating. All starter clusters have been placed in maintenance mode.
Posted 4 months ago. Jun 18, 2019 - 16:21 UTC
Monitoring
The cluster group is stabilized and starter clusters that have been placed in maintenance mode have been re-enabled.
Posted 4 months ago. Jun 18, 2019 - 16:04 UTC
Update
We are continuing to work on a fix for this issue.
Posted 4 months ago. Jun 18, 2019 - 16:03 UTC
Identified
The issue was identified shortly after increased errors were reported by our systems, and our team is working on a fix.
Posted 4 months ago. Jun 18, 2019 - 16:00 UTC
This incident affected: Region Health: Bonsai Virginia.