Nodes unavailable in Ireland

Incident Report for Bonsai

Postmortem

At 10:20pm CDT (03:20 UTC) on 2019-09-19 Bonsai operators were notified that a portion of our customers were experiencing a red state on their clusters. Upon inspection the operators discovered that AWS had rotated a server and clusters that contained unreplicated indices had failed to a red state. When we restored the indices from the latest snapshot, the cluster state changed to green and an internal all clear was sounded.

At 4:00am CDT (09:00 UTC) on 2019-09-19 an operator responded to reports by users that some clusters in Ireland were still not accessible and reporting an increase in 503s. It was determined that for a subset of clusters in the region, our proxy layer contained outdated IP addresses following the previous EC2 instance replacement. We refreshed the proxy with the latest IP addresses and restored affected clusters to normal operation.

During our followup review, we determined that the affected clusters were a subset of EU region clusters created between 8 months and 3 months ago. Furthermore, a process designed to synchronize routing metadata was not operating correctly. We have since deployed an updated process to check and synchronize routing information as needed to prevent a similar incident from happening again in the future.

Posted Sep 19, 2019 - 21:47 UTC

Resolved

This incident has been resolved.

Posted Sep 19, 2019 - 17:05 UTC

Monitoring

All impacted clusters have been updated and service is back to normal. The Bonsai team will be monitoring the situation.

Posted Sep 19, 2019 - 09:27 UTC

Identified

The Bonsai time has identified the issue and rolling out an update to address the issue.

Posted Sep 19, 2019 - 09:00 UTC

Investigating

We are currently seeing an elevated rate of 503 errors for some Ireland clusters.

Posted Sep 19, 2019 - 05:00 UTC

This incident affected: Region Health: Bonsai Ireland.