At 10:20pm CDT (03:20 UTC) on 2019-09-19 Bonsai operators were notified that a portion of our customers were experiencing a red state on their clusters. Upon inspection the operators discovered that AWS had rotated a server and clusters that contained unreplicated indices had failed to a red state. When we restored the indices from the latest snapshot, the cluster state changed to green and an internal all clear was sounded.
At 4:00am CDT (09:00 UTC) on 2019-09-19 an operator responded to reports by users that some clusters in Ireland were still not accessible and reporting an increase in 503s. It was determined that for a subset of clusters in the region, our proxy layer contained outdated IP addresses following the previous EC2 instance replacement. We refreshed the proxy with the latest IP addresses and restored affected clusters to normal operation.
During our followup review, we determined that the affected clusters were a subset of EU region clusters created between 8 months and 3 months ago. Furthermore, a process designed to synchronize routing metadata was not operating correctly. We have since deployed an updated process to check and synchronize routing information as needed to prevent a similar incident from happening again in the future.