Investigating reports of errors for some production clusters in Virginia

Incident Report for Bonsai

Postmortem

On December 1, 2016, starting around 14:00 UTC, our on-call ops team was paged due to failing health checks on several production clusters on our shared tier in Virginia. After triaging the incident, we identified several indices reporting 'Red' health, indicating corrupt indices, and we escalated to our Level-3 Ops on-call. We made the decision that in this case restoring from backups would be the quickest resolution, and immediately proceeded to restore affected indices from the latest hourly backup. By 14:45 UTC, affected indices were restored from backups and normal operation restored. We sincerely apologize to the users and staff of those affected for the service interruption.

The data corruption in this case was caused by a unusual cluster configuration left over from a prior maintenance operation. And we had a gap in our automation tooling and maintenance playbooks that resulted in some active data remaining on old servers that were scheduled for deprovisioning. When the old servers were deprovisioned as part of regular off-hours maintenance, it resulted in data corruption for indices with data on those servers. We've reviewed and updated our procedures to prevent this from happening in the future.

Posted Dec 01, 2016 - 15:32 UTC

Resolved

The incident is resolved, and we will be following up later with a post-mortem.

Posted Dec 01, 2016 - 15:12 UTC

Monitoring

Affected indices should be recovered, and we are monitoring the results.

Posted Dec 01, 2016 - 14:48 UTC

Identified

We've identified the issue, and for affected indices we are currently restoring from backups.

Posted Dec 01, 2016 - 14:20 UTC

Investigating

We are currently investigating this issue.

Posted Dec 01, 2016 - 14:05 UTC