On December 1, 2016, starting around 14:00 UTC, our on-call ops team was paged due to failing health checks on several production clusters on our shared tier in Virginia. After triaging the incident, we identified several indices reporting 'Red' health, indicating corrupt indices, and we escalated to our Level-3 Ops on-call. We made the decision that in this case restoring from backups would be the quickest resolution, and immediately proceeded to restore affected indices from the latest hourly backup. By 14:45 UTC, affected indices were restored from backups and normal operation restored. We sincerely apologize to the users and staff of those affected for the service interruption.
The data corruption in this case was caused by a unusual cluster configuration left over from a prior maintenance operation. And we had a gap in our automation tooling and maintenance playbooks that resulted in some active data remaining on old servers that were scheduled for deprovisioning. When the old servers were deprovisioned as part of regular off-hours maintenance, it resulted in data corruption for indices with data on those servers. We've reviewed and updated our procedures to prevent this from happening in the future.