Some clusters un US-East-1 reporting readonly errors

Incident Report for Bonsai

Postmortem

Earlier today, an Elasticsearch 6.2 node in a Virginia-based multitenant server group was terminated and replaced by our autoscaling group. This resulted in some shards being shuffled around to the remaining nodes. While waiting for the replacement node to come online, the remaining nodes quickly filled up with data and reached the “flood stage” watermark.

When a flood stage watermark is reached, Elasticsearch automatically sets all indices to readonly for safety; this triages the growth in disk usage and protects against data corruption.

We were paged about several issues in rapid succession, and responded quickly. One underlying issue that was uncovered during the process was that our cleanup scripts were not running properly, which caused the disks to have a higher utilization than intended. Another issue was that the indices were not automatically taken out of readonly mode after the new node had joined the group and the shards had rebalanced.

We’re working on several long term fixes to mitigate against a recurrence of this issue in the future. If you have any questions or concerns, please feel free to reach out to us at [email protected].

Posted Mar 25, 2019 - 15:09 UTC

Resolved

This incident has been resolved.

Posted Mar 25, 2019 - 14:59 UTC

Identified

We have been paged in response to an incident that has placed some clusters in a readonly state. The errors will be resolved shortly.

Posted Mar 25, 2019 - 13:58 UTC

This incident affected: Region Health: Bonsai Virginia.