Elevated 503 errors
Incident Report for Bonsai
Postmortem

As preparation for upcoming system upgrades, over the last 6 weeks we've been migrating user clusters onto new underlying infrastructure with more capacity and features that will increase overall system performance.

Today we had a mishap where an automated upgrade completed successfully, but a setting on our core routing layer had been misconfigured. This led to a temporary loss of connectivity for about 1.1% of Starter and Production clusters in our Bonsai Virginia Region. Dedicated and Enterprise customers were not affected.

Our on-call team was paged with alerts immediately, and within a few minutes escalated as P0 by confirming the escalated levels of 503 errors corresponding to a recent maintenance event. Following escalation, we fixed the misconfigured routing setting within approximately 15 minutes.

No data loss occurred to clusters, but according to our internal logs, the elevated 503 errors may have affected some customers for up to 30 minutes.

In reviewing the chain of events leading up to the event, this error occurred due to a custom routing setting made outside of our automation systems. The error was also outside of the scope of our validation tests, so it was not caught as part of our pre-check tests or post-op validation checks.

As part of our review, we found that upgrades that would have prevented this are already in our backlog. However, following this incident we will add these updates to our next sprint, which will help prevent a recurrence. As part of these updates, we will enhance our pre and post automation validation checks, and also further lockdown our production systems to prevent manual changes from happening outside of our normal operations toolchain.

Posted Aug 21, 2015 - 23:46 UTC

Resolved
We've identified and resolved the issue. Post-mortem to follow.
Posted Aug 21, 2015 - 22:07 UTC
Investigating
We are currently investigating reports of 503 errors and unavailable clusters.
Posted Aug 21, 2015 - 22:02 UTC