Service-wide outage

Incident Report for Bonsai

Postmortem

On Wednesday, June 19th, One More Cloud's services Websolr.com and Bonsai.io suffered a major extended outage. This outage was the result of an attack on our systems using a compromised API key.

We are so sorry for the severe impact to our customers and their businesses, and we regret the circumstances which made it possible.

Timeline

On Wednesday morning, around 5:30am PDT, our monitoring systems paged our on-call engineer with an alert of a major outage. It was quickly determined that both Websolr and Bonsai were unresponsive in all availability zones in all regions, an unprecedented scope of outage, prompting an immediate escalation to alert and wake up our entire team.

Continued investigation showed a deliberate API-initiated mass-termination of all instances in One More Cloud's AWS account. Because none of our automated systems have the ability to perform any destructive actions, and based on the early morning timing of the attack, we were able to rule out an accident or bug almost immediately.

Our conclusion was that our account was compromised and our systems under attack, and we proceeded accordingly.

Our first priority was to identify and contain the source of the attack, and re-establish complete control over our account. We quickly identified and revoked all administrative credentials with sufficient ability to terminate instances, notably including an old pre-IAM full-access API key. We also revoked and rotated a handful of limited-scope credentials for good measure.

While we rotated credentials, we also reached out to AWS support for help identifying the source of the attacks. They were extremely helpful, and were able to positively confirm that the key used was our old pre-IAM root API key, the very first key created with our account back in 2006.

Identifying this key helped us to confirm that our account was re-secured, as well as identify vectors for the key to have been compromised. This key had been mis-labeled as one with more limited permissions, and consequently had been committed into several private repositories for very old projects. Our leading hypothesis is that this was the main vector for the key's leak.

Given that this key was not important for normal system operation, and because we have long since adopted a policy of not committing any kind of credentials into source, we were able to revoke it without any negative impact on our operations. Out of an abundance of caution, we also rotated all of the rest of our keys, as well as restricted all account access to the bare minimum of essential personnel.

Once confident that the attack was contained and our account was re-secured, we proceeded to restore service to our clusters. Our first priority was to bring the service back online and make it possible for customers to reindex their data from a primary datastore. Following that, we would begin restoring individuals from backups while planning and executing a larger mass restore.

By this time, Bonsai had already since self-repaired its clusters. It is running on the latest generation of our operational architecture, and was back online (without data) within about thirty minutes of the initial mass termination. A very small amount of manual intervention was required to repopulate some systems from our primary database, hosted on another provider.

Websolr runs on older systems and requires more manual work to provision its clusters. By 15:55 UTC (T+3.5h), new index creation was brought partially online. There were some problems with new index creation at this time, as we experienced a "thundering herd" effect of new index creation. We spent approximately the next hour rebalancing systems and bringing these under control.

Our next phase of the recovery was to begin restoring data from nightly backups. We began with individual restorations, because that is the use-case our backup recovery tooling has been optimized for. The lack of well-rehearsed tooling and procedures for a mass recovery proved to be a major bottleneck in our efforts.

Over the next few hours, we were able to develop the necessary tooling for Bonsai to mass-restore data from backups, bringing Bonsai to full recovery by June 20th at 02:55 UTC (T+14h).

Retooling and recovery efforts for Websolr continued into the next day. Restorations from backup data were completed by June 21 at 04:50 UTC (approx. T+40h). Through the next few days, we continued to closely monitor and audit the performance of the system. We performed occasional system maintenance to redistribute load, and to help individuals whose indices suffered data corruption or configuration regressions during the restoration process.

Followup investigation

One of our first actions was to hire a top-tier security firm to conduct a breach analysis and security audit of our systems. Given the nature of the attack, it was important for us to have an expert third party analyze our response protocol and perform a thorough audit our systems and policies.

This audit was able to help us confirm that we had contained the attack, as well as make recommendations to help us make continuing improvements our security practices.

Our followup investigation has also been able to turn up clear evidence regarding the nature of the attack. The attack itself was very focused in its scope, and we've had no extortion demands or other communication. Based on the our analysis of the evidence, we feel confident that customer data was not accessed.

Areas for improvement

Throughout this outage, and our subsequent investigation, we identified a number of areas where we can improve our systems and our practices. We want to thoroughly harden our systems to prevent this kind of attack from ever happening again, as well as improve at our overall incident response.

API key management

The primary contributing factor to this attack and outage was that we allowed a very old full-access API key to be leaked.

This key was an anomaly for us: eight years old, mislabeled and improperly used internally, committed to source, and non-critical in our day-to-day operations. It not only predates AWS IAM functionality, it predates both Websolr and Bonsai, and most of our current security practices. It should have been revoked years ago.

These days, our standard operating procedure is to create role-based keys, with limited lifespans when possible, whose permissions limited to the bare minimum to fulfill their purposes. Furthermore, it has been our policy for a number of years not to commit any credentials into source.

Finally, it is an industry standard to rotate long-lived access keys every 12–18 months. In this case, we have since revoked and rotated all of our keys and credentials, and in the future we will be sure to include a similar full audit and rotation of our keys during semi-annual system maintenance.

Standardized security policies

While our analysis of the data is still ongoing, the likeliest scenario is that our old key was leaked from an insecure system of one of the collaborators with access to our private GitHub repositories. While our employees all practice high standards for security on their workstations, our enforcement of similar standards with contractors has been less consistent.

The nature of our business requires special trust, which demands special attention to security. We will develop written policies before bringing in any new collaborators in the future.

Better access control logging

An account compromise demanded that we wait to establish secure access to our account before proceeding with recovery.

We were able to do this thanks to help from AWS support. However, this took much longer than we would have liked, and with access to more logs we could have self-diagnosed the situation and responded much more quickly and authoritatively.

Fortunately, AWS provides these kinds of API logs in the form of CloudTrail, which we have since enabled across all of our regions and accounts. AWS was able to furnish us with back-logs out of CloudTrail. We found CloudTrail logs, in correlation with logs from other systems, to be immensely useful in our post-incident security analysis and pursuit of attribution.

Because of the sensitive nature of these kinds of logs, we intend in the near future to move them into an isolated off-site account for archival and analysis.

More effective communication

One of the facets of our response that we most regret was the scattered and poor communication with customers. In particular, many customers expressed a preference for receiving proactive email notifications for an outage of this scope, a capability which we have traditionally left underdeveloped.

Our best options were to post updates to Twitter, the Bonsai status page, and otherwise answer emails and tickets individually. With nearly a 1,000x increase over our normal support traffic, this resulted in an unsatisfactory experience for our customers as well as a critical bottleneck for our overall response efforts.

In the future we'll be developing better ways to quickly send emails for important system notices, as well as working to consolidate and improve our support and status reporting tools.

Faster service recovery

Bonsai has benefited from a lot of work toward the end of 2013 in improving its resiliency and ability to recover from failure. Its clusters were able to automatically rebuild within roughly 30 minutes of the initial mass-termination, with minimal manual intervention to fully restore service.

While Bonsai's recovery of data took another few hours to restore from backups, the tooling improvements are reusable and will greatly shorten our time to full recovery for any future outage of this scope.

Websolr, by comparison, still relies on a fair amount of manual provisioning. While it can gracefully tolerate the loss of an entire availability zone, the current level of manual coordination was a substantial bottleneck in recovering from a region-wide failure.

This kind of resiliency improvement for Websolr is something we have invested a lot of time and effort into this year, following the improvements we have already made for Bonsai. We plan to aggressively move forward with consolidating Websolr onto our latest generation of more resilient architecture.

Backup data resiliency

Regrettably, not all indices were able to be successfully restored from backups. We saw approximately 5% of indices suffer some form of corruption. Historically this kind of corruption is a fairly rare phenomenon during normal operations, but not entirely uncommon either.

We are working on improvements of our backup systems to help validate the integrity of data as backups are created. Lucene itself has also implemented work specifically to improve its resiliency, which should be available soon in future releases. (For example, LUCENE-2446, 5580, and 5602, which introduce internal checksumming for better data integrity validation.)

Moving forward

Over the last two weeks we've taken the time to thoroughly audit and review all of our systems and processes. Every credential, every account, every role, every third-party service. We've had the help of a fresh outside perspective from a third-party security firm to ensure that nothing went overlooked.

We've purged legacy systems, rotated credentials, improved our recovery tools, and even consolidated a large number of customers onto newer versions and better operational tooling.

This was pretty close to the worst-case scenario for us; the kind of disaster plan you think about and plan for but hope to never have to execute. But we also subscribe to the mantra, 'Hope is not a valid strategy.' And so, moving forward, we'll keep working to turn this experience into a better, more resilient company.

Posted Jul 03, 2014 - 23:34 UTC

Resolved

Websolr recoveries from backups are complete. We have performed preliminary index health audits and will run another series of audits throughout the weekend to check for anomolies. We expect some small amount (1–2%) of anomolies and index corruption among indexes restored from backups due to the somewhat imperfect process of restoration.

If your index is still offline or missing data, please let us know your URL at [email protected] so we can work on it individually.

Posted Jun 21, 2014 - 04:50 UTC

Update

Websolr: We have made progress on developing our backup recovery tooling, and the rate of recovery is improving dramatically. We will update again when we have a more precise ETA.

Posted Jun 20, 2014 - 18:28 UTC

Update

Websolr recap: recovery efforts are still underway, but proceeding slowly.

Websolr generally relies heavily on Solr on replication to tolerate the loss of up to a single availability zone worth of servers. The structure of our backups simply is not scaling well to the simultaneous termination of all of our instances. We're continuing to give it our best effort, and we're working on improving our tooling to perform a more efficient mass-restore, but unfortunately for now must proceed one individual index at a time.

It is currently not possible for us to give a specific ETA for the recovery of any one index. For any index that can be reindexed under 12 hours (or under 1MM docs), the best option is still to create and reindex into a replacement index.

Posted Jun 20, 2014 - 15:22 UTC

Update

Bonsai recap: recovery efforts have been completed. All available backups have been restored as of yesterday evening. Not all indices were perfectly recovered, and any that are still empty will need to be reindexed. Any further errors should be reported to us at [email protected] with the cluster URL and information about the error.

Posted Jun 20, 2014 - 15:15 UTC

Update

Websolr recoveries from backups are on hold while the team catches a few hours of sleep. We will continue first thing in the morning (US time).

Posted Jun 20, 2014 - 06:00 UTC

Monitoring

Recoveries from backups are still underway for websolr indices.

Posted Jun 20, 2014 - 04:00 UTC

Update

Bonsai recoveries from backups have been completed for all clusters that had not yet reindexed. Some backups were not able to be completed successfully, and any cluster that's still empty will need to reindex their data.

Posted Jun 20, 2014 - 02:55 UTC

Update

Recoveries from backups are under way for both Websolr and Bonsai, on a per-customer basis.

Posted Jun 20, 2014 - 00:31 UTC

Update

We are planning our backup restoration, however we still have no firm ETA on full recovery. This will be a long, slow process and anyone who can reindex should do so.

Posted Jun 19, 2014 - 17:52 UTC

Update

With help from AWS support, we have positively identified the compromised API access key. To be clear, this key was revoked as one of the first actions of our initial response. Over the next few days we'll keep working to find the source of its leak and the precise origination of the termination commands.

Posted Jun 19, 2014 - 17:13 UTC

Update

Websolr: new index creation is back online.

Posted Jun 19, 2014 - 16:38 UTC

Update

Websolr new index creation is currently experiencing errors. If you're seeing an error message after creating a new index, we will repair it shortly.

Posted Jun 19, 2014 - 16:27 UTC

Update

Bonsai clusters have been confirmed to successfully reindex. Any cluster that has been reindexed will be excluded from the later restoration from backups.

Posted Jun 19, 2014 - 16:27 UTC

Update

Bonsai clusters are online and available to reindex. No ETA for recovery from backups; not less than two hours. Websolr indices can now be recreated in US West, US East, and EU West regions. No ETA for recovery from backups; not less than 2h.

Posted Jun 19, 2014 - 15:55 UTC

Update

All Bonsai clusters are back online and ready for reindexing, with plans to restore from backups through the course of the day. Websolr servers are still offline pending reprovisioning, which will begin shortly.

Posted Jun 19, 2014 - 15:14 UTC

Update

Old credentials have been revoked, and all-new API credentials have been created. We are proceeding with service recovery.

Posted Jun 19, 2014 - 15:13 UTC

Update

We are still in the process of rotating all account credentials before recovery starts in earnest, and are working with AWS support to ensure a thorough response.

Posted Jun 19, 2014 - 14:26 UTC

Identified

Today on 2014-06-19 at 12:24:54 UTC, within a two-minute span, all of our AWS instances for Websolr and Bonsai were terminated. We're considering our AWS account as compromised and responding accordingly. Service recovery efforts are also underway, updates and more precise ETAs to follow.

Posted Jun 19, 2014 - 13:50 UTC

Investigating

We are currently investigating this issue.

Posted Jun 19, 2014 - 12:56 UTC