On Wednesday, June 19th, One More Cloud's services Websolr.com and Bonsai.io suffered a major extended outage. This outage was the result of an attack on our systems using a compromised API key.
We are so sorry for the severe impact to our customers and their businesses, and we regret the circumstances which made it possible.
On Wednesday morning, around 5:30am PDT, our monitoring systems paged our on-call engineer with an alert of a major outage. It was quickly determined that both Websolr and Bonsai were unresponsive in all availability zones in all regions, an unprecedented scope of outage, prompting an immediate escalation to alert and wake up our entire team.
Continued investigation showed a deliberate API-initiated mass-termination of all instances in One More Cloud's AWS account. Because none of our automated systems have the ability to perform any destructive actions, and based on the early morning timing of the attack, we were able to rule out an accident or bug almost immediately.
Our conclusion was that our account was compromised and our systems under attack, and we proceeded accordingly.
Our first priority was to identify and contain the source of the attack, and re-establish complete control over our account. We quickly identified and revoked all administrative credentials with sufficient ability to terminate instances, notably including an old pre-IAM full-access API key. We also revoked and rotated a handful of limited-scope credentials for good measure.
While we rotated credentials, we also reached out to AWS support for help identifying the source of the attacks. They were extremely helpful, and were able to positively confirm that the key used was our old pre-IAM root API key, the very first key created with our account back in 2006.
Identifying this key helped us to confirm that our account was re-secured, as well as identify vectors for the key to have been compromised. This key had been mis-labeled as one with more limited permissions, and consequently had been committed into several private repositories for very old projects. Our leading hypothesis is that this was the main vector for the key's leak.
Given that this key was not important for normal system operation, and because we have long since adopted a policy of not committing any kind of credentials into source, we were able to revoke it without any negative impact on our operations. Out of an abundance of caution, we also rotated all of the rest of our keys, as well as restricted all account access to the bare minimum of essential personnel.
Once confident that the attack was contained and our account was re-secured, we proceeded to restore service to our clusters. Our first priority was to bring the service back online and make it possible for customers to reindex their data from a primary datastore. Following that, we would begin restoring individuals from backups while planning and executing a larger mass restore.
By this time, Bonsai had already since self-repaired its clusters. It is running on the latest generation of our operational architecture, and was back online (without data) within about thirty minutes of the initial mass termination. A very small amount of manual intervention was required to repopulate some systems from our primary database, hosted on another provider.
Websolr runs on older systems and requires more manual work to provision its clusters. By 15:55 UTC (T+3.5h), new index creation was brought partially online. There were some problems with new index creation at this time, as we experienced a "thundering herd" effect of new index creation. We spent approximately the next hour rebalancing systems and bringing these under control.
Our next phase of the recovery was to begin restoring data from nightly backups. We began with individual restorations, because that is the use-case our backup recovery tooling has been optimized for. The lack of well-rehearsed tooling and procedures for a mass recovery proved to be a major bottleneck in our efforts.
Over the next few hours, we were able to develop the necessary tooling for Bonsai to mass-restore data from backups, bringing Bonsai to full recovery by June 20th at 02:55 UTC (T+14h).
Retooling and recovery efforts for Websolr continued into the next day. Restorations from backup data were completed by June 21 at 04:50 UTC (approx. T+40h). Through the next few days, we continued to closely monitor and audit the performance of the system. We performed occasional system maintenance to redistribute load, and to help individuals whose indices suffered data corruption or configuration regressions during the restoration process.
One of our first actions was to hire a top-tier security firm to conduct a breach analysis and security audit of our systems. Given the nature of the attack, it was important for us to have an expert third party analyze our response protocol and perform a thorough audit our systems and policies.
This audit was able to help us confirm that we had contained the attack, as well as make recommendations to help us make continuing improvements our security practices.
Our followup investigation has also been able to turn up clear evidence regarding the nature of the attack. The attack itself was very focused in its scope, and we've had no extortion demands or other communication. Based on the our analysis of the evidence, we feel confident that customer data was not accessed.
Throughout this outage, and our subsequent investigation, we identified a number of areas where we can improve our systems and our practices. We want to thoroughly harden our systems to prevent this kind of attack from ever happening again, as well as improve at our overall incident response.
The primary contributing factor to this attack and outage was that we allowed a very old full-access API key to be leaked.
This key was an anomaly for us: eight years old, mislabeled and improperly used internally, committed to source, and non-critical in our day-to-day operations. It not only predates AWS IAM functionality, it predates both Websolr and Bonsai, and most of our current security practices. It should have been revoked years ago.
These days, our standard operating procedure is to create role-based keys, with limited lifespans when possible, whose permissions limited to the bare minimum to fulfill their purposes. Furthermore, it has been our policy for a number of years not to commit any credentials into source.
Finally, it is an industry standard to rotate long-lived access keys every 12–18 months. In this case, we have since revoked and rotated all of our keys and credentials, and in the future we will be sure to include a similar full audit and rotation of our keys during semi-annual system maintenance.
While our analysis of the data is still ongoing, the likeliest scenario is that our old key was leaked from an insecure system of one of the collaborators with access to our private GitHub repositories. While our employees all practice high standards for security on their workstations, our enforcement of similar standards with contractors has been less consistent.
The nature of our business requires special trust, which demands special attention to security. We will develop written policies before bringing in any new collaborators in the future.
An account compromise demanded that we wait to establish secure access to our account before proceeding with recovery.
We were able to do this thanks to help from AWS support. However, this took much longer than we would have liked, and with access to more logs we could have self-diagnosed the situation and responded much more quickly and authoritatively.
Fortunately, AWS provides these kinds of API logs in the form of CloudTrail, which we have since enabled across all of our regions and accounts. AWS was able to furnish us with back-logs out of CloudTrail. We found CloudTrail logs, in correlation with logs from other systems, to be immensely useful in our post-incident security analysis and pursuit of attribution.
Because of the sensitive nature of these kinds of logs, we intend in the near future to move them into an isolated off-site account for archival and analysis.
One of the facets of our response that we most regret was the scattered and poor communication with customers. In particular, many customers expressed a preference for receiving proactive email notifications for an outage of this scope, a capability which we have traditionally left underdeveloped.
Our best options were to post updates to Twitter, the Bonsai status page, and otherwise answer emails and tickets individually. With nearly a 1,000x increase over our normal support traffic, this resulted in an unsatisfactory experience for our customers as well as a critical bottleneck for our overall response efforts.
In the future we'll be developing better ways to quickly send emails for important system notices, as well as working to consolidate and improve our support and status reporting tools.
Bonsai has benefited from a lot of work toward the end of 2013 in improving its resiliency and ability to recover from failure. Its clusters were able to automatically rebuild within roughly 30 minutes of the initial mass-termination, with minimal manual intervention to fully restore service.
While Bonsai's recovery of data took another few hours to restore from backups, the tooling improvements are reusable and will greatly shorten our time to full recovery for any future outage of this scope.
Websolr, by comparison, still relies on a fair amount of manual provisioning. While it can gracefully tolerate the loss of an entire availability zone, the current level of manual coordination was a substantial bottleneck in recovering from a region-wide failure.
This kind of resiliency improvement for Websolr is something we have invested a lot of time and effort into this year, following the improvements we have already made for Bonsai. We plan to aggressively move forward with consolidating Websolr onto our latest generation of more resilient architecture.
Regrettably, not all indices were able to be successfully restored from backups. We saw approximately 5% of indices suffer some form of corruption. Historically this kind of corruption is a fairly rare phenomenon during normal operations, but not entirely uncommon either.
We are working on improvements of our backup systems to help validate the integrity of data as backups are created. Lucene itself has also implemented work specifically to improve its resiliency, which should be available soon in future releases. (For example, LUCENE-2446, 5580, and 5602, which introduce internal checksumming for better data integrity validation.)
Over the last two weeks we've taken the time to thoroughly audit and review all of our systems and processes. Every credential, every account, every role, every third-party service. We've had the help of a fresh outside perspective from a third-party security firm to ensure that nothing went overlooked.
We've purged legacy systems, rotated credentials, improved our recovery tools, and even consolidated a large number of customers onto newer versions and better operational tooling.
This was pretty close to the worst-case scenario for us; the kind of disaster plan you think about and plan for but hope to never have to execute. But we also subscribe to the mantra, 'Hope is not a valid strategy.' And so, moving forward, we'll keep working to turn this experience into a better, more resilient company.