FathomDB

FathomDB

May 19 / 3:18pm

FathomDB: Focusing on the future

Summary

  • FathomDB customers on EC2 were affected by the April AWS outage
  • We will give usage credits to affected customers
  • We are deprecating MySQL database service on EC2, and will no longer accept new MySQL customers
  • We have focused all our resources on delivering our next-generation scalable relational database, which is fault-tolerant by design

Our customers on AWS were affected by the April outage

Early in the morning of April 21st, a significant outage occurred on EC2 / Amazon Web Services (post-mortem here)  FathomDB customers on EC2 were offline during the incident.  All functionality was restored within 48 hours, and no data was lost.  FathomDB customers on Rackspace were unaffected.

FathomDB aims to provide a fully reliable database-as-a-service.  Outages of any length affecting even one customer are not acceptable to us.  We'd like to apologize unreservedly to those customers who were affected.  

Had we known recovery was going to take 48 hours, we would have initiated our disaster recovery processes; instead we trusted in Amazon's repeated assurances that recovery was only a few hours away.  It is a difficult decision for us because full disaster-recovery from a backup involves losing recent transactions (<5 minutes worth on FathomDB), and it is impossible for us as a platform to know how any missing data would impact our customers.  We should nonetheless have immediately brought backups online and let customers make the decision.

We will give usage credits to affected customers

Amazon's SLA only promises a credit of approximately 3 days usage, and this outage might not even be considered downtime under the terms of their SLA.  However, Amazon has agreed to give customers 10 days usage credit.  While this is commendable, we believe that level of penalty does not compel providers to take uptime seriously.

FathomDB does not have a formal SLA, because our view is that a SLA would serve only to limit our liability, and would not offer you any guarantees.  Every SLA turns uptime into a financial trade-off for the provider, but we believe that we should keep customers online irrespective of our costs.

FathomDB will be giving all affected customers credit for three months usage.  For the 2 day interruption, we will be giving close to 100 days usage credit.  We will be contacting affected customers individually.

We know that financial compensation isn't enough; but by giving such a large credit we hope to demonstrate our commitment to high levels of reliability.

We are deprecating MySQL database service on AWS, and will no longer accept new MySQL customers

A database is a uniquely demanding system to run, because it is the store of state for an application.   We learned in the outage that we have built our MySQL product on promises made by the underlying clouds; we are no longer confident we can rely on those promises.  Many questions still remain unanswered even after AWS's outage analysis.

We can't operate a reliable MySQL service when we're building our house on sand; MySQL simply wasn't designed to run on unreliable distributed systems.  We are therefore deprecating our MySQL service: we won't be offering it going forwards.

We will continue to support our existing customers and will help them determine their best long-term solution.  Some customers will want to run MySQL databases on the cloud for a while yet.  We are working with the OpenStack project to ensure that our MySQL database approach will always be an option.  The openness of the OpenStack model allows us to verify the promises for ourselves, and to contribute enhancements where we need additional guarantees.

We have focused all our resources on delivering our next generation scalable relational database, which is fault-tolerant by design

FathomDB has up until now offered a MySQL database running on the cloud.  This is taking database technology designed for the mainframe era (i.e. one big reliable machine) and running it on the cloud (i.e. lots of small machines that are not individually reliable).  We've built an amazing operations system that makes this impedance-mismatch work.  The AWS outage showed us that we would need massive additional engineering effort to tolerate all sorts of scenarios we had thought realistically impossible.  While we've been vocal in our criticism of AWS, this is actually a fundamental issue and not just about any particular cloud: today's databases aren't designed for today's architecture.

Instead we are devoting our engineering resources towards FathomDB's next generation database.  We're re-examining those mainframe-era design decisions, and building a database designed from the ground up to work on multiple unreliable machines, or even multiple unreliable clouds.  Instead of working to find a treatment for every imaginable complication, we're going to cure the disease.

The new database is progressing very well.  Paul Graham has often in the past advised us to “do what's best for the customer”.  In this case, we believe that means giving them a scalable, relational, fault-tolerant database, rather than encouraging them to stay on a technology that we'll soon be making obsolete.

If you're excited by the prospect of building the database of the future, rather than fixing up the databases of the past, join us: jobs@fathomdb.com