Eric Day

Thoughts, code, and other oddments.
Dark | Light

< || >

Eventually Consistent Relational Database?

October 12th, 2009

This weekend I attended Drupal Camp PDX and listened to a session titled “Drupal in the Cloud”. The presenter, Josh Koenig from Chapter Three, gave a great introduction of what moving to “the cloud” really means, especially in the context of a typical web application like Drupal. The problem, which is of course no fault of Josh’s, is that the best high availability database practices are harder to deploy because you’re working within a different set of constraints in the cloud. Sure, you can setup MySQL replication, but without the ability to insert a hardware load balancer or better control over floating IPs, reliable single-master solutions are difficult at best.

I spoke with Josh for a bit after and discussed how Drizzle is doing things to help and what it would take to have a Drizzle back end for Drupal (turns out it should not be too difficult). We then got onto the topic what some of the newer non-relational databases would look like for Drupal, and the short answer is it would be extremely difficult. Drupal, in both the core and many of the modules, depend on a relational model for the underlying data. This is not unique to Drupal. People, and the software they write, have thought “relational” for decades when it comes down to data. Sure, the various NoSQL projects are becoming more popular, but the masses are still thinking in terms of joining tables.

Silver Bullet

So, what would be the silver bullet? A relational database that did not depend on a single master. Not just dual-master setups with offset auto-increment, I’m talking about removing the entire concept of master-slave for replication. This is obviously nothing new in the industry, but it’s never been easy to accomplish. Just do some reading on distributed locking algorithms and you’ll get the idea. The main problem with distributed locking is that they don’t scale.

But, what about an eventually consistent replication model for a relational database? So far eventually consistent databases have not been relational (document based like CouchDB or simple key/value pairs) and relational databases have always focused on atomic consistency or some close relaxed relative (various levels of serialization). As a thought experiment, I’m going to attempt to describe what this may look under the hood at a high level.

Eventually consistent?

Not familiar with this term? Take a look at Werner Vogels’ article on the topic. The main idea behind EC is that you sacrifice the ability for all nodes to see exact same thing at any given time (consistency), but in return you can tolerate network partitions and you have availability. This directly relates to the CAP theorem which states you only get two of: Consistency, Availability, and tolerance to network Partitions. So, we are throwing out “C” so we can get rid of those nasty distributed locking algorithms, but in return we take on “EC”.

MyEventuallyConsistentSQL

Let’s start off with a traditional relational database and start modifying it until we have something that looks like an ECRDBMS (ok, maybe this acronym is a bit wordy).

  • Throw out transactions, serializability, and MVCC – Stay with me. These are not required for a relational data model, and they don’t make much sense when your throw out atomic consistency. At any point an event could come in from another database node that would overwrite whatever you are protecting in your transaction, so what’s the point of protecting yourself? I realize this eliminates a certain class of applications that depend on strict consistency, but this is the cost of moving to EC. We still have a large set of applications that will function just fine even with these dirty reads/writes. I don’t really care if my Drupal site or WordPress blog has dirty transactions (or none at all). Sure, something might get slightly off in some of those edge cases, but no one is really going to complain, and they are easily resolved when recognized.

  • Remove or fix auto-increment types – I think the best way to handle auto-increment types is to just remove them and require a time-based unique ID (ie, UUID) in your primary key. Sure, they might take a little more space, but space is cheap and it’s a small price to pay for scalability. The other option is to setup some type of pool reservation system for the nodes to pull from, but then we get back into the business of distributed locking or single points of failure. It’s easier to just remove them and use globally unique IDs. These play a role in EC replication as well, so we might as well share them here.

  • Apply EC to replication events – This requires making our events deterministic, regardless of ordering. The easiest way is to tag each row with a time-based unique ID, and the most recent ID always wins. This is a bit rough, but it works. You can refine this in a couple of ways. First, we can change our granularity of events. We probably don’t want to go lager (ie, page-based replication), but we may want to go smaller (field based). You may have a hybrid as well, mixing row and field. The other way to refine EC replication is to allow delta events. Rather than pushing only absolute values, create options on certain columns that allow for increment, decrement, append, and so on. For example, an INTEGER column defined as a inc/dec would always have replication events that looked like “x+=3″ instead of “x=42″. You may even have the option to create user-defined delta algorithms, but I won’t go into that here. Oh, and for those thinking append would not work, you can order your append operations by the time-based unique ID (which means keeping a history of the field). Look at what Google Wave is doing with their text field conflict resolution algorithms.

What are we missing? What else would break down if we toss out atomic consistency and make the above changes? One thing I left out is DDL operations. Those would require some more thought, but I’m pretty sure we could figure out a way to handle conflicting events, possibly with configuration parameters to control the decisions made in conflict resolution algorithms. For example, if an UPDATE event gets applied after a ALTER TABLE that removed a column referenced in the UPDATE, you could just ignore that value and apply the other updates (if any). Chances are you didn’t want that column if it was removed at about the same time. This model has the major benefit of not having to worry about which node is the master or keeping an ordered replication log, they all operate independently and toss deterministic events which can be applied in any order.

Summary

This would-be-ECRDBMS looks a bit different on the inside, but from the outside it will look pretty familiar. From the normal web application perspective we are still creating tables, inserting data, joining data, and doing all the things we depend on from a relational database. This many not be a great idea, but I think it would be possible if you are willing to accept some of the behaviors that come along with it. So what do you think? How can it be improved? Would you use it for your application?

Posted in Drizzle, Main, MySQL

9 Responses to "Eventually Consistent Relational Database?"

  1. Padraig says:

    Hi Eric!

    Thanks for the interesting article. One other related thought that I’ve had is why not use an eventually consistent data store as the backend for a storage engine in Drizzle. This way, we don’t have to worry about implementing replication – its already done for us. For example, one of the NoSQL stores like Project Voldemort is pretty much just a clone of Amazon’s Dynamo. So why not just build a storage engine in Drizzle on top of voldemort? Then you would have a SQL interface to an eventually consistent data store i.e. we shoe-horned the relational model on to this key/value store.

    Of course, this does bring up a lot of issues such as how to store and query relational data in these key/value stores efficiently. And how do we do joins efficiently on top of a store such as this. Another issue is whether its even a good idea to build a storage engine on top of a store such as this! But it gets you the relational model on top of an eventually consistent store pretty easily.

    Some research has been done recently in this area, for example:

    * Building a Database on S3 – http://is.gd/4goRW
    * Building a Database in the Cloud – http://www.dbis.ethz.ch/research/publications/dbs3.pdf

    For a class I’m taking this semester on distributed systems, I’m creating a storage engine for Drizzle on Amazon’s S3 storage service so I’ve been thinking about these things a fair bit myself lately. One thing that’s a big issue in that case is the latency involved in making a request to S3 so I’m starting to think a fair bit about caching strategies for the engine.

    Its definitely an interesting topic you’ve brought up (well, I find it pretty interesting). Sometimes I wonder how many people actually need to migrate to the eventually consistent data model though. That doesn’t affect me though so I’m happy to work on it :)

  2. Eric Day says:

    Hi Padraig!

    Thanks for the thoughts and links. I’ve seen a few folks mention building storage engines on S3, but that’s a bit different from what I was thinking. The PDF is an interesting read nonetheless. :) As you mention, the latency still kills you. One of the thoughts from my post is to make it look and feel like a normal database (it’s still durable, fsync() on writes, …). We just throw out the “ACI” in ACID. There still a few things going on in there for consistency to make sure the most recent event according to the IDs (which may or may not match reality) is the one that is final, and that delta updates are processed correctly. We could still use InnoDB, Tokyo Cabinet, and other storage engines for the back end storage on disk, but more work would be done at a higher level to remove all locking and transactions, and of course rethink replication. Not much time to prototype this, and I don’t think we’ll convince the other Drizzle folks to change things this drastically. Need…more…time… :)

  3. Lukas says:

    “Remove or fix auto-increment types”, this would also be a huge help when trying to deal with keeping production and all developer machines in sync with all the latest changes and additions done via the admintool, a problem that needs a resolution! I just dont understand why it hasnt been made the focus of any of the recent major releases of Drupal. :(

  4. kicking out the hard consistency out is discussed over many years.
    hard consistancy is considered to be a must have for OLTP (transaction processing) databases.
    offcource, it is the conservative way of thinking.

    probably you may be entering it to an area where open source databases not much entered (OLAP).

    you have very interesting solutions …like using timestamps across…i like it.

    Since i am working as Oracle DBA for many years, i can see how much effort it puts in and How much it scarifies in terms of performance and scalablity just to achive very hard consistancy in RAC (Cluster) (globally unique number generation for SCN and shared control file across the cluster nodes are just examples)

    I believe its worth to drop the hard consistancy in most cases. and something in between OLTP and OLAP is worth.
    as you pointed out its relevance is increasing as the Cloud become more and more popular

  5. Josh Koenig says:

    Hey Eric, thanks for the shout-out! Some initial thoughts from the Drupal side:

    1) I’m definitely excited by the prospects of Drizzle, will keep playing with it to see if I can get it working. Conceptually I love it, and I believe my favorite CMS has the necessary abstraction (esp in the upcoming version 7 release) to handle it.

    2) From the end-developer point of view, having something I can just swap in for my existing relational data store is obviously ideal. Letting the web devs keep their warm fuzzy blanket will obviously provoke the fewest cries of pain and dismay.

    3) That said, looking far ahead to Drupal 8, I think there’s a good case to me made that having core project do more to structure how the average developer interacts with data is a Good Thing. People will probably always have direct access to underlying datastore via PDO or whatever, but there would be a good set of “best practice” wrapping functions for most developers for DRUDing data objects and configuration settings.

    Anyway, there are people in our community who have been tracking these issues for a while now:

    http://fourkitchens.com/sites/default/files/Drupalcon%20DC%20-%20The%20Next%2010%20Years.pdf

    I’ll see if I can get some of them to read/respond as well.

  6. Steve Yen says:

    Hi Eric,
    Great article! Wondering if you’re thinking mostly of the fully replicated data, where each node (eventually) has a complete copy of all data? CouchDb takes this design point.

    A lot of these fancy just add more node data stores also do partitioning or sharding. In these cases, relational joins get rather difficult.

  7. Eric Day says:

    Hi Josh – Thanks for the feedback. Curious to see what the other Drupal DB folks think. :)

    Hi Steve – The simple case is to have every node keep a full replica and let the user handle the sharding. It probably wouldn’t be too hard to create a “replication router” that could use the table/column info to decide which remote nodes to send the events to. I understand with distributed joins, not suggesting trying to perform those, although you would be able to do a bit more with caching of join info if you are less worried about atomic consistency. The main idea behind this post is that you’re relational, so you still have the constraints that come with it. You have more robust replication system though with EC, so I can have my 3 VMs running my WordPress blog, they run independently (no single master), and they all sync events with one another. If you have a load spike or if one goes down? Start a new VM up, pull DB snapshot (possibly dumped into S3 or something), and sync the events (much like bringing up a new slave server). Better yet, your “backup” could just be taking an AMI snapshot so your DB is there ready to go, and when they come up, they can start serving (possibly stale) data immediately and start syncing new events from other node.

  8. “newer always wins” gets interesting with flaky time-keeping (which is very common on busy database servers :)

  9. Steven Roussey says:

    Now *that* would be useful indeed! So when can I try it out?? ;)

Leave a Reply


< || >
Blog
Wiki
About
Resume
RSS
Comments

E-Mail
Launchpad
LinkedIn
Twitter
identi.ca
Facebook

OpenStack
Scale Stack
Gearman
NW Veg
Veg Food & Fit

Linux On Laptops