Eric Day

Thoughts, code, and other oddments.
Dark | Light

Blog
Wiki
About
Resume
RSS
Comments

E-Mail
Launchpad
LinkedIn
Twitter
identi.ca
Facebook

Gearman
Drizzle
NW Veg
Veg Food & Fit

O'Reilly MySQL Conference & Expo 2010 Linux On Laptops

Drizzling from the Rackspace Cloud

March 8th, 2010

Since I left Sun back in January, folks have been asking what was next. I’m happy to say that I’m going to continue hacking on open source projects like Drizzle and Gearman, but now at the Rackspace Cloud. Not only will I be there, but I get to continue working closely with a few of the amazing Drizzle hackers who have also joined, including Monty Taylor, Jay Pipes, Stewart Smith, and Lee Bieber.

Why Rackspace Cloud? Late last year I was considering what I wanted to do next with the Oracle acquisition looming near, and this was one of the options that presented itself. Rackspace had been a supporter of Drizzle from early on by offering virtual machines to develop and test on, and when talking to some folks more closely, something really hit home. Rackspace provides first-class service and “fanatical” support – they are not a software company. One might ask why an open source software developer would be interested in a company that doesn’t create software or vice-versa, and the answer is that Rackspace wants to find ways to offer the best possible service now and into the future. What better way than to help develop the next generation of service software and get a jump start into integrating this into their architecture? Both the open source community and Rackspace win.

Another thing I learned while talking with Rackspace is that one of their core principles is transparency. This applies to both customer and employees, and anyone within an open source community can appreciate this. The more I learned about the company and the folks within it, the more impressed I was at the lack of internal barriers or “need-to-know” information. One of Drizzle’s core goals is also transparency, from discussing design decisions on public mailing lists and IRC, to having the entire project management infrastructure hosted out in the open at Launchpad.

What does this mean for the Drizzle project? It means continued support for a number of core developers, more infrastructure for development, and most importantly in my eyes, more context. One of the Drizzle tag-lines is “A Lightweight SQL Database for Cloud and Web,” so what better place to develop a database designed for the cloud than on one of the fastest growing cloud platforms. We’ll get a detailed look at the demands, get feedback from cloud customers, and have the perfect test bed for offering new services. We’ll also be able to work closely with a top-notch group of DBAs, developers, and sysadmins in one of the most demanding service architectures out there. This invaluable context will help the Drizzle developers make more informed decisions moving forward, which also means better software for the community.

Personally, this also means getting back to my hosting roots. Before Sun, I worked at Concentric for almost 10 years in a clustered hosting environment. I’m very familiar with many of the multi-tenant scalability concerns Rackspace has, and I’m excited to be working in this type of environment again. We’ve already been working closely with the MySQL DBAs at Rackspace to learn what the biggest pain points are for a multi-tenant architecture, and we’ll be taking steps to address these as it will help anyone wanting to run Drizzle in a cloud-like environment. Drizzle’s modular architecture has already proved useful, as some of these concerns are easily answered with “oh, we have a plugin point for that.”

I’m excited, this is going to be a fun ride.

Posted in Drizzle, Gearman, Main, MySQL | 11 Comments


C++, or Something Like It

March 5th, 2010

I’ve developed primarily in C most of my career, and recently decided to give C++ a shot as my “primary language” due to hacking on Drizzle and MySQL. The past few months I’ve read and experimented with most features C++ provides over C, including reading Scott Meyer’s excellent “Effective” series books (highly recommended). Along the way I’ve been developing a project I’ve wanted to write for a while, and I’m finding some features to be problematic. I thought I’d share these issues so others can be aware of them and perhaps I can learn better workarounds.

The project I’ve been working on uses dynamic shared object loading at runtime (using dlopen() and friends), is threaded, and has about every strict compiler warning on you can find and being treated as errors (thanks to Monty Taylor’s pandora-build project). I’m also testing on various architectures and compilers, including Linux, OpenSolaris, and OSX. I also have been trying my best to avoid any dependencies on large C++ libraries like Boost and just stick to the standard language and STL. With these requirements in mind, here are the issues I’ve run into:

Can’t Reliably Use Exceptions

My first pass relied on exceptions, but this proved problematic on some architectures as soon as custom exceptions were being throw across module boundaries. This comes down to ABI issues for some shared object formats generated by some compiler versions. While you can make it work in some environments, it’s not going to be portable. This means I’ve had to catch exceptions closer to where they are throw, requiring a lot more try/catch blocks, and not being able to take full advantage of automatic stack cleanup. This also means resorting back to the C way or handling exceptions: returning and checking return codes while generating error strings. To be completely exception safe, this means not using std::string for error returns since they can throw exceptions while building useful error messages. Not using exceptions has had a viral effect throughout the rest of the design of the code, making it look more like C. I was a bit disappointed by this, as not having to check every function’s return code was keeping the code very clean. :)

Limited Use of the STL and std::string

I was excited to take advantage of the STL, as writing things like doubly-linked lists and hash tables for every C struct was getting a bit old (I did have a set of macros I used, but they were not the most popular in some circles because of certain C-preprocessor features). When I learned more about the internals of the STL, and how it relies heavily on copying objects, my heart sunk a little. It completely makes sense in the design, it’s just not as efficient as it could be (especially coming from a place where I would optimize to reduce pointer copies in C). No worries, I just created private copy constructors/assignment operators and only used pointers to objects. This came with it’s own set of issues with pointer management and avoiding leaks if the ‘new’ operator were to fail. Once working out the memory management issues, there were still exceptions to watch out for, including figuring out all the methods that may throw (due to an internal allocation usually). This is especially annoying when doing simple std::string operations like assignment or concatenation, and having to always catch around those. With other annoyances like the reference-to-reference issues and std::unary_function having a non-virtual destuctor, I’ve ended up using a watered down set of STL algorithms and resorted to a mix of non-STL containers and custom algorithms for some things. The lack of thread safety concerns in STL containers and differences in implementations have also lead me to not use STL containers for thread communication (using a mutex for every access is not efficient).

Conclusion

For the sake of consistency, I’ve wondered if it’s worth incorporating STL components? Is it better to have a mix or none at all? This would leave only inheritance, polymorphism, member protections, namespaces, and automatic object destruction the only C++ features being used. These are still very good reasons to use C++, but I’ve found the transition to not be as productive as I had hoped. I am very curious to hear other folks thoughts on their experience with any of the issues above.

Posted in Drizzle, Main, MySQL | 4 Comments


MySQL Conf & Drizzle Dev Day

February 17th, 2010

I’m glad to announce that we’ll be having a Drizzle developer day again this year on the Friday after the MySQL Conference! Be sure to sign up and add any topic ideas you may have so we know what folks are interested in. Space is limited!

While at the MySQL Conference, I’ll be speaking with Monty Taylor on “Using Drizzle.” This will take a non-developer approach to the project, so everyday DBAs and web developers should find this interesting. I’ll also be teaming up with Giuseppe Maxia to talk about Gearman in three sessions. These include:

We’re also going to have a combo Drizzle/Gearman booth in the expo hall, so be sure to stop by and chat. See you there!

Posted in Drizzle, Gearman, MySQL | No Comments


Linux Conf AU 2010

February 3rd, 2010

I was really excited when I had my Gearman talk accepted to Linux Conf AU 2010 because I had never been out that far in the Pacific (only Hawaii). Of course it wasn’t in Australia this year, and instead in Wellington, New Zealand. My wife came too, and we also made a vacation out of the down times we had around the conference. It turned out Brian couldn’t make it this year so Monty, Stewart, and I gave the Drizzle talk. It was great to see some familiar faces, including Mark Atwood, Giuseppe Maxia, Josh Berkus, and Selena Deckelmann. Josh actually ended up being on the same flight out, so we got to catch up while going through New Zealand customs at 5am after a 13 hour flight. :)

New Zealand is an amazing place. We flew in and out of Auckland and took the train to Wellington. The train ride mostly consisted of grazing sheep once out of the metro areas, did you know there are more sheep than people in NZ? Beyond the sheep, there were great views along the way, especially in the middle near the larger mountains and volcanoes. We stopped for a day to hike the Tongariro Alpine Crossing. It was sunny when we started, but it it was raining with 40mph winds at the top, so we didn’t get to see as much as we hoped. There were still beautiful views on each side though.

The conference was very well run, thanks to anyone who had a hand in it! The speakers dinner was at this great museum nearby on the waterfront and included live Maori singing and dancing. The vegan options were tasty, and I got to meet a few interesting folks there (like the folks from Dreamwidth, a LiveJournal-like blogging service). Some notable sessions during the conference were “The World’s Worst Inventions” by Paul Fenwick, “Anti-features” Keynote by Benjamin Mako Hill, “The Hydras GCC Static Analysis Plugins” by Taras Glek, and “Simplicity Through Optimization” by Paul McKenney. There were many other great sessions, and some I wish I could have attended.

I’m certainly going to try to go again next year, which if you didn’t hear will be in Brisbane!

Posted in Drizzle, Gearman, Main | No Comments


Moving On

January 11th, 2010

Friday was my last day at Sun Microsystems, and today is the first day at my new job (location coming soon). I’ve had a great time at Sun, and thank them for all the opportunities given to me there. I’ll be doing mostly the same work at the new gig, working on projects like Drizzle, but with a slightly different focus. For the most part my day-to-day won’t change much.

Right now I’m focusing on libdrizzle again and am implementing the prepared statement API, cleaning up the MySQL protocol support a little, and also implementing the new Drizzle client/server protocol. I’ll continue to work on Gearman as well, especially where it is relevant to Drizzle. I also need to start blogging again with specific topics in the projects I’m working on, I’ve been fairly quiet lately.

I’ll be in New Zealand next week at Linux Conf AU (yes, it’s not in AU this year). I have a talk on Gearman, and it looks like I’ll also be helping out with the Drizzle talk. It will be really nice to escape the Portland, OR winter for a bit. :)

Posted in Drizzle, Gearman, Main, MySQL | 3 Comments


Pluggable Database Client Tool

November 23rd, 2009

A few weeks ago I wrote about a student group who will be working with the Drizzle community to build a new database client tool. While the tool will be the primary replacement for the Drizzle client tool, we hope it will be generic (using the Python DB API) so it will work with others like MySQL and PostgreSQL. We’ve had a number of great discussions, including a session at OpenSQL camp last weekend. I wanted to toss out a few ideas of how such a tool could be structured to allow for maximum extensibility.

One possibility is to borrow from typical Unix shells and DSP processing systems where you have a number of modules with I/O interfaces and data exchange formats between each module. Each module provides a specific signature so you know what other modules it can plug into. Here is a simple example:

Simple

drizzle> SELECT TABLE_SCHEMA,TABLE_NAME FROM INFORMATION_SCHEMA.TABLES LIMIT 3;
+--------------------+---------------------------------------+
| TABLE_SCHEMA       | TABLE_NAME                            |
+--------------------+---------------------------------------+
| information_schema | CHARACTER_SETS                        |
| information_schema | COLLATIONS                            |
| information_schema | COLLATION_CHARACTER_SET_APPLICABILITY |
+--------------------+---------------------------------------+
3 rows in set (0 sec)

The client tool would start with a single active module, the Command Line reader that provides an interactive prompt. When a command is read, it creates a pipeline including a Query Executer module then a Console Output module (these are all the sane “default” modules to use). Once the pipeline is created, it then provides a Command message to the Query Executer, which talks to the appropriate Drizzle database server, and then starts providing column headers and rows in a Data Set message format to the Console output plugin. Initially this may seem a bit over engineered for such a simple use case, but let’s explore the flexibility this provides.

File Reader/Writer

drizzle> grep foo < commands > results

In this example, the client tool is started with the input module being a file reader. This module will read commands from the file (or stdin if things are piped into stdin) and feed each command to the Query Executer module. Since we specified an optional “grep foo” on the command line, this will initialize the ‘grep’ module (it may be more efficient to put this filtering into the WHERE clause, but ignore that for now). The result set will therefore be piped into the grep module, and then a possibly modified result set is pipelined out of that and into the File Writer module which will write the data set to ‘results’. Now imagine being able to write your own plugins, similar to the grep plugin, that can easily plug right into the pipeline just by using the common name. Each plugin can have any number of options and arguments, allowing you to create flexible modules to do client-side data set processing.

CSV Processor

drizzle> csv input.csv | INSERT INTO x VALUES ($1, $3, $5);

In this case, the command ‘csv’ causes a module to be initialized that will read the file input.csv file. The fields are separated out and formatted into a Data Set exchange message that is then fed into a query generator. The argument to the query generator is the INSERT command from the command line, and a number of INSERT commands would then be generated from the CSV input. These commands would then be piped into the query executer module and eventually results would be displayed on the console. Any other modules could also be plugged in to work on either the input or output data sets (filtering, rewriting, …).

Multi-Server/Result Aggregator

drizzle> \servers 10.0.0.1,10.0.0.2,10.0.0.3
drizzle> SELECT TABLE_SCHEMA,TABLE_NAME FROM INFORMATION_SCHEMA.TABLES LIMIT 3; | merge -c
+-------+--------------------+---------------------------------------+
| COUNT | TABLE_SCHEMA       | TABLE_NAME                            |
+-------+--------------------+---------------------------------------+
| 3     | information_schema | CHARACTER_SETS                        |
| 3     | information_schema | COLLATIONS                            |
| 3     | information_schema | COLLATION_CHARACTER_SET_APPLICABILITY |
+-------+--------------------+---------------------------------------+
3 rows in set (0 sec)

This example takes a list of servers, executes a command on all of them, aggregates the results, and displays the final result to the console. This specific command may not be all that useful, but similar aggregate data sets could be very useful when you manage a number of servers or have sharded data sets across multiple servers. In the command above, the ‘merge -c’ command merges all the resulting data sets into a single data set, and also inserts a ‘count’ column showing how many of each row appeared during the merge (think of the uniq -c unix command line tool).

If you took this concept a step further, you could read result sets from one command and pipe them into a another command generator that could talk to one or more servers. This allows the tool to become a generic “data set router” for various purposes. Rather than using a command line interface, you could replace the Command Line and Console Output plugins with a simple GUI input/out. The entire system could also be used internally within your applications to provide a richer set of functionality that the normal simple DB API provides. Building such a system up from a series of equal modules (ie, there are no first class modules) would allow great flexibility for users and developers to customize to their environment.

Posted in Drizzle, Main, MySQL | 1 Comment


New Database Command Line Client

October 29th, 2009

A few weeks ago I proposed a project to students at Portland State University for their senior capstone class, and this weekend I found out it was chosen by a group! The project will be a rewrite of the command line tool (the Drizzle tool is currently based on the ‘mysql’ tool), plus a lot of new features. We’re really excited to be working with them, and they seem equally excited about the project too. I hope DBAs, developers, and other folks in the Drizzle/MySQL/MariaDB communities will work with them to help define what features should be part of this new command line client. Some new features we have in mind are background queries, piping and redirection of queries (like a normal shell), and plugin support. It will also support at least the MySQL/MariaDB protocol too since it will be built on libdrizzle, but possibly more if we end up using a common DB API (we’re pondering Python). If you have any ideas or feature requests, feel free to leave a comment. The student group will be sending plans to the Drizzle mailing list soon for feedback, as well as attending OpenSQL Camp and leading a session on what folks would like to see in a client tool. Join me in welcoming Clark, Ken, Max, Victoria, David, and Andreas!

Posted in Drizzle, Main, MySQL | 3 Comments


OpenSQL Camp, SQL vs NoSQL

October 26th, 2009

The upcoming OpenSQL Camp is almost full! We have space for 130 people to register, and as of this writing only 10 spots are free. If you want to attend, sign up before it’s too late!

We’re still looking for a few sponsors if anyone is interested in helping cover food and t-shirt costs.

I’m organizing the closing keynote panel, “SQL vs NoSQL”, which will include core community members and committers from a number of open source databases. Selena has offered to take the PostgreSQL position if we don’t find another worthy contender. So far, it will include:

  • Brian Aker – Drizzle
  • Eric Evans – Cassandra
  • Joydeep Sen Sarma – Hive/Hadoop
  • Mike Dirolf – MongoDB
  • Mike Miller – CouchDB
  • Monty Widenius – MariaDB, MySQL
  • Ronald Bradford – MySQL, Drizzle, and others
  • ? – PostgreSQL

I’ll be sure to report who is the last one standing so we know which project to follow the closest. :)

Posted in Drizzle, Main, MySQL | 1 Comment


Eventually Consistent Relational Database?

October 12th, 2009

This weekend I attended Drupal Camp PDX and listened to a session titled “Drupal in the Cloud”. The presenter, Josh Koenig from Chapter Three, gave a great introduction of what moving to “the cloud” really means, especially in the context of a typical web application like Drupal. The problem, which is of course no fault of Josh’s, is that the best high availability database practices are harder to deploy because you’re working within a different set of constraints in the cloud. Sure, you can setup MySQL replication, but without the ability to insert a hardware load balancer or better control over floating IPs, reliable single-master solutions are difficult at best.

I spoke with Josh for a bit after and discussed how Drizzle is doing things to help and what it would take to have a Drizzle back end for Drupal (turns out it should not be too difficult). We then got onto the topic what some of the newer non-relational databases would look like for Drupal, and the short answer is it would be extremely difficult. Drupal, in both the core and many of the modules, depend on a relational model for the underlying data. This is not unique to Drupal. People, and the software they write, have thought “relational” for decades when it comes down to data. Sure, the various NoSQL projects are becoming more popular, but the masses are still thinking in terms of joining tables.

Silver Bullet

So, what would be the silver bullet? A relational database that did not depend on a single master. Not just dual-master setups with offset auto-increment, I’m talking about removing the entire concept of master-slave for replication. This is obviously nothing new in the industry, but it’s never been easy to accomplish. Just do some reading on distributed locking algorithms and you’ll get the idea. The main problem with distributed locking is that they don’t scale.

But, what about an eventually consistent replication model for a relational database? So far eventually consistent databases have not been relational (document based like CouchDB or simple key/value pairs) and relational databases have always focused on atomic consistency or some close relaxed relative (various levels of serialization). As a thought experiment, I’m going to attempt to describe what this may look under the hood at a high level.

Eventually consistent?

Not familiar with this term? Take a look at Werner Vogels’ article on the topic. The main idea behind EC is that you sacrifice the ability for all nodes to see exact same thing at any given time (consistency), but in return you can tolerate network partitions and you have availability. This directly relates to the CAP theorem which states you only get two of: Consistency, Availability, and tolerance to network Partitions. So, we are throwing out “C” so we can get rid of those nasty distributed locking algorithms, but in return we take on “EC”.

MyEventuallyConsistentSQL

Let’s start off with a traditional relational database and start modifying it until we have something that looks like an ECRDBMS (ok, maybe this acronym is a bit wordy).

  • Throw out transactions, serializability, and MVCC – Stay with me. These are not required for a relational data model, and they don’t make much sense when your throw out atomic consistency. At any point an event could come in from another database node that would overwrite whatever you are protecting in your transaction, so what’s the point of protecting yourself? I realize this eliminates a certain class of applications that depend on strict consistency, but this is the cost of moving to EC. We still have a large set of applications that will function just fine even with these dirty reads/writes. I don’t really care if my Drupal site or Wordpress blog has dirty transactions (or none at all). Sure, something might get slightly off in some of those edge cases, but no one is really going to complain, and they are easily resolved when recognized.

  • Remove or fix auto-increment types – I think the best way to handle auto-increment types is to just remove them and require a time-based unique ID (ie, UUID) in your primary key. Sure, they might take a little more space, but space is cheap and it’s a small price to pay for scalability. The other option is to setup some type of pool reservation system for the nodes to pull from, but then we get back into the business of distributed locking or single points of failure. It’s easier to just remove them and use globally unique IDs. These play a role in EC replication as well, so we might as well share them here.

  • Apply EC to replication events – This requires making our events deterministic, regardless of ordering. The easiest way is to tag each row with a time-based unique ID, and the most recent ID always wins. This is a bit rough, but it works. You can refine this in a couple of ways. First, we can change our granularity of events. We probably don’t want to go lager (ie, page-based replication), but we may want to go smaller (field based). You may have a hybrid as well, mixing row and field. The other way to refine EC replication is to allow delta events. Rather than pushing only absolute values, create options on certain columns that allow for increment, decrement, append, and so on. For example, an INTEGER column defined as a inc/dec would always have replication events that looked like “x+=3″ instead of “x=42″. You may even have the option to create user-defined delta algorithms, but I won’t go into that here. Oh, and for those thinking append would not work, you can order your append operations by the time-based unique ID (which means keeping a history of the field). Look at what Google Wave is doing with their text field conflict resolution algorithms.

What are we missing? What else would break down if we toss out atomic consistency and make the above changes? One thing I left out is DDL operations. Those would require some more thought, but I’m pretty sure we could figure out a way to handle conflicting events, possibly with configuration parameters to control the decisions made in conflict resolution algorithms. For example, if an UPDATE event gets applied after a ALTER TABLE that removed a column referenced in the UPDATE, you could just ignore that value and apply the other updates (if any). Chances are you didn’t want that column if it was removed at about the same time. This model has the major benefit of not having to worry about which node is the master or keeping an ordered replication log, they all operate independently and toss deterministic events which can be applied in any order.

Summary

This would-be-ECRDBMS looks a bit different on the inside, but from the outside it will look pretty familiar. From the normal web application perspective we are still creating tables, inserting data, joining data, and doing all the things we depend on from a relational database. This many not be a great idea, but I think it would be possible if you are willing to accept some of the behaviors that come along with it. So what do you think? How can it be improved? Would you use it for your application?

Posted in Drizzle, Main, MySQL | 9 Comments


Non-blocking State Machines

October 7th, 2009

If you’ve ever done any non-blocking programming (usually for socket I/O), you’ve probably had to come up with a non-trivial state machine to handle all the places where everything can pause. Say you’re reading an application level packet from a socket, and half way through the read() system call it screams EAGAIN. You need to stop, save any state, and exit out of whatever chain of functions got you there so the calling application can regain control. I’m going to explain a few techniques I’ve come up with over the years, each with their strengths and weaknesses, and I hope this will spur some conversation of what other folks have done. While I’m fairly happy with how I handle these state machines now, but I’m always looking for a more succinct way of handling things. Please share your thoughts!

Switch Statements

The obvious way to handle non-blocking I/O is with one or more switch statements. Say we need to check the status of something by sending a request over a TCP connection, possibly connecting to the remote host first, and then reading the response. Here is a bit of pseudo-code that demonstrates how this could work (ignoring some error cases, efficient buffer handling, and non-blocking connect cases):

int check_status(struct connection *con)
{
  switch (con->state)
  {
  case CONNECTION_STATE_NONE:
    getaddrinfo(...);
    con->fd = socket(...);
    /* Fall through to next state. */

  case CONNECTION_STATE_CONNECT:
    ret = connect(con->fd, ...);
    if (ret == -1 && errno == EAGAIN)
    {
      con->state = CONNECTION_STATE_CONNECT;
      return WAIT_FOR_WRITE;
    }
    /* Fall through to next state. */

  case CONNECTION_STATE_REQUEST:
    ret = write(con->fd, ...);
    if (ret == -1 && errno == EAGAIN)
    {
      con->state = CONNECTION_STATE_REQUEST;
      return WAIT_FOR_WRITE;
    }
    /* Fall through to next state. */

  case CONNECTION_STATE_RESPONSE_HEADER:
    ret = read(con->fd, ...);
    if (ret == -1 && errno == EAGAIN)
    {
      con->state = CONNECTION_STATE_RESPONSE_HEADER;
      return WAIT_FOR_READ;
    }
    /* Save header. */
    /* Fall through to next state. */

  case CONNECTION_STATE_RESPONSE:
    ret = read(con->fd, ...);
    if (ret == -1 && errno == EAGAIN)
    {
      con->state = CONNECTION_STATE_RESPONSE;
      return WAIT_FOR_READ;
    }
    /* Save response. */

    /* Set this here so we skip the connect state next time around. */
    con->state = CONNECTION_STATE_REQUEST;
    break;
  }
}

The first thing you may cringe at is the fall-through cases in switch statements. The alternative is to set a new state at the end of each case, break, and then reevaluate the switch again with that new state (wrapping the above switch in a while loop). I skipped that version since those are some extra ops that are just not necessary. The above machine may be a bit clunky, but it works for simple cases. But what about when you have more complex states that have loops, non-sequential state execution, or nested switch statements? The above has the potential to grow into an unwieldy mess of code. For example, say if we need to read multiple responses back in the last state above, this could be expanded to:

int check_status(struct connection *con)
{
  switch (con->state)
  {
...
    /* Fall through to next state. */

  case CONNECTION_STATE_RESPONSE:
    while (1)
    {
      if (con->need_header)
      {
        ret = read(con->fd, ...);
        if (ret == -1 && errno == EAGAIN)
        {
          con->state = CONNECTION_STATE_RESPONSE;
          return WAIT_FOR_READ;
        }
        /* Save header. */
        con->need_header = false;
      }

      ret = read(con->fd, ...);
      if (ret == -1 && errno == EAGAIN)
      {
        con->state = CONNECTION_STATE_RESPONSE;
        return WAIT_FOR_READ;
      }
      /* Save response. */
      if (last_response)
        break;
      con->need_header = true;
    }

    /* Set this here so we skip the connect state next time around. */
    con->state = CONNECTION_STATE_REQUEST;
    break;
  }
}

As you can see, another state variable has been added as a boolean (con->need_header). What if responses were not made up of simple header and body? What if there are more nested levels? We can add more switch statements and start breaking this up some into nested functions to make it more readable, but the complexity is still there. For non-trivial non-blocking state machines, this approach is not scalable.

Nested switch/while Statements

Early on in my C years I stumbled upon Duff’s Device. At first I was confused, is that even valid C? Oh, it compiles! Then I was offended. Eventually it clicked and I appreciated the cleverness of the code. Nesting while/for/if with switch statements. I went off to re-write my non-blocking state machines with this new trick:

int check_status(struct connection *con)
{
  switch (con->state)
  {
...
    /* Fall through to next state. */

    while (1)
    {
  case CONNECTION_STATE_RESPONSE_HEADER:
      ret = read(con->fd, ...);
      if (ret == -1 && errno == EAGAIN)
      {
        con->state = CONNECTION_STATE_RESPONSE_HEADER;
        return WAIT_FOR_READ;
      }
      /* Save header. */
      /* Fall through to next state. */

  case CONNECTION_STATE_RESPONSE:
      ret = read(con->fd, ...);
      if (ret == -1 && errno == EAGAIN)
      {
        con->state = CONNECTION_STATE_RESPONSE;
        return WAIT_FOR_READ;
      }
      /* Save response. */
      if (last_response)
        break;
    }

    /* Set this here so we skip the connect state next time around. */
    con->state = CONNECTION_STATE_REQUEST;
    break;
  }
}

Yup, that’s correct. Shove that while loop right in there. Think of it this way: write your state machine as you would if it were blocking, nesting as deep as you need with for/if/while statements. Next, put a switch around the entire thing, and toss a case statement in wherever something could hit a non-blocking condition (regardless of scope or nesting level). Some folks have commented this feels a lot like using gotos, but I disagree. With switch, you have structure, and compiler warnings for when things are missing (like a case). Sure, it may not be the most elegant solution, but you avoid the nested switch statements and multiple state variables. I still use this today for some things (like inside of the Gearman C server and library), but only when they are fairly simple state machines.

Function Pointer Stack

Last year I started writing a non-blocking C library for MySQL. When I head about Drizzle, I decided to focus my effort there (while keeping the MySQL compatibility), and renamed it to libdrizzle. Today it supports the Drizzle protocol and the most common parts of the MySQL protocol. The protocols for these projects are a bit more involved, so when I began writing the library, I went through a few iterations of state machines and didn’t find anything I was happy with. After some brainstorming I came up with an alternative design, I usually refer to it as a “function pointer stack” or “callback stack”. Please let me know if you have seen something like this and point me to the proper name. :)

This works by creating a traditional stack (LIFO structure) of function pointers. When a state needs to be executed, push it on, when a state is complete, it can pop itself off. It’s similar to a program execution stack, but maintained in user space and state is kept so you know where things left off. Still not getting it? Lets look at the code. First, one quick note about function pointer typedefs:

typedef int (state_fn)(struct connection *con);

These are not required of course, but it makes things a bit more legible. This is saying ’state_fn’ is now a type that points to a function with the given signature. It’s a lot easier that having to write the function signature out every time you have a variable of this type.

Now, the code:

typedef int (state_fn)(struct connection *con);

struct connection
{
  ...
  state_fn *state_stack[STACK_SIZE];
  int state_current;
};

/* These functions operation on the function pointer stack. */
static inline bool state_none(struct connection *con)
{
  return con->state_current == 0;
}

static inline void state_push(struct connection *con, state_fn *function)
{
  assert(con->state_current < STACK_SIZE);
  con->state_stack[con->state_current]= function;
  con->state_current++;
}

static inline void state_pop(struct connection *con)
{
  con->state_current--;
}

int state_run(struct connection *con)
{
  int ret;

  while (!state_none(con))
  {
    ret= con->state_stack[con->state_current - 1](con);
    if (ret)
      return ret;
  }

  return 0;
}

/* These are the states that can be pushed onto the stack. */
int start_state(struct connection *con)
{
  getaddrinfo(...);
  con->fd = socket(...);
  state_pop(con);
  state_push(con, connect_state);
  return 0;
}

int connect_state(struct connection *con)
{
  ret = connect(con->fd, ...);
  if (ret == -1 && errno == EAGAIN)
    return WAIT_FOR_WRITE;
  state_pop(con);
  return 0;
}

int request_state(struct connection *con)
{
  if (not connected)
  {
    state_push(con, start_state);
    return 0;
  }

  ret = write(con->fd, ...);
  if (ret == -1 && errno == EAGAIN)
    return WAIT_FOR_WRITE;
  state_pop(con);
  state_push(con, response_header_state);
  return 0;
}

int response_header_state(struct connection *con)
{
  ret = read(con->fd, ...);
  if (ret == -1 && errno == EAGAIN)
    return WAIT_FOR_READ;
  /* Save header. */
  state_pop(con);
  state_push(con, response_state);
  return 0;
}

int response_state(struct connection *con)
{
  ret = read(con->fd, ...);
  if (ret == -1 && errno == EAGAIN)
    return WAIT_FOR_READ;
  /* Save response. */
  state_pop(con);
  if (have_more_responses)
    state_push(con, response_header_state);
  return 0;
}

/* Here is a function you would make public in the API to start the state machine. */
int check_status(struct connection *con)
{
  /* If we are coming back into this after a blocking event,
     make sure we don't push a new state on again. */
  if (state_none(con))
    state_push(con, request_state);

  return state_run(con);
}

As you can see, we still start in the check_status function, but push a state if there is no state and then go into our run loop. You can follow along the various functions (sort of like a choose your own adventure book) but eventually you should end up with an empty stack. When this happens, the state_run() function returns 0 and the call is complete.

This may be a bit overkill for such a simple state machine, but as your state execution flow becomes non-sequential (random jumps, recursion, …) the power and flexibility of this design becomes apparent. And what? No switch statements? As far as performance is concerned, you may have more function calls, but you are eliminating jumps (those nested if/switches). For example, if your state is five levels deep and you need to keep pausing and returning to that point, you hit all those switch statements every time. With the above approach? You jump directly into the function you left off in. I’m not sure which one is faster in general (really depends on application), but the cost of switches vs function calls will be insignificant compared to what normal applications are actually doing (like system calls for I/O).

I have working C and C++ examples of what a complete state machine looks like. There is also some micro-benchmarking numbers in there comparing C vs C++ (you take a hit in C++ due to inheritance, but that cost is fairly insignificant).

Thoughts?
Another choice I didn’t bother to mention is to have one thread per connection and let it block, but that doesn’t scale. What methods have you used to solve this? Do the nested switch/whiles offend you? Are the function pointer stacks elegant or spaghetti code?

Posted in Drizzle, Gearman, Main, MySQL | 10 Comments


< Older Entries |