Eric Day

Thoughts, code, and other oddments.
Dark | Light

< Older Entries || Newer Entries >

Archive for the "MySQL" Category


Drizzle Protocol Changes

March 17th, 2010

On an entirely unrelated note to the MySQL protocol discussions happening yesterday, the MySQL protocol is now the default protocol in Drizzle as of Monday’s tarball (3/15). Drizzle supports a limited version of the MySQL protocol, only supporting the subset of commands Drizzle cares about (no server-side prepared statements, replication, or deprecated commands due to SQL query equivalents). Not all MySQL clients have been fully tested with it, but our entire test suite is using it now with the libdrizzle MySQL implementation. The latest release of libdrizzle also includes defaulting to the MySQL protocol and port for Drizzle connections.

There has been some debate about this change, even amongst some of the core developers. The current Drizzle protocol is a slight modification of the MySQL protocol. It has been running on the IANA assigned Drizzle port (4427) by default since the beginning. We are developing a new Drizzle protocol which will have a number of new features, as well as being more extensible than the old protocol. It may be some time before this protocol is stable and tested well enough to make it the default. Thinking into the future, there are a couple upgrade paths to consider when it is ready:

The first option is to declare a certain release the “new protocol release” and clients talking to it would need to be upgraded at the same time. This would switch the line-level protocol on port 4427 from the old Drizzle protocol to the new. Clients would be required to upgrade because those using the old protocol would break when trying to connect to the new server-side protocol module (there are no clever, efficient hacks here because different sides send the first byte for each protocol). This option doesn’t involve any dependencies on the MySQL protocol module or APIs at all, but does make a forced upgrade situation in the future for clients tools and APIs.

The second option is to make the MySQL protocol the default protocol now, and when the new Drizzle protocol module is ready and available, tell folks they can start using it. Those people using libdrizzle or it’s derived APIs will have an easy transition because a libdrizzle release can change the default as well. This allows us to put the Drizzle protocol module (port 4427) in an experimental mode that changes between releases as we develop it. This introduces a required dependency to the MySQL protocol module for the time being, and possible confusion when we make the switch back.

Both approaches have pros and cons, and for better or for worse, it has been decided to take the latter approach. If you start running the drizzle tarball being released today, the client tools will speak MySQL to port 3306 by default, and the latest libdrizzle release defaults to this as well. If you are running Drizzle on the same machine as MySQL/MariaDB and you get a port conflict on 3306 when starting, be sure to start drizzled with –mysql-protocol-port=X to bind to a different port (you’ll of course need to use the same port in the client utils/APIs when connecting).

Posted in Drizzle, Main, MySQL | 8 Comments


Open Source Bridge 2010

March 16th, 2010

I’m Attending Open Source Bridge – June 1–4, 2010 – Portland, OR

A couple months ago Selena Deckelmann asked if I wanted to co-chair the Open Source Bridge Conference this year, and I was thrilled to say yes! This conference is all volunteer run by some of the most dedicated volunteers I have ever seen, I’m excited to be working with such a fantastic group of people. The conference is also backed by the 501(c)3 non-profit Technocation which is primarily run by Sheeri Cabral who is well known in the MySQL community.

The conference is June 1-4 in Portland, OR, and will be held at the Portland Art Museum. The call for proposals is open until the end-of-day on March 25th, so please submit your ideas now! Early registration is also open until April 1st, so now would also be a great time to sign up. We are still looking for sponsors, so if your organization or company is interested, please get in touch!

Emphasizing what our about page describes, this conference is slightly different in some regards to others in that we’re focused on open source citizenship. We want folks to openly share ideas in a variety of ways and to get things done when they attend. We also are trying to keep the cost as low as possible so it is accessible to a larger audience.

We’re also partnering with O’Reilly’s OSCON which is in Portland during late July because both sides feel the two conference complement each other, we are in no way trying to compete. Portland is bursting with so much open source, one conference could not contain it! You should come check it out. :)

Posted in Drizzle, Main, MySQL | 1 Comment


Drizzling from the Rackspace Cloud

March 8th, 2010

Since I left Sun back in January, folks have been asking what was next. I’m happy to say that I’m going to continue hacking on open source projects like Drizzle and Gearman, but now at the Rackspace Cloud. Not only will I be there, but I get to continue working closely with a few of the amazing Drizzle hackers who have also joined, including Monty Taylor, Jay Pipes, Stewart Smith, and Lee Bieber.

Why Rackspace Cloud? Late last year I was considering what I wanted to do next with the Oracle acquisition looming near, and this was one of the options that presented itself. Rackspace had been a supporter of Drizzle from early on by offering virtual machines to develop and test on, and when talking to some folks more closely, something really hit home. Rackspace provides first-class service and “fanatical” support – they are not a software company. One might ask why an open source software developer would be interested in a company that doesn’t create software or vice-versa, and the answer is that Rackspace wants to find ways to offer the best possible service now and into the future. What better way than to help develop the next generation of service software and get a jump start into integrating this into their architecture? Both the open source community and Rackspace win.

Another thing I learned while talking with Rackspace is that one of their core principles is transparency. This applies to both customer and employees, and anyone within an open source community can appreciate this. The more I learned about the company and the folks within it, the more impressed I was at the lack of internal barriers or “need-to-know” information. One of Drizzle’s core goals is also transparency, from discussing design decisions on public mailing lists and IRC, to having the entire project management infrastructure hosted out in the open at Launchpad.

What does this mean for the Drizzle project? It means continued support for a number of core developers, more infrastructure for development, and most importantly in my eyes, more context. One of the Drizzle tag-lines is “A Lightweight SQL Database for Cloud and Web,” so what better place to develop a database designed for the cloud than on one of the fastest growing cloud platforms. We’ll get a detailed look at the demands, get feedback from cloud customers, and have the perfect test bed for offering new services. We’ll also be able to work closely with a top-notch group of DBAs, developers, and sysadmins in one of the most demanding service architectures out there. This invaluable context will help the Drizzle developers make more informed decisions moving forward, which also means better software for the community.

Personally, this also means getting back to my hosting roots. Before Sun, I worked at Concentric for almost 10 years in a clustered hosting environment. I’m very familiar with many of the multi-tenant scalability concerns Rackspace has, and I’m excited to be working in this type of environment again. We’ve already been working closely with the MySQL DBAs at Rackspace to learn what the biggest pain points are for a multi-tenant architecture, and we’ll be taking steps to address these as it will help anyone wanting to run Drizzle in a cloud-like environment. Drizzle’s modular architecture has already proved useful, as some of these concerns are easily answered with “oh, we have a plugin point for that.”

I’m excited, this is going to be a fun ride.

Posted in Drizzle, Gearman, Main, MySQL | 16 Comments


C++, or Something Like It

March 5th, 2010

I’ve developed primarily in C most of my career, and recently decided to give C++ a shot as my “primary language” due to hacking on Drizzle and MySQL. The past few months I’ve read and experimented with most features C++ provides over C, including reading Scott Meyer’s excellent “Effective” series books (highly recommended). Along the way I’ve been developing a project I’ve wanted to write for a while, and I’m finding some features to be problematic. I thought I’d share these issues so others can be aware of them and perhaps I can learn better workarounds.

The project I’ve been working on uses dynamic shared object loading at runtime (using dlopen() and friends), is threaded, and has about every strict compiler warning on you can find and being treated as errors (thanks to Monty Taylor’s pandora-build project). I’m also testing on various architectures and compilers, including Linux, OpenSolaris, and OSX. I also have been trying my best to avoid any dependencies on large C++ libraries like Boost and just stick to the standard language and STL. With these requirements in mind, here are the issues I’ve run into:

Can’t Reliably Use Exceptions

My first pass relied on exceptions, but this proved problematic on some architectures as soon as custom exceptions were being throw across module boundaries. This comes down to ABI issues for some shared object formats generated by some compiler versions. While you can make it work in some environments, it’s not going to be portable. This means I’ve had to catch exceptions closer to where they are throw, requiring a lot more try/catch blocks, and not being able to take full advantage of automatic stack cleanup. This also means resorting back to the C way or handling exceptions: returning and checking return codes while generating error strings. To be completely exception safe, this means not using std::string for error returns since they can throw exceptions while building useful error messages. Not using exceptions has had a viral effect throughout the rest of the design of the code, making it look more like C. I was a bit disappointed by this, as not having to check every function’s return code was keeping the code very clean. :)

Limited Use of the STL and std::string

I was excited to take advantage of the STL, as writing things like doubly-linked lists and hash tables for every C struct was getting a bit old (I did have a set of macros I used, but they were not the most popular in some circles because of certain C-preprocessor features). When I learned more about the internals of the STL, and how it relies heavily on copying objects, my heart sunk a little. It completely makes sense in the design, it’s just not as efficient as it could be (especially coming from a place where I would optimize to reduce pointer copies in C). No worries, I just created private copy constructors/assignment operators and only used pointers to objects. This came with it’s own set of issues with pointer management and avoiding leaks if the ‘new’ operator were to fail. Once working out the memory management issues, there were still exceptions to watch out for, including figuring out all the methods that may throw (due to an internal allocation usually). This is especially annoying when doing simple std::string operations like assignment or concatenation, and having to always catch around those. With other annoyances like the reference-to-reference issues and std::unary_function having a non-virtual destuctor, I’ve ended up using a watered down set of STL algorithms and resorted to a mix of non-STL containers and custom algorithms for some things. The lack of thread safety concerns in STL containers and differences in implementations have also lead me to not use STL containers for thread communication (using a mutex for every access is not efficient).

Conclusion

For the sake of consistency, I’ve wondered if it’s worth incorporating STL components? Is it better to have a mix or none at all? This would leave only inheritance, polymorphism, member protections, namespaces, and automatic object destruction the only C++ features being used. These are still very good reasons to use C++, but I’ve found the transition to not be as productive as I had hoped. I am very curious to hear other folks thoughts on their experience with any of the issues above.

Posted in Drizzle, Main, MySQL | 4 Comments


MySQL Conf & Drizzle Dev Day

February 17th, 2010

I’m glad to announce that we’ll be having a Drizzle developer day again this year on the Friday after the MySQL Conference! Be sure to sign up and add any topic ideas you may have so we know what folks are interested in. Space is limited!

While at the MySQL Conference, I’ll be speaking with Monty Taylor on “Using Drizzle.” This will take a non-developer approach to the project, so everyday DBAs and web developers should find this interesting. I’ll also be teaming up with Giuseppe Maxia to talk about Gearman in three sessions. These include:

We’re also going to have a combo Drizzle/Gearman booth in the expo hall, so be sure to stop by and chat. See you there!

Posted in Drizzle, Gearman, MySQL | No Comments


Moving On

January 11th, 2010

Friday was my last day at Sun Microsystems, and today is the first day at my new job (location coming soon). I’ve had a great time at Sun, and thank them for all the opportunities given to me there. I’ll be doing mostly the same work at the new gig, working on projects like Drizzle, but with a slightly different focus. For the most part my day-to-day won’t change much.

Right now I’m focusing on libdrizzle again and am implementing the prepared statement API, cleaning up the MySQL protocol support a little, and also implementing the new Drizzle client/server protocol. I’ll continue to work on Gearman as well, especially where it is relevant to Drizzle. I also need to start blogging again with specific topics in the projects I’m working on, I’ve been fairly quiet lately.

I’ll be in New Zealand next week at Linux Conf AU (yes, it’s not in AU this year). I have a talk on Gearman, and it looks like I’ll also be helping out with the Drizzle talk. It will be really nice to escape the Portland, OR winter for a bit. :)

Posted in Drizzle, Gearman, Main, MySQL | 4 Comments


Pluggable Database Client Tool

November 23rd, 2009

A few weeks ago I wrote about a student group who will be working with the Drizzle community to build a new database client tool. While the tool will be the primary replacement for the Drizzle client tool, we hope it will be generic (using the Python DB API) so it will work with others like MySQL and PostgreSQL. We’ve had a number of great discussions, including a session at OpenSQL camp last weekend. I wanted to toss out a few ideas of how such a tool could be structured to allow for maximum extensibility.

One possibility is to borrow from typical Unix shells and DSP processing systems where you have a number of modules with I/O interfaces and data exchange formats between each module. Each module provides a specific signature so you know what other modules it can plug into. Here is a simple example:

Simple

drizzle> SELECT TABLE_SCHEMA,TABLE_NAME FROM INFORMATION_SCHEMA.TABLES LIMIT 3;
+--------------------+---------------------------------------+
| TABLE_SCHEMA       | TABLE_NAME                            |
+--------------------+---------------------------------------+
| information_schema | CHARACTER_SETS                        |
| information_schema | COLLATIONS                            |
| information_schema | COLLATION_CHARACTER_SET_APPLICABILITY |
+--------------------+---------------------------------------+
3 rows in set (0 sec)

The client tool would start with a single active module, the Command Line reader that provides an interactive prompt. When a command is read, it creates a pipeline including a Query Executer module then a Console Output module (these are all the sane “default” modules to use). Once the pipeline is created, it then provides a Command message to the Query Executer, which talks to the appropriate Drizzle database server, and then starts providing column headers and rows in a Data Set message format to the Console output plugin. Initially this may seem a bit over engineered for such a simple use case, but let’s explore the flexibility this provides.

File Reader/Writer

drizzle> grep foo < commands > results

In this example, the client tool is started with the input module being a file reader. This module will read commands from the file (or stdin if things are piped into stdin) and feed each command to the Query Executer module. Since we specified an optional “grep foo” on the command line, this will initialize the ‘grep’ module (it may be more efficient to put this filtering into the WHERE clause, but ignore that for now). The result set will therefore be piped into the grep module, and then a possibly modified result set is pipelined out of that and into the File Writer module which will write the data set to ‘results’. Now imagine being able to write your own plugins, similar to the grep plugin, that can easily plug right into the pipeline just by using the common name. Each plugin can have any number of options and arguments, allowing you to create flexible modules to do client-side data set processing.

CSV Processor

drizzle> csv input.csv | INSERT INTO x VALUES ($1, $3, $5);

In this case, the command ‘csv’ causes a module to be initialized that will read the file input.csv file. The fields are separated out and formatted into a Data Set exchange message that is then fed into a query generator. The argument to the query generator is the INSERT command from the command line, and a number of INSERT commands would then be generated from the CSV input. These commands would then be piped into the query executer module and eventually results would be displayed on the console. Any other modules could also be plugged in to work on either the input or output data sets (filtering, rewriting, …).

Multi-Server/Result Aggregator

drizzle> \servers 10.0.0.1,10.0.0.2,10.0.0.3
drizzle> SELECT TABLE_SCHEMA,TABLE_NAME FROM INFORMATION_SCHEMA.TABLES LIMIT 3; | merge -c
+-------+--------------------+---------------------------------------+
| COUNT | TABLE_SCHEMA       | TABLE_NAME                            |
+-------+--------------------+---------------------------------------+
| 3     | information_schema | CHARACTER_SETS                        |
| 3     | information_schema | COLLATIONS                            |
| 3     | information_schema | COLLATION_CHARACTER_SET_APPLICABILITY |
+-------+--------------------+---------------------------------------+
3 rows in set (0 sec)

This example takes a list of servers, executes a command on all of them, aggregates the results, and displays the final result to the console. This specific command may not be all that useful, but similar aggregate data sets could be very useful when you manage a number of servers or have sharded data sets across multiple servers. In the command above, the ‘merge -c’ command merges all the resulting data sets into a single data set, and also inserts a ‘count’ column showing how many of each row appeared during the merge (think of the uniq -c unix command line tool).

If you took this concept a step further, you could read result sets from one command and pipe them into a another command generator that could talk to one or more servers. This allows the tool to become a generic “data set router” for various purposes. Rather than using a command line interface, you could replace the Command Line and Console Output plugins with a simple GUI input/out. The entire system could also be used internally within your applications to provide a richer set of functionality that the normal simple DB API provides. Building such a system up from a series of equal modules (ie, there are no first class modules) would allow great flexibility for users and developers to customize to their environment.

Posted in Drizzle, Main, MySQL | 1 Comment


New Database Command Line Client

October 29th, 2009

A few weeks ago I proposed a project to students at Portland State University for their senior capstone class, and this weekend I found out it was chosen by a group! The project will be a rewrite of the command line tool (the Drizzle tool is currently based on the ‘mysql’ tool), plus a lot of new features. We’re really excited to be working with them, and they seem equally excited about the project too. I hope DBAs, developers, and other folks in the Drizzle/MySQL/MariaDB communities will work with them to help define what features should be part of this new command line client. Some new features we have in mind are background queries, piping and redirection of queries (like a normal shell), and plugin support. It will also support at least the MySQL/MariaDB protocol too since it will be built on libdrizzle, but possibly more if we end up using a common DB API (we’re pondering Python). If you have any ideas or feature requests, feel free to leave a comment. The student group will be sending plans to the Drizzle mailing list soon for feedback, as well as attending OpenSQL Camp and leading a session on what folks would like to see in a client tool. Join me in welcoming Clark, Ken, Max, Victoria, David, and Andreas!

Posted in Drizzle, Main, MySQL | 5 Comments


OpenSQL Camp, SQL vs NoSQL

October 26th, 2009

The upcoming OpenSQL Camp is almost full! We have space for 130 people to register, and as of this writing only 10 spots are free. If you want to attend, sign up before it’s too late!

We’re still looking for a few sponsors if anyone is interested in helping cover food and t-shirt costs.

I’m organizing the closing keynote panel, “SQL vs NoSQL”, which will include core community members and committers from a number of open source databases. Selena has offered to take the PostgreSQL position if we don’t find another worthy contender. So far, it will include:

  • Brian Aker – Drizzle
  • Eric Evans – Cassandra
  • Joydeep Sen Sarma – Hive/Hadoop
  • Mike Dirolf – MongoDB
  • Mike Miller – CouchDB
  • Monty Widenius – MariaDB, MySQL
  • Ronald Bradford – MySQL, Drizzle, and others
  • ? – PostgreSQL

I’ll be sure to report who is the last one standing so we know which project to follow the closest. :)

Posted in Drizzle, Main, MySQL | 1 Comment


Eventually Consistent Relational Database?

October 12th, 2009

This weekend I attended Drupal Camp PDX and listened to a session titled “Drupal in the Cloud”. The presenter, Josh Koenig from Chapter Three, gave a great introduction of what moving to “the cloud” really means, especially in the context of a typical web application like Drupal. The problem, which is of course no fault of Josh’s, is that the best high availability database practices are harder to deploy because you’re working within a different set of constraints in the cloud. Sure, you can setup MySQL replication, but without the ability to insert a hardware load balancer or better control over floating IPs, reliable single-master solutions are difficult at best.

I spoke with Josh for a bit after and discussed how Drizzle is doing things to help and what it would take to have a Drizzle back end for Drupal (turns out it should not be too difficult). We then got onto the topic what some of the newer non-relational databases would look like for Drupal, and the short answer is it would be extremely difficult. Drupal, in both the core and many of the modules, depend on a relational model for the underlying data. This is not unique to Drupal. People, and the software they write, have thought “relational” for decades when it comes down to data. Sure, the various NoSQL projects are becoming more popular, but the masses are still thinking in terms of joining tables.

Silver Bullet

So, what would be the silver bullet? A relational database that did not depend on a single master. Not just dual-master setups with offset auto-increment, I’m talking about removing the entire concept of master-slave for replication. This is obviously nothing new in the industry, but it’s never been easy to accomplish. Just do some reading on distributed locking algorithms and you’ll get the idea. The main problem with distributed locking is that they don’t scale.

But, what about an eventually consistent replication model for a relational database? So far eventually consistent databases have not been relational (document based like CouchDB or simple key/value pairs) and relational databases have always focused on atomic consistency or some close relaxed relative (various levels of serialization). As a thought experiment, I’m going to attempt to describe what this may look under the hood at a high level.

Eventually consistent?

Not familiar with this term? Take a look at Werner Vogels’ article on the topic. The main idea behind EC is that you sacrifice the ability for all nodes to see exact same thing at any given time (consistency), but in return you can tolerate network partitions and you have availability. This directly relates to the CAP theorem which states you only get two of: Consistency, Availability, and tolerance to network Partitions. So, we are throwing out “C” so we can get rid of those nasty distributed locking algorithms, but in return we take on “EC”.

MyEventuallyConsistentSQL

Let’s start off with a traditional relational database and start modifying it until we have something that looks like an ECRDBMS (ok, maybe this acronym is a bit wordy).

  • Throw out transactions, serializability, and MVCC – Stay with me. These are not required for a relational data model, and they don’t make much sense when your throw out atomic consistency. At any point an event could come in from another database node that would overwrite whatever you are protecting in your transaction, so what’s the point of protecting yourself? I realize this eliminates a certain class of applications that depend on strict consistency, but this is the cost of moving to EC. We still have a large set of applications that will function just fine even with these dirty reads/writes. I don’t really care if my Drupal site or WordPress blog has dirty transactions (or none at all). Sure, something might get slightly off in some of those edge cases, but no one is really going to complain, and they are easily resolved when recognized.

  • Remove or fix auto-increment types – I think the best way to handle auto-increment types is to just remove them and require a time-based unique ID (ie, UUID) in your primary key. Sure, they might take a little more space, but space is cheap and it’s a small price to pay for scalability. The other option is to setup some type of pool reservation system for the nodes to pull from, but then we get back into the business of distributed locking or single points of failure. It’s easier to just remove them and use globally unique IDs. These play a role in EC replication as well, so we might as well share them here.

  • Apply EC to replication events – This requires making our events deterministic, regardless of ordering. The easiest way is to tag each row with a time-based unique ID, and the most recent ID always wins. This is a bit rough, but it works. You can refine this in a couple of ways. First, we can change our granularity of events. We probably don’t want to go lager (ie, page-based replication), but we may want to go smaller (field based). You may have a hybrid as well, mixing row and field. The other way to refine EC replication is to allow delta events. Rather than pushing only absolute values, create options on certain columns that allow for increment, decrement, append, and so on. For example, an INTEGER column defined as a inc/dec would always have replication events that looked like “x+=3″ instead of “x=42″. You may even have the option to create user-defined delta algorithms, but I won’t go into that here. Oh, and for those thinking append would not work, you can order your append operations by the time-based unique ID (which means keeping a history of the field). Look at what Google Wave is doing with their text field conflict resolution algorithms.

What are we missing? What else would break down if we toss out atomic consistency and make the above changes? One thing I left out is DDL operations. Those would require some more thought, but I’m pretty sure we could figure out a way to handle conflicting events, possibly with configuration parameters to control the decisions made in conflict resolution algorithms. For example, if an UPDATE event gets applied after a ALTER TABLE that removed a column referenced in the UPDATE, you could just ignore that value and apply the other updates (if any). Chances are you didn’t want that column if it was removed at about the same time. This model has the major benefit of not having to worry about which node is the master or keeping an ordered replication log, they all operate independently and toss deterministic events which can be applied in any order.

Summary

This would-be-ECRDBMS looks a bit different on the inside, but from the outside it will look pretty familiar. From the normal web application perspective we are still creating tables, inserting data, joining data, and doing all the things we depend on from a relational database. This many not be a great idea, but I think it would be possible if you are willing to accept some of the behaviors that come along with it. So what do you think? How can it be improved? Would you use it for your application?

Posted in Drizzle, Main, MySQL | 9 Comments


< Older Entries || Newer Entries >
Blog
Wiki
About
Resume
RSS
Comments

E-Mail
Launchpad
LinkedIn
Twitter
identi.ca
Facebook

OpenStack
Scale Stack
Gearman
NW Veg
Veg Food & Fit

Linux On Laptops