Everything I know about databases is wrong. Also, right.

June 24th, 2010, 12:48 pm PDT by Greg

I have been teaching CMPT 470 for six years now, with my 13th offering going on right now. Anybody doing that is going to pick up a thing or two about web systems.

I was there for the rise of the MVC frameworks and greeted them with open arms. I watched Web 2.0 proclaim “screw it, everything is JavaScript now” and listed with suspicion, but interest. I am currently watching HTML5/CSS3 develop with excitement but wondering why nobody is asking whether IE will support any of it before the sun burns out.

There’s another thing on the horizon that is causing me great confusion: NoSQL.

The NoSQL idea is basically that relational databases (MySQL, Oracle, MSSQL, etc.) are not the best solution to every problem, and that there is a lot more to the data-storage landscape. I can get behind that.

But then, the NoSQL aficionados keep talking. “Relational databases are slow” they say. “You should never JOIN.” “Relational databases can’t scale.” These things sound suspicious. Relational databases have a long history of being very good at their job: these are big assertions that should be accompanied by equally-big evidence.

So, I’m going to try to talk some of this through. Let’s start with the non-relational database types. (I’ll stick to the ones getting a lot of NoSQL-related attention.)

Key-value stores
(e.g. Cassandra, Memcachedb) A key-value store sounds simple enough: it’s a collection of keys (that you lookup with) and each key has an associated value (which is the data you want). For Memcachedb, that’s exactly what you get: keys (strings) and values (strings/binary blobs that you interpret to your whim).

Cassandra add another layer of indirection: each “value” can itself be a dictionary of key-value pairs. So, the “value” associated with the key “ggbaker” might be {"fname":"Greg", "mi":"G", "lname":"Baker"}. Each of those sub-key-values is called a “column”. So, the record “ggbaker” has a column with name “fname” and value “Greg” (with a timestamp). Each record can have whatever set of columns are appropriate.

Document stores
(e.g. CouchDB, MongoDB) The idea here is that each “row” of your data is basically a collection of key-value pairs. For example, one record might be {"fname":"Greg", "mi":"G", "lname":"Baker"}. Some other records might be missing the middle initial, or have a phone number added: there is no fixed schema, just rows storing properties. I choose to think of this as a “collection of JSON objects that you can query” (but of course the internal data format is probably not JSON).

Mongo has a useful SQL to Mongo chart that summarizes things nicely.

Tabular
(e.g. BigTable, Hbase) The big difference here seems to be that the tabular databases use a fixed schema. So, I have to declare ahead of time that I will have a “people” table and entries in there can have columns “fname”, “lname”, and “mi”. Not every column has to be filled for each row, but there’s a fixed set.

There are typically many of these “tables”, each with their own schema.

Summary: There’s a lot of similarity here. Things aren’t as different as I thought. In fact, the big common thread is certainly less-structured data (compared to the relational style of foreign keys and rigid data definition). Of course, I haven’t gotten into how you can actually query this data, but that’s a whole other thing.

Let’s see if I can summarize this (with Haskell-ish type notation, since that’s fresh in my head).

data Key,Data = String
memcacheDB :: Map Key Data
data CassandraRecord = Map Key (Data, Timestamp)
cassandraDB :: Map Key CassandraRecord

data JSON = Map Key (String | Number | … | JSON)
mongoDB,couchDB :: [JSON]

data Schema = [Key]
data BigTable = (Schema, [Map Key Data]) -- where only keys from Schema are allowed in the map
bigTableDB :: Map Key BigTable -- key here is table name

The documentation for these projects is generally somewhere between poor and non-existent: there are a lot of claims of speed and efficiency and how they are totally faster than MySQL. What’s in short supply are examples/descriptions of how to actually get things done. (For example, somewhere in my searching, I saw the phrase “for examples of usage, see the unit tests.”)

That’s a good start. Hopefully I can get back to this and say something else useful on the topic.

6 Responses to “Everything I know about databases is wrong. Also, right.”

  1. Curtis Lassam » Blog Archive » NoSQL vs YesSQL Says:

    […] is currently involved in the process of trying to wrap his head around NoSQL. Technologies such as Cassandra, Memcached, CouchDB, MongoDB, BigTable, SmallTable, MediumTable, […]

  2. Nico Says:

    It feels to me that they’re all conceptually the same (a bunch of records), with each one taking a different position in the compromise of storage/memory footprint vs. flexibility/query efficiency.

  3. Thusjanthan Kubendranathan Says:

    I seem to be hearing alot of that as well how non relational databases are much faster and efficient than relational databases. I wonder the same questions: How? Why? Examples? It does suggest that they are faster seems how Google uses a large number of clustered BigTable as their backend “DB” I believe.

    Thusjanthan

  4. Lee Iverson Says:

    A very interesting variation of this theme is Redis (http://code.google.com/p/redis). This is a key-value store but each value has type which has a set of atomic ops that can be performed on them (e.g. stack, queue, list, hash, etc.)

  5. Curt Sampson Says:

    The proposition that certain implementations of DBMSes, in certain specific conditions, are faster than others is trivially true. The reasoning behind all of this is bunk, for the most part, and comes from a complete misunderstanding of relational theory.

    Relational theory says nothing about the physical storage model, which is the key when one is considering access speed under any particular circumstances.

    I too believe that SQL is a rather nasty little language, but that has not much to do with the implementations of DBMSes using it, and a lot to do with that it’s hard to express relational queries in SQL. Were the term “NoSQL” talking about this, rather than simply being a term adopted for certain implementations of storage engines that don’t use SQL, I’d quite like the term.

    I think that many of these users are correct when they say that the relational model doesn’t suit certain particular things that they are trying to do, if you look at it in the sense that they are trying to use a hierarchical or network database model (and writing massive OORM systems to convert between the two models). However, I think that they are not correct in the sense that the relational model isn’t useful in a larger sense for what they want to do, in that the relational model could quite happily support, and support better, their applications if only they would use it and they had an appropriate implementation of it available. I.e., the specific problem that they are solving is that they don’t want to use the relational model (rather than a more general data processing problem), and it’s pretty clear that using an RDBMS when you don’t want one is a bad idea.

    I’ve had the germs of a detailed article or blog post on this topic brewing around for a while; I really need to spend a couple of days at some point writing it.

  6. Lew Payne Says:

    I’ve been hearing such claims for many decades now, and such things are hardly new. Berkeley db, for example, has been around (in several incantations) for longer than most can remember.

    In almost all areas of life (not just science), I’ve learned to quantify extraordinary claims early on. An extraordinary claim requires extraordinary proof. But even prior to that, it requires quantification. So, therefore, I begin by dissecting the claim itself… “faster than MySQL at what particular operations, using which particular data sets, arranged in what particular order, with what particular query?”

    While I wholeheartedly agree that relational databases have been abused by so-called computer programmers (by being grafted into service in areas where a plain flat-file methodology would serve better), I also agree there’s a time and place for each. In my mind, it is up to the contender (i.e., the so-called “new” schemes) to provide extraordinary proof. Until such is done, I see the style and format of advertising their “breakthrough” – filled with extraordinary claims – as nothing more than hollow platitudes.