I have been teaching CMPT 470 for six years now, with my 13th offering going on right now. Anybody doing that is going to pick up a thing or two about web systems.
I was there for the rise of the MVC frameworks and greeted them with open arms. I watched Web 2.0 proclaim “screw it, everything is JavaScript now” and listed with suspicion, but interest. I am currently watching HTML5/CSS3 develop with excitement but wondering why nobody is asking whether IE will support any of it before the sun burns out.
There’s another thing on the horizon that is causing me great confusion: NoSQL.
The NoSQL idea is basically that relational databases (MySQL, Oracle, MSSQL, etc.) are not the best solution to every problem, and that there is a lot more to the data-storage landscape. I can get behind that.
But then, the NoSQL aficionados keep talking. “Relational databases are slow” they say. “You should never JOIN.” “Relational databases can’t scale.” These things sound suspicious. Relational databases have a long history of being very good at their job: these are big assertions that should be accompanied by equally-big evidence.
So, I’m going to try to talk some of this through. Let’s start with the non-relational database types. (I’ll stick to the ones getting a lot of NoSQL-related attention.)
- Key-value stores
- (e.g. Cassandra, Memcachedb) A key-value store sounds simple enough: it’s a collection of keys (that you lookup with) and each key has an associated value (which is the data you want). For Memcachedb, that’s exactly what you get: keys (strings) and values (strings/binary blobs that you interpret to your whim).
Cassandra add another layer of indirection: each “value” can itself be a dictionary of key-value pairs. So, the “value” associated with the key “ggbaker” might be
{"fname":"Greg", "mi":"G", "lname":"Baker"}
. Each of those sub-key-values is called a “column”. So, the record “ggbaker” has a column with name “fname” and value “Greg” (with a timestamp). Each record can have whatever set of columns are appropriate. - Document stores
- (e.g. CouchDB, MongoDB) The idea here is that each “row” of your data is basically a collection of key-value pairs. For example, one record might be
{"fname":"Greg", "mi":"G", "lname":"Baker"}
. Some other records might be missing the middle initial, or have a phone number added: there is no fixed schema, just rows storing properties. I choose to think of this as a “collection of JSON objects that you can query” (but of course the internal data format is probably not JSON).Mongo has a useful SQL to Mongo chart that summarizes things nicely.
- Tabular
- (e.g. BigTable, Hbase) The big difference here seems to be that the tabular databases use a fixed schema. So, I have to declare ahead of time that I will have a “people” table and entries in there can have columns “fname”, “lname”, and “mi”. Not every column has to be filled for each row, but there’s a fixed set.
There are typically many of these “tables”, each with their own schema.
Summary: There’s a lot of similarity here. Things aren’t as different as I thought. In fact, the big common thread is certainly less-structured data (compared to the relational style of foreign keys and rigid data definition). Of course, I haven’t gotten into how you can actually query this data, but that’s a whole other thing.
Let’s see if I can summarize this (with Haskell-ish type notation, since that’s fresh in my head).
data Key,Data = String memcacheDB :: Map Key Data data CassandraRecord = Map Key (Data, Timestamp) cassandraDB :: Map Key CassandraRecord
data JSON = Map Key (String | Number | … | JSON) mongoDB,couchDB :: [JSON] data Schema = [Key] data BigTable = (Schema, [Map Key Data]) -- where only keys from Schema are allowed in the map bigTableDB :: Map Key BigTable -- key here is table name
The documentation for these projects is generally somewhere between poor and non-existent: there are a lot of claims of speed and efficiency and how they are totally faster than MySQL. What’s in short supply are examples/descriptions of how to actually get things done. (For example, somewhere in my searching, I saw the phrase “for examples of usage, see the unit tests.”)
That’s a good start. Hopefully I can get back to this and say something else useful on the topic.