{"id":1047,"date":"2010-06-24T12:48:44","date_gmt":"2010-06-24T19:48:44","guid":{"rendered":"http:\/\/gregbaker.ca\/blog\/?p=1047"},"modified":"2010-06-24T13:16:02","modified_gmt":"2010-06-24T20:16:02","slug":"everything-i-know-about-databases-is-wrong-also-right","status":"publish","type":"post","link":"http:\/\/gregbaker.ca\/blog\/2010\/06\/24\/everything-i-know-about-databases-is-wrong-also-right\/","title":{"rendered":"Everything I know about databases is wrong.  Also, right."},"content":{"rendered":"<p>I have been teaching <a href=\"http:\/\/www.cs.sfu.ca\/CC\/470\/ggbaker\/\">CMPT 470<\/a> for six years now, with my 13th offering going on right now.  Anybody doing that is going to pick up a thing or two about web systems.<\/p>\n<p>I was there for the rise of the MVC frameworks and greeted them with open arms.  I watched Web 2.0 proclaim &#8220;screw it, everything is JavaScript now&#8221; and listed with suspicion, but interest.  I am currently watching HTML5\/CSS3 develop with excitement but wondering why nobody is asking whether IE will support any of it before the sun burns out.<\/p>\n<p>There&#8217;s another thing on the horizon that is causing me great confusion: <a href=\"http:\/\/en.wikipedia.org\/wiki\/NoSQL\">NoSQL<\/a>.<\/p>\n<p>The NoSQL idea is basically that relational databases (MySQL, Oracle, MSSQL, etc.) are not the best solution to every problem, and that there is a lot more to the data-storage landscape.  I can get behind that.<\/p>\n<p>But then, the NoSQL aficionados keep talking.  &#8220;Relational databases are slow&#8221; they say.  &#8220;You should never JOIN.&#8221;  &#8220;Relational databases can&#8217;t scale.&#8221;  These things sound suspicious.  Relational databases have a long history of being very good at their job: these are big assertions that should be accompanied by equally-big evidence.<\/p>\n<p>So, I&#8217;m going to try to talk some of this through.  Let&#8217;s start with the non-relational database types.  (I&#8217;ll stick to the ones getting a lot of NoSQL-related attention.)<\/p>\n<dl>\n<dt>Key-value stores<\/dt>\n<dd>(e.g. <a href=\"http:\/\/cassandra.apache.org\/\">Cassandra<\/a>, <a href=\"http:\/\/memcachedb.org\/\">Memcachedb<\/a>) A key-value store sounds simple enough: it&#8217;s a collection of keys (that you lookup with) and each key has an associated value (which is the data you want).  For Memcachedb, that&#8217;s exactly what you get: keys (strings) and values (strings\/binary blobs that you interpret to your whim).<\/p>\n<p>Cassandra add another layer of indirection: each &#8220;value&#8221; can itself be a dictionary of key-value pairs.  So, the &#8220;value&#8221; associated with the key &#8220;ggbaker&#8221; might be <code>{\"fname\":\"Greg\", \"mi\":\"G\", \"lname\":\"Baker\"}<\/code>.  Each of those sub-key-values is called a &#8220;column&#8221;.  So, the record &#8220;ggbaker&#8221; has a column with name &#8220;fname&#8221; and value &#8220;Greg&#8221; (with a timestamp).  Each record can have whatever set of columns are appropriate.<\/dd>\n<dt>Document stores<\/dt>\n<dd>(e.g. <a href=\"http:\/\/couchdb.apache.org\/\">CouchDB<\/a>, <a href=\"http:\/\/www.mongodb.org\/\">MongoDB<\/a>) The idea here is that each &#8220;row&#8221; of your data is basically a collection of key-value pairs.  For example, one record might be <code>{\"fname\":\"Greg\", \"mi\":\"G\", \"lname\":\"Baker\"}<\/code>.  Some other records might be missing the middle initial, or have a phone number added: there is no fixed schema, just rows storing properties.  I choose to think of this as a &#8220;collection of JSON objects that you can query&#8221; (but of course the internal data format is probably not JSON).<\/p>\n<p>Mongo has a useful <a href=\"http:\/\/www.mongodb.org\/display\/DOCS\/SQL+to+Mongo+Mapping+Chart\">SQL to Mongo<\/a> chart that summarizes things nicely.<\/dd>\n<dt>Tabular<\/dt>\n<dd>(e.g. <a href=\"http:\/\/labs.google.com\/papers\/bigtable.html\">BigTable<\/a>, <a href=\"http:\/\/hbase.apache.org\/\">Hbase<\/a>)  The big difference here seems to be that the tabular databases use a <em>fixed schema<\/em>.  So, I have to declare ahead of time that I will have a &#8220;people&#8221; table and entries in there can have columns &#8220;fname&#8221;, &#8220;lname&#8221;, and &#8220;mi&#8221;.  Not every column has to be filled for each row, but there&#8217;s a fixed set.<\/p>\n<p>There are typically many of these &#8220;tables&#8221;, each with their own schema.  <\/dd>\n<\/dl>\n<p>Summary:  There&#8217;s a lot of similarity here.  Things aren&#8217;t as different as I thought.  In fact, the big common thread is certainly less-structured data (compared to the relational style of foreign keys and rigid data definition).  Of course, I haven&#8217;t gotten into how you can actually <em>query<\/em> this data, but that&#8217;s a whole other thing.<\/p>\n<p>Let&#8217;s see if I can summarize this (with Haskell-ish type notation, since that&#8217;s fresh in my head).<\/p>\n<blockquote>\n<pre><code>data Key,Data = String\r\nmemcacheDB :: Map Key Data\r\ndata CassandraRecord = Map Key (Data, Timestamp)\r\ncassandraDB :: Map Key CassandraRecord<\/code>\r\n\r\ndata JSON = Map Key (String | Number | &hellip; | JSON)\r\nmongoDB,couchDB :: [JSON]\r\n\r\ndata Schema = [Key]\r\ndata BigTable = (Schema, [Map Key Data]) -- where only keys from Schema are allowed in the map\r\nbigTableDB :: Map Key BigTable -- key here is table name<\/pre>\n<\/blockquote>\n<p>The documentation for these projects is generally somewhere between poor and non-existent: there are a lot of claims of speed and efficiency and how they are <em>totally<\/em> faster than MySQL.  What&#8217;s in short supply are examples\/descriptions of how to actually get things done.  (For example, somewhere in my searching, I saw the phrase &#8220;for examples of usage, see the unit tests.&#8221;)<\/p>\n<p>That&#8217;s a good start. Hopefully I can get back to this and say something else useful on the topic.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have been teaching CMPT 470 for six years now, with my 13th offering going on right now. Anybody doing that is going to pick up a thing or two about web systems. I was there for the rise of the MVC frameworks and greeted them with open arms. I watched Web 2.0 proclaim &#8220;screw [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[11,10],"tags":[],"class_list":["post-1047","post","type-post","status-publish","format-standard","hentry","category-teaching","category-tech"],"_links":{"self":[{"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/posts\/1047","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/comments?post=1047"}],"version-history":[{"count":27,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/posts\/1047\/revisions"}],"predecessor-version":[{"id":1074,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/posts\/1047\/revisions\/1074"}],"wp:attachment":[{"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/media?parent=1047"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/categories?post=1047"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/tags?post=1047"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}