Everything I know about databases is wrong. Also, right.

June 24th, 2010, 12:48 pm PDT by Greg

I have been teaching CMPT 470 for six years now, with my 13th offering going on right now. Anybody doing that is going to pick up a thing or two about web systems.

I was there for the rise of the MVC frameworks and greeted them with open arms. I watched Web 2.0 proclaim “screw it, everything is JavaScript now” and listed with suspicion, but interest. I am currently watching HTML5/CSS3 develop with excitement but wondering why nobody is asking whether IE will support any of it before the sun burns out.

There’s another thing on the horizon that is causing me great confusion: NoSQL.

The NoSQL idea is basically that relational databases (MySQL, Oracle, MSSQL, etc.) are not the best solution to every problem, and that there is a lot more to the data-storage landscape. I can get behind that.

But then, the NoSQL aficionados keep talking. “Relational databases are slow” they say. “You should never JOIN.” “Relational databases can’t scale.” These things sound suspicious. Relational databases have a long history of being very good at their job: these are big assertions that should be accompanied by equally-big evidence.

So, I’m going to try to talk some of this through. Let’s start with the non-relational database types. (I’ll stick to the ones getting a lot of NoSQL-related attention.)

Key-value stores
(e.g. Cassandra, Memcachedb) A key-value store sounds simple enough: it’s a collection of keys (that you lookup with) and each key has an associated value (which is the data you want). For Memcachedb, that’s exactly what you get: keys (strings) and values (strings/binary blobs that you interpret to your whim).

Cassandra add another layer of indirection: each “value” can itself be a dictionary of key-value pairs. So, the “value” associated with the key “ggbaker” might be {"fname":"Greg", "mi":"G", "lname":"Baker"}. Each of those sub-key-values is called a “column”. So, the record “ggbaker” has a column with name “fname” and value “Greg” (with a timestamp). Each record can have whatever set of columns are appropriate.

Document stores
(e.g. CouchDB, MongoDB) The idea here is that each “row” of your data is basically a collection of key-value pairs. For example, one record might be {"fname":"Greg", "mi":"G", "lname":"Baker"}. Some other records might be missing the middle initial, or have a phone number added: there is no fixed schema, just rows storing properties. I choose to think of this as a “collection of JSON objects that you can query” (but of course the internal data format is probably not JSON).

Mongo has a useful SQL to Mongo chart that summarizes things nicely.

Tabular
(e.g. BigTable, Hbase) The big difference here seems to be that the tabular databases use a fixed schema. So, I have to declare ahead of time that I will have a “people” table and entries in there can have columns “fname”, “lname”, and “mi”. Not every column has to be filled for each row, but there’s a fixed set.

There are typically many of these “tables”, each with their own schema.

Summary: There’s a lot of similarity here. Things aren’t as different as I thought. In fact, the big common thread is certainly less-structured data (compared to the relational style of foreign keys and rigid data definition). Of course, I haven’t gotten into how you can actually query this data, but that’s a whole other thing.

Let’s see if I can summarize this (with Haskell-ish type notation, since that’s fresh in my head).

data Key,Data = String
memcacheDB :: Map Key Data
data CassandraRecord = Map Key (Data, Timestamp)
cassandraDB :: Map Key CassandraRecord

data JSON = Map Key (String | Number | … | JSON)
mongoDB,couchDB :: [JSON]

data Schema = [Key]
data BigTable = (Schema, [Map Key Data]) -- where only keys from Schema are allowed in the map
bigTableDB :: Map Key BigTable -- key here is table name

The documentation for these projects is generally somewhere between poor and non-existent: there are a lot of claims of speed and efficiency and how they are totally faster than MySQL. What’s in short supply are examples/descriptions of how to actually get things done. (For example, somewhere in my searching, I saw the phrase “for examples of usage, see the unit tests.”)

That’s a good start. Hopefully I can get back to this and say something else useful on the topic.

Computer Woes

June 10th, 2010, 10:32 am PDT by Greg

My computer at home has been locking up occasionally for the last few weeks. This has been happening since my upgrade to Ubuntu 10.04/Lucid, but I suspect this is a coincidence. (1) The lockups are hard: even the SysRq magic doesn’t do anything, so I deduce that the problem is in the kernel or below. (2) I haven’t seen any reports of the new Linux kernels being flaky. (3) I tried an upgrade from the i386 to amd46 (32-bit to 64-bit) system which I had been meaning to do anyway: no change even with a significantly different kernel.

Thus, I am of the opinion that I have a hardware problem.

As a computer scientist, I don’t enjoy hardware problems, so I’m thinking about buying my way out of them. (Also, my current system is mostly 3 years old, so it’s not a crazy time to upgrade.) My current thinking:

For about $700, that would leave me with the same case and power supply (an Antec Sonata II, 450W), my video card (nVidia 7600GT, but don’t game so who cares), my Hauppauge PCI TV tuner, and my recently-upgraded hard disks.

So, the questions for the crowd: Does my “it’s hardware” assessment sound right? Is it likely that the processor/mobo/RAM swap will fix my problems? Any other suggestions for hardware purchases?

The History of HTML

March 19th, 2010, 8:46 am PDT by Greg

After a simple query from a colleague about the differences between HTML versions, I wrote this. I thought I might as well post it. Everything was from-memory, so there may be some minor errors.

HTML 1 never existed (it was the informal “standard” that the first documentation implied).

HTML 2 was a really minimal initial description of the language. The language was simple because the initial goals were simple. The browser makers made many de facto extensions to this by implementing random stuff.

HTML 3 was an abandoned attempt to standardize everything and the kitchen sink. HTML 3.2 was a really ugly standard that was basically “here’s what browsers accept today.”

Which brings us to modern history…

HTML 4 was an attempt to clean up the language: get rid of the visual stuff and make HTML a semantic markup language again. It included the transitional version (with most of the old ugly stuff) and strict version (as things should be).

HTML 4.01 was a minor change: missed errors and typos.

XHTML 1.0 is HTML 4.01 but with XML syntax: closing empty tags with the slash, everything lowercase, attribute values quotes, etc.

XHTML 1.1 contains some minor changes, but was abandoned in a practical sense because nobody saw any point to the change. XHTML 2.0 was another very ambitious change (non-backwards compatible changes to the language) that was abandoned.

HTML 5 is in-progress of being standardized now. If you ask me, there are two camps driving it. One who thinks “the web is more about more than just simple web pages now: applications and interactivity rule the day” and another who thinks “closing our tags is too hard; I don’t understand what a doctype is: make it easier. Dur.”

As a result, there are some things I like and some things I don’t. I is showing signs of something that will actually be completed and used (unlike HTML 3 and XHTML 2).

Most people don’t know that the HTML 5 standard includes an XHTML version as well. It will be perfectly legal to write HTML 5 with the XML syntax and call it “XHTML 5”.

Addendum: The moral of the story is that I have no intention of teaching HTML 5 anywhere until the standards process is done. For 165 I also need real browser support: no JS/DOM hack to get IE to work, and some defaults in the system stylesheet to let the thing display reasonably without any CSS applied. Even then I will probably teach XHTML 5 because I think it promotes the right habits.

My Minimal Setup

January 6th, 2010, 9:37 pm PST by Greg

I just got a new netbook: an Asus Eee 1005HA.

As my old tablet got slowly older, I realized that I don’t really have heavy laptop demands: most of my use is a text editor and “hey look at this web page” in lectures. Even when away from the lecture hall, I tend to work primarily in a text editor (for LaTeX, HTML, Python, etc.), Thunderbird, and Firefox. I’m not exactly putting a big strain on the system, and can trade off power for small and light.

As always, there’s a big difference between the average stock setup and what I need to get some work done. Bridging this gap is a hassle, so I’m going to finally record what I need so I can look it up next time.

The new Eee is dual-booting Windows 7 and Ubuntu (Karmic netbook remix). Yay to Asus for shipping with a second “data” partition on the drive that was dead-easy to put Ubuntu on.

I’m open to must-have software suggestions that I missed. I’ll probably add more below as I find stuff I missed.

In Windows

In Ubuntu

  • rsync (As far as I’m concerned it’s negligent to have an operating system install without rsync.)
  • subversion
  • sshfs
  • ntp
  • thunderbird (and thunderbird-gnome-support)
  • If I’m going to be downloading pictures from a camera: mmv, jhead, exif, gphoto2, python-pyexiv2, gpsbabel
  • ddclient (with a config file like this)

In Firefox

Mythical Programming Beasts

November 7th, 2009, 1:55 am PST by Greg

In the time I have been programming, and mostly doing web programming recently, I have learned a few things. Notably, I have learned that there are a few things that people think are simple to deal with, but aren’t. These “simple” things that people think they’re doing when programming don’t really exist. Here are three examples:

“Text”

I don’t always agree with everything Joel Spolsky says, but he’s right in his rant about Unicode:

There Ain’t No Such Thing As Plain Text.

When dealing with input and output, you never have the luxury of just having “text”. What you really have is a byte stream using a specific character encoding. If you don’t know what encoding you’re dealing with, you’ve got nothing. Every input stream has to be decoded; every output stream has to be encoded.

Even once you have encodings sorted out, there’s a lot of question about what a “string” is in your program. Consider the distinction Django makes between strings and safestrings that allows the auto escaping to work: some strings contain HTML code, and some contain text that the user should see as-is. You can’t “output a string” without knowing how (or if) it has to be processed/escaped/cleaned first.

It’s never just “plain text”.

“Time”

It’s very easy in most languages to store date and time values. Unfortunately, there’s not really any such thing as a “time” either.

As I sit here, it is about midnight (0:00) PST. It’s 8:00 in London and 16:00 in Beijing. A time is no good to anybody without a time zone to tell you how it fits into the world. This comes into much sharper focus with web applications where users are probably going to be in different time zones.

But it’s not even as easy as storing a time + timezone: one week (7 days × 24 hours/day) ago, it was 1:00 PDT, not 12:00 PST. You can’t just add n days to a time and get the same time n days later. Time zones can change, even for a particular user, even if they don’t change their location. (And if not for knowing the time zone, I would have absolutely no way to notice these gotchas.)

Suppose I was using a calendaring application and I enter a meeting at “13:00” on a particular date.

How does the program represent that? The first instinct would probably be to store “<date> 13:00 PST” (using the entered date/time and my current time zone) but that’s not right if there’s a time change before that date. I have seen calendar error announcements “all meetings after the time change will be off by an hour” because of this mistake. Should it really be stored as “<date> 13:00 PDT” depending on the date? What if the North American daylight savings rules change again before this meeting?

I don’t even want to think about two users in different time zones trying to schedule a meeting, but it should definitely be possible.

The only real thing to do is store “<date> 13:00 America/Vancouver” and hope some timezone library is smart enough to save us later. That means we need a date library with a lot of smarts, like pytz for Python.

It also means that you have to at least be very careful with any built-in date/time library (and possibly data type) your language comes with. It might mean you have to bypass them entirely.

“Appearance of a web page”

[I know it’s not really “programming”, but just move on, okay?]

This one shouldn’t be a surprise to anybody who knows anything about the web, but web pages simply don’t have a single unique appearance. The way a page looks depends on the browser, window size, available fonts, font size settings, and who knows how many other factors.

If you’re making web pages, you simply have to understand and live with this limitation. As I have said many times in lectures: if you don’t like it, don’t make web pages.

Also, what the page looks like to you has relatively little relation to the way Google or other bots “see” it, but that’s another rant.

My latest project: web lint

October 15th, 2009, 11:30 pm PDT by Greg

I have alluded to this in a status update, but I think it’s time to look more widely for feedback…

A while ago, I started thinking about all of the annoying things my CMPT 165 students do in their HTML, and then started thinking about ways to get them to stop. I started working on an automated checker to give them as much personalized feedback as possible without me actually having to talk to them.

They already use an HTML validator which checks documents against the HTML/XHTML syntax, but it’s amazing what kind of things actually pass the validator. In the list: resizing images with width/height on <img />; saving their source as UTF-16 (no idea how they do it); putting spaces in their URLs; using class names like “red” instead of “important”; not specifying the natural language/character encoding of the document; etc.

As the list became longer, the thing became sort of a general HTML lint: the thing you go to after your code is valid to check for other common problems, annoyances, and omissions. The more I look at it, the more I think it’s a useful tool for CMPT 165 students as well as a good way to make others think a little more about the code they are producing.

I’m now at the point of wanting some feedback. There are still some missing strings and help text, but hopefully you get the idea. I don’t want to guarantee that this link will exist forever, but have a look at my web lint.

As with any “lint”, the goal here probably isn’t for authors to get zero warnings, but just to think about why they are ignoring the warnings that remain. (No, I don’t need you to tell me that some of my pages produce some warnings.)

At this point, I’m most interested in:

  • Links to input that causes an exception (500 Internal Server Error) or other truly broken behaviour.
  • Feedback on the warnings presented and their “level”. I have deliberately hidden levels 4 and 5 in the default display: I’m aware that the tool is pretty anal-retentive.
  • Are there things you can thing of (that could be automatically-checkable) that should get a warning but don’t? I have a few more on my list, but the core is in there.
  • I don’t think the URL validation (for <a>, <link>, <img>) is perfect: I still need to go back to the RFC and check the details. Any cases you notice that don’t pass but should would be appreciated.
  • Any spelling/grammar errors?
  • I’m trying not to duplicate functionality of the HTML validators: they already do their job well. But, notice the links to “other checkers” on the right. Didn’t know about all of them, did you? Any others I should include?

My intention is to GPL the code and CC license the text, but let’s take one step at a time.

Wikipedia Anti-Hate

August 26th, 2009, 10:31 pm PDT by Greg

There has been a bunch of bad noise about Wikipedia on the tubes recently and it’s annoying me.

First, there was a study about Wikipedia growth slowing. Basically, the rate of new article creation has slowed and one-off editors are more likely to have their edits reverted.

Secondly, Wikipedia is adding a new level of editorial oversight for biographies of living people. This amounts to turning on flagged revisions for those articles: basically, non-logged-in users only see “flagged” edits that have been approved by “trusted editor” (i.e. not worth reverting).

Both of these caused a lot of consternation: Wikipedia is over the hill, Wikipedia is becoming elitist, etc. I made the mistake of reading slashdot comments on the second issue and regreted it.

Seriously? Can you look closely at the English Wikipedia and come to the conclusion that it’s dying?

Try clicking “random article” in Wikipedia a few times. Can you really say that the number of new articles shouldn’t be slowing down? Many of the articles are pretty dicey on the notability criteria. There is simply a finite number of “notable” topics that need to be written about: I’d say that English Wikipedia is closing in on that number. There will always be gaps, but they’re getting hard to find.

I have done a moderate amount of Wikipedia editing: about 200 edits across Wikimedia sites. In looking at the history of pages, I’ve never seen an edit that has been unjustly reverted. (Although I do tend to stay away from controversial pages.) Most of the reversions I have seen are of the quality “my high school principal is teh gay”. Again, I’m sure there are problems and edit wars, but they are definitely not the majority.

As for “flagged revisions”, I think it’s a great solution to the vandalism problem. Logged in users and editors will always see the most recent revisions, only anonymous viewers will see the “flagged” versions. The criteria for flagging seems to be “not worth reverting”, so that’s pretty minimal. I’d feel better if there was a better definition of “trusted editor” who can flag a revision, but assuming there is a sufficient set of people doing the flagging, it should work well.

So why the hate? My theory is that all of these people have written long articles about their totally awesome band, but had the page deleted for not being notable. Or maybe their high school principal really is teh gay, and they feel they are being censored.

Custom classes in Docbook to HTML conversion

July 16th, 2009, 8:55 am PDT by Greg

Maybe I should have a tag for “boring technical notes that I’m writing so others can Google them later”.

Anyway… if you’re converting a Docbook document to HTML, and want customized classes on elements (so you can hit them with CSS), first create a custom XSL style for the document (and use with xmlto -m).

Then suppose you have <code language="html"> in the Docbook and want that to have classes html and xml to hold on to in the resulting HTML. Add this:

<xsl:template match="code[@language = 'html']" mode="class.value">
html xml
</xsl:template>

The match can be any XSL matching pattern. The contents can also be a <xsl:value-of> if you want to do something more advanced.

Maybe it’s because I’m an XSL newb, but I haven’t seen this explained nicely anywhere else.

CMPT 383, or “Why I Hate Ted”

July 7th, 2009, 1:56 pm PDT by Greg

As many of you know, one of the goals for my study leave has been to prepare to teach CMPT 383, Comparative Programming Languages. The calendar says this course is:

Various concepts and principles underlying the design and use of modern programming languages are considered in the context of procedural, object-oriented, functional and logic programming languages. Topics include data and control structuring constructs, facilities for modularity and data abstraction, polymorphism, syntax, and formal semantics.

I took a similar course in my undergrad, and I think it was really useful in helping me see the broader picture of what programming is.

I have been thinking about the course off-and-on for more than a year. I had been forming a pretty solid picture of what the course I teach would look like and things were going well, despite never having devoted any specific time to it or really writing anything down.

Then I talked to Ted. Ted has taught the course before, and has thought a lot about it. His thoughts on the course differed from mine. In particular, he opined that “logic programming is dead, so why teach it?” (Okay, maybe that’s not a direct quote, but that’s what I heard.) So that leaves functional programming as the only new paradigm worth talking about.

He also convinced me that covering too many languages in the single course puts students into a situation of too many trees, not enough forest. (That is, they get lost in syntax and don’t appreciate the core differences between languages.)

Basically Ted did the most annoying thing in the world: he disagreed with me and he was right.

But, there is a lot of stuff that I hadn’t considered before, but might be worth talking about:

  • Type systems: static/dynamic, strong/weak, built-in data types, OO (or not), type inference, etc.
  • Execution/compilation environment: native code generation, JIT compilers, virtual machines, language translation (e.g. Alice → Java → execution), etc.

So, what the hell do I do with all of that? Any ideas how to put all of that together into a coherent course that students can actually enjoy?

Wikipedia-based Machine Translation

July 1st, 2009, 12:07 am PDT by Greg

I have been pondering this for a while and thought I might as well throw it in a blog entry…

Wikipedia is, of course, a massive collection of mostly-correct information. The information there isn’t fundamentally designed to be machine readable (unlike the semantic web stuff), but there are some structures that allow data to be extracted automatically. My favourite example is the {{coord}} template allowing the Wikipedia layer in Google Earth.

The part of Wikipedia pages that recently caught my eye is the “other languages” section on the left of every page. I’d be willing to bet that these interwiki links form the largest translation database that exists anywhere.

Take the entry for “Lithography” entry as a moderately-obscure example. On the left of that page, we can read off that the German word for lithography is “Lithografie”, the Urdu word is “سنگی طباعت”, and 34 others. Sure, some of the words might literally be “lithograph” or “photolithography”, but that’s not the worst thing ever. All of this can be mechanically discovered by parsing the wikitext.

Should it not be possible to do some good machine translation starting with that huge dictionary? Admittedly, I know approximately nothing about machine translation. I know there are still gobs of problems when it comes to grammar and ambiguity, but a good dictionary of word and phrase translations has to count for something. The “disambiguation” pages could probably help with the ambiguity problem too.

I’d guess that even this would produce a readable translation: (1) chunk source text into the longest possible page titles (e.g. look at “spontaneous combustion”, not “spontaneous” and “combustion” separately), (2) apply whatever grammar-translation rules you have lying around, (3) literally translate each chunk with the Wikipedia “other language” article titles, and (4) if there’s no “other language” title, fall back to any other algorithm.

I can’t believe this is a new idea, but a half-hearted search in Google and CiteSeer turned up nothing. Now it’s off my chest. Anybody who knows anything about machine translation should feel free to tell me why I’m wrong.

« Previous Entries   Next Entries »