My Minimal Setup

January 6th, 2010, 9:37 pm UTC by Greg

I just got a new netbook: an Asus Eee 1005HA.

As my old tablet got slowly older, I realized that I don’t really have heavy laptop demands: most of my use is a text editor and “hey look at this web page” in lectures. Even when away from the lecture hall, I tend to work primarily in a text editor (for LaTeX, HTML, Python, etc.), Thunderbird, and Firefox. I’m not exactly putting a big strain on the system, and can trade off power for small and light.

As always, there’s a big difference between the average stock setup and what I need to get some work done. Bridging this gap is a hassle, so I’m going to finally record what I need so I can look it up next time.

The new Eee is dual-booting Windows 7 and Ubuntu (Karmic netbook remix). Yay to Asus for shipping with a second “data” partition on the drive that was dead-easy to put Ubuntu on.

I’m open to must-have software suggestions that I missed. I’ll probably add more below as I find stuff I missed.

In Windows

In Ubuntu

  • rsync (As far as I’m concerned it’s negligent to have an operating system install without rsync.)
  • subversion
  • sshfs
  • ntp
  • thunderbird (and thunderbird-gnome-support)
  • If I’m going to be downloading pictures from a camera: mmv, jhead, exif, gphoto2, python-pyexiv2, gpsbabel
  • ddclient (with a config file like this)

In Firefox

Mythical Programming Beasts

November 7th, 2009, 1:55 am UTC by Greg

In the time I have been programming, and mostly doing web programming recently, I have learned a few things. Notably, I have learned that there are a few things that people think are simple to deal with, but aren’t. These “simple” things that people think they’re doing when programming don’t really exist. Here are three examples:

“Text”

I don’t always agree with everything Joel Spolsky says, but he’s right in his rant about Unicode:

There Ain’t No Such Thing As Plain Text.

When dealing with input and output, you never have the luxury of just having “text”. What you really have is a byte stream using a specific character encoding. If you don’t know what encoding you’re dealing with, you’ve got nothing. Every input stream has to be decoded; every output stream has to be encoded.

Even once you have encodings sorted out, there’s a lot of question about what a “string” is in your program. Consider the distinction Django makes between strings and safestrings that allows the auto escaping to work: some strings contain HTML code, and some contain text that the user should see as-is. You can’t “output a string” without knowing how (or if) it has to be processed/escaped/cleaned first.

It’s never just “plain text”.

“Time”

It’s very easy in most languages to store date and time values. Unfortunately, there’s not really any such thing as a “time” either.

As I sit here, it is about midnight (0:00) PST. It’s 8:00 in London and 16:00 in Beijing. A time is no good to anybody without a time zone to tell you how it fits into the world. This comes into much sharper focus with web applications where users are probably going to be in different time zones.

But it’s not even as easy as storing a time + timezone: one week (7 days × 24 hours/day) ago, it was 1:00 PDT, not 12:00 PST. You can’t just add n days to a time and get the same time n days later. Time zones can change, even for a particular user, even if they don’t change their location. (And if not for knowing the time zone, I would have absolutely no way to notice these gotchas.)

Suppose I was using a calendaring application and I enter a meeting at “13:00″ on a particular date.

How does the program represent that? The first instinct would probably be to store “<date> 13:00 PST” (using the entered date/time and my current time zone) but that’s not right if there’s a time change before that date. I have seen calendar error announcements “all meetings after the time change will be off by an hour” because of this mistake. Should it really be stored as “<date> 13:00 PDT” depending on the date? What if the North American daylight savings rules change again before this meeting?

I don’t even want to think about two users in different time zones trying to schedule a meeting, but it should definitely be possible.

The only real thing to do is store “<date> 13:00 America/Vancouver” and hope some timezone library is smart enough to save us later. That means we need a date library with a lot of smarts, like pytz for Python.

It also means that you have to at least be very careful with any built-in date/time library (and possibly data type) your language comes with. It might mean you have to bypass them entirely.

“Appearance of a web page”

[I know it's not really "programming", but just move on, okay?]

This one shouldn’t be a surprise to anybody who knows anything about the web, but web pages simply don’t have a single unique appearance. The way a page looks depends on the browser, window size, available fonts, font size settings, and who knows how many other factors.

If you’re making web pages, you simply have to understand and live with this limitation. As I have said many times in lectures: if you don’t like it, don’t make web pages.

Also, what the page looks like to you has relatively little relation to the way Google or other bots “see” it, but that’s another rant.

My latest project: web lint

October 15th, 2009, 11:30 pm UTC by Greg

I have alluded to this in a status update, but I think it’s time to look more widely for feedback…

A while ago, I started thinking about all of the annoying things my CMPT 165 students do in their HTML, and then started thinking about ways to get them to stop. I started working on an automated checker to give them as much personalized feedback as possible without me actually having to talk to them.

They already use an HTML validator which checks documents against the HTML/XHTML syntax, but it’s amazing what kind of things actually pass the validator. In the list: resizing images with width/height on <img />; saving their source as UTF-16 (no idea how they do it); putting spaces in their URLs; using class names like “red” instead of “important”; not specifying the natural language/character encoding of the document; etc.

As the list became longer, the thing became sort of a general HTML lint: the thing you go to after your code is valid to check for other common problems, annoyances, and omissions. The more I look at it, the more I think it’s a useful tool for CMPT 165 students as well as a good way to make others think a little more about the code they are producing.

I’m now at the point of wanting some feedback. There are still some missing strings and help text, but hopefully you get the idea. I don’t want to guarantee that this link will exist forever, but have a look at my web lint.

As with any “lint”, the goal here probably isn’t for authors to get zero warnings, but just to think about why they are ignoring the warnings that remain. (No, I don’t need you to tell me that some of my pages produce some warnings.)

At this point, I’m most interested in:

  • Links to input that causes an exception (500 Internal Server Error) or other truly broken behaviour.
  • Feedback on the warnings presented and their “level”. I have deliberately hidden levels 4 and 5 in the default display: I’m aware that the tool is pretty anal-retentive.
  • Are there things you can thing of (that could be automatically-checkable) that should get a warning but don’t? I have a few more on my list, but the core is in there.
  • I don’t think the URL validation (for <a>, <link>, <img>) is perfect: I still need to go back to the RFC and check the details. Any cases you notice that don’t pass but should would be appreciated.
  • Any spelling/grammar errors?
  • I’m trying not to duplicate functionality of the HTML validators: they already do their job well. But, notice the links to “other checkers” on the right. Didn’t know about all of them, did you? Any others I should include?

My intention is to GPL the code and CC license the text, but let’s take one step at a time.

Wikipedia Anti-Hate

August 26th, 2009, 10:31 pm UTC by Greg

There has been a bunch of bad noise about Wikipedia on the tubes recently and it’s annoying me.

First, there was a study about Wikipedia growth slowing. Basically, the rate of new article creation has slowed and one-off editors are more likely to have their edits reverted.

Secondly, Wikipedia is adding a new level of editorial oversight for biographies of living people. This amounts to turning on flagged revisions for those articles: basically, non-logged-in users only see “flagged” edits that have been approved by “trusted editor” (i.e. not worth reverting).

Both of these caused a lot of consternation: Wikipedia is over the hill, Wikipedia is becoming elitist, etc. I made the mistake of reading slashdot comments on the second issue and regreted it.

Seriously? Can you look closely at the English Wikipedia and come to the conclusion that it’s dying?

Try clicking “random article” in Wikipedia a few times. Can you really say that the number of new articles shouldn’t be slowing down? Many of the articles are pretty dicey on the notability criteria. There is simply a finite number of “notable” topics that need to be written about: I’d say that English Wikipedia is closing in on that number. There will always be gaps, but they’re getting hard to find.

I have done a moderate amount of Wikipedia editing: about 200 edits across Wikimedia sites. In looking at the history of pages, I’ve never seen an edit that has been unjustly reverted. (Although I do tend to stay away from controversial pages.) Most of the reversions I have seen are of the quality “my high school principal is teh gay”. Again, I’m sure there are problems and edit wars, but they are definitely not the majority.

As for “flagged revisions”, I think it’s a great solution to the vandalism problem. Logged in users and editors will always see the most recent revisions, only anonymous viewers will see the “flagged” versions. The criteria for flagging seems to be “not worth reverting”, so that’s pretty minimal. I’d feel better if there was a better definition of “trusted editor” who can flag a revision, but assuming there is a sufficient set of people doing the flagging, it should work well.

So why the hate? My theory is that all of these people have written long articles about their totally awesome band, but had the page deleted for not being notable. Or maybe their high school principal really is teh gay, and they feel they are being censored.

Custom classes in Docbook to HTML conversion

July 16th, 2009, 8:55 am UTC by Greg

Maybe I should have a tag for “boring technical notes that I’m writing so others can Google them later”.

Anyway… if you’re converting a Docbook document to HTML, and want customized classes on elements (so you can hit them with CSS), first create a custom XSL style for the document (and use with xmlto -m).

Then suppose you have <code language="html"> in the Docbook and want that to have classes html and xml to hold on to in the resulting HTML. Add this:

<xsl:template match="code[@language = 'html']" mode="class.value">
html xml
</xsl:template>

The match can be any XSL matching pattern. The contents can also be a <xsl:value-of> if you want to do something more advanced.

Maybe it’s because I’m an XSL newb, but I haven’t seen this explained nicely anywhere else.

CMPT 383, or “Why I Hate Ted”

July 7th, 2009, 1:56 pm UTC by Greg

As many of you know, one of the goals for my study leave has been to prepare to teach CMPT 383, Comparative Programming Languages. The calendar says this course is:

Various concepts and principles underlying the design and use of modern programming languages are considered in the context of procedural, object-oriented, functional and logic programming languages. Topics include data and control structuring constructs, facilities for modularity and data abstraction, polymorphism, syntax, and formal semantics.

I took a similar course in my undergrad, and I think it was really useful in helping me see the broader picture of what programming is.

I have been thinking about the course off-and-on for more than a year. I had been forming a pretty solid picture of what the course I teach would look like and things were going well, despite never having devoted any specific time to it or really writing anything down.

Then I talked to Ted. Ted has taught the course before, and has thought a lot about it. His thoughts on the course differed from mine. In particular, he opined that “logic programming is dead, so why teach it?” (Okay, maybe that’s not a direct quote, but that’s what I heard.) So that leaves functional programming as the only new paradigm worth talking about.

He also convinced me that covering too many languages in the single course puts students into a situation of too many trees, not enough forest. (That is, they get lost in syntax and don’t appreciate the core differences between languages.)

Basically Ted did the most annoying thing in the world: he disagreed with me and he was right.

But, there is a lot of stuff that I hadn’t considered before, but might be worth talking about:

  • Type systems: static/dynamic, strong/weak, built-in data types, OO (or not), type inference, etc.
  • Execution/compilation environment: native code generation, JIT compilers, virtual machines, language translation (e.g. Alice → Java → execution), etc.

So, what the hell do I do with all of that? Any ideas how to put all of that together into a coherent course that students can actually enjoy?

Wikipedia-based Machine Translation

July 1st, 2009, 12:07 am UTC by Greg

I have been pondering this for a while and thought I might as well throw it in a blog entry…

Wikipedia is, of course, a massive collection of mostly-correct information. The information there isn’t fundamentally designed to be machine readable (unlike the semantic web stuff), but there are some structures that allow data to be extracted automatically. My favourite example is the {{coord}} template allowing the Wikipedia layer in Google Earth.

The part of Wikipedia pages that recently caught my eye is the “other languages” section on the left of every page. I’d be willing to bet that these interwiki links form the largest translation database that exists anywhere.

Take the entry for “Lithography” entry as a moderately-obscure example. On the left of that page, we can read off that the German word for lithography is “Lithografie”, the Urdu word is “سنگی طباعت”, and 34 others. Sure, some of the words might literally be “lithograph” or “photolithography”, but that’s not the worst thing ever. All of this can be mechanically discovered by parsing the wikitext.

Should it not be possible to do some good machine translation starting with that huge dictionary? Admittedly, I know approximately nothing about machine translation. I know there are still gobs of problems when it comes to grammar and ambiguity, but a good dictionary of word and phrase translations has to count for something. The “disambiguation” pages could probably help with the ambiguity problem too.

I’d guess that even this would produce a readable translation: (1) chunk source text into the longest possible page titles (e.g. look at “spontaneous combustion”, not “spontaneous” and “combustion” separately), (2) apply whatever grammar-translation rules you have lying around, (3) literally translate each chunk with the Wikipedia “other language” article titles, and (4) if there’s no “other language” title, fall back to any other algorithm.

I can’t believe this is a new idea, but a half-hearted search in Google and CiteSeer turned up nothing. Now it’s off my chest. Anybody who knows anything about machine translation should feel free to tell me why I’m wrong.

imapsync migration

May 21st, 2009, 5:29 pm UTC by Greg

A quick technical note… I just want to get this down so it’s findable later.

I was trying to move my mail from FASnet to the new Zimbra server. The Zimbra wiki suggests imapsync to migrate mail from one IMAP server to another.

When I tried that, imapsync insists on spidering every folder on the source server to see what’s there. Since FASnet is set up in its own particular fashion, that means it will look through every file in my home directory. That’s a lot of load on the IMAP server and something that caused the system admins to ask me nicely to stop.

To prevent the spidering, I replaced this line in the imapsync script:

my @all_source_folders = sort $from->folders();

with this:

my @all_source_folders = ();

As long as you specify a --folderrec (or --folder or --subscribed) and no --include, it will still work just fine. My command line was:

./imapsync --host1 imap.css.sfu.ca --user1 ggbaker --host2 imapserver.sfu.ca --user2 ggbaker --noauthmd5 --folderrec mail --exclude old --prefix1 mail --prefix2 fasnet 2>&1 | tee log

Everything seemed to work, and it looks like I have my email moved over. The line in the imapsync summary “Total bytes skipped: 36444″ worries me a little. I wonder what those bytes were.

Geotagging

May 15th, 2009, 1:28 pm UTC by Greg

Astute observers (or people I told about it) will have noticed that in our gallery of China pictures, clicking the links on the left “View Album on a Map” and “View in Google Earth” do cool stuff.

I promised I would write about how I did that, so I will. There are obviously other ways to geotag photos, but this is what I did.

The Gear

Before the trip, I had the realization that I didn’t have to get some kind of GPS receiver that connected to my camera: any GPS data was enough, as long as it was timestamped. The photos I take are timestamped, so if the GPS data is too, I can connect the two and figure out where pictures were taken (assuming the GPS and camera are close to each other).

So, before the trip, I picked up a eTrex Venture HC. It claimed to be high-sensitivity, so that sounded good. Also, it was on the cheaper side of GPS receivers and it records the track data I needed.

I had the GPS on a lot over the trip, and I think I ran through 4 pairs of AA batteries the whole time. That’s quite reasonable to my mind.

The cameras were my Rebel XT and Kat’s SD800. Basically, it can be any Digital camera with its clock set reasonably accurately.

On the Trip

When we were about to arrive somewhere, all I had to do was flip on the GPS and throw it back in my camera bag. Then, take some pictures. When leaving, turn the GPS off.

The GPS records its position (“tracks”) until I turn it off. (It’s important with a Garmin GPS to not “save” the track: that throws away critical time info.) Just turn it off when done.

Processing

The data can be pulled off the GPS to a GPX file with its own software or GPSBabel. Photos come off the camera in the usual fashion.

I couldn’t find anything I liked to get the GPS data and the camera time stamps together, so I did what I always do in these situations: I started writing Python. The job was basically to read the GPS data, read the timestamps from JPEG files, interpolate the GPS data, and write the position data back to the JPEG.

I threw what I have on SourceForge as “Geotag Merge“. I haven’t “released” it yet, so you have to grab the Subversion repository if you want to play. Sooner or later, I’ll find some best-practices for packaging Python-based applications and I’ll do a beta release.

I added the Gallery2 GPS module to my gallery to make it all work.

An Example?

Okay, look at this picture which, according to its time stamp, was taken at 2009-04-19 14:20:08 (Beijing time) = 2009-04-19 06:20:08 UTC. Looking in the GPX file extracted from the GPS, I see these entries:

<trkpt lat="40.360190626" lon="116.013850048">
  <ele>829.274414</ele>
  <time>2009-04-19T06:19:48Z</time>
</trkpt>
<trkpt lat="40.360201774" lon="116.013833955">
  <ele>829.274414</ele>
  <time>2009-04-19T06:20:11Z</time>
</trkpt>

Like all good XML, this is minimally-human-readable: two observations separated by 23 seconds, with latitude, longitude, time, and elevation.

So, I deduce that the picture was taken in between these locations (and in fact very close to the second). A quick linear interpolation, and we decide that the picture was taken at 40°21′36.72″N 116°0′49.81″E (and 829.27 metres above sea level). This is then written back as part of the image’s EXIF tag, and it can be picked up by any geotagging-aware photo viewer.

Everything that’s wrong with Java

March 22nd, 2009, 4:52 pm UTC by Greg

I’m in the process of learning the Java Spring web framework (motto: there’s nothing another XML configuration file can’t fix). This has turned out to be a bit of an exercise in frustration: I have always had trouble dealing with Java tech because of their jargon-filled docs. Actually, it’s not even the jargon per se, it’s that the jargon is all Java-specific.

An example: the term “servlet container”. A “servlet container” is a web server that can run a servlet. That’s all. There’s no need for a new term: just say “web server that can run a servlet” or even “servlet implementation” and you’ve removed a whole layer of jargon that people have to learn.

As I was exploring Hibernate (which can integrate with Spring) today, I went to the Hibernate home page and realized I had another example of why I hate the Java ecosystem. Their front page contains this description of what Hibernate is:

Hibernate is a powerful, high performance object/relational persistence and query service. Hibernate lets you develop persistent classes following object-oriented idiom – including association, inheritance, polymorphism, composition, and collections.

Well… I suppose that’s pretty informative if you’re willing to parse through the overly-dense sentence structure and already know how the Java world uses all those terms. And, the page contains this diagram:

hibernate_stacks

Riiiiight. That totally clears things up. Perfect for first-time visitors.

Now, compare a similar (but admittedly less-powerful) Python technology: the home page for SQLObject. They have this description:

SQLObject is a popular Object Relational Manager for providing an object interface to your database, with tables as classes, rows as instances, and columns as attributes.

I’d be hard-pressed to come up with a more clear and concise description of ORM than that. It’s followed by a dozen-line code example of how to work with SQLObject in Python which more-or-less demonstrates exactly what the tool does, how it does it, and what it can be used for.

Basically, the message I get from the Hibernate front page: “boy, this sure looks enterprisey“. From SQLObject: “oh, I see what this tool is for”.

Just to be a little constructive, let me take a shot at rewriting the Hibernate intro:

Hibernate is a powerful Object-Relational Mapper for Java: it lets you save object instances as rows in a relational database, and retrieve them later. Hibernate supports most object-oriented programming techniques, including association, inheritance, polymorphism, composition, and collections.

Okay, that’s off my chest. Bring on the Java fanboys…

« Previous Entries