Need a CMPT 470 instructor

August 25th, 2011, 3:20 pm UTC by Greg

As many of you know, I’m the most frequent instructor for CMPT 470, Web-based information systems, at SFU. I love teaching the course, but it’s just not possible to do it every semester. In particular, it’s not possible for me to teach it in the spring (Jan-Apr 2012).

So, it will likely be posted as a sessional (contract) instructor position. We have had sessionals do the course in the past, but I reckon I can make things a little more interesting: there’s a good web development community in Vancouver, and there a lot of people who would do a good job with the course. There is also an increasingly-large group of CMPT 470 alumni who have been out there in the world for a few years getting some experience: some of them would be good at this too.

I just have to find somebody and get them to actually apply. So, I’m putting the call out: anybody interested or know anybody who’d be good?

The course (as I approach it) is a survey of web development topics: markup and style, HTTP, server-side programming, client-side programming, architecture/speed/backend stuff, and whatever else I feel like talking about that semester. The big piece of work for the students is a group-based project which makes up a big chunk of their final mark.

Officially the appointment requires a masters degree, but a case can be made for somebody with industrial (and even better, teaching) experience. The course is scheduled in the evenings (Mondays 5:30-8:30) on the Burnaby campus, so shouldn’t interfere too directly with a day job. Pay is around $8500 plus benefits.

Of course, anybody teaching the course is welcome to my lecture notes, assignments, web materials, and anything else I have that would be useful. There’s no official posting yet, but I figure it’s a good time for people to start thinking about it. I’m happy to talk to anybody about the course.

Edit: I should point out that I’m not the one making the hiring decisions. I’m just an interested third party.

Version Control conundrum

June 4th, 2011, 12:27 am UTC by Greg

As most of you know, the School’s new course management system is my baby. It keeps track of many things, but what I care about right now is (1) who is in a course, and (2) what groups have been formed for assignments/projects/whatever.

Given those things, I have had this idea: It would make perfectly good sense for each of those things (every student in a course; every group in a course) to have a version control repository automatically created for them. The instructor and TAs would also have access, but wouldn’t have to set anything up. Students could use the repositories even in courses where the instructor doesn’t know what technology is.

I have used Subversion repositories for the project groups in CMPT 470 for years. The benefits from my point of view:

  1. Groups can collaborate in that way that version control systems allow.
  2. Students can work on code (even individually) in multiple locations and with versions kept.
  3. All of their code is safely backed-up on a server that we kind of trust.
  4. I can review what members of the group contributed what code.
  5. It’s a nice and easy way to submit code: just give me the SVN URL.

When contemplating technologies to implement my scheme, I went first to GIT (or possibly some other distributed version control system, since they’re all the rage). GIT also has a pile of nice management tools like gitolite that make creating thousands of repositories surprisingly easy.

But while experimenting, I realized that GIT inherently trusted the user-provided information about who they are. If I claim to be “Barack Obama <president @whitehouse.gov>” in my commits, then GIT lets me push those commits just fine, no matter who I have authenticated as at the central server. So, I pretty much lose benefit (4) in the worst cases (which are the cases I’m usually concerned with), which is pretty much a deal-breaker by itself.

The “distributed” nature of any DVCS gets me this problem one way or another—anybody could push the whole group’s work since they could be working for weeks without touching the central server. And having made that realization, I have to admit that (3) also disappears: they don’t have to push to the server very often, so a crash on their end could lose a lot of work.

Finally, knowing students the way I do, (5) is gone too. I’d give a lot to not have this conversation five times a semester: “I got a zero.” “You didn’t submit any code.” “Yes, I committed it.” “You committed it, but did you push it to the server?” “Yes, I pushed it.” “You typed the command ‘git push’?” “No, I use ‘git commit’. That puts the code on the server.” “No it doesn’t. You didn’t put any code on the server where I can get it.” “Yes I did… I committed it.”

Also, it’s my understanding that it’s not possible to give a URL to a subtree of a GIT repository: the only URL is to the project itself. That makes submitting with GIT much harder.

So, I’m left with this: distributed version control is at least as good for developers, but it’s very bad for instructors.

According to Wikipedia’s comparison of revision control software, the only open source, “actively-developed”, “client-server” VCS is Subversion. So it looks like I’m back to the totally-uncool and old-fashioned SVN?

Does anybody want to refute any of that?

What programming language should I learn?

May 20th, 2011, 11:56 am UTC by Greg

I recently had a former CMPT 165 student email me and ask essentially if Python was the best language to learn [first] from a practical/employment standpoint. This was my response, that I think was good and would like to expand on here:

Certainly the traditional view of the world is “C/C++/Java for big projects or where speed matters; higher-level languages like Python/Perl/VB for smaller projects or automation.” Certainly many of my colleagues continue to see the world in this way.

The programming language world has changed in some subtle ways in the last few years and I don’t think that attitude is really valid anymore. If I was starting a big project today (like writing a word processor or something), I would probably start with Python (or something similar): it’s easier to write and get things done and it’s possible to bridge to code in Java or C if you need to.

If I had to honestly summarize the world today, I’d say “C++/Java/C# for big companies who want to make a ‘safe’ choice of programming language; Python/Ruby/JavaScript/Scala/etc on smaller projects where the developers make the choice and want to get things done and enjoy their lives.”

A few footnotes on that: (1) the result is there are probably more Java/C# jobs in the world than other languages; (2) the Python/Ruby/Javascript jobs tend to be in smaller companies and are probably more fun; (3) after you learn to program, learning a new language isn’t nearly as big a deal as learning your first–most of the concepts are always the same.

By “the programming language world has changed in some subtle ways”, I mean mostly:

  1. Languages we always though of as “slow” have been made shockingly fast by just-in-time compilers like V8 and PyPy.
  2. Mixing languages in a project (e.g. calling C from Python, or using one language’s standard library from another) seems, to me at least, to be an easier and more mainstream thing to do if you need to.
  3. Frameworks/libraries are used much more heavily. If you spend 90% of your time calling some GUI library, the speed of your code doesn’t matter much: the speed of the GUI library is what matters. (And, who’s to say the library is written in the same language you’re writing? See 2.)
  4. C isn’t the “fast” language anymore. That’s probably more controversial, but basically, C is really good at single-threaded performance, but multithreading and heterogeneous processor environments are a real pain. Today’s reality is that new processors aren’t improving single-treaded speed by very much. Those who want their computation to happen really, really fast seem to be increasingly reaching for computation-specialized tools like Go, OpenCL, or Hadoop. It turns out that the explicitness of C starts to become a burden if you have to smack the mutexes around by hand.

My assertion that “C++/Java/C# for big companies who want to make a ‘safe’ choice” is really just a gut feeling. I understand and even agree with the desire for static typing in a huge project, but I honestly don’t think that’s why companies choose Java or C#. Companies choose these language because they are enterprisey: they are the kind of language that checks all of the CIO’s boxes and have comforting professional certifications that the HR department can look for.

Also, big companies probably think Oracle’s ownership of Java is a good thing. They haven’t reached the conclusion that I (and I suspect many others have): Oracle will slowly strangle the life out of Java until it truly becomes the new Cobol.

So where does that leave us?

You might as well look for a language that’s (1) fun to write, and (2) easy to actually get shit done with. For me that’s Python, but I can certainly accept Ruby, Lua, Scheme, and friends. I could accept PHP and VB (if I had enough drinks in me) or even C# and Java (if you had an explanation grounded in the language design/features and didn’t contain the words “enterprise” or “corporate”).

Wackiest spam ever

May 11th, 2011, 11:19 pm UTC by Greg

So I got this random email reporting a broken link on my CMPT 470 web site. It’s a little unusual to get an email like that from someone who was apparently not a student, but not totally crazy.

From: angela.hill88@gmail.com
Subject: Found a broken link on your page

Hey Greg,

I found a broken link on
[web page on my course site] and since I
was researching computer science and needed the page, I found an updated
article online. 

The broken link to "Howto for Python" is
[the once-working link] and I found an article on the front
page of [some not-totally-related web site] if you wanted to fix it.
Click the tab "Beginner Python Tutorials" to get to the article.

Figure I'd send you an email because others may need that link for the same
reasons!

Thanks!

Angela Hill
angela.hill88@gmail.com

I googled the sender, and found this essentially-identical broken link report with the same “correct” URL. There are a few other examples a Google away. It’s freakin’ link spam!

If anybody really wants to find the target page (that I’m not going to link to prevent bumping their pagerank for any reason), it’s “onlinecomputersciencedegree” with a “www” and a “com”. The site itself is entirely content-free: all external links to other pages.

Somebody’s plan must be:

  1. Crawl tech link pages.
  2. Link-checking all of their links. (The examples I have are actually broken links.)
  3. Finding the creator’s email and first name on the page. (or accessible somewhere else nearby?)
  4. Emailing that address with the spam link come-on.
  5. Hoping they blindly link to your site without noticing that it’s entirely worthless.
  6. Profit?

How is this possibly a thing? Am I missing something?

P ≠ NP

August 7th, 2010, 8:21 pm UTC by Greg

An email I was recently forwarded (a couple of steps removed) from Vinay Deolalikar from HP Labs:

Dear Fellow Researchers,

I am pleased to announce a proof that P is not equal to NP, which is attached in 10pt and 12pt fonts.

The proof required the piecing together of principles from multiple areas within mathematics. The major effort in constructing this proof was uncovering a chain of conceptual links between various fields and viewing them through a common lens. Second to this were the technical hurdles faced at each stage in the proof.

This work builds upon fundamental contributions many esteemed researchers have made to their fields. In the presentation of this paper, it was my intention to provide the reader with an understanding of the global framework for this proof. Technical and computational details within chapters were minimized as much as possible.

This work was pursued independently of my duties as a HP Labs researcher, and without the knowledge of others. I made several unsuccessful attempts these past two years trying other combinations of ideas before I began this work.

Comments and suggestions for improvements to the paper are highly welcomed.

The paper is about 100 pages, and looks serious (but being a decade away from last thinking about complexity, I am unable to give any more useful evaluation than that). I’ll refrain from posting the paper itself.

Deciding P ≠ NP is a Millennium Prize Problem and I don’t think I’d get much argument to say it is the biggest open problem in computing science.

Update: I see someone else Deolalikar has uploaded the paper. I should point out that in the email thread I got, Stephen Cook said “This appears to be a relatively serious claim to have solved P vs NP.”

Update: Huh, slashdotted. I think “broke” the story is a little strong, but anyway… any media wanting comment on this story, I’d suggest my colleagues David Mitchell (whose work was cited by Deolalikar in this paper), Valentine Kabanets, or Pavol Hell (who also do research in this area).

Update 08/09: Richard Lipton is posting excellent commentary in his blog.

Everything I know about databases is wrong. Also, right.

June 24th, 2010, 12:48 pm UTC by Greg

I have been teaching CMPT 470 for six years now, with my 13th offering going on right now. Anybody doing that is going to pick up a thing or two about web systems.

I was there for the rise of the MVC frameworks and greeted them with open arms. I watched Web 2.0 proclaim “screw it, everything is JavaScript now” and listed with suspicion, but interest. I am currently watching HTML5/CSS3 develop with excitement but wondering why nobody is asking whether IE will support any of it before the sun burns out.

There’s another thing on the horizon that is causing me great confusion: NoSQL.

The NoSQL idea is basically that relational databases (MySQL, Oracle, MSSQL, etc.) are not the best solution to every problem, and that there is a lot more to the data-storage landscape. I can get behind that.

But then, the NoSQL aficionados keep talking. “Relational databases are slow” they say. “You should never JOIN.” “Relational databases can’t scale.” These things sound suspicious. Relational databases have a long history of being very good at their job: these are big assertions that should be accompanied by equally-big evidence.

So, I’m going to try to talk some of this through. Let’s start with the non-relational database types. (I’ll stick to the ones getting a lot of NoSQL-related attention.)

Key-value stores
(e.g. Cassandra, Memcachedb) A key-value store sounds simple enough: it’s a collection of keys (that you lookup with) and each key has an associated value (which is the data you want). For Memcachedb, that’s exactly what you get: keys (strings) and values (strings/binary blobs that you interpret to your whim).

Cassandra add another layer of indirection: each “value” can itself be a dictionary of key-value pairs. So, the “value” associated with the key “ggbaker” might be {"fname":"Greg", "mi":"G", "lname":"Baker"}. Each of those sub-key-values is called a “column”. So, the record “ggbaker” has a column with name “fname” and value “Greg” (with a timestamp). Each record can have whatever set of columns are appropriate.

Document stores
(e.g. CouchDB, MongoDB) The idea here is that each “row” of your data is basically a collection of key-value pairs. For example, one record might be {"fname":"Greg", "mi":"G", "lname":"Baker"}. Some other records might be missing the middle initial, or have a phone number added: there is no fixed schema, just rows storing properties. I choose to think of this as a “collection of JSON objects that you can query” (but of course the internal data format is probably not JSON).

Mongo has a useful SQL to Mongo chart that summarizes things nicely.

Tabular
(e.g. BigTable, Hbase) The big difference here seems to be that the tabular databases use a fixed schema. So, I have to declare ahead of time that I will have a “people” table and entries in there can have columns “fname”, “lname”, and “mi”. Not every column has to be filled for each row, but there’s a fixed set.

There are typically many of these “tables”, each with their own schema.

Summary: There’s a lot of similarity here. Things aren’t as different as I thought. In fact, the big common thread is certainly less-structured data (compared to the relational style of foreign keys and rigid data definition). Of course, I haven’t gotten into how you can actually query this data, but that’s a whole other thing.

Let’s see if I can summarize this (with Haskell-ish type notation, since that’s fresh in my head).

data Key,Data = String
memcacheDB :: Map Key Data
data CassandraRecord = Map Key (Data, Timestamp)
cassandraDB :: Map Key CassandraRecord

data JSON = Map Key (String | Number | … | JSON)
mongoDB,couchDB :: [JSON]

data Schema = [Key]
data BigTable = (Schema, [Map Key Data]) -- where only keys from Schema are allowed in the map
bigTableDB :: Map Key BigTable -- key here is table name

The documentation for these projects is generally somewhere between poor and non-existent: there are a lot of claims of speed and efficiency and how they are totally faster than MySQL. What’s in short supply are examples/descriptions of how to actually get things done. (For example, somewhere in my searching, I saw the phrase “for examples of usage, see the unit tests.”)

That’s a good start. Hopefully I can get back to this and say something else useful on the topic.

Computer Woes

June 10th, 2010, 10:32 am UTC by Greg

My computer at home has been locking up occasionally for the last few weeks. This has been happening since my upgrade to Ubuntu 10.04/Lucid, but I suspect this is a coincidence. (1) The lockups are hard: even the SysRq magic doesn’t do anything, so I deduce that the problem is in the kernel or below. (2) I haven’t seen any reports of the new Linux kernels being flaky. (3) I tried an upgrade from the i386 to amd46 (32-bit to 64-bit) system which I had been meaning to do anyway: no change even with a significantly different kernel.

Thus, I am of the opinion that I have a hardware problem.

As a computer scientist, I don’t enjoy hardware problems, so I’m thinking about buying my way out of them. (Also, my current system is mostly 3 years old, so it’s not a crazy time to upgrade.) My current thinking:

For about $700, that would leave me with the same case and power supply (an Antec Sonata II, 450W), my video card (nVidia 7600GT, but don’t game so who cares), my Hauppauge PCI TV tuner, and my recently-upgraded hard disks.

So, the questions for the crowd: Does my “it’s hardware” assessment sound right? Is it likely that the processor/mobo/RAM swap will fix my problems? Any other suggestions for hardware purchases?

The History of HTML

March 19th, 2010, 8:46 am UTC by Greg

After a simple query from a colleague about the differences between HTML versions, I wrote this. I thought I might as well post it. Everything was from-memory, so there may be some minor errors.

HTML 1 never existed (it was the informal “standard” that the first documentation implied).

HTML 2 was a really minimal initial description of the language. The language was simple because the initial goals were simple. The browser makers made many de facto extensions to this by implementing random stuff.

HTML 3 was an abandoned attempt to standardize everything and the kitchen sink. HTML 3.2 was a really ugly standard that was basically “here’s what browsers accept today.”

Which brings us to modern history…

HTML 4 was an attempt to clean up the language: get rid of the visual stuff and make HTML a semantic markup language again. It included the transitional version (with most of the old ugly stuff) and strict version (as things should be).

HTML 4.01 was a minor change: missed errors and typos.

XHTML 1.0 is HTML 4.01 but with XML syntax: closing empty tags with the slash, everything lowercase, attribute values quotes, etc.

XHTML 1.1 contains some minor changes, but was abandoned in a practical sense because nobody saw any point to the change. XHTML 2.0 was another very ambitious change (non-backwards compatible changes to the language) that was abandoned.

HTML 5 is in-progress of being standardized now. If you ask me, there are two camps driving it. One who thinks “the web is more about more than just simple web pages now: applications and interactivity rule the day” and another who thinks “closing our tags is too hard; I don’t understand what a doctype is: make it easier. Dur.”

As a result, there are some things I like and some things I don’t. I is showing signs of something that will actually be completed and used (unlike HTML 3 and XHTML 2).

Most people don’t know that the HTML 5 standard includes an XHTML version as well. It will be perfectly legal to write HTML 5 with the XML syntax and call it “XHTML 5″.

Addendum: The moral of the story is that I have no intention of teaching HTML 5 anywhere until the standards process is done. For 165 I also need real browser support: no JS/DOM hack to get IE to work, and some defaults in the system stylesheet to let the thing display reasonably without any CSS applied. Even then I will probably teach XHTML 5 because I think it promotes the right habits.

My Minimal Setup

January 6th, 2010, 9:37 pm UTC by Greg

I just got a new netbook: an Asus Eee 1005HA.

As my old tablet got slowly older, I realized that I don’t really have heavy laptop demands: most of my use is a text editor and “hey look at this web page” in lectures. Even when away from the lecture hall, I tend to work primarily in a text editor (for LaTeX, HTML, Python, etc.), Thunderbird, and Firefox. I’m not exactly putting a big strain on the system, and can trade off power for small and light.

As always, there’s a big difference between the average stock setup and what I need to get some work done. Bridging this gap is a hassle, so I’m going to finally record what I need so I can look it up next time.

The new Eee is dual-booting Windows 7 and Ubuntu (Karmic netbook remix). Yay to Asus for shipping with a second “data” partition on the drive that was dead-easy to put Ubuntu on.

I’m open to must-have software suggestions that I missed. I’ll probably add more below as I find stuff I missed.

In Windows

In Ubuntu

  • rsync (As far as I’m concerned it’s negligent to have an operating system install without rsync.)
  • subversion
  • sshfs
  • ntp
  • thunderbird (and thunderbird-gnome-support)
  • If I’m going to be downloading pictures from a camera: mmv, jhead, exif, gphoto2, python-pyexiv2, gpsbabel
  • ddclient (with a config file like this)

In Firefox

Mythical Programming Beasts

November 7th, 2009, 1:55 am UTC by Greg

In the time I have been programming, and mostly doing web programming recently, I have learned a few things. Notably, I have learned that there are a few things that people think are simple to deal with, but aren’t. These “simple” things that people think they’re doing when programming don’t really exist. Here are three examples:

“Text”

I don’t always agree with everything Joel Spolsky says, but he’s right in his rant about Unicode:

There Ain’t No Such Thing As Plain Text.

When dealing with input and output, you never have the luxury of just having “text”. What you really have is a byte stream using a specific character encoding. If you don’t know what encoding you’re dealing with, you’ve got nothing. Every input stream has to be decoded; every output stream has to be encoded.

Even once you have encodings sorted out, there’s a lot of question about what a “string” is in your program. Consider the distinction Django makes between strings and safestrings that allows the auto escaping to work: some strings contain HTML code, and some contain text that the user should see as-is. You can’t “output a string” without knowing how (or if) it has to be processed/escaped/cleaned first.

It’s never just “plain text”.

“Time”

It’s very easy in most languages to store date and time values. Unfortunately, there’s not really any such thing as a “time” either.

As I sit here, it is about midnight (0:00) PST. It’s 8:00 in London and 16:00 in Beijing. A time is no good to anybody without a time zone to tell you how it fits into the world. This comes into much sharper focus with web applications where users are probably going to be in different time zones.

But it’s not even as easy as storing a time + timezone: one week (7 days × 24 hours/day) ago, it was 1:00 PDT, not 12:00 PST. You can’t just add n days to a time and get the same time n days later. Time zones can change, even for a particular user, even if they don’t change their location. (And if not for knowing the time zone, I would have absolutely no way to notice these gotchas.)

Suppose I was using a calendaring application and I enter a meeting at “13:00″ on a particular date.

How does the program represent that? The first instinct would probably be to store “<date> 13:00 PST” (using the entered date/time and my current time zone) but that’s not right if there’s a time change before that date. I have seen calendar error announcements “all meetings after the time change will be off by an hour” because of this mistake. Should it really be stored as “<date> 13:00 PDT” depending on the date? What if the North American daylight savings rules change again before this meeting?

I don’t even want to think about two users in different time zones trying to schedule a meeting, but it should definitely be possible.

The only real thing to do is store “<date> 13:00 America/Vancouver” and hope some timezone library is smart enough to save us later. That means we need a date library with a lot of smarts, like pytz for Python.

It also means that you have to at least be very careful with any built-in date/time library (and possibly data type) your language comes with. It might mean you have to bypass them entirely.

“Appearance of a web page”

[I know it's not really "programming", but just move on, okay?]

This one shouldn’t be a surprise to anybody who knows anything about the web, but web pages simply don’t have a single unique appearance. The way a page looks depends on the browser, window size, available fonts, font size settings, and who knows how many other factors.

If you’re making web pages, you simply have to understand and live with this limitation. As I have said many times in lectures: if you don’t like it, don’t make web pages.

Also, what the page looks like to you has relatively little relation to the way Google or other bots “see” it, but that’s another rant.

« Previous Entries