Wikipedia-based Machine Translation

July 1st, 2009, 12:07 am PDT by Greg

I have been pondering this for a while and thought I might as well throw it in a blog entry…

Wikipedia is, of course, a massive collection of mostly-correct information. The information there isn’t fundamentally designed to be machine readable (unlike the semantic web stuff), but there are some structures that allow data to be extracted automatically. My favourite example is the {{coord}} template allowing the Wikipedia layer in Google Earth.

The part of Wikipedia pages that recently caught my eye is the “other languages” section on the left of every page. I’d be willing to bet that these interwiki links form the largest translation database that exists anywhere.

Take the entry for “Lithography” entry as a moderately-obscure example. On the left of that page, we can read off that the German word for lithography is “Lithografie”, the Urdu word is “سنگی طباعت”, and 34 others. Sure, some of the words might literally be “lithograph” or “photolithography”, but that’s not the worst thing ever. All of this can be mechanically discovered by parsing the wikitext.

Should it not be possible to do some good machine translation starting with that huge dictionary? Admittedly, I know approximately nothing about machine translation. I know there are still gobs of problems when it comes to grammar and ambiguity, but a good dictionary of word and phrase translations has to count for something. The “disambiguation” pages could probably help with the ambiguity problem too.

I’d guess that even this would produce a readable translation: (1) chunk source text into the longest possible page titles (e.g. look at “spontaneous combustion”, not “spontaneous” and “combustion” separately), (2) apply whatever grammar-translation rules you have lying around, (3) literally translate each chunk with the Wikipedia “other language” article titles, and (4) if there’s no “other language” title, fall back to any other algorithm.

I can’t believe this is a new idea, but a half-hearted search in Google and CiteSeer turned up nothing. Now it’s off my chest. Anybody who knows anything about machine translation should feel free to tell me why I’m wrong.

2 Responses to “Wikipedia-based Machine Translation”

  1. Yang Says:

    IIRC, this is partially what Google Translate does. Not on Wikipedia, of course, since that’s unreliable, but on, say, the Canadian House of Commons proceedings or other similarly large multi-lingual corpus.

    For lexicographic translations, something like Wiktionary is potentially more appropriate. It’s certainly not as interesting when compared to translation of actual passages.

  2. Greg Says:

    It occurs to me that the pronunciation guides that accompany most articles could be used for text-to-speech as well.

    And how many people know the Japanese for “Edmonton Eskimos” (http://en.wikipedia.org/wiki/Edmonton_Eskimos ) or 30 translations of “Smurf (http://en.wikipedia.org/wiki/Smurf )? You won’t find those in your average dictionary.