{"id":765,"date":"2009-07-01T00:07:48","date_gmt":"2009-07-01T07:07:48","guid":{"rendered":"http:\/\/gregbaker.ca\/blog\/?p=765"},"modified":"2009-07-01T00:08:22","modified_gmt":"2009-07-01T07:08:22","slug":"wikipedia-based-machine-translation","status":"publish","type":"post","link":"http:\/\/gregbaker.ca\/blog\/2009\/07\/01\/wikipedia-based-machine-translation\/","title":{"rendered":"Wikipedia-based Machine Translation"},"content":{"rendered":"<p>I have been pondering this for a while and thought I might as well throw it in a blog entry&hellip;<\/p>\n<p>Wikipedia is, of course, a massive collection of mostly-correct information.  The information there isn&#8217;t fundamentally designed to be machine readable (unlike the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Semantic_Web\">semantic web<\/a> stuff), but there are some structures that allow data to be extracted automatically.  My favourite example is the {{<a href=\"http:\/\/en.wikipedia.org\/wiki\/Template:Coord\">coord<\/a>}} template allowing the Wikipedia layer in Google Earth.<\/p>\n<p>The part of Wikipedia pages that recently caught my eye is the &ldquo;other languages&rdquo; section on the left of every page.  I&#8217;d be willing to bet that these interwiki links form the largest translation database that exists anywhere.<\/p>\n<p>Take the entry for &ldquo;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Lithography\">Lithography<\/a>&rdquo; entry as a moderately-obscure example.  On the left of that page, we can read off that the German word for lithography is &ldquo;<i lang=\"de\">Lithografie<\/i>&rdquo;, the Urdu word is &ldquo;<i lang=\"ur\">\u00d8\u00b3\u00d9\u2020\u00da\u00af\u00db\u0152 \u00d8\u00b7\u00d8\u00a8\u00d8\u00a7\u00d8\u00b9\u00d8\u00aa<\/i>&rdquo;, and 34 others.  Sure, some of the words might literally be &ldquo;lithograph&#8221; or &ldquo;photolithography&rdquo;, but that&#8217;s not the worst thing ever.  All of this can be mechanically discovered by parsing the wikitext.<\/p>\n<p>Should it not be possible to do some good <a href=\"http:\/\/en.wikipedia.org\/wiki\/Machine_translation\">machine translation<\/a> starting with that huge dictionary?  Admittedly, I know approximately nothing about machine translation.  I know there are still gobs of problems when it comes to grammar and ambiguity, but a good dictionary of word and phrase translations has to count for something.  The &ldquo;disambiguation&rdquo; pages could probably help with the ambiguity problem too.<\/p>\n<p>I&#8217;d guess that even this would produce a readable translation: (1) chunk source text into the longest possible page titles (e.g. look at &ldquo;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Spontaneous_combustion\">spontaneous combustion<\/a>&rdquo;, not &ldquo;spontaneous&rdquo; and &ldquo;combustion&#8221; separately), (2) apply whatever grammar-translation rules you have lying around, (3) literally translate each chunk with the Wikipedia &ldquo;other language&rdquo; article titles, and (4) if there&#8217;s no &ldquo;other language&rdquo; title, fall back to any other algorithm.<\/p>\n<p>I can&#8217;t believe this is a new idea, but a half-hearted search in Google and <a href=\"http:\/\/citeseer.ist.psu.edu\/\">CiteSeer<\/a> turned up nothing.  Now it&#8217;s off my chest.  Anybody who knows anything about machine translation should feel free to tell me why I&#8217;m wrong.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I have been pondering this for a while and thought I might as well throw it in a blog entry&hellip; Wikipedia is, of course, a massive collection of mostly-correct information. The information there isn&#8217;t fundamentally designed to be machine readable (unlike the semantic web stuff), but there are some structures that allow data to be [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[10],"tags":[],"class_list":["post-765","post","type-post","status-publish","format-standard","hentry","category-tech"],"_links":{"self":[{"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/posts\/765","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/comments?post=765"}],"version-history":[{"count":8,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/posts\/765\/revisions"}],"predecessor-version":[{"id":773,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/posts\/765\/revisions\/773"}],"wp:attachment":[{"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/media?parent=765"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/categories?post=765"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/gregbaker.ca\/blog\/wp-json\/wp\/v2\/tags?post=765"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}