Automated translation systems, such as Alta Vista's Babelfish, have relied on a set of human-defined rules that attempt to encapsulate the underlying grammar and vocabulary used to construct a language. Although Google has been using that approach to power much of its translation service, it's not really in keeping with the company's philosophy of using some clever code and a massive data set. So it should be no surprise that the company has started developing its own statistical machine translation service. According to some Google-watchers, Google's homegrown translation process is now being used for all languages available through the service.
We took the new service for a spin. Five years of Spanish in high school and college, as well as countless years of exposure to the language through ads on the subway and watching the World Cup on Univision, have left me borderline-literate in the language. I chose a web page that was inspired by my contributions to Urs Technica: a description of the native bear population of the Iberian peninsula. The page contains a mix of some basic descriptive language, along with more detailed discussions of ursine biology. A second translation using Babelfish was performed at the same time.
Overall, it was difficult to discern a difference in quality between the two. Each service had some difficulty with Spanish's sentence structure, which places adjectives after the nouns they modify. For example, instead of "Discover Bear Country," Google suggested that a link was inviting people to "Discover the Country Bears." Maybe Disney paid for that one.
Both also ran into a number of words they didn't know what to do with; for example, Spanish has a specific word for "bear den"—osera—that neither service recognized and so left untranslated. Neither correctly figured out the proper context for the use of "celo". This is a term that didn't come up during my years of Spanish, but it apparently can be used to describe the annual period of female fertility. Both services went literal when faced with "celo", with Babelfish choosing "fervor" and Google picking "zeal" as its translation. This caused Google to suggest that female bears "can be mounted by several different males over the same zeal."
There were also what might be termed Spanish 101 level errors. The verb "molesta" is generally used to mean "bother" or "harass." Yet Google made a novice-level mistake and did a literal translation to "molest." Neither service demonstrated a human's ability to recognize when they were producing gibberish. Google, for example, described a group of bears gathering around a rich food source as "They can also occur by coincidence, rallies temporary copies in a few places with abundant food."
There was one case where Google's statistical method seemed to lead it astray. Both services went Spanish 101 on the term "crudo," which was used to describe the harshest or roughest part of winter, when bears hibernate. Google apparently applied undue statistical weight to the word "crude." In one case, this trashed the entire sentence that contained "crudo"—a photo of a cold winter scene was captioned: "The period of winter as crude bears spend winter." In a second instance, the more typical context of "crudo" was applied, with hilarious results: "The life of a bear begins as crude oil during winter."
To test a language that is more distant from English, I located a press release in both Japanese and English: the one announcing the 2002 Nobel Prize in Physics, which went to researchers running parallel experiments in the US and Japan. The release in Japanese was available only as a PDF, so I copied and pasted the text into the translation box. The results, which seem to have preserved the line breaks from the PDF, were practically poetic:
I do so without interaction, thus detected is extremely
Difficult for. For example, the trillions of pieces of New
Torino is our second body to penetrate, but I
We are absolutely not aware. Raymond Davis Jr.
Coal giant tank is placed 600 tons of liquid meets applicable
The construction of a completely new detection equipment. He was 30 years…
That bears a slight resemblance to Japanese Zen poetry, which is supposed to startle its readers out of their normal perception of reality, allowing them to reach a Buddhist enlightenment.
This may sound like I'm being excessively harsh regarding Google's new translation method, so I'll reemphasize that it appears to produce translations that are roughly equal in quality to those provided by other services. Where it really shines, however, is its interface. On a translated web page, you can hover the mouse over any translated sentence, and the untranslated version will appear. This is a tremendous aid for those that have a partial command of the language, as the immediate comparison between the texts can help eliminate any confusion caused by mistranslation.
This same feature may ultimately help Google move beyond the quality of other services. Each of these popups comes with a link that offers you the opportunity to suggest a better translation. If people are willing to spend the time suggesting fixes for mistranslations (and vandalism doesn't become a problem), Google may ultimately have a dataset that allows their service to provide an exceptional degree of accuracy.