Another source of translators is translation Web sites. AltaVista is one such example. Go to AltaVista and click the "Translate" link (see Figure 9.6). Figure 9.6. AltaVista's Translation FacilityFrom here you can type a string into the "Translate a block of text:" box, select a language pair using the combo box, and click the Translate button; the translated text is shown on the next page. The job of an HTML translator is to automate this process and to collect the result. The result is returned somewhere in the returned HTML page and must be extracted. This process is often referred to as screen scraping and is not an ideal solution. The biggest problem is that when the HTML changes, the algorithm for extracting the string is usually broken, so the process fails. As such, it is a fragile solution; the more explicit method employed by Web services is preferable. In the code for this book, you will find HTML translators for the following:
See Appendix B for a complete list of online translators. All the HTML translators inherit from the HtmlTranslator class, which uses the .NET Framework 2.0 WebBrowser control (in the All Windows Forms section of the toolbox). If you are using Visual Studio 2003, see the section "Visual Studio 2003 WebBrowser Control" for an equivalent control. Both the Visual Studio 2005 and Visual Studio 2003 projects are included in the book's source code. The WebBrowser control is part of a form called WebBrowserForm, which is never shown. The WebBrowser control simply represents a way to post information to a page and to get the resulting HTML. The form has one method to wait for the completion of the page before getting the result: public void WaitForBrowser() { while(WebBrowser.ReadyState != WebBrowserReadyState.Complete) { Application.DoEvents(); } } The HtmlTranslator class makes life easy for its subclasses. With the majority of the work performed in the HtmlTranslator class, the subclasses need to specify only the following:
This is the HtmlTranslator class: public abstract class HtmlTranslator: Translator { private string url; private WebBrowserForm webBrowserForm; private WebBrowser webBrowser; public HtmlTranslator(string name, string url): base(name) { this.url = url; } public HtmlTranslator(string name, string url, string[,] languagePairs): base(name, languagePairs) { this.url = url; } public string Url { get {return url;} set {url = value;} } protected virtual void InitializeWebBrowser() { if (webBrowser == null) { webBrowserForm = new WebBrowserForm(); webBrowser = webBrowserForm.WebBrowser; } } public abstract string GetPostData( string inputLanguage, string outputLanguage, string text); protected string Encode(string text) { return HttpUtility.UrlEncode(text); } protected virtual string GetTranslation(string inputLanguage, string outputLanguage, string innerText) { return innerText; } } As you would expect, the action happens in the TRanslate method: public override string Translate( string inputLanguage, string outputLanguage, string text) { if (! IsSupported(inputLanguage, outputLanguage)) throw new LanguageCombinationNotSupportedException( "Language combination is not supported", inputLanguage, outputLanguage); InitializeWebBrowser(); string innerText = GetInnerText(GetPostData( inputLanguage, outputLanguage, text)); return GetTranslation(inputLanguage, outputLanguage, innerText); } translate initializes the Web browser and calls the subclass's GetPostData to get the data to post to the URL. GetInnerText navigates to the URL, posts the data, waits for the browser to complete the display of the page, extracts the HTML from the page, and then extracts just the text part of the HTML: protected virtual string GetInnerText(string postData) { string headers = "Content-Type: application/x-www-form-urlencoded" + (char) 10 + (char) 13; byte[] bytePostData = System.Text.Encoding.ASCII.GetBytes(postData); webBrowser.Navigate( new Uri(url), String.Empty, bytePostData, headers); webBrowserForm.WaitForBrowser(); return webBrowser.Document.Body.InnerText; } The subclass's Gettranslation method is passed the language pair and the Web page's text, and is responsible for extracting the translated text from the page. Visual Studio 2003 WebBrowser ControlThe .NET Framework 1.1 does not have a WebBrowser control, but an equivalent ActiveX control can be used instead. This control is not installed in the Visual Studio toolbox by default. To install it, right-click the toolbox, select Add/Remove Items..., select the COM Components tab, click the Browse... button, enter shdocvw.dll from your system32 folder, and click Open. Figure 9.7 shows the result. Click OK. Figure 9.7. Adding the Microsoft Web Browser Control to the Visual Studio 2003 ToolboxThe ActiveX Web Browser wrapper control is similar to the .NET Framework 2.0 WebBrowser control, but you should be aware of the differences listed in Table 9.1.
The AltaVistaTranslator ClassThe AltaVistaTranslator class uses AltaVista's translation Web page to perform translations: public class AltaVistaTranslator: HtmlTranslator { public AltaVistaTranslator(): base("AltaVista Translator", @"http://babelfish.altavista.com/tr", new string[,] { {"en", "zh-CHS"}, {"en", "zh-CHT"}, {"en", "nl"}, {"en", "fr"}, {"en", "de"}, {"en", "el"}, {"en", "it"}, {"en", "ja"}, {"en", "ko"}, {"en", "pt"}, {"en", "ru"}, {"en", "es"}, {"zh-CHS", "en"}, {"zh-CHT", "en"}, {"nl", "en"}, {"nl", "fr"}, {"fr", "nl"}, {"fr", "en"}, {"fr", "de"}, {"fr", "el"}, {"fr", "it"}, {"fr", "pt"}, {"fr", "es"}, {"de", "en"}, {"de", "fr"}, {"el", "en"}, {"el", "fr"}, {"it", "en"}, {"it", "fr"}, {"ja", "en"}, {"ko", "en"}, {"pt", "en"}, {"pt", "fr"}, {"ru", "en"}, {"es", "en"}, {"es", "fr"} }) { } protected virtual string DotNetLanguageCodeToLanguageCode( string language) { // check for a couple of adjustments to the language codes if (language == "zh-CHS") // Chinese (Simplified) return "zh"; else if (language == "zh-CHT") // Chinese (Traditional) return "zt"; else return language; } public override string GetPostData(string inputLanguage, string outputLanguage, string text) { string languagePair = DotNetLanguageCodeToLanguageCode(inputLanguage) + "_" + DotNetLanguageCodeToLanguageCode(outputLanguage); return "doit=done&intl=1&tt=urltext&trtext=" + Encode(text) + "&lp=" + languagePair; } protected override string GetTranslation(string inputLanguage, string outputLanguage, string innerText) { int index = innerText.IndexOf("Babel Fish Translation"); if (index == -1) return String.Empty; innerText = innerText.Substring(index + 2); index = innerText.IndexOf(":"); if (index == -1) return String.Empty; innerText = innerText.Substring(index + 3); index = innerText.IndexOf("Translate again"); if (index == -1) return String.Empty; return innerText.Substring(0, index - 2).TrimEnd( new char[] {' ', (char) 10, (char) 13}); } } The GetPostData method builds a post data string containing the language pair and the text to translate. The GetTRanslation method looks for textual markers that are known to be immediately before the translated text and immediately after the translated text, and gets the text in between. This represents the most fragile part of this process. If the textual content of the resulting Web page changes, this code will need to be rewritten. |