HTML Translators


Another source of translators is translation Web sites. AltaVista is one such example. Go to AltaVista and click the "Translate" link (see Figure 9.6).

Figure 9.6. AltaVista's Translation Facility


From here you can type a string into the "Translate a block of text:" box, select a language pair using the combo box, and click the Translate button; the translated text is shown on the next page. The job of an HTML translator is to automate this process and to collect the result. The result is returned somewhere in the returned HTML page and must be extracted. This process is often referred to as screen scraping and is not an ideal solution. The biggest problem is that when the HTML changes, the algorithm for extracting the string is usually broken, so the process fails. As such, it is a fragile solution; the more explicit method employed by Web services is preferable.

In the code for this book, you will find HTML translators for the following:

  • AltaVista

  • Free Translation

  • Online Translator

  • Socrates

See Appendix B for a complete list of online translators.

All the HTML translators inherit from the HtmlTranslator class, which uses the .NET Framework 2.0 WebBrowser control (in the All Windows Forms section of the toolbox). If you are using Visual Studio 2003, see the section "Visual Studio 2003 WebBrowser Control" for an equivalent control. Both the Visual Studio 2005 and Visual Studio 2003 projects are included in the book's source code.

The WebBrowser control is part of a form called WebBrowserForm, which is never shown. The WebBrowser control simply represents a way to post information to a page and to get the resulting HTML. The form has one method to wait for the completion of the page before getting the result:

 public void WaitForBrowser() {     while(WebBrowser.ReadyState != WebBrowserReadyState.Complete)     {         Application.DoEvents();     } } 


The HtmlTranslator class makes life easy for its subclasses. With the majority of the work performed in the HtmlTranslator class, the subclasses need to specify only the following:

  • The translator's name

  • The URL for the Web site

  • The language pairs supported

  • A method to format the data posted to the URL

  • A method to decode the result from the Web page

This is the HtmlTranslator class:

 public abstract class HtmlTranslator: Translator {     private string url;     private WebBrowserForm webBrowserForm;     private WebBrowser webBrowser;     public HtmlTranslator(string name, string url): base(name)     {         this.url = url;     }     public HtmlTranslator(string name, string url,         string[,] languagePairs): base(name, languagePairs)     {         this.url = url;     }     public string Url     {         get {return url;}         set {url = value;}     }     protected virtual void InitializeWebBrowser()     {         if (webBrowser == null)         {             webBrowserForm = new WebBrowserForm();             webBrowser = webBrowserForm.WebBrowser;         }     }     public abstract string GetPostData(         string inputLanguage, string outputLanguage, string text);     protected string Encode(string text)     {         return HttpUtility.UrlEncode(text);     }     protected virtual string GetTranslation(string inputLanguage,         string outputLanguage, string innerText)     {         return innerText;     } } 


As you would expect, the action happens in the TRanslate method:

 public override string Translate(     string inputLanguage, string outputLanguage, string text) {     if (! IsSupported(inputLanguage, outputLanguage))         throw new LanguageCombinationNotSupportedException(             "Language combination is not supported",             inputLanguage, outputLanguage);     InitializeWebBrowser();     string innerText = GetInnerText(GetPostData(         inputLanguage, outputLanguage, text));     return GetTranslation(inputLanguage, outputLanguage, innerText); } 


translate initializes the Web browser and calls the subclass's GetPostData to get the data to post to the URL. GetInnerText navigates to the URL, posts the data, waits for the browser to complete the display of the page, extracts the HTML from the page, and then extracts just the text part of the HTML:

 protected virtual string GetInnerText(string postData) {     string headers =         "Content-Type: application/x-www-form-urlencoded" +         (char) 10 + (char) 13;     byte[] bytePostData =         System.Text.Encoding.ASCII.GetBytes(postData);     webBrowser.Navigate(         new Uri(url), String.Empty, bytePostData, headers);     webBrowserForm.WaitForBrowser();     return webBrowser.Document.Body.InnerText; } 


The subclass's Gettranslation method is passed the language pair and the Web page's text, and is responsible for extracting the translated text from the page.

Visual Studio 2003 WebBrowser Control

The .NET Framework 1.1 does not have a WebBrowser control, but an equivalent ActiveX control can be used instead. This control is not installed in the Visual Studio toolbox by default. To install it, right-click the toolbox, select Add/Remove Items..., select the COM Components tab, click the Browse... button, enter shdocvw.dll from your system32 folder, and click Open. Figure 9.7 shows the result. Click OK.

Figure 9.7. Adding the Microsoft Web Browser Control to the Visual Studio 2003 Toolbox


The ActiveX Web Browser wrapper control is similar to the .NET Framework 2.0 WebBrowser control, but you should be aware of the differences listed in Table 9.1.

Table 9.1. Relevant Differences Between the .NET Framework 2.0 WebBrowser Control and the ActiveX Web Browser Control Used in .NET Framework 1.1

.NET Framework 2.0

.NET Framework 1.1

WebBrowserReadyState.Complete

SHDocVw.tagREADYSTATE.READYSTATE_COMPLETE

System.Windows.Forms.WebBrowser

AxSHDocVw.AxWebBrowser

WebBrowser.Document.Body.InnerText

WebBrowser.Document.body.innerText

WebBrowser.Navigate accepts strongly typed params

WebBrowser.Navigate accepts objects typically passed by reference, as well as a Flags parameter (which should be 0)


The AltaVistaTranslator Class

The AltaVistaTranslator class uses AltaVista's translation Web page to perform translations:

 public class AltaVistaTranslator: HtmlTranslator {     public AltaVistaTranslator(): base("AltaVista Translator",         @"http://babelfish.altavista.com/tr", new string[,]         {             {"en", "zh-CHS"},             {"en", "zh-CHT"},             {"en", "nl"},             {"en", "fr"},             {"en", "de"},             {"en", "el"},             {"en", "it"},             {"en", "ja"},             {"en", "ko"},             {"en", "pt"},             {"en", "ru"},             {"en", "es"},             {"zh-CHS", "en"},             {"zh-CHT", "en"},             {"nl", "en"},             {"nl", "fr"},             {"fr", "nl"},             {"fr", "en"},             {"fr", "de"},             {"fr", "el"},             {"fr", "it"},             {"fr", "pt"},             {"fr", "es"},             {"de", "en"},             {"de", "fr"},             {"el", "en"},             {"el", "fr"},             {"it", "en"},             {"it", "fr"},             {"ja", "en"},             {"ko", "en"},             {"pt", "en"},             {"pt", "fr"},             {"ru", "en"},             {"es", "en"},             {"es", "fr"}         })     {     }     protected virtual string DotNetLanguageCodeToLanguageCode(         string language)     {         // check for a couple of adjustments to the language codes         if (language == "zh-CHS")             // Chinese (Simplified)             return "zh";         else if (language == "zh-CHT")             // Chinese (Traditional)             return "zt";         else             return language;     }     public override string GetPostData(string inputLanguage,         string outputLanguage, string text)     {         string languagePair =             DotNetLanguageCodeToLanguageCode(inputLanguage) + "_" +             DotNetLanguageCodeToLanguageCode(outputLanguage);         return "doit=done&intl=1&tt=urltext&trtext=" +             Encode(text) + "&lp=" + languagePair;     }     protected override string GetTranslation(string inputLanguage,         string outputLanguage, string innerText)     {         int index = innerText.IndexOf("Babel Fish Translation");         if (index == -1)             return String.Empty;         innerText = innerText.Substring(index + 2);         index = innerText.IndexOf(":");         if (index == -1)             return String.Empty;         innerText = innerText.Substring(index + 3);         index = innerText.IndexOf("Translate again");         if (index == -1)             return String.Empty;         return innerText.Substring(0, index - 2).TrimEnd(             new char[] {' ', (char) 10, (char) 13});     } } 


The GetPostData method builds a post data string containing the language pair and the text to translate. The GetTRanslation method looks for textual markers that are known to be immediately before the translated text and immediately after the translated text, and gets the text in between. This represents the most fragile part of this process. If the textual content of the resulting Web page changes, this code will need to be rewritten.




.NET Internationalization(c) The Developer's Guide to Building Global Windows and Web Applications
.NET Internationalization: The Developers Guide to Building Global Windows and Web Applications
ISBN: 0321341384
EAN: 2147483647
Year: 2006
Pages: 213

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net