Encodings in Web Pages | Developing International Software

Glossary

MLang: A Component Object Model (COM) component that provides a variety of services. These services include detecting the character encoding used by Web pages and e-mails, converting text from one encoding to another as part of an import or export operation, and displaying characters that are not included within the font specified for parts of a Web page.

All the reasons for creating Unicode Win32 applications-such as the need for universality and standardization-also apply to Web content. Creating and managing multilingual Web content and empowering users to see Web pages in different languages are the driving forces behind using a universal encoding for your Web content. There are several possibilities when it comes to encoding Web content, although some methods have disadvantages that you should be aware of.

Options for Web Encoding

Generally speaking, there are four different ways of setting the character set or the encoding of a Web page. (For more detailed information on how to set these encodings, see "Setting and Manipulating Encodings" later in this chapter.)

Windows code pages or ISO character encodings: With this approach, you can select from the list of supported code pages to create your Web content. The downside of this approach is that you are limited to languages that are included in the selected character set, making true multilingual Web content impossible. This limits you to a single-script Web page.
Number entities: Number entities can be used to represent a few symbols out of the currently selected code page or encoding. Let's say, for example, you have decided to create a Web page using the previous approach with the Latin ISO charset 8859-1. Now you also want to display some Greek characters in a mathematical equation; Greek characters, however, are not part of the Latin code page. Take, for instance, the Greek character , which has the Unicode code-point U+03A6. By using the decimal number entity of this code point pre-ceded by &#, the character's output will be as follows:

 <span> This is my text with a Greek Phi: &#934 </span>

and the output would be:

This is my text with a Greek Phi:

Unfortunately, this approach makes it impossible to compose large amounts of text and makes editing your Web content very hard.
UTF-16: Unlike Win32 applications where UTF-16 is by far the best approach, for Web content UTF-16 can be used safely only on Windows NT networks that have full Unicode support. Therefore, this is not a suggested encoding for Internet sites where the capabilities of the client Web browser as well the network Unicode support are not known.
UTF-8: This Unicode encoding is the best and safest approach for multilingual Web pages. It allows you to encode the whole repertoire of Unicode characters. Also, all versions of Internet Explorer 4 and later as well as Netscape 4 and later support this encoding, which is not restricted to network or wire capabilities. The UTF-8 encoding allows you to create multilingual Web content without having to change the encoding based on the target language. (See Figure 3-12.)

figure 3.12 example of a multilingual web page encoded in utf-8.

Figure 3.12 - Example of a multilingual Web page encoded in UTF-8.

(You can find the HTML file [Multilingual.html], corresponding to Figure 3-12, in the Samples subdirectory on the companion CD.)

Setting and Manipulating Encodings

Since Web content is currently based on Windows or other encoding schemes, you'll need to know how to set and manipulate encodings. The following describes how to do this for HTML pages, Active Server Pages (ASP), and XML pages.

HTML pages: Internet Explorer uses the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the meta element in the document, taking into account the user's preferences if no meta element is specified. To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after the head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames. Here is how it works:

 <META HTTP-EQUIV="Content-Type" CONTENT="text/html;  charset=<value>">

You substitute <value> with any supported character-set-friendly name (for example, UTF-8) or any code-page name (for example, windows 1251). (For more information, see Appendix J, "Encoding Web Documents.")

ASP pages: Internally, ASP and the language engines it calls-such as Microsoft Visual Basic Scripting Edition (VBScript), JScript, and so forth-all communicate in Unicode strings. However, Web pages currently consist of content that can be in Windows or other character-encoding schemes besides Unicode. Therefore, when form or query-string values come in from the browser in an HTTP request, they must be converted from the character set used by the browser into Unicode for processing by the ASP script. Similarly, when output is sent back to the browser, any strings returned by scripts must be converted from Unicode back to the code page used by the client. In ASP these internal conversions are done using the default code page of the Web server. This works great if the users and the server are all using the same language or script (more precisely, if they use the same code page). However, if you have a Japanese client connecting to an English server, the code page translations just mentioned won't work because ASP will try to treat Japanese characters as English ones.

The solution is to set the code page that ASP uses to perform these inbound and outbound string translations. Two mechanisms exist to set the code page:

Per page, at design time: <%@CODEPAGE=<charset> %>. For example:

 <% @ LANGUAGE=VBScript CODEPAGE=1252 %>

In script code, at run time: The Session.CodePage property sets the code page to use for the current sessions's string translations. In Microsoft Internet Information Services (IIS) 5.1 and later, the Response.Code-Page property defines the code page of response set to the client. Once explicitly set, the Re sponse code page overrides the Session code page, which in turn overrides the @CODEPAGE setting. For example:

 <% @ LANGUAGE=VBScript CODEPAGE=65001 %>  <%  Response.Write (Session.CodePage)  Session.CodePage = 1252  %>

How are these code-page settings applied? First, any static content (HTML) in the .asp file is not affected at all; it is returned exactly as written. Any static strings in the script code (and in fact the script code itself) will be converted based on the CODEPAGE setting in the .asp file. Think of CODEPAGE as the way an author (or better yet, the authoring tool, which should be able to place this in the .asp file automatically) tells ASP the code page in which the .asp file was written.

Any dynamic content-such as Response.Write(x) calls, where the x is a variable-is converted using the value of Response.CodePage, which defaults to the CODEPAGE setting but can be overridden. You'll need this override, since the code page used to write the script might differ from the code page you use to send output to a particular client. For example, the author may have written the ASP page in a tool that generates text encoded in JIS, but the end user's browser might use UTF-8. With this code-page control feature, ASP now enables correct handling of code-page conversion.

The behavior of the browser set by the meta tags (described earlier) in the server-side script can be achieved by setting the Response.Charset property. Setting this property would instruct the browser how to interpret the encoding of the incoming stream. Generally this value should always match the value of the session's code page.

For example, for an ASP page that did not include the Response.Charset property, the content-type header would be

 content-type:text/html

If the same .asp file included

 <% Response.Charset= "ISO-LATIN-7" %>

the content-type header would be

 content-type:text/html; charset=ISO-LATIN-7

XML pages: All XML processors are required to understand two transformations of the Unicode character encoding: UTF-8 (the default encoding) and UTF-16. The Microsoft XML Parser (MSXML) supports more encodings, but all text in XML documents is treated internally as the Unicode UTF-16 character encoding.

The encoding declaration identifies which encoding is used to represent the characters in the document. Although XML parsers can determine automatically if a document uses the UTF-8 or UTF-16 Unicode encoding, this declaration should be used in documents that support other encodings.

For example, the following is the encoding declaration for a document that uses the ISO 8859-1 encoding (Latin 1):

 <?xml version="1.0" encoding="ISO-8859-1"?>

Just as in Win32 programming, in which there are instances where you would like to convert from one encoding to another, you can use some of the MLang functions to perform this operation in your Web content. (For more information about MLang, see Chapter 17, "MLang.")

User Override

By default, Internet Explorer uses the encoding in which a particular Web page was created or specified by the meta tag. However, a user can override the default encoding of a Web page for his or her own viewing purposes. This feature is especially helpful when the Web author does not specify an encoding or uses a wrong encoding. A user can simply right-click in a Web page and select a given encoding from the list. (See Figure 3-13 below.)

figure 3.13 user selection of encoding for individual viewing purposes.

Figure 3.13 - User selection of encoding for individual viewing purposes.

Internet Explorer Language Support

By default, your Internet Explorer browser has the capability of supporting all languages that your Windows version supports. However, because Internet Explorer 5 and later are based on a single, worldwide binary, support for all scripts and languages are available on all versions of Internet Explorer 5 and later. This is true even when the browser is run on a Windows version that does not support all the character sets, such as when displaying Russian on a Korean version of Windows Me. A user can decide to add support for additional languages during the custom setup of Internet Explorer. Also, with its Language Encoding Auto-Select feature, Internet Explorer can usually determine the appropriate language encoding used to create a Web page.

Suppose, however, that your version of Windows does not support the detected encoding. For example, if you are running Russian Windows 98 and viewing an Arabic Web page using code page 1256, you would be prompted to download language-support components for Arabic script. The components package includes code-page information defined in files labeled c_xxx.nls, where xxx stands for the code-page number. Also included are fonts for the given script.

While Unicode offers significant benefits in terms of addressing the difficulties inherent in multilingual content on the Web, it also has an important place within the .NET Framework. Strategies for working with and manipulating various encodings in the .NET Framework follow, along with an assortment of code samples.