Glossary
All the reasons for creating Unicode Win32 applications-such as the need for universality and standardization-also apply to Web content. Creating and managing multilingual Web content and empowering users to see Web pages in different languages are the driving forces behind using a universal encoding for your Web content. There are several possibilities when it comes to encoding Web content, although some methods have disadvantages that you should be aware of.
Generally speaking, there are four different ways of setting the character set or the encoding of a Web page. (For more detailed information on how to set these encodings, see "Setting and Manipulating Encodings" later in this chapter.)
<span> This is my text with a Greek Phi: Φ </span>
and the output would be:
This is my text with a Greek Phi:
Unfortunately, this approach makes it impossible to compose large amounts of text and makes editing your Web content very hard.
Figure 3.12 - Example of a multilingual Web page encoded in UTF-8.
(You can find the HTML file [Multilingual.html], corresponding to Figure 3-12, in the Samples subdirectory on the companion CD.)
Since Web content is currently based on Windows or other encoding schemes, you'll need to know how to set and manipulate encodings. The following describes how to do this for HTML pages, Active Server Pages (ASP), and XML pages.
HTML pages: Internet Explorer uses the character set specified for a document to determine how to translate the bytes in the document into characters on the screen or on paper. By default, Internet Explorer uses the character set specified in the HTTP content type returned by the server to determine this translation. If this parameter is not given, Internet Explorer uses the character set specified by the meta element in the document, taking into account the user's preferences if no meta element is specified. To apply a character set to an entire document, you must insert the meta element before the body element. For clarity, it should appear as the first element after the head, so that all browsers can translate the meta element before the document is parsed. The meta element applies to the document containing it. This means, for example, that a compound document (a document consisting of two or more documents in a set of frames) can use different character sets in different frames. Here is how it works:
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=<value>">
You substitute <value> with any supported character-set-friendly name (for example, UTF-8) or any code-page name (for example, windows 1251). (For more information, see Appendix J, "Encoding Web Documents.")
ASP pages: Internally, ASP and the language engines it calls-such as Microsoft Visual Basic Scripting Edition (VBScript), JScript, and so forth-all communicate in Unicode strings. However, Web pages currently consist of content that can be in Windows or other character-encoding schemes besides Unicode. Therefore, when form or query-string values come in from the browser in an HTTP request, they must be converted from the character set used by the browser into Unicode for processing by the ASP script. Similarly, when output is sent back to the browser, any strings returned by scripts must be converted from Unicode back to the code page used by the client. In ASP these internal conversions are done using the default code page of the Web server. This works great if the users and the server are all using the same language or script (more precisely, if they use the same code page). However, if you have a Japanese client connecting to an English server, the code page translations just mentioned won't work because ASP will try to treat Japanese characters as English ones.
The solution is to set the code page that ASP uses to perform these inbound and outbound string translations. Two mechanisms exist to set the code page:
<% @ LANGUAGE=VBScript CODEPAGE=1252 %>
<% @ LANGUAGE=VBScript CODEPAGE=65001 %> <% Response.Write (Session.CodePage) Session.CodePage = 1252 %>
How are these code-page settings applied? First, any static content (HTML) in the .asp file is not affected at all; it is returned exactly as written. Any static strings in the script code (and in fact the script code itself) will be converted based on the CODEPAGE setting in the .asp file. Think of CODEPAGE as the way an author (or better yet, the authoring tool, which should be able to place this in the .asp file automatically) tells ASP the code page in which the .asp file was written.
Any dynamic content-such as Response.Write(x) calls, where the x is a variable-is converted using the value of Response.CodePage, which defaults to the CODEPAGE setting but can be overridden. You'll need this override, since the code page used to write the script might differ from the code page you use to send output to a particular client. For example, the author may have written the ASP page in a tool that generates text encoded in JIS, but the end user's browser might use UTF-8. With this code-page control feature, ASP now enables correct handling of code-page conversion.
The behavior of the browser set by the meta tags (described earlier) in the server-side script can be achieved by setting the Response.Charset property. Setting this property would instruct the browser how to interpret the encoding of the incoming stream. Generally this value should always match the value of the session's code page.
For example, for an ASP page that did not include the Response.Charset property, the content-type header would be
content-type:text/html
If the same .asp file included
<% Response.Charset= "ISO-LATIN-7" %>
the content-type header would be
content-type:text/html; charset=ISO-LATIN-7
XML pages: All XML processors are required to understand two transformations of the Unicode character encoding: UTF-8 (the default encoding) and UTF-16. The Microsoft XML Parser (MSXML) supports more encodings, but all text in XML documents is treated internally as the Unicode UTF-16 character encoding.
The encoding declaration identifies which encoding is used to represent the characters in the document. Although XML parsers can determine automatically if a document uses the UTF-8 or UTF-16 Unicode encoding, this declaration should be used in documents that support other encodings.
For example, the following is the encoding declaration for a document that uses the ISO 8859-1 encoding (Latin 1):
<?xml version="1.0" encoding="ISO-8859-1"?>
Just as in Win32 programming, in which there are instances where you would like to convert from one encoding to another, you can use some of the MLang functions to perform this operation in your Web content. (For more information about MLang, see Chapter 17, "MLang.")
By default, Internet Explorer uses the encoding in which a particular Web page was created or specified by the meta tag. However, a user can override the default encoding of a Web page for his or her own viewing purposes. This feature is especially helpful when the Web author does not specify an encoding or uses a wrong encoding. A user can simply right-click in a Web page and select a given encoding from the list. (See Figure 3-13 below.)
Figure 3.13 - User selection of encoding for individual viewing purposes.
By default, your Internet Explorer browser has the capability of supporting all languages that your Windows version supports. However, because Internet Explorer 5 and later are based on a single, worldwide binary, support for all scripts and languages are available on all versions of Internet Explorer 5 and later. This is true even when the browser is run on a Windows version that does not support all the character sets, such as when displaying Russian on a Korean version of Windows Me. A user can decide to add support for additional languages during the custom setup of Internet Explorer. Also, with its Language Encoding Auto-Select feature, Internet Explorer can usually determine the appropriate language encoding used to create a Web page.
Suppose, however, that your version of Windows does not support the detected encoding. For example, if you are running Russian Windows 98 and viewing an Arabic Web page using code page 1256, you would be prompted to download language-support components for Arabic script. The components package includes code-page information defined in files labeled c_xxx.nls, where xxx stands for the code-page number. Also included are fonts for the given script.
While Unicode offers significant benefits in terms of addressing the difficulties inherent in multilingual content on the Web, it also has an important place within the .NET Framework. Strategies for working with and manipulating various encodings in the .NET Framework follow, along with an assortment of code samples.