International Features

In addition to the fact that XML is readily available, its international features make it even more compelling in the creation of world-ready applications. Some of XML's multiple international features include encoding declaration and character entities. Encoding information can also be contained in HTTP headers.

Encoding Declaration

XML is a text format and uses mechanisms similar to those of HTML in terms of specifying character encodings. In XML, an optional encoding attribute on the XML declaration defines the character encoding. For example, the following encoding declaration indicates that the encoding is International Organization for Standardization (ISO) 8859-1:

<?xml version='1.0' encoding='ISO-8859-1'?>

To determine the default encodings, a specific algorithm is used. If the file starts with a 4-byte Unicode byte-order mark [0xFF 0xFE 0xFF 0xFE] or [0xFE 0xFF 0xFE 0xFF], the document is considered to be in UTF-32 encoding. If the file starts with a 2-byte Unicode byte-order mark [0xFF 0xFE] or [0xFE 0xFF], the document is considered to be in UTF-16 encoding. Otherwise, the character encoding of the document defaults to UTF-8. This default can only be changed by the presence of an XML declaration or of a Content-Type HTTP header. Table 26-2 shows different encodings of the same XML data, as well as the HTTP headers that are used for each encoding.

Table 26-2 Encodings and HTTP headers used for XML data.

Character Set or Encoding

HTTP Header

XML Document

ISO 8859-1

Content-Type: text/xml; charset:ISO 8859-1;

<test> </test>


Content-Type: text/xml;

<test> </test>

UTF-8 with UTF-8 byte-order mark

Content-Type: text/xml;

<test> </test>

ISO 8859-1

Content-Type: text/xml;

<?xml version="1.0" encoding="ISO 8859-1"?> <test> </test>

UTF-8 (using character entities)

Content-Type: text/xml;


UTF-16 (Unicode with byte-order mark)

Content-Type: text/xml;

ff fe 3c 00 74 00 65 00
73 00 74 00 3e 00 e5
00 ..<.t.e.s.t.>...

3c 00 2f 00 74 00 65
00 73 00 74 00 3e 00
0d 00 <./.t.e.s.t.>...

0a 00

See Appendix J, "Encoding Web Documents," for more XML character set or encoding labels.

Character Entities

Like HTML, XML allows individual characters in a page to be encoded by specifying their exact Unicode character value. These character entities are then parsed independently of the character set, and their Unicode values can be determined unambiguously. These values can be specified in decimal format (&#229;) or in hexadecimal format (&#xE5;).

HTTP Headers

HTTP headers can contain character-encoding information in the Content-Type Header, which MSXML and the .NET System.Xml.XmlTextReader can use. For example, the following HTTP header shows the character encoding, in bold.

HTTP/1.1 200 OK Content-Length: 15327 Content-Type: text/html; charset:ISO 8859-1; Server: Microsoft-IIS/5.0 Content-Location: Date: Wed, 08 Dec 1999 00:55:26 GMT Last-Modified: Mon, 06 Dec 1999 22:56:30 GMT <test>This is some XML encoded in the ISO 8859-1 character set</test>

Even though the XML in the message has no XML encoding declaration, the HTTP header indicates that the XML content is in the ISO 8859-1 character set.

Microsoft Corporation - Developing International Software
Developing International Software
ISBN: 0735615837
EAN: 2147483647
Year: 2003
Pages: 198 © 2008-2017.
If you may any questions please contact us: