International Best Practices | Developing International Software

The following contains best practices when working with XML in the global environment. Since it is common to encounter character-encoding issues when first learning XML and dealing with MSXML, you will see pointers and examples of what to watch for as you use MSXML.

Use UTF-8

The default encoding for all XML documents (or messages) is UTF-8, which is a compressed encoding of Unicode (ISO 10646). If possible, stick with UTF-8, especially if you want to maximize cross-platform reach.

Make Sure Your XML Data Is Locale- and Culture-Neutral

Remember that using XML entails making sure your data is cleanly separated from your user interfaces (UIs). Therefore, follow the W3C XML Schema standard and note how you store your data in XML. For example, if you store your dates in ISO 8601 format, the date May 17, 2001 would be represented as 2001-05-17. The application that uses XML must decide how to present data to the user based on the user's regional settings. (For more information on adjusting data according to the user's regional or language settings, see Chapter 4, "Locale and Cultural Awareness." )

Take Advantage of MSXML Tips and Tricks

If you are transmitting data using XML, it is recommended that you use the default UTF-8 encoding. However, the following sample shows how to transmit data reliably, between client and server, if you have to use a different encoding. Suppose you have the following XML file on your server:

<?xml version='1.0' encoding='windows-1252'?> <test>Copyright   2001</test>

Note

This code represents the simplest way to encode the copyright symbol because the symbol has a single-byte representation in the Windows 1252 character set.

If the XML file were encoded in UTF-8, the raw file would look like this:

<?xml version='1.0' encoding='utf-8'?> <test>Copyright   2001</test>

Note

Notepad in Windows 2000 or Windows XP supports UTF-8 and will decode the " " sequence, so you won't see the raw format. If you type the file from a Microsoft MS-DOS command window, you will see this raw UTF-8 encoding.

In the previous code, the " " sequence is a two-byte UTF-8 encoding of the Unicode character 0xA9. To send this XML document from an ASP page, therefore, you would do the following:

<%@LANGUAGE=JScript%> <%       Response.ContentType = "text/xml";       Response.Clear();       var dom = Server.CreateObject("MSXML2.DOMDocument");       dom.load(Server.MapPath("test.xml"));       dom.save(Response); %>

Note

This code uses the MSXML 3 class ID "MSXML2.DOMDocument." The MSXML 2 version is called "Microsoft.XMLDOM."

To receive the XML document on the client using the XMLHTTP component in Microsoft JScript, you would do the following:

var http = new ActiveXObject("MSXML2.XMLHTTP"); http.open("GET","http://localhost/test.asp",false); http.send(); doc = http.responseXML; doc.save("test.xml");

To execute this script, run the following from the command line:

cscript /nologo test.js

This will create a file on your local directory called "test.xml," which is correctly encoded in the Windows 1252 character set. Be careful using WScript.echo with XML content. For example, if you add the following to the JScript code:

WScript.echo (doc.xml);

You will see the following incorrect output:

<?xml version="1.0"?> <test>c</test>

This is because WScript.echo is losing information in the translation of Unicode (doc.xml) to the console character set. Also, notice that the xml property on the DOM document object removed the encoding attribute. This is by design. The xml property is returning a correctly encoded Unicode version of your XML document that can then be successfully parsed by LoadXML.

Therefore, the two main methods of loading XML documents using the MSXML DOM are the LoadXML method and the Load method. The LoadXML method always takes a Unicode BSTR that is encoded in UCS-2 or UTF-16 only. If you pass in anything other than a valid Unicode BSTR to LoadXML, it will fail to load. The Load method takes a VARIANT and supports the following values, shown in Table 26-3.

Table 26-3 Values supported by the Load method.

Value	Description
URL	If the VARIANT is a BSTR, it is interpreted as a URI, and the XML content is fetched from that URI.
VT_ARRAY \| VT_UI1	The VARIANT can also be a SAFEARRAY containing the raw encoded bytes.
IUnknown	If the VARIANT is an IUnknown interface, the DOM document calls QueryInterface for IStream, IPersistStream, and IPersistStreamInit.

The Load method implements the following algorithm for determining the character encoding or character set of the XML document:

If the Content-Type HTTP header defines a character set, this character set overrides anything in the XML document itself. This obviously doesn't apply to SAFEARRAY and IStream mechanisms because there is no HTTP header.
If there is a 2-byte Unicode byte-order mark, the Load method assumes the encoding is UTF-16. It can handle both big-endian and little-endian.
If there is a 4-byte Unicode byte-order mark (0xFF 0xFE 0xFF 0xFE), the Load method assumes the encoding is UTF-32. It can handle both big-endian and little-endian.
Otherwise, the Load method assumes the encoding is UTF-8, unless it finds an XML declaration with an encoding attribute that specifies some other character set (such as ISO 8859-1, Windows 1252, Shift-JIS, and so on).

There are two errors you will see returned from the MSXML DOM that indicate encoding problems. The first usually indicates that a character in the document does not match the encoding of the XML document:

An invalid character was found in text content.

The ParseError object will tell you exactly where on a particular line this rogue character occurs so that you can fix the problem. The second error indicates that you started with a Unicode byte-order mark (or that you called the LoadXML method) and that an encoding attribute specified something other than a 2-byte encoding (such as UTF-8 or Windows 1250):

Switch from current encoding to specified encoding not supported.

Another possibility is that you could have called the Load method and started off with a single-byte encoding (no byte-order mark), but the method might have found an encoding attribute that specified a 2- or 4-byte encoding (such as UTF-16 or UTF-32).

Note

The LoadXML method is more forgiving in MSXML 3. It allows any XML declaration to be passed in, but still assumes the real encoding is Unicode (since a BSTR is always Unicode by definition).

Finally, the IXMLHttpRequest interface provides several ways of accessing downloaded data, shown in Table 26-4.

Table 26-4 Properties of the IXMLHttpRequest interface, used to access downloaded data.

Properties	Description
ResponseXML	Represents the response entity body as parsed by the MSXML DOM parser, using the same rules as the Load method.
ResponseText	Represents the response entity body as a string.
ResponseBody	Represents the response entity body as an array of unsigned bytes.
ResponseStream	Represents the response entity body as an IStream interface.

Note

A common problem people experience involves the ResponseText property. This property assumes the message is in UTF-8 encoding. Consequently, if you send a message that is in another encoding, the property will not work correctly.

Take Advantage of System.Xml Tips and Tricks

While Chapter 19, "The Microsoft .NET Framework," provides an overview of the .NET Framework, this section offers more detail on the System.Xml classes within the framework. It looks specifically at the same scenarios just discussed for MSXML.

The System.Xml classes are built upon lower-level classes that handle networking (such as System.Net) and encoding/decoding text (such as System.Text). The main difference between System.Xml and MSXML is that, instead of using the XMLHTTP control for HTTP client code to send and receive XML data over the Internet, you use the classes defined in the System.Net namespace, which are far more powerful and flexible. Similar to MSXML, the System.Xml classes allow you to load XML documents from in-memory strings encoded in UTF-16, or from encoded streams where the decoding rules are the same. In the .NET Framework you can also load XML documents from a text reader, which is a stream that has already been decoded into Unicode. The following is the managed Microsoft ASP.NET (once known as "ASPX") equivalent of the earlier ASP server-side code fragment used in the MSXML scenario:

<%@LANGUAGE=C#%> <%@Import Namespace="System.Xml"%> <%    Response.ContentType = "text/xml";    Response.Clear();    XmlDocument doc = new XmlDocument();    doc.Load(Server.MapPath("test.xml"));    doc.Save(Response.Output); %>

This will send a UTF-8-encoded response to the client. The following is the simplest managed client-side code for downloading and decoding this XML data correctly:

using System.Xml; using System.Net; class Test {    static void Main() {       XmlDocument doc = new XmlDocument();       doc.Load("http://localhost/temp/test.aspx");       doc.Save("saved.xml");    } }

If you need more control over the specific HTTP client-side code, use the System.Net.WebRequest class. For example, the following code adds the Network-Credentials class to download data from a Web site that requires authenticated access.

using System.Xml; using System.Net; class Test {    static void Main() {       WebRequest wr =           WebRequest.Create("http://localhost/temp/test.aspx");       wr.Credentials = CredentialCache.DefaultCredentials;       XmlDocument doc = new XmlDocument();       doc.Load(wr.GetResponse().GetResponseStream());       doc.Save("saved.xml");    } }

This will retrieve the UTF-8-encoded response and generate a UTF-8-encoded local file named "saved.xml."

There is a tricky issue in this particular ASP.NET scenario. ASP.NET picks up the "default" encoding-UTF-8-for the HTTP response from the Web.Config file. Therefore, do not set the Response.Charset property to something other than UTF-8. If you do, you will get an invalid response because the XML body will be encoded as UTF-8, but the HTTP Content-Type header will say the charset is something else. As a result, MSXML and Internet Explorer will be confused, since they will attempt to decode the body according to the charset in the HTTP header, and not according to UTF-8, resulting in a garbled XML message.

If you must respond with something other than the UTF-8 encoding, change the Web.Config file as follows:

<configuration>    <system.web>       <globalization           responseEncoding="iso-8859-1" />    </system.web> </configuration>

This code will now cause the ASP.NET page to respond with the XML body correctly encoded in the ISO 8859-1 encoding. If you don't want to or cannot change the Web.Config file for some reason, you can use the following code instead:

<%@LANGUAGE=C#%> <%@Import Namespace="System.Xml"%> <%    Response.ContentType = "text/text";    Response.Charset = "iso-8859-1";

   Response.Clear();    XmlDocument doc = new XmlDocument();    doc.Load(Server.MapPath("test.xml"));    XmlTextWriter xw = new XmlTextWriter(                          Response.OutputStream,                          Encoding.GetEncoding(Response.Charset));    xw.Formatting = Formatting.Indented;    doc.Save(xw);    w.Close(); %>

This code makes sure the XML body is encoded according to the Charset property. The client-side code (both the MSXML and System.Xml versions) will decode the XML body correctly, and the local file called "saved.xml" will preserve the ISO 8859-1 encoding.