The following contains best practices when working with XML in the global environment. Since it is common to encounter character-encoding issues when first learning XML and dealing with MSXML, you will see pointers and examples of what to watch for as you use MSXML.
The default encoding for all XML documents (or messages) is UTF-8, which is a compressed encoding of Unicode (ISO 10646). If possible, stick with UTF-8, especially if you want to maximize cross-platform reach.
Remember that using XML entails making sure your data is cleanly separated from your user interfaces (UIs). Therefore, follow the W3C XML Schema standard and note how you store your data in XML. For example, if you store your dates in ISO 8601 format, the date May 17, 2001 would be represented as 2001-05-17. The application that uses XML must decide how to present data to the user based on the user's regional settings. (For more information on adjusting data according to the user's regional or language settings, see Chapter 4, "Locale and Cultural Awareness." )
If you are transmitting data using XML, it is recommended that you use the default UTF-8 encoding. However, the following sample shows how to transmit data reliably, between client and server, if you have to use a different encoding. Suppose you have the following XML file on your server:
<?xml version='1.0' encoding='windows-1252'?> <test>Copyright 2001</test>
Note
If the XML file were encoded in UTF-8, the raw file would look like this:
<?xml version='1.0' encoding='utf-8'?> <test>Copyright 2001</test>
Note
In the previous code, the " " sequence is a two-byte UTF-8 encoding of the Unicode character 0xA9. To send this XML document from an ASP page, therefore, you would do the following:
<%@LANGUAGE=JScript%> <% Response.ContentType = "text/xml"; Response.Clear(); var dom = Server.CreateObject("MSXML2.DOMDocument"); dom.load(Server.MapPath("test.xml")); dom.save(Response); %>
Note
To receive the XML document on the client using the XMLHTTP component in Microsoft JScript, you would do the following:
var http = new ActiveXObject("MSXML2.XMLHTTP"); http.open("GET","http://localhost/test.asp",false); http.send(); doc = http.responseXML; doc.save("test.xml");
To execute this script, run the following from the command line:
cscript /nologo test.js
This will create a file on your local directory called "test.xml," which is correctly encoded in the Windows 1252 character set. Be careful using WScript.echo with XML content. For example, if you add the following to the JScript code:
WScript.echo (doc.xml);
You will see the following incorrect output:
<?xml version="1.0"?> <test>c</test>
This is because WScript.echo is losing information in the translation of Unicode (doc.xml) to the console character set. Also, notice that the xml property on the DOM document object removed the encoding attribute. This is by design. The xml property is returning a correctly encoded Unicode version of your XML document that can then be successfully parsed by LoadXML.
Therefore, the two main methods of loading XML documents using the MSXML DOM are the LoadXML method and the Load method. The LoadXML method always takes a Unicode BSTR that is encoded in UCS-2 or UTF-16 only. If you pass in anything other than a valid Unicode BSTR to LoadXML, it will fail to load. The Load method takes a VARIANT and supports the following values, shown in Table 26-3.
Table 26-3 Values supported by the Load method.
Value | Description |
URL | If the VARIANT is a BSTR, it is interpreted as a URI, and the XML content is fetched from that URI. |
VT_ARRAY | VT_UI1 | The VARIANT can also be a SAFEARRAY containing the raw encoded bytes. |
IUnknown | If the VARIANT is an IUnknown interface, the DOM document calls QueryInterface for IStream, IPersistStream, and IPersistStreamInit. |
The Load method implements the following algorithm for determining the character encoding or character set of the XML document:
There are two errors you will see returned from the MSXML DOM that indicate encoding problems. The first usually indicates that a character in the document does not match the encoding of the XML document:
An invalid character was found in text content.
The ParseError object will tell you exactly where on a particular line this rogue character occurs so that you can fix the problem. The second error indicates that you started with a Unicode byte-order mark (or that you called the LoadXML method) and that an encoding attribute specified something other than a 2-byte encoding (such as UTF-8 or Windows 1250):
Switch from current encoding to specified encoding not supported.
Another possibility is that you could have called the Load method and started off with a single-byte encoding (no byte-order mark), but the method might have found an encoding attribute that specified a 2- or 4-byte encoding (such as UTF-16 or UTF-32).
Note
Finally, the IXMLHttpRequest interface provides several ways of accessing downloaded data, shown in Table 26-4.
Table 26-4 Properties of the IXMLHttpRequest interface, used to access downloaded data.
Properties | Description |
ResponseXML | Represents the response entity body as parsed by the MSXML DOM parser, using the same rules as the Load method. |
ResponseText | Represents the response entity body as a string. |
ResponseBody | Represents the response entity body as an array of unsigned bytes. |
ResponseStream | Represents the response entity body as an IStream interface. |
Note
While Chapter 19, "The Microsoft .NET Framework," provides an overview of the .NET Framework, this section offers more detail on the System.Xml classes within the framework. It looks specifically at the same scenarios just discussed for MSXML.
The System.Xml classes are built upon lower-level classes that handle networking (such as System.Net) and encoding/decoding text (such as System.Text). The main difference between System.Xml and MSXML is that, instead of using the XMLHTTP control for HTTP client code to send and receive XML data over the Internet, you use the classes defined in the System.Net namespace, which are far more powerful and flexible. Similar to MSXML, the System.Xml classes allow you to load XML documents from in-memory strings encoded in UTF-16, or from encoded streams where the decoding rules are the same. In the .NET Framework you can also load XML documents from a text reader, which is a stream that has already been decoded into Unicode. The following is the managed Microsoft ASP.NET (once known as "ASPX") equivalent of the earlier ASP server-side code fragment used in the MSXML scenario:
<%@LANGUAGE=C#%> <%@Import Namespace="System.Xml"%> <% Response.ContentType = "text/xml"; Response.Clear(); XmlDocument doc = new XmlDocument(); doc.Load(Server.MapPath("test.xml")); doc.Save(Response.Output); %>
This will send a UTF-8-encoded response to the client. The following is the simplest managed client-side code for downloading and decoding this XML data correctly:
using System.Xml; using System.Net; class Test { static void Main() { XmlDocument doc = new XmlDocument(); doc.Load("http://localhost/temp/test.aspx"); doc.Save("saved.xml"); } }
If you need more control over the specific HTTP client-side code, use the System.Net.WebRequest class. For example, the following code adds the Network-Credentials class to download data from a Web site that requires authenticated access.
using System.Xml; using System.Net; class Test { static void Main() { WebRequest wr = WebRequest.Create("http://localhost/temp/test.aspx"); wr.Credentials = CredentialCache.DefaultCredentials; XmlDocument doc = new XmlDocument(); doc.Load(wr.GetResponse().GetResponseStream()); doc.Save("saved.xml"); } }
This will retrieve the UTF-8-encoded response and generate a UTF-8-encoded local file named "saved.xml."
There is a tricky issue in this particular ASP.NET scenario. ASP.NET picks up the "default" encoding-UTF-8-for the HTTP response from the Web.Config file. Therefore, do not set the Response.Charset property to something other than UTF-8. If you do, you will get an invalid response because the XML body will be encoded as UTF-8, but the HTTP Content-Type header will say the charset is something else. As a result, MSXML and Internet Explorer will be confused, since they will attempt to decode the body according to the charset in the HTTP header, and not according to UTF-8, resulting in a garbled XML message.
If you must respond with something other than the UTF-8 encoding, change the Web.Config file as follows:
<configuration> <system.web> <globalization responseEncoding="iso-8859-1" /> </system.web> </configuration>
This code will now cause the ASP.NET page to respond with the XML body correctly encoded in the ISO 8859-1 encoding. If you don't want to or cannot change the Web.Config file for some reason, you can use the following code instead:
<%@LANGUAGE=C#%> <%@Import Namespace="System.Xml"%> <% Response.ContentType = "text/text"; Response.Charset = "iso-8859-1";
Response.Clear(); XmlDocument doc = new XmlDocument(); doc.Load(Server.MapPath("test.xml")); XmlTextWriter xw = new XmlTextWriter( Response.OutputStream, Encoding.GetEncoding(Response.Charset)); xw.Formatting = Formatting.Indented; doc.Save(xw); w.Close(); %>
This code makes sure the XML body is encoded according to the Charset property. The client-side code (both the MSXML and System.Xml versions) will decode the XML body correctly, and the local file called "saved.xml" will preserve the ISO 8859-1 encoding.