5.1 Character-Set Metadata | XML in a Nutshell, Third Edition

Some environments keep track of which encodings particular documents are written in. For instance, web servers that transmit XML documents precede them with an HTTP header that looks something like this:

 HTTP/1.1 200 OK Date: Sun, 28 Oct 2001 11:05:42 GMT Server: Apache/1.3.19 (Unix) mod_jk mod_perl/1.25 mod_fastcgi/2.2.10 Connection: close Transfer-Encoding: chunked  Content-Type: text/xml; charset=iso-8859-1

The Content-Type field of the HTTP header provides the MIME media type of the document. This may, as shown here, specify which character set the document is written in. An XML parser reading this document from a web server should use this information to determine the document's character encoding.

Many web servers omit the charset parameter from the MIME media type. In this case, if the MIME media type is text/xml , then the document is assumed to be in the US-ASCII encoding. If the MIME media type is application/xml , then the parser attempts to guess the character set by reading the first few bytes of the document.

Since ASCII is almost never an appropriate character set for an XML document, application/xml is much preferred over text/xml . Unfortunately, most web servers including Apache 2.0.36 and earlier are configured to use text/xml by default. If you're running such a version you should probably upgrade before serving XML files. ^[1]

^[1] You could fix Apache's MIME types instead of upgrading, but you really should upgrade. All versions of Apache that are old enough to have the wrong MIME type for XML also have a number of security holes that have since been plugged.

We've focused on MIME types in HTTP headers because that's the most common place where character set metadata is applied to XML documents. However, MIME types are also used in some filesystems (e.g., the BeOS), in email, and in other environments. Other systems may provide other forms of character set metadata. If such metadata is available for a document, whatever form it takes, the parser should use it, although in practice this is an area where not all parsers and programs are as conformant as they should be.