The encoding Declaration | Effective XML: 50 Specific Ways to Improve Your XML

The `encoding` Declaration

The encoding attribute specifies which character set and encoding the document is written in. Sometimes this identifies an encoding of the Unicode character set such as UTF-8 and UTF-16; other times it identifies a different character set such as ISO-8859-1 or US-ASCII, which for XML's purposes serves mainly as an encoding of a subset of the full Unicode character set.

The default encoding is UTF-8 if no encoding declaration or other metadata is present. UTF-16 can also be used if the document begins with a byte order mark. However, even in cases where the document is written in the UTF-8 or UTF-16 encodings, an encoding declaration helps people reading the document recognize the encoding, so it's useful to specify it explicitly.

Try to stick to well-known standard character sets and encodings such as ISO-8859-1, UTF-8, and UTF-16 if possible. You should always use the standard names for these character sets. Table 1-1 lists the names defined by the XML 1.0 specification. All parsers that support these character sets should recognize these names. For character encodings not defined in XML 1.0, choose a name registered with the IANA. You can find a complete list at http://www.iana.org/assignments/character-sets/. However, you should avoid nonstandard names. In particular, watch out for Java names like 8859_1 and UTF16. Relatively few parsers not written in Java recognize these, and even some Java parsers don't recognize them by default. However, all parsers including those written in Java should recognize the IANA standard equivalents such as ISO-8859-1 and UTF-16.

For similar reasons, avoid declaring and using vendor-dependent character sets such as Cp1252 (U.S. Windows) or MacRoman. These are not as interoperable as the standard character sets across the heterogeneous set of platforms that XML supports.