Output Streams, Writers, and Encodings | Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

Most of the time you don't want to dump an XML document to System.out . Instead you want to write it in a file or onto a network socket. You might even store it in a string and pass it to some other process. What connects all of these possible targets in Java is the java.io.OutputStream class. Files, sockets, and even strings can all be treated as just another kind of stream.

XML documents are text, and that text is made up of Unicode characters. When the Unicode characters are actually written onto a stream, you need to pick a character encoding that specifies how each character is converted into bytes. This encoding can be one of the Unicode encodings such as UTF-8 or UTF-16, or it can be a local code page such as ISO-8859-1 or MacRoman. Characters that don't exist in the local code page can be escaped using numeric character references. The encoding declaration will be set to indicate the character set in use. The normal way that Java converts characters into bytes in a specific encoding is by chaining an OutputStreamWriter to an OutputStream . As chars and strings are written onto the OutputStreamWriter , they are converted to bytes in the specified encoding, which are then written onto the underlying OutputStream .

Let's suppose you want to dump the Fibonacci numbers into a file called fibonacci.xml in the current working directory. First you would open a FileOutputStream to that file, like this:

 OutputStream fout = new FileOutputStream("fibonacci.xml");

If performance is at all a concern, you would immediately chain this FileOutputStream to a BufferedOutputStream like this:

 OutputStream bout = new BufferedOutputStream(fout);

Then you would chain the BufferedOutputStream to an OutputStreamWriter . You'd pass the Java name of the encoding you want as the second argument to the OutputStreamWriter() constructor. For example, the following line chooses the ISO-8859-1, Latin-1 encoding, although it uses Java's name for this encoding, "8859_1":

 OutputStreamWriter out = new OutputStreamWriter(bout, "8859_1");

Finally, you would write the output onto that OutputStreamWriter , making sure to include the right encoding declaration using the XML name for the encoding. Example 3.9 demonstrates .

Example 3.9 A Java Program That Writes an XML File

 import java.math.BigInteger; import java.io.*; public class FibonacciFile {   public static void main(String[] args) {       BigInteger low  = BigInteger.ONE;       BigInteger high = BigInteger.ONE;       try {         OutputStream fout= new FileOutputStream("fibonacci.xml");         OutputStream bout= new BufferedOutputStream(fout);         OutputStreamWriter out          = new OutputStreamWriter(bout, "8859_1");         out.write("<?xml version=\"1.0\" ");         out.write("encoding=\"ISO-8859-1\"?>\r\n");         out.write("<Fibonacci_Numbers>\r\n");         for (int i = 1; i <= 10; i++) {           out.write("  <fibonacci index=\"" + i + "\">");           out.write(low.toString());           out.write("</fibonacci>\r\n");           BigInteger temp = high;           high = high.add(low);           low = temp;         }         out.write("</Fibonacci_Numbers>\r\n");         out.flush();  // Don't forget to flush!         out.close();       }       catch (UnsupportedEncodingException e) {         System.out.println(          "This VM does not support the Latin-1 character set."         );       }       catch (IOException e) {         System.out.println(e.getMessage());       }   } }

One change from the System.out version is that the line breaks have to be encoded explicitly. Although they aren't really necessary for this XML document, the examples are prettier if the XML doesn't consist of one long line of text. I recommend using a carriage -return/ linefeed pair ( \r\n ) as your line break. This is the native format for DOS and Windows, and most Unix and Macintosh text editors can handle it. More important, it is the standard line ending for network protocols such as HTTP and SMTP. XML parsers do normalize all line breaks to a single linefeed on input, so the proper choice of line break for an XML document is not nearly as fraught as for some other types of files. Nonetheless, picking carriage-return/linefeed does help when processing or transmitting XML documents with non-XML-aware tools.

Although most XML parsers written in Java support exactly those encodings that are available in Java, they don't use the same names. Java tends to use underscores whereas XML tends to use hyphens in encoding names, or to eliminate the hyphens completely. The reason is that earlier Java virtual machines used reflection to locate the classes that convert between different encodings, so its encoding names needed to be legal Java class names. Table 3.1 lists the Java and XML equivalents of the standard character sets and encodings. Later versions of Java, especially Java 1.4, often allow multiple names for the same encoding. Here I've picked the names that are supported across the broadest range of virtual machines.

In Table 3.1, I have deliberately omitted XML legal encodings that are not yet supported by Java, such as ISO-8859-10, ISO-8859-11, ISO-8859-14, and ISO-8859-16. It's not hard to add them in Java 1.4; but because they're not available by default, you're better off picking UTF-8 or one of the other Unicode encodings.

The exact list of encodings Java supports varies from virtual machine to virtual machine and from version to version. Java 1.4 is a major leap forward in support for many character sets, as well as for different aliases for character set names. However, since the standard Unicode and ISO encodings enable you to handle most environments today, there's little reason to use other encodings in XML documents.

Table 3.1. Standard Character Sets and Encodings

XML Name	Java Name	Java Version First Supported In	Scripts and Languages
ISO-8859-1	8859_1	1.1	Latin-1: ASCII plus the accented characters needed for most Western European languages, including Albanian, Basque, Breton, Catalan, Cornish, Danish, Dutch, English, Estonian, Faeroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Sorbian, Spanish, and Swedish as well as many non-European languages written in the Latin alphabet, such as Malaysian and Swahili
ISO-8859-2	8859_2	1.1	Latin-2: ASCII plus the accented characters needed for most Central European languages, including Albanian, Croatian, Czech, Finnish, German, Hungarian, Latin, Polish, Romanian, Slovak, Slovenian, and Sorbian
ISO-8859-3	8859_3	1.1	Latin-3: ASCII plus the accented characters needed for most Southern European languages, including English, Esperanto, Finnish, French, German, Italian, Latin, Maltese, Portuguese, and Turkish
ISO-8859-4	8859_4	1.1	Latin-4: ASCII plus the accented characters needed for most Northern European languages, including Danish, English, Estonian, Finnish, German, Greenlandic, Latin, Latvian, Lithuanian, Norwegian, Sami, Slovenian, and Swedish
ISO-8859-5	8859_5	1.1	ASCII plus Cyrillic
ISO-8859-6	8859_6	1.1	ASCII plus Arabic
ISO-8859-7	8859_7	1.1	ASCII plus Greek
ISO-8859-8	8859_8	1.1	ASCII plus Hebrew
ISO-8859-9	8859_9	1.1	Latin-5: same as Latin-1, except that the Turkish letters , , , , , and take the place of the Icelandic letters , , , , , and
ISO-8859-13	ISO8859_13	1.3	Latin-7: ASCII plus the accented characters needed for most Baltic languages, including Estonian, Latvian, Lithuanian, and Slovenian, as well as English, Danish, Finnish, German, Norwegian, and Swedish
ISO-8859-15	ISO8859_15_FDIS	1.2	Latin-9: same as Latin-1 but with the Euro symbol instead of the international currency symbol . It also replaces the infrequently used symbol characters , , , , ¼, ½, and ¾ with the infrequently used French and Finnish letters , , , , ’, “, and .
UTF-8	UTF8	1.1	The default encoding of XML documents; each Unicode character is represented in one to four bytes.
UTF-16	Unicode-Big or Unicode-Little	1.2	An encoding of Unicode in which characters in the Basic Multilingual Plane are encoded in two bytes, and all other characters are encoded as two two-byte surrogates
ISO-10646-UCS-2	N/A	N/A	A straightforward encoding in which each Unicode character is represented as a two-byte integer; cannot represent characters outside the Basic Multilingual Plane
ISO-10646-UCS-4	N/A	N/A	A straightforward encoding in which each Unicode character is represented as a four-byte integer
ISO-2022-JP	JIS	1.1	Japanese
Shift_ JIS	SJIS	1.1	Japanese
EUC-JP	EUCJIS	1.1	Japanese
US-ASCII	ASCII	1.2	English
GBK	GBK	1.1	Simplified Chinese
Big5	Big5	1.1	Traditional Chinese
ISO-2022-CN	ISO2022CN	1.1	Traditional Chinese
ISO-2022-KR	ISO2022KR	1.1	Korean

Some parsers, including Xerces-J, have an option to recognize the Java names, and all of the other encodings Java supports. This can be useful when you're reading XML documents that other people and systems send to you. However, you should not generate documents that use these encodings. The standard encodings in Table 3.1 should be sufficient for any document you need to create, and they are a lot more cross-platform compatible than platform-specific code pages such as Cp1252 and MacRoman. In a few cases you might prefer to use standard non-Unicode, non-ISO national character sets such as KS C 5601 for Korean or KOI8-R for Cyrillic. These are okay too, but still a little less well recognized around the world than the standard encodings listed in Table 3.1. The general principle is to be conservative in what you generate, but liberal in what you accept. In other words, try to stick to the most standard encodings when writing documents, but accept any encoding you recognize when reading documents.

Other character sets that you should not use in XML but which are available in Java include UTF-16BE and UTF16-LE. These are big-endian and little-endian encodings of Unicode without an explicit byte order mark. XML documents in UTF-16 must have an explicit byte order mark; it may not be omitted. UTF-8 documents may have a byte order mark; but in general should not for maximum compatibility with other software.

Note

Output streams, output stream writers, files, Unicode, character sets, and character encodings, and many other aspects of input and output in Java are covered in much more detail in another of my books: Java I/O . O'Reilly & Associates, 1999. ISBN 1-56592-485-1.