The Complete Stream Zoo

Unlike C, which gets by just fine with a single type FILE*, Java has a whole zoo of more than 60 (!) different stream types (see Figures 12-1 and 12-2). Library designers claim that there is a good reason to give users a wide choice of stream types: it is supposed to reduce programming errors. For example, in C, some people think it is a common mistake to send output to a file that was open only for reading. (Well, it is not actually that common.) Naturally, if you do this, the output is ignored at run time. In Java and C++, the compiler catches that kind of mistake because an InputStream (Java) or istream (C++) has no methods for output.

Figure 12-1. Input and output stream hierarchy

Figure 12-2. Reader and writer hierarchy

(We would argue that in C++, and even more so in Java, the main tool that the stream interface designers have against programming errors is intimidation. The sheer complexity of the stream libraries keeps programmers on their toes.)

C++ NOTE

ANSI C++ gives you more stream types than you want, such as istream, ostream, iostream, ifstream, ofstream, fstream, wistream, wifstream, istrstream, and so on (18 classes in all). But Java really goes overboard with streams and gives you separate classes for selecting buffering, lookahead, random access, text formatting, and binary data.

Let us divide the animals in the stream class zoo by how they are used. Four abstract classes are at the base of the zoo: InputStream, OutputStream, Reader, and Writer. You do not make objects of these types, but other methods can return them. For example, as you saw in Chapter 10, the URL class has the method openStream that returns an InputStream. You then use this InputStream object to read from the URL. As we said, the InputStream and OutputStream classes let you read and write only individual bytes and arrays of bytes; they have no methods to read and write strings and numbers. You need more capable child classes for this. For example, DataInputStream and DataOutputStream let you read and write all the basic Java types.

For Unicode text, on the other hand, as we said, you use classes that descend from Reader and Writer. The basic methods of the Reader and Writer classes are similar to the ones for InputStream and OutputStream.

 abstract int read() abstract void write(int b)

They work just as the comparable methods do in the InputStream and OutputStream classes except, of course, the read method returns either a Unicode code unit (as an integer between 0 and 65535) or 1 when you have reached the end of the file.

Finally, there are streams that do useful stuff, for example, the ZipInputStream and ZipOutputStream that let you read and write files in the familiar ZIP compression format.

Moreover, JDK 5.0 introduces four new interfaces: Closeable, Flushable, Readable, and Appendable (see Figure 12-3). The first two interfaces are very simple, with methods

 void close() throws IOException

Figure 12-3. The `Closeable`, `Flushable`, `Readable`, and `Appendable` interfaces

and

 void flush()

respectively. The classes InputStream, OutputStream, Reader, and Writer all implement the Closeable interface. OutputStream and Writer implement the Flushable interface.

The Readable interface has a single method

 int read(CharBuffer cb)

The CharBuffer class has methods for sequential and random read/write access. It represents an in-memory buffer or a memory-mapped file (see page 696).

The Appendable interface has two methods, for appending single characters and character sequences:

 Appendable append(char c) Appendable append(CharSequence s)

The CharSequence type is yet another interface, describing minimal properties of a sequence of char values. It is implemented by String, CharBuffer, and StringBuilder/StringBuffer (see page 656).

Of the stream zoo classes, only Writer implements Appendable.

 java.io.Closeable 5.0

void close()
closes this Closeable. This method may throw an IOException.

 java.io.Flushable 5.0

void flush()
flushes this Flushable.

 java.lang.Readable 5.0

int read(CharBuffer cb)
attempts to read as many char values into cb as it can hold. Returns the number of values read, or -1 if no further values are available from this Readable.

 java.lang.Appendable 5.0

Appendable append(char c)
appends the code unit c to this Appendable; returns this.
Appendable append(CharSequence cs)
appends all code units in cs to this Appendable; returns this.

 java.lang.CharSequence 1.4

char charAt(int index)
returns the code unit at the given index.
int length()
returns the number of code units in this sequence.
CharSequence subSequence(int startIndex, int endIndex)
returns a CharSequence consisting of the code units stored at index startIndex to endIndex - 1.
String toString()
returns a string consisting of the code units of this sequence.

Layering Stream Filters

FileInputStream and FileOutputStream give you input and output streams attached to a disk file. You give the file name or full path name of the file in the constructor. For example,

 FileInputStream fin = new FileInputStream("employee.dat");

looks in the current directory for a file named "employee.dat".

CAUTION

Because the backslash character is the escape character in Java strings, be sure to use \\ for Windows-style path names ("C:\\Windows\\win.ini"). In Windows, you can also use a single forward slash ("C:/Windows/win.ini") because most Windows file handling system calls will interpret forward slashes as file separators. However, this is not recommended the behavior of the Windows system functions is subject to change, and on other operating systems, the file separator may yet be different. Instead, for portable programs, you should use the correct file separator character. It is stored in the constant string File.separator.

You can also use a File object (see page 685 for more on file objects):

 File f = new File("employee.dat"); FileInputStream fin = new FileInputStream(f);

Like the abstract InputStream and OutputStream classes, these classes support only reading and writing on the byte level. That is, we can only read bytes and byte arrays from the object fin.

 byte b = (byte) fin.read();

TIP

Because all the classes in java.io interpret relative path names as starting with the user's current working directory, you may want to know this directory. You can get at this information by a call to System.getProperty("user.dir").

As you will see in the next section, if we just had a DataInputStream, then we could read numeric types:

 DataInputStream din = . . .; double s = din.readDouble();

But just as the FileInputStream has no methods to read numeric types, the DataInputStream has no method to get data from a file.

Java uses a clever mechanism to separate two kinds of responsibilities. Some streams (such as the FileInputStream and the input stream returned by the openStream method of the URL class) can retrieve bytes from files and other more exotic locations. Other streams (such as the DataInputStream and the PrintWriter) can assemble bytes into more useful data types. The Java programmer has to combine the two into what are often called filtered streams by feeding an existing stream to the constructor of another stream. For example, to be able to read numbers from a file, first create a FileInputStream and then pass it to the constructor of a DataInputStream.

 FileInputStream fin = new FileInputStream("employee.dat"); DataInputStream din = new DataInputStream(fin); double s = din.readDouble();

It is important to keep in mind that the data input stream that we created with the above code does not correspond to a new disk file. The newly created stream still accesses the data from the file attached to the file input stream, but the point is that it now has a more capable interface.

If you look at Figure 12-1 again, you can see the classes FilterInputStream and FilterOutputStream. You combine their subclasses into a new filtered stream to construct the streams you want. For example, by default, streams are not buffered. That is, every call to read contacts the operating system to ask it to dole out yet another byte. If you want buffering and the data input methods for a file named employee.dat in the current directory, you need to use the following rather monstrous sequence of constructors:

 DataInputStream din = new DataInputStream(    new BufferedInputStream(       new FileInputStream("employee.dat")));

Notice that we put the DataInputStream last in the chain of constructors because we want to use the DataInputStream methods, and we want them to use the buffered read method. Regardless of the ugliness of the above code, it is necessary: you must be prepared to continue layering stream constructors until you have access to the functionality you want.

Sometimes you'll need to keep track of the intermediate streams when chaining them together. For example, when reading input, you often need to peek at the next byte to see if it is the value that you expect. Java provides the PushbackInputStream for this purpose.

 PushbackInputStream pbin = new PushbackInputStream(    new BufferedInputStream(       new FileInputStream("employee.dat")));

Now you can speculatively read the next byte

 int b = pbin.read();

and throw it back if it isn't what you wanted.

 if (b != '<') pbin.unread(b);

But reading and unreading are the only methods that apply to the pushback input stream. If you want to look ahead and also read numbers, then you need both a pushback input stream and a data input stream reference.

 DataInputStream din = new DataInputStream(    pbin = new PushbackInputStream(       new BufferedInputStream(          new FileInputStream("employee.dat"))));

Of course, in the stream libraries of other programming languages, niceties such as buffering and lookahead are automatically taken care of, so it is a bit of a hassle in Java that one has to resort to layering stream filters in these cases. But the ability to mix and match filter classes to construct truly useful sequences of streams does give you an immense amount of flexibility. For example, you can read numbers from a compressed ZIP file by using the following sequence of streams (see Figure 12-4).

 ZipInputStream zin = new ZipInputStream(new FileInputStream("employee.zip")); DataInputStream din = new DataInputStream(zin);

Figure 12-4. A sequence of filtered streams

(See the section on ZIP file streams starting on page 643 for more on Java's ability to handle ZIP files.)

All in all, apart from the rather monstrous constructors that are needed to layer streams, the ability to mix and match streams is a very useful feature of Java!

 java.io.FileInputStream 1.0

FileInputStream(String name)
creates a new file input stream, using the file whose path name is specified by the name string.
FileInputStream(File f)
creates a new file input stream, using the information encapsulated in the File object. (The File class is described at the end of this chapter.)

 java.io.FileOutputStream 1.0

FileOutputStream(String name)
creates a new file output stream specified by the name string. Path names that are not absolute are resolved relative to the current working directory. Caution: This method automatically deletes any existing file with the same name.
FileOutputStream(String name, boolean append)
creates a new file output stream specified by the name string. Path names that are not absolute are resolved relative to the current working directory. If the append parameter is TRue, then data are added at the end of the file. An existing file with the same name will not be deleted.
FileOutputStream(File f)
creates a new file output stream using the information encapsulated in the File object. (The File class is described at the end of this chapter.) Caution: This method automatically deletes any existing file with the same name as the name of f.

 java.io.BufferedInputStream 1.0

BufferedInputStream(InputStream in)
creates a new buffered stream with a default buffer size. A buffered input stream reads characters from a stream without causing a device access every time. When the buffer is empty, a new block of data is read into the buffer.
BufferedInputStream(InputStream in, int n)
creates a new buffered stream with a user-defined buffer size.

 java.io.BufferedOutputStream 1.0

BufferedOutputStream(OutputStream out)
creates a new buffered stream with a default buffer size. A buffered output stream collects characters to be written without causing a device access every time. When the buffer fills up or when the stream is flushed, the data are written.
BufferedOutputStream(OutputStream out, int n)
creates a new buffered stream with a user-defined buffer size.

 java.io.PushbackInputStream 1.0

PushbackInputStream(InputStream in)
constructs a stream with one-byte lookahead.
PushbackInputStream(InputStream in, int size)
constructs a stream with a pushback buffer of specified size.
void unread(int b)
pushes back a byte, which is retrieved again by the next call to read. You can push back only one byte at a time.

Parameters:	`b`	The byte to be read again

Data Streams

You often need to write the result of a computation or read one back. The data streams support methods for reading back all the basic Java types. To write a number, character, Boolean value, or string, use one of the following methods of the DataOutput interface:

 writeChars writeByte writeInt writeShort writeLong writeFloat writeDouble writeChar writeBoolean writeUTF

For example, writeInt always writes an integer as a 4-byte binary quantity regardless of the number of digits, and writeDouble always writes a double as an 8-byte binary quantity. The resulting output is not humanly readable, but the space needed will be the same for each value of a given type and reading it back in will be faster. (See the section on the PrintWriter class later in this chapter for how to output numbers as human-readable text.)

NOTE

There are two different methods of storing integers and floating-point numbers in memory, depending on the platform you are using. Suppose, for example, you are working with a 4-byte int, say the decimal number 1234, or 4D2 in hexadecimal (1234 = 4 x 256 + 13 x 16 + 2). This can be stored in such a way that the first of the 4 bytes in memory holds the most significant byte (MSB) of the value: 00 00 04 D2. This is the so-called big-endian method. Or we can start with the least significant byte (LSB) first: D2 04 00 00. This is called, naturally enough, the little-endian method. For example, the SPARC uses big-endian; the Pentium, little-endian. This can lead to problems. When a C or C++ file is saved, the data are saved exactly as the processor stores them. That makes it challenging to move even the simplest data files from one platform to another. In Java, all values are written in the big-endian fashion, regardless of the processor. That makes Java data files platform independent.

The writeUTF method writes string data by using a modified version of 8-bit Unicode Transformation Format. Instead of simply using the standard UTF-8 encoding (which is shown in Table 12-1), character strings are first represented in UTF-16 (see Table 12-2) and then the result is encoded using the UTF-8 rules. The modified encoding is different for characters with code higher than 0xFFFF. It is used for backwards compatibility with virtual machines that were built when Unicode had not yet grown beyond 16 bits.

Table 12-1. UTF-8 Encoding
Character Range	Encoding
0...7F	0a₆a₅a₄a₃a₂a₁a₀
80...7FF	110a₁₀a₉a₈a₇a₆ 10a₅a₄a₃a₂a₁a₀
800...FFFF	1110a₁₅a₁₄a₁₃a₁₂ 10a₁₁a₁₀a₉a₈a₇a₆ 10a₅a₄a₃a₂a₁a₀
10000...10FFFF	11110a₂₀a₁₉a₁₈ 10a₁₇a₁₆a₁₅a₁₄a₁₃a₁₂ 10a₁₁a₁₀a₉a₈a₇a₆ 10a₅a₄a₃a₂a₁a₀

Table 12-2. UTF-16 Encoding
Character Range	Encoding
0...FFFF	a₁₅a₁₄a₁₃a₁₂a₁₁a₁₀a₉a₈ a₇a₆a₅a₄a₃a₂a₁a₀
10000...10FFFF	110110b₁₉b₁₈ b₁₇a₁₆a₁₅a₁₄a₁₃a₁₂a₁₁a₁₀ 110111a₉a₈ a₇a₆a₅a₄a₃a₂a₁a₀ where b₁₉b₁₈b₁₇b₁₆ = a₂₀a₁₉a₁₈a₁₇a₁₆ -1

Because nobody else uses this modification of UTF-8, you should only use the writeUTF method to write strings that are intended for a Java virtual machine; for example, if you write a program that generates bytecodes. Use the writeChars method for other purposes.

NOTE

See RFC 2279 (http://ietf.org/rfc/rfc2279.txt) and RFC 2781 (http://ietf.org/rfc/rfc2781.txt) for definitions of UTF-8 and UTF-16.

To read the data back in, use the following methods:

`readInt`	`readDouble`
`readShort`	`readChar`
`readLong`	`readBoolean`
`readFloat`	`readUTF`

NOTE

The binary data format is compact and platform independent. Except for the UTF strings, it is also suited to random access. The major drawback is that binary files are not readable by humans.

 java.io.DataInput 1.0

boolean readBoolean()
reads in a Boolean value.
byte readByte()
reads an 8-bit byte.
char readChar()
reads a 16-bit Unicode character.
double readDouble()
reads a 64-bit double.
float readFloat()
reads a 32-bit float.
void readFully(byte[] b)
reads bytes into the array b , blocking until all bytes are read.
Parameters:
b
The buffer into which the data is read
void readFully(byte[] b, int off, int len)
reads bytes into the array b, blocking until all bytes are read.
Parameters:
b
The buffer into which the data is read

off
The start offset of the data

len
The maximum number of bytes to read
int readInt()
reads a 32-bit integer.
String readLine()
reads in a line that has been terminated by a \n, \r, \r\n, or EOF. Returns a string containing all bytes in the line converted to Unicode characters.
long readLong()
reads a 64-bit long integer.
short readShort()
reads a 16-bit short integer.
String readUTF()
reads a string of characters in "modified UTF-8" format.
int skipBytes(int n)
skips n bytes, blocking until all bytes are skipped.
Parameters:
n
The number of bytes to be skipped

 java.io.DataOutput 1.0

void writeBoolean(boolean b)
writes a Boolean value.
void writeByte(int b)
writes an 8-bit byte.
void writeChar(int c)
writes a 16-bit Unicode character.
void writeChars(String s)
writes all characters in the string.
void writeDouble(double d)
writes a 64-bit double.
void writeFloat(float f)
writes a 32-bit float.
void writeInt(int i)
writes a 32-bit integer.
void writeLong(long l)
writes a 64-bit long integer.
void writeShort(int s)
writes a 16-bit short integer.
void writeUTF(String s)
writes a string of characters in "modified UTF-8" format.

Random-Access File Streams

The RandomAccessFile stream class lets you find or write data anywhere in a file. It implements both the DataInput and DataOutput interfaces. Disk files are random access, but streams of data from a network are not. You open a random-access file either for reading only or for both reading and writing. You specify the option by using the string "r" (for read access) or "rw" (for read/write access) as the second argument in the constructor.

 RandomAccessFile in = new RandomAccessFile("employee.dat", "r"); RandomAccessFile inOut = new RandomAccessFile("employee.dat", "rw");

When you open an existing file as a RandomAccessFile, it does not get deleted.

A random-access file also has a file pointer setting that comes with it. The file pointer always indicates the position of the next record that will be read or written. The seek method sets the file pointer to an arbitrary byte position within the file. The argument to seek is a long integer between zero and the length of the file in bytes.

The getFilePointer method returns the current position of the file pointer.

To read from a random-access file, you use the same methods such as readInt and readChar as for DataInputStream objects. That is no accident. These methods are actually defined in the DataInput interface that both DataInputStream and RandomAccessFile implement.

Similarly, to write a random-access file, you use the same writeInt and writeChar methods as in the DataOutputStream class. These methods are defined in the DataOutput interface that is common to both classes.

The advantage of having the RandomAccessFile class implement both DataInput and DataOutput is that this lets you use or write methods whose argument types are the DataInput and DataOutput interfaces.

 class Employee {  . . .    read(DataInput in) { . . . }    write(DataOutput out) { . . . } }

Note that the read method can handle either a DataInputStream or a RandomAccessFile object because both of these classes implement the DataInput interface. The same is true for the write method.

 java.io.RandomAccessFile 1.0

RandomAccessFile(String file, String mode)

RandomAccessFile(File file, String mode)

Parameters:	`file`	The file to be opened
	`mode`	"r" for read-only mode, "rw" for read/write mode, "rws" for read/write mode with synchronous disk writes of data and metadata for every update, and "rwd" for read/write mode with synchronous disk writes of data only.

long getFilePointer()
returns the current location of the file pointer.
void seek(long pos)
sets the file pointer to pos bytes from the beginning of the file.
long length()
returns the length of the file in bytes.

Text Streams

In the last section, we discussed binary input and output. While binary I/O is fast and efficient, it is not easily readable by humans. In this section, we will focus on text I/O. For example, if the integer 1234 is saved in binary, it is written as the sequence of bytes 00 00 04 D2 (in hexadecimal notation). In text format, it is saved as the string "1234".

Unfortunately, doing this in Java requires a bit of work, because, as you know, Java uses Unicode characters. That is, the character encoding for the string "1234" really is 00 31 00 32 00 33 00 34 (in hex). However, at the present time most environments in which your Java programs will run use their own character encoding. This may be a single-byte, double-byte, or variable-byte scheme. For example, if you use Windows, the string would be written in ASCII, as 31 32 33 34, without the extra zero bytes. If the Unicode encoding were written into a text file, then it would be quite unlikely that the resulting file would be humanly readable with the tools of the host environment. To overcome this problem, Java has a set of stream filters that bridges the gap between Unicode-encoded strings and the character encoding used by the local operating system. All of these classes descend from the abstract Reader and Writer classes, and the names are reminiscent of the ones used for binary data. For example, the InputStreamReader class turns an input stream that contains bytes in a particular character encoding into a reader that emits Unicode characters. Similarly, the OutputStreamWriter class turns a stream of Unicode characters into a stream of bytes in a particular character encoding.

For example, here is how you make an input reader that reads keystrokes from the console and automatically converts them to Unicode.

 InputStreamReader in = new InputStreamReader(System.in);

This input stream reader assumes the normal character encoding used by the host system. For example, under Windows, it uses the ISO 8859-1 encoding (also known as ISO Latin-1 or, among Windows programmers, as "ANSI code"). You can choose a different encoding by specifying it in the constructor for the InputStreamReader. This takes the form

 InputStreamReader(InputStream, String)

where the string describes the encoding scheme that you want to use. For example,

 InputStreamReader in = new InputStreamReader(    new FileInputStream("kremlin.dat"), "ISO8859_5");

The next section has more information on character sets.

Because it is so common to want to attach a reader or writer to a file, a pair of convenience classes, FileReader and FileWriter, is provided for this purpose. For example, the writer definition

 FileWriter out = new FileWriter("output.txt");

is equivalent to

 FileWriter out = new FileWriter(new FileOutputStream("output.txt"));

Character Sets

In the past, international character sets have been handled rather unsystematically throughout the Java library. The java.nio package introduced in JDK 1.4 unifies character set conversion with the introduction of the Charset class. (Note that the s is lower case.)

A character set maps between sequences of two-byte Unicode code units and byte sequences used in a local character encoding. One of the most popular character encodings is ISO-8859-1, a single-byte encoding of the first 256 Unicode characters. Gaining in importance is ISO-8859-15, which replaces some of the less useful characters of ISO-8859-1 with accented letters used in French and Finnish, and, more important, replaces the "international currency" character - with the Euro symbol () in code point 0xA4. Other examples for character encodings are the variable-byte encodings commonly used for Japanese and Chinese.

The Charset class uses the character set names standardized in the IANA Character Set Registry (http://www.iana.org/assignments/character-sets). These names differ slightly from those used in previous versions. For example, the "official" name of ISO-8859-1 is now "ISO-8859-1" and no longer "ISO8859_1", which was the preferred name up to JDK 1.3. For compatibility with other naming conventions, each character set can have a number of aliases. For example, ISO-8859-1 has aliases

 ISO8859-1 ISO_8859_1 ISO8859_1 ISO_8859-1 ISO_8859-1:1987 8859_1 latin1 l1 csISOLatin1 iso-ir-100 cp819 IBM819 IBM-819 819

Character set names are case insensitive.

You obtain a Charset by calling the static forName method with either the official name or one of its aliases:

 Charset cset = Charset.forName("ISO-8859-1");

The aliases method returns a Set object of the aliases. A Set is a collection that we discuss in Volume 2; here is the code to iterate through the set elements:

 Set<String> aliases = cset.aliases(); for (String alias : aliases)    System.out.println(alias);

NOTE

An excellent reference for the "ISO 8859 alphabet soup" is http://czyborra.com/charsets/iso8859.html.

International versions of Java support many more encodings. There is even a mechanism for adding additional character set providers see the JDK documentation for details. To find out which character sets are available in a particular implementation, call the static availableCharsets method. It returns a SortedMap, another collection class. Use this code to find out the names of all available character sets:

 Set<String, Charset> charsets = Charset.availableCharsets(); for (String name : charsets.keySet())    System.out.println(name);

Table 12-3 lists the character encodings that every Java implementation is required to have. Table 12-4 lists the encoding schemes that the JDK installs by default. The character sets in Tables 12-5 and 12-6 are installed only on operating systems that use non-European languages. The encoding schemes in Table 12-6 are supplied for compatibility with previous versions of the JDK.

Table 12-3. Required Character Encodings
Charset Standard Name	Legacy Name	Description
`US-ASCII`	`ASCII`	American Standard Code for Information Exchange
`ISO-8859-1`	`ISO8859_1`	ISO 8859-1, Latin alphabet No. 1
`UTF-8`	`UTF8`	Eight-bit Unicode Transformation Format
`UTF-16`	`UTF-16`	Sixteen-bit Unicode Transformation Format, byte order specified by an optional initial byte-order mark
`UTF-16BE`	`UnicodeBigUnmarked`	Sixteen-bit Unicode Transformation Format, big-endian byte order
`UTF-16LE`	`UnicodeLittleUnmarked`	Sixteen-bit Unicode Transformation Format, little-endian byte order

Table 12-4. Basic Character Encodings
Charset Standard Name	Legacy Name	Description
`ISO8859-2`	`ISO8859_2`	ISO 8859-2, Latin alphabet No. 2
`ISO8859-4`	`ISO8859_4`	ISO 8859-4, Latin alphabet No. 4
`ISO8859-5`	`ISO8859_5`	ISO 8859-5, Latin/Cyrillic alphabet
`ISO8859-7`	`ISO8859_7`	ISO 8859-7, Latin/Greek alphabet
`ISO8859-9`	`ISO8859_9`	ISO 8859-9, Latin alphabet No. 5
`ISO8859-13`	`ISO8859_13`	ISO 8859-13, Latin alphabet No. 7
`ISO8859-15`	`ISO8859_15`	ISO 8859-15, Latin alphabet No. 9
`windows-1250`	`Cp1250`	Windows Eastern European
`windows-1251`	`Cp1251`	Windows Cyrillic
`windows-1252`	`Cp1252`	Windows Latin-1
`windows-1253`	`Cp1253`	Windows Greek
`windows-1254`	`Cp1254`	Windows Turkish
`windows-1257`	`Cp1257`	Windows Baltic

Table 12-5. Extended Character Encodings
Charset Standard Name	Legacy Name	Description
`Big5`	`Big5`	Big5, Traditional Chinese
`Big5-HKSCS`	`Big5_HKSCS`	Big5 with Hong Kong extensions, Traditional Chinese
`EUC-JP`	`EUC_JP`	JIS X 0201, 0208, 0212, EUC encoding, Japanese
`EUC-KR`	`EUC_KR`	KS C 5601, EUC encoding, Korean
`GB18030`	`GB18030`	Simplified Chinese, PRC Standard
`GBK`	`GBK`	GBK, Simplified Chinese
`ISCII91`	`ISCII91`	ISCII91 encoding of Indic scripts
`ISO-2022-JP`	`ISO2022JP`	JIS X 0201, 0208 in ISO 2022 form, Japanese
`ISO-2022-KR`	`ISO2022KR`	ISO 2022 KR, Korean
`ISO8859-3`	`ISO8859_3`	ISO 8859-3, Latin alphabet No. 3
`ISO8859-6`	`ISO8859_6`	ISO 8859-6, Latin/Arabic alphabet
`ISO8859-8`	`ISO8859_8`	ISO 8859-8, Latin/Hebrew alphabet
`Shift_JIS`	`SJIS`	Shift-JIS, Japanese
`TIS-620`	`TIS620`	TIS620, Thai
`windows-1255`	`Cp1255`	Windows Hebrew
`windows-1256`	`Cp1256`	Windows Arabic
`windows-1258`	`Cp1258`	Windows Vietnamese
`windows-31j`	`MS932`	Windows Japanese
`x-EUC-CN`	`EUC_CN`	GB2312, EUC encoding, Simplified Chinese
`x-EUC-JP-LINUX`	`EUC_JP_LINUX`	JIS X 0201, 0208, EUC encoding, Japanese
`x-EUC-TW`	`EUC_TW`	CNS11643 (Plane 1-3), EUC encoding, Traditional Chinese
`x-MS950-HKSCS`	`MS950_HKSCS`	Windows Traditional Chinese with Hong Kong extensions
`x-mswin-936`	`MS936`	Windows Simplified Chinese
`x-windows-949`	`MS949`	Windows Korean
`x-windows-950`	`MS950`	Windows Traditional Chinese

Table 12-6. Legacy Character Encodings
Legacy Name	Description
`Cp037`	USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia
`Cp273`	IBM Austria, Germany
`Cp277`	IBM Denmark, Norway
`Cp278`	IBM Finland, Sweden
`Cp280`	IBM Italy
`Cp284`	IBM Catalan/Spain, Spanish Latin America
`Cp285`	IBM United Kingdom, Ireland
`Cp297`	IBM France
`Cp420`	IBM Arabic
`Cp424`	IBM Hebrew
`Cp437`	MS-DOS United States, Australia, New Zealand, South Africa
`Cp500`	EBCDIC 500V1
`Cp737`	PC Greek
`Cp775`	PC Baltic
`Cp838`	IBM Thailand extended SBCS
`Cp850`	MS-DOS Latin-1
`Cp852`	MS-DOS Latin-2
`Cp855`	IBM Cyrillic
`Cp856`	IBM Hebrew
`Cp857`	IBM Turkish
`Cp858`	Variant of Cp850 with Euro character
`Cp860`	MS-DOS Portuguese
`Cp861`	MS-DOS Icelandic
`Cp862`	PC Hebrew
`Cp863`	MS-DOS Canadian French
`Cp864`	PC Arabic
`Cp865`	MS-DOS Nordic
`Cp866`	MS-DOS Russian
`Cp868`	MS-DOS Pakistan
`Cp869`	IBM Modern Greek
`Cp870`	IBM Multilingual Latin-2
`Cp871`	IBM Iceland
`Cp874`	IBM Thai
`Cp875`	IBM Greek
`Cp918`	IBM Pakistan (Urdu)
`Cp921`	IBM Latvia, Lithuania (AIX, DOS)
`Cp922`	IBM Estonia (AIX, DOS)
`Cp930`	Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026
`Cp933`	Korean Mixed with 1880 UDC, superset of 5029
`Cp935`	Simplified Chinese Host mixed with 1880 UDC, superset of 5031
`Cp937`	Traditional Chinese Host mixed with 6204 UDC, superset of 5033
`Cp939`	Japanese Latin Kanji mixed with 4370 UDC, superset of 5035
`Cp942`	IBM OS/2 Japanese, superset of `Cp932`
`Cp942C`	Variant of `Cp942`
`Cp943`	IBM OS/2 Japanese, superset of `Cp932` and `Shift-JIS`
`Cp943C`	Variant of `Cp943`
`Cp948`	OS/2 Chinese (Taiwan) superset of 938
`Cp949`	PC Korean
`Cp949C`	Variant of `Cp949`
`Cp950`	PC Chinese (Hong Kong, Taiwan)
`Cp964`	AIX Chinese (Taiwan)
`Cp970`	AIX Korean
`Cp1006`	IBM AIX Pakistan (Urdu)
`Cp1025`	IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovina, Macedonia (FYR)
`Cp1026`	IBM Latin-5, Turkey
`Cp1046`	IBM Arabic - Windows
`Cp1097`	IBM Iran (Farsi)/Persian
`Cp1098`	IBM Iran (Farsi)/Persian (PC)
`Cp1112`	IBM Latvia, Lithuania
`Cp1122`	IBM Estonia
`Cp1123`	IBM Ukraine
`Cp1124`	IBM AIX Ukraine
`Cp1140`	Variant of `Cp037` with Euro character
`Cp1141`	Variant of `Cp273` with Euro character
`Cp1142`	Variant of `Cp277` with Euro character
`Cp1143`	Variant of `Cp278` with Euro character
`Cp1144`	Variant of `Cp280` with Euro character
`Cp1145`	Variant of `Cp284` with Euro character
`Cp1146`	Variant of `Cp285` with Euro character
`Cp1147`	Variant of `Cp297` with Euro character
`Cp1148`	Variant of `Cp500` with Euro character
`Cp1149`	Variant of `Cp871` with Euro character
`Cp1381`	IBM OS/2, DOS People's Republic of China (PRC)
`Cp1383`	IBM AIX People's Republic of China (PRC)
`Cp33722`	IBM-eucJP - Japanese (superset of 5050)
`ISO2022CN`	ISO 2022 CN, Chinese (conversion to Unicode only)
`ISO2022CN_CNS`	CNS 11643 in ISO 2022 CN form, Traditional Chinese (conversion from Unicode only)
`ISO2022CN_GB`	GB 2312 in ISO 2022 CN form, Simplified Chinese (conversion from Unicode only)
`JIS0201`	JIS X 0201, Japanese
`JIS0208`	JIS X 0208, Japanese
`JIS0212`	JIS X 0212, Japanese
`JISAutoDetect`	Detects and converts from Shift-JIS, EUC-JP, ISO 2022 JP (conversion to Unicode only)
`Johab`	Johab, Korean
`MS874`	Windows Thai
`MacArabic`	Macintosh Arabic
`MacCentralEurope`	Macintosh Latin-2
`MacCroatian`	Macintosh Croatian
`MacCyrillic`	Macintosh Cyrillic
`MacDingbat`	Macintosh Dingbat
`MacGreek`	Macintosh Greek
`MacHebrew`	Macintosh Hebrew
`MacIceland`	Macintosh Iceland
`MacRoman`	Macintosh Roman
`MacRomania`	Macintosh Romania
`MacSymbol`	Macintosh Symbol
`MacThai`	Macintosh Thai
`MacTurkish`	Macintosh Turkish
`MacUkraine`	Macintosh Ukraine

Local encoding schemes cannot represent all Unicode characters. If a character cannot be represented, it is transformed to a ?.

Once you have a character set, you can use it to convert between Unicode strings and encoded byte sequences. Here is how you encode a Unicode string.

 String str = . . .; ByteBuffer buffer = cset.encode(str); byte[] bytes = buffer.array();

Conversely, to decode a byte sequence, you need a byte buffer. Use the static wrap method of the ByteBuffer array to turn a byte array into a byte buffer. The result of the decode method is a CharBuffer. Call its toString method to get a string.

 byte[] bytes = . . .; ByteBuffer bbuf = ByteBuffer.wrap(bytes, offset, length); CharBuffer cbuf = cset.decode(bbuf); String str = cbuf.toString();

 java.nio.charset.Charset 1.4

static SortedMap availableCharsets()
gets all available character sets for this virtual machine. Returns a map whose keys are character set names and whose values are character sets.
static Charset forName(String name)
gets a character set for the given name.
Set aliases()
returns the set of alias names for this character set.
ByteBuffer encode(String str)
encodes the given string into a sequence of bytes.
CharBuffer decode(ByteBuffer buffer)
decodes the given character sequence. Unrecognized inputs are converted to the Unicode "replacement character" ('\uFFFD').

 java.nio.ByteBuffer 1.4

byte[] array()
returns the array of bytes that this buffer manages.
static ByteBuffer wrap(byte[] bytes)
static ByteBuffer wrap(byte[] bytes, int offset, int length)
return a byte buffer that manages the given array of bytes or the given range.

 java.nio.CharBuffer

char[] array()
returns the array of code units that this buffer manages.
char charAt(int index)
returns the code unit at the given index.
String toString()
returns a string consisting of the code units that this buffer manages

How to Write Text Output

For text output, you want to use a PrintWriter. A print writer can print strings and numbers in text format. Just as a DataOutputStream has useful output methods but no destination, a PrintWriter must be combined with a destination writer.

 PrintWriter out = new PrintWriter(new FileWriter("employee.txt"));

You can also combine a print writer with a destination (output) stream.

 PrintWriter out = new PrintWriter(new FileOutputStream("employee.txt"));

The PrintWriter(OutputStream) constructor automatically adds an OutputStreamWriter to convert Unicode characters to bytes in the stream.

To write to a print writer, you use the same print and println methods that you used with System.out. You can use these methods to print numbers (int, short, long, float, double), characters, Boolean values, strings, and objects.

NOTE

Java veterans may wonder whatever happened to the PrintStream class and to System.out. In Java 1.0, the PrintStream class simply truncated all Unicode characters to ASCII characters by dropping the top byte. Conversely, the readLine method of the DataInputStream turned ASCII to Unicode by setting the top byte to 0. Clearly, that was not a clean or portable approach, and it was fixed with the introduction of readers and writers in Java 1.1. For compatibility with existing code, System.in, System.out, and System.err are still streams, not readers and writers. But now the PrintStream class internally converts Unicode characters to the default host encoding in the same way as the PrintWriter does. Objects of type PrintStream act exactly like print writers when you use the print and println methods, but unlike print writers, they allow you to send raw bytes to them with the write(int) and write(byte[]) methods.

For example, consider this code:

 String name = "Harry Hacker"; double salary = 75000; out.print(name); out.print(' '); out.println(salary);

This writes the characters

 Harry Hacker 75000

to the stream out. The characters are then converted to bytes and end up in the file employee.txt.

The println method automatically adds the correct end-of-line character for the target system ("\r\n" on Windows, "\n" on UNIX, "\r" on Macs) to the line. This is the string obtained by the call System.getProperty("line.separator").

If the writer is set to autoflush mode, then all characters in the buffer are sent to their destination whenever println is called. (Print writers are always buffered.) By default, autoflushing is not enabled. You can enable or disable autoflushing by using the PrintWriter(Writer, boolean) constructor and passing the appropriate Boolean as the second argument.

 PrintWriter out = new PrintWriter(new FileWriter("employee.txt"), true); // autoflush

The print methods don't throw exceptions. You can call the checkError method to see if something went wrong with the stream.

NOTE

You cannot write raw bytes to a PrintWriter. Print writers are designed for text output only.

 java.io.PrintWriter 1.1

PrintWriter(Writer out)
creates a new PrintWriter, without automatic line flushing.
Parameters:
out
A character-output writer
PrintWriter(Writer out, boolean autoFlush)
creates a new PrintWriter.
Parameters:
out
A character-output writer

autoFlush
If true, the println methods will flush the output buffer
PrintWriter(OutputStream out)
creates a new PrintWriter, without automatic line flushing, from an existing OutputStream by automatically creating the necessary intermediate OutputStreamWriter.
Parameters:
out
An output stream
PrintWriter(OutputStream out, boolean autoFlush)
creates a new PrintWriter from an existing OutputStream but allows you to determine whether the writer autoflushes or not.
Parameters:
out
An output stream

autoFlush
If TRue, the println methods will flush the output buffer
void print(Object obj)
prints an object by printing the string resulting from toString.
Parameters:
obj
The object to be printed
void print(String s)
prints a Unicode string.
void println(String s)
prints a string followed by a line terminator. Flushes the stream if the stream is in autoflush mode.
void print(char[] s)
prints an array of Unicode characters.
void print(char c)
prints a Unicode character.
void print(int i)
prints an integer in text format.
void print(long l)
prints a long integer in text format.
void print(float f)
prints a floating-point number in text format.
void print(double d)
prints a double-precision floating-point number in text format.
void print(boolean b)
prints a Boolean value in text format.
boolean checkError()
returns true if a formatting or output error occurred. Once the stream has encountered an error, it is tainted and all calls to checkError return true.

How to Read Text Input

As you know:

To write data in binary format, you use a DataOutputStream.
To write in text format, you use a PrintWriter.

Therefore, you might expect that there is an analog to the DataInputStream that lets you read data in text format. The closest analog is the Scanner class that we have used extensively. However, before JDK 5.0, the only game in town for processing text input was the BufferedReader method it has a method, readLine, that lets you read a line of text. You need to combine a buffered reader with an input source.

 BufferedReader in = new BufferedReader(new FileReader("employee.txt"));

The readLine method returns null when no more input is available. A typical input loop, therefore, looks like this:

String line;
while ((line = in.readLine()) != null)
{
do something with line
}

The FileReader class already converts bytes to Unicode characters. For other input sources, you need to use the InputStreamReader unlike the PrintWriter, the InputStreamReader has no automatic convenience method to bridge the gap between bytes and Unicode characters.

 BufferedReader in2 = new BufferedReader(new InputStreamReader(System.in)); BufferedReader in3 = new BufferedReader(new InputStreamReader(url.openStream()));

To read numbers from text input, you need to read a string first and then convert it.

 String s = in.readLine(); double x = Double.parseDouble(s);

That works if there is a single number on each line. Otherwise, you must work harder and break up the input string, for example, by using the StringTokenizer utility class. We see an example of this later in this chapter.

TIP

Java has StringReader and StringWriter classes that allow you to treat a string as if it were a data stream. This can be quite convenient if you want to use the same code to parse both strings and data from a stream.

Figure 12-1. Input and output stream hierarchy

Figure 12-2. Reader and writer hierarchy

Figure 12-3. The Closeable, Flushable, Readable, and Appendable interfaces

Layering Stream Filters

Figure 12-4. A sequence of filtered streams

Data Streams

Table 12-1. UTF-8 Encoding

Table 12-2. UTF-16 Encoding

Random-Access File Streams

Text Streams

Character Sets

Table 12-3. Required Character Encodings

Table 12-4. Basic Character Encodings

Table 12-5. Extended Character Encodings

Table 12-6. Legacy Character Encodings

How to Write Text Output

How to Read Text Input

Figure 12-3. The `Closeable`, `Flushable`, `Readable`, and `Appendable` interfaces