The Complete Stream Zoo

   


Unlike C, which gets by just fine with a single type FILE*, Java has a whole zoo of more than 60 (!) different stream types (see Figures 12-1 and 12-2). Library designers claim that there is a good reason to give users a wide choice of stream types: it is supposed to reduce programming errors. For example, in C, some people think it is a common mistake to send output to a file that was open only for reading. (Well, it is not actually that common.) Naturally, if you do this, the output is ignored at run time. In Java and C++, the compiler catches that kind of mistake because an InputStream (Java) or istream (C++) has no methods for output.

Figure 12-1. Input and output stream hierarchy


Figure 12-2. Reader and writer hierarchy


(We would argue that in C++, and even more so in Java, the main tool that the stream interface designers have against programming errors is intimidation. The sheer complexity of the stream libraries keeps programmers on their toes.)

C++ NOTE

ANSI C++ gives you more stream types than you want, such as istream, ostream, iostream, ifstream, ofstream, fstream, wistream, wifstream, istrstream, and so on (18 classes in all). But Java really goes overboard with streams and gives you separate classes for selecting buffering, lookahead, random access, text formatting, and binary data.


Let us divide the animals in the stream class zoo by how they are used. Four abstract classes are at the base of the zoo: InputStream, OutputStream, Reader, and Writer. You do not make objects of these types, but other methods can return them. For example, as you saw in Chapter 10, the URL class has the method openStream that returns an InputStream. You then use this InputStream object to read from the URL. As we said, the InputStream and OutputStream classes let you read and write only individual bytes and arrays of bytes; they have no methods to read and write strings and numbers. You need more capable child classes for this. For example, DataInputStream and DataOutputStream let you read and write all the basic Java types.

For Unicode text, on the other hand, as we said, you use classes that descend from Reader and Writer. The basic methods of the Reader and Writer classes are similar to the ones for InputStream and OutputStream.

 abstract int read() abstract void write(int b) 

They work just as the comparable methods do in the InputStream and OutputStream classes except, of course, the read method returns either a Unicode code unit (as an integer between 0 and 65535) or 1 when you have reached the end of the file.

Finally, there are streams that do useful stuff, for example, the ZipInputStream and ZipOutputStream that let you read and write files in the familiar ZIP compression format.

Moreover, JDK 5.0 introduces four new interfaces: Closeable, Flushable, Readable, and Appendable (see Figure 12-3). The first two interfaces are very simple, with methods

 void close() throws IOException 

Figure 12-3. The Closeable, Flushable, Readable, and Appendable interfaces


and

 void flush() 

respectively. The classes InputStream, OutputStream, Reader, and Writer all implement the Closeable interface. OutputStream and Writer implement the Flushable interface.

The Readable interface has a single method

 int read(CharBuffer cb) 

The CharBuffer class has methods for sequential and random read/write access. It represents an in-memory buffer or a memory-mapped file (see page 696).

The Appendable interface has two methods, for appending single characters and character sequences:

 Appendable append(char c) Appendable append(CharSequence s) 

The CharSequence type is yet another interface, describing minimal properties of a sequence of char values. It is implemented by String, CharBuffer, and StringBuilder/StringBuffer (see page 656).

Of the stream zoo classes, only Writer implements Appendable.


 java.io.Closeable 5.0 

  • void close()

    closes this Closeable. This method may throw an IOException.


 java.io.Flushable 5.0 

  • void flush()

    flushes this Flushable.


 java.lang.Readable 5.0 

  • int read(CharBuffer cb)

    attempts to read as many char values into cb as it can hold. Returns the number of values read, or -1 if no further values are available from this Readable.


 java.lang.Appendable 5.0 

  • Appendable append(char c)

    appends the code unit c to this Appendable; returns this.

  • Appendable append(CharSequence cs)

    appends all code units in cs to this Appendable; returns this.


 java.lang.CharSequence 1.4 

  • char charAt(int index)

    returns the code unit at the given index.

  • int length()

    returns the number of code units in this sequence.

  • CharSequence subSequence(int startIndex, int endIndex)

    returns a CharSequence consisting of the code units stored at index startIndex to endIndex - 1.

  • String toString()

    returns a string consisting of the code units of this sequence.

Layering Stream Filters

FileInputStream and FileOutputStream give you input and output streams attached to a disk file. You give the file name or full path name of the file in the constructor. For example,

 FileInputStream fin = new FileInputStream("employee.dat"); 

looks in the current directory for a file named "employee.dat".

CAUTION

Because the backslash character is the escape character in Java strings, be sure to use \\ for Windows-style path names ("C:\\Windows\\win.ini"). In Windows, you can also use a single forward slash ("C:/Windows/win.ini") because most Windows file handling system calls will interpret forward slashes as file separators. However, this is not recommended the behavior of the Windows system functions is subject to change, and on other operating systems, the file separator may yet be different. Instead, for portable programs, you should use the correct file separator character. It is stored in the constant string File.separator.


You can also use a File object (see page 685 for more on file objects):

 File f = new File("employee.dat"); FileInputStream fin = new FileInputStream(f); 

Like the abstract InputStream and OutputStream classes, these classes support only reading and writing on the byte level. That is, we can only read bytes and byte arrays from the object fin.

 byte b = (byte) fin.read(); 

TIP

Because all the classes in java.io interpret relative path names as starting with the user's current working directory, you may want to know this directory. You can get at this information by a call to System.getProperty("user.dir").


As you will see in the next section, if we just had a DataInputStream, then we could read numeric types:

 DataInputStream din = . . .; double s = din.readDouble(); 

But just as the FileInputStream has no methods to read numeric types, the DataInputStream has no method to get data from a file.

Java uses a clever mechanism to separate two kinds of responsibilities. Some streams (such as the FileInputStream and the input stream returned by the openStream method of the URL class) can retrieve bytes from files and other more exotic locations. Other streams (such as the DataInputStream and the PrintWriter) can assemble bytes into more useful data types. The Java programmer has to combine the two into what are often called filtered streams by feeding an existing stream to the constructor of another stream. For example, to be able to read numbers from a file, first create a FileInputStream and then pass it to the constructor of a DataInputStream.

 FileInputStream fin = new FileInputStream("employee.dat"); DataInputStream din = new DataInputStream(fin); double s = din.readDouble(); 

It is important to keep in mind that the data input stream that we created with the above code does not correspond to a new disk file. The newly created stream still accesses the data from the file attached to the file input stream, but the point is that it now has a more capable interface.

If you look at Figure 12-1 again, you can see the classes FilterInputStream and FilterOutputStream. You combine their subclasses into a new filtered stream to construct the streams you want. For example, by default, streams are not buffered. That is, every call to read contacts the operating system to ask it to dole out yet another byte. If you want buffering and the data input methods for a file named employee.dat in the current directory, you need to use the following rather monstrous sequence of constructors:

 DataInputStream din = new DataInputStream(    new BufferedInputStream(       new FileInputStream("employee.dat"))); 

Notice that we put the DataInputStream last in the chain of constructors because we want to use the DataInputStream methods, and we want them to use the buffered read method. Regardless of the ugliness of the above code, it is necessary: you must be prepared to continue layering stream constructors until you have access to the functionality you want.

Sometimes you'll need to keep track of the intermediate streams when chaining them together. For example, when reading input, you often need to peek at the next byte to see if it is the value that you expect. Java provides the PushbackInputStream for this purpose.

 PushbackInputStream pbin = new PushbackInputStream(    new BufferedInputStream(       new FileInputStream("employee.dat"))); 

Now you can speculatively read the next byte

 int b = pbin.read(); 

and throw it back if it isn't what you wanted.

 if (b != '<') pbin.unread(b); 

But reading and unreading are the only methods that apply to the pushback input stream. If you want to look ahead and also read numbers, then you need both a pushback input stream and a data input stream reference.

 DataInputStream din = new DataInputStream(    pbin = new PushbackInputStream(       new BufferedInputStream(          new FileInputStream("employee.dat")))); 

Of course, in the stream libraries of other programming languages, niceties such as buffering and lookahead are automatically taken care of, so it is a bit of a hassle in Java that one has to resort to layering stream filters in these cases. But the ability to mix and match filter classes to construct truly useful sequences of streams does give you an immense amount of flexibility. For example, you can read numbers from a compressed ZIP file by using the following sequence of streams (see Figure 12-4).

 ZipInputStream zin = new ZipInputStream(new FileInputStream("employee.zip")); DataInputStream din = new DataInputStream(zin); 

Figure 12-4. A sequence of filtered streams


(See the section on ZIP file streams starting on page 643 for more on Java's ability to handle ZIP files.)

All in all, apart from the rather monstrous constructors that are needed to layer streams, the ability to mix and match streams is a very useful feature of Java!


 java.io.FileInputStream 1.0 

  • FileInputStream(String name)

    creates a new file input stream, using the file whose path name is specified by the name string.

  • FileInputStream(File f)

    creates a new file input stream, using the information encapsulated in the File object. (The File class is described at the end of this chapter.)


 java.io.FileOutputStream 1.0 

  • FileOutputStream(String name)

    creates a new file output stream specified by the name string. Path names that are not absolute are resolved relative to the current working directory. Caution: This method automatically deletes any existing file with the same name.

  • FileOutputStream(String name, boolean append)

    creates a new file output stream specified by the name string. Path names that are not absolute are resolved relative to the current working directory. If the append parameter is TRue, then data are added at the end of the file. An existing file with the same name will not be deleted.

  • FileOutputStream(File f)

    creates a new file output stream using the information encapsulated in the File object. (The File class is described at the end of this chapter.) Caution: This method automatically deletes any existing file with the same name as the name of f.


 java.io.BufferedInputStream 1.0 

  • BufferedInputStream(InputStream in)

    creates a new buffered stream with a default buffer size. A buffered input stream reads characters from a stream without causing a device access every time. When the buffer is empty, a new block of data is read into the buffer.

  • BufferedInputStream(InputStream in, int n)

    creates a new buffered stream with a user-defined buffer size.


 java.io.BufferedOutputStream 1.0 

  • BufferedOutputStream(OutputStream out)

    creates a new buffered stream with a default buffer size. A buffered output stream collects characters to be written without causing a device access every time. When the buffer fills up or when the stream is flushed, the data are written.

  • BufferedOutputStream(OutputStream out, int n)

    creates a new buffered stream with a user-defined buffer size.


 java.io.PushbackInputStream 1.0 

  • PushbackInputStream(InputStream in)

    constructs a stream with one-byte lookahead.

  • PushbackInputStream(InputStream in, int size)

    constructs a stream with a pushback buffer of specified size.

  • void unread(int b)

    pushes back a byte, which is retrieved again by the next call to read. You can push back only one byte at a time.

Parameters:

b

The byte to be read again


Data Streams

You often need to write the result of a computation or read one back. The data streams support methods for reading back all the basic Java types. To write a number, character, Boolean value, or string, use one of the following methods of the DataOutput interface:

 writeChars writeByte writeInt writeShort writeLong writeFloat writeDouble writeChar writeBoolean writeUTF 

For example, writeInt always writes an integer as a 4-byte binary quantity regardless of the number of digits, and writeDouble always writes a double as an 8-byte binary quantity. The resulting output is not humanly readable, but the space needed will be the same for each value of a given type and reading it back in will be faster. (See the section on the PrintWriter class later in this chapter for how to output numbers as human-readable text.)

NOTE

There are two different methods of storing integers and floating-point numbers in memory, depending on the platform you are using. Suppose, for example, you are working with a 4-byte int, say the decimal number 1234, or 4D2 in hexadecimal (1234 = 4 x 256 + 13 x 16 + 2). This can be stored in such a way that the first of the 4 bytes in memory holds the most significant byte (MSB) of the value: 00 00 04 D2. This is the so-called big-endian method. Or we can start with the least significant byte (LSB) first: D2 04 00 00. This is called, naturally enough, the little-endian method. For example, the SPARC uses big-endian; the Pentium, little-endian. This can lead to problems. When a C or C++ file is saved, the data are saved exactly as the processor stores them. That makes it challenging to move even the simplest data files from one platform to another. In Java, all values are written in the big-endian fashion, regardless of the processor. That makes Java data files platform independent.


The writeUTF method writes string data by using a modified version of 8-bit Unicode Transformation Format. Instead of simply using the standard UTF-8 encoding (which is shown in Table 12-1), character strings are first represented in UTF-16 (see Table 12-2) and then the result is encoded using the UTF-8 rules. The modified encoding is different for characters with code higher than 0xFFFF. It is used for backwards compatibility with virtual machines that were built when Unicode had not yet grown beyond 16 bits.

Table 12-1. UTF-8 Encoding

Character Range

Encoding

0...7F

0a6a5a4a3a2a1a0

80...7FF

110a10a9a8a7a6 10a5a4a3a2a1a0

800...FFFF

1110a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0

10000...10FFFF

11110a20a19a18 10a17a16a15a14a13a12 10a11a10a9a8a7a6 10a5a4a3a2a1a0


Table 12-2. UTF-16 Encoding

Character Range

Encoding

0...FFFF

a15a14a13a12a11a10a9a8 a7a6a5a4a3a2a1a0

10000...10FFFF

110110b19b18 b17a16a15a14a13a12a11a10 110111a9a8 a7a6a5a4a3a2a1a0

where b19b18b17b16 = a20a19a18a17a16 -1


Because nobody else uses this modification of UTF-8, you should only use the writeUTF method to write strings that are intended for a Java virtual machine; for example, if you write a program that generates bytecodes. Use the writeChars method for other purposes.

NOTE

See RFC 2279 (http://ietf.org/rfc/rfc2279.txt) and RFC 2781 (http://ietf.org/rfc/rfc2781.txt) for definitions of UTF-8 and UTF-16.


To read the data back in, use the following methods:

readInt

readDouble

readShort

readChar

readLong

readBoolean

readFloat

readUTF


NOTE

The binary data format is compact and platform independent. Except for the UTF strings, it is also suited to random access. The major drawback is that binary files are not readable by humans.



 java.io.DataInput 1.0 

  • boolean readBoolean()

    reads in a Boolean value.

  • byte readByte()

    reads an 8-bit byte.

  • char readChar()

    reads a 16-bit Unicode character.

  • double readDouble()

    reads a 64-bit double.

  • float readFloat()

    reads a 32-bit float.

  • void readFully(byte[] b)

    reads bytes into the array b , blocking until all bytes are read.

    Parameters:

    b

    The buffer into which the data is read


  • void readFully(byte[] b, int off, int len)

    reads bytes into the array b, blocking until all bytes are read.

    Parameters:

    b

    The buffer into which the data is read

     

    off

    The start offset of the data

     

    len

    The maximum number of bytes to read


  • int readInt()

    reads a 32-bit integer.

  • String readLine()

    reads in a line that has been terminated by a \n, \r, \r\n, or EOF. Returns a string containing all bytes in the line converted to Unicode characters.

  • long readLong()

    reads a 64-bit long integer.

  • short readShort()

    reads a 16-bit short integer.

  • String readUTF()

    reads a string of characters in "modified UTF-8" format.

  • int skipBytes(int n)

    skips n bytes, blocking until all bytes are skipped.

    Parameters:

    n

    The number of bytes to be skipped



 java.io.DataOutput 1.0 

  • void writeBoolean(boolean b)

    writes a Boolean value.

  • void writeByte(int b)

    writes an 8-bit byte.

  • void writeChar(int c)

    writes a 16-bit Unicode character.

  • void writeChars(String s)

    writes all characters in the string.

  • void writeDouble(double d)

    writes a 64-bit double.

  • void writeFloat(float f)

    writes a 32-bit float.

  • void writeInt(int i)

    writes a 32-bit integer.

  • void writeLong(long l)

    writes a 64-bit long integer.

  • void writeShort(int s)

    writes a 16-bit short integer.

  • void writeUTF(String s)

    writes a string of characters in "modified UTF-8" format.

Random-Access File Streams

The RandomAccessFile stream class lets you find or write data anywhere in a file. It implements both the DataInput and DataOutput interfaces. Disk files are random access, but streams of data from a network are not. You open a random-access file either for reading only or for both reading and writing. You specify the option by using the string "r" (for read access) or "rw" (for read/write access) as the second argument in the constructor.

 RandomAccessFile in = new RandomAccessFile("employee.dat", "r"); RandomAccessFile inOut = new RandomAccessFile("employee.dat", "rw"); 

When you open an existing file as a RandomAccessFile, it does not get deleted.

A random-access file also has a file pointer setting that comes with it. The file pointer always indicates the position of the next record that will be read or written. The seek method sets the file pointer to an arbitrary byte position within the file. The argument to seek is a long integer between zero and the length of the file in bytes.

The getFilePointer method returns the current position of the file pointer.

To read from a random-access file, you use the same methods such as readInt and readChar as for DataInputStream objects. That is no accident. These methods are actually defined in the DataInput interface that both DataInputStream and RandomAccessFile implement.

Similarly, to write a random-access file, you use the same writeInt and writeChar methods as in the DataOutputStream class. These methods are defined in the DataOutput interface that is common to both classes.

The advantage of having the RandomAccessFile class implement both DataInput and DataOutput is that this lets you use or write methods whose argument types are the DataInput and DataOutput interfaces.

 class Employee {  . . .    read(DataInput in) { . . . }    write(DataOutput out) { . . . } } 

Note that the read method can handle either a DataInputStream or a RandomAccessFile object because both of these classes implement the DataInput interface. The same is true for the write method.


 java.io.RandomAccessFile 1.0 

  • RandomAccessFile(String file, String mode)

  • RandomAccessFile(File file, String mode)

    Parameters:

    file

    The file to be opened

     

    mode

    "r" for read-only mode, "rw" for read/write mode, "rws" for read/write mode with synchronous disk writes of data and metadata for every update, and "rwd" for read/write mode with synchronous disk writes of data only.


  • long getFilePointer()

    returns the current location of the file pointer.

  • void seek(long pos)

    sets the file pointer to pos bytes from the beginning of the file.

  • long length()

    returns the length of the file in bytes.

Text Streams

In the last section, we discussed binary input and output. While binary I/O is fast and efficient, it is not easily readable by humans. In this section, we will focus on text I/O. For example, if the integer 1234 is saved in binary, it is written as the sequence of bytes 00 00 04 D2 (in hexadecimal notation). In text format, it is saved as the string "1234".

Unfortunately, doing this in Java requires a bit of work, because, as you know, Java uses Unicode characters. That is, the character encoding for the string "1234" really is 00 31 00 32 00 33 00 34 (in hex). However, at the present time most environments in which your Java programs will run use their own character encoding. This may be a single-byte, double-byte, or variable-byte scheme. For example, if you use Windows, the string would be written in ASCII, as 31 32 33 34, without the extra zero bytes. If the Unicode encoding were written into a text file, then it would be quite unlikely that the resulting file would be humanly readable with the tools of the host environment. To overcome this problem, Java has a set of stream filters that bridges the gap between Unicode-encoded strings and the character encoding used by the local operating system. All of these classes descend from the abstract Reader and Writer classes, and the names are reminiscent of the ones used for binary data. For example, the InputStreamReader class turns an input stream that contains bytes in a particular character encoding into a reader that emits Unicode characters. Similarly, the OutputStreamWriter class turns a stream of Unicode characters into a stream of bytes in a particular character encoding.

For example, here is how you make an input reader that reads keystrokes from the console and automatically converts them to Unicode.

 InputStreamReader in = new InputStreamReader(System.in); 

This input stream reader assumes the normal character encoding used by the host system. For example, under Windows, it uses the ISO 8859-1 encoding (also known as ISO Latin-1 or, among Windows programmers, as "ANSI code"). You can choose a different encoding by specifying it in the constructor for the InputStreamReader. This takes the form

 InputStreamReader(InputStream, String) 

where the string describes the encoding scheme that you want to use. For example,

 InputStreamReader in = new InputStreamReader(    new FileInputStream("kremlin.dat"), "ISO8859_5"); 

The next section has more information on character sets.

Because it is so common to want to attach a reader or writer to a file, a pair of convenience classes, FileReader and FileWriter, is provided for this purpose. For example, the writer definition

 FileWriter out = new FileWriter("output.txt"); 

is equivalent to

 FileWriter out = new FileWriter(new FileOutputStream("output.txt")); 

Character Sets

In the past, international character sets have been handled rather unsystematically throughout the Java library. The java.nio package introduced in JDK 1.4 unifies character set conversion with the introduction of the Charset class. (Note that the s is lower case.)

A character set maps between sequences of two-byte Unicode code units and byte sequences used in a local character encoding. One of the most popular character encodings is ISO-8859-1, a single-byte encoding of the first 256 Unicode characters. Gaining in importance is ISO-8859-15, which replaces some of the less useful characters of ISO-8859-1 with accented letters used in French and Finnish, and, more important, replaces the "international currency" character - with the Euro symbol () in code point 0xA4. Other examples for character encodings are the variable-byte encodings commonly used for Japanese and Chinese.

The Charset class uses the character set names standardized in the IANA Character Set Registry (http://www.iana.org/assignments/character-sets). These names differ slightly from those used in previous versions. For example, the "official" name of ISO-8859-1 is now "ISO-8859-1" and no longer "ISO8859_1", which was the preferred name up to JDK 1.3. For compatibility with other naming conventions, each character set can have a number of aliases. For example, ISO-8859-1 has aliases

 ISO8859-1 ISO_8859_1 ISO8859_1 ISO_8859-1 ISO_8859-1:1987 8859_1 latin1 l1 csISOLatin1 iso-ir-100 cp819 IBM819 IBM-819 819 

Character set names are case insensitive.

You obtain a Charset by calling the static forName method with either the official name or one of its aliases:

 Charset cset = Charset.forName("ISO-8859-1"); 

The aliases method returns a Set object of the aliases. A Set is a collection that we discuss in Volume 2; here is the code to iterate through the set elements:

 Set<String> aliases = cset.aliases(); for (String alias : aliases)    System.out.println(alias); 

NOTE

An excellent reference for the "ISO 8859 alphabet soup" is http://czyborra.com/charsets/iso8859.html.


International versions of Java support many more encodings. There is even a mechanism for adding additional character set providers see the JDK documentation for details. To find out which character sets are available in a particular implementation, call the static availableCharsets method. It returns a SortedMap, another collection class. Use this code to find out the names of all available character sets:

 Set<String, Charset> charsets = Charset.availableCharsets(); for (String name : charsets.keySet())    System.out.println(name); 

Table 12-3 lists the character encodings that every Java implementation is required to have. Table 12-4 lists the encoding schemes that the JDK installs by default. The character sets in Tables 12-5 and 12-6 are installed only on operating systems that use non-European languages. The encoding schemes in Table 12-6 are supplied for compatibility with previous versions of the JDK.

Table 12-3. Required Character Encodings

Charset Standard Name

Legacy Name

Description

US-ASCII

ASCII

American Standard Code for Information Exchange

ISO-8859-1

ISO8859_1

ISO 8859-1, Latin alphabet No. 1

UTF-8

UTF8

Eight-bit Unicode Transformation Format

UTF-16

UTF-16

Sixteen-bit Unicode Transformation Format, byte order specified by an optional initial byte-order mark

UTF-16BE

UnicodeBigUnmarked

Sixteen-bit Unicode Transformation Format, big-endian byte order

UTF-16LE

UnicodeLittleUnmarked

Sixteen-bit Unicode Transformation Format, little-endian byte order


Table 12-4. Basic Character Encodings

Charset Standard Name

Legacy Name

Description

ISO8859-2

ISO8859_2

ISO 8859-2, Latin alphabet No. 2

ISO8859-4

ISO8859_4

ISO 8859-4, Latin alphabet No. 4

ISO8859-5

ISO8859_5

ISO 8859-5, Latin/Cyrillic alphabet

ISO8859-7

ISO8859_7

ISO 8859-7, Latin/Greek alphabet

ISO8859-9

ISO8859_9

ISO 8859-9, Latin alphabet No. 5

ISO8859-13

ISO8859_13

ISO 8859-13, Latin alphabet No. 7

ISO8859-15

ISO8859_15

ISO 8859-15, Latin alphabet No. 9

windows-1250

Cp1250

Windows Eastern European

windows-1251

Cp1251

Windows Cyrillic

windows-1252

Cp1252

Windows Latin-1

windows-1253

Cp1253

Windows Greek

windows-1254

Cp1254

Windows Turkish

windows-1257

Cp1257

Windows Baltic


Table 12-5. Extended Character Encodings

Charset Standard Name

Legacy Name

Description

Big5

Big5

Big5, Traditional Chinese

Big5-HKSCS

Big5_HKSCS

Big5 with Hong Kong extensions, Traditional Chinese

EUC-JP

EUC_JP

JIS X 0201, 0208, 0212, EUC encoding, Japanese

EUC-KR

EUC_KR

KS C 5601, EUC encoding, Korean

GB18030

GB18030

Simplified Chinese, PRC Standard

GBK

GBK

GBK, Simplified Chinese

ISCII91

ISCII91

ISCII91 encoding of Indic scripts

ISO-2022-JP

ISO2022JP

JIS X 0201, 0208 in ISO 2022 form, Japanese

ISO-2022-KR

ISO2022KR

ISO 2022 KR, Korean

ISO8859-3

ISO8859_3

ISO 8859-3, Latin alphabet No. 3

ISO8859-6

ISO8859_6

ISO 8859-6, Latin/Arabic alphabet

ISO8859-8

ISO8859_8

ISO 8859-8, Latin/Hebrew alphabet

Shift_JIS

SJIS

Shift-JIS, Japanese

TIS-620

TIS620

TIS620, Thai

windows-1255

Cp1255

Windows Hebrew

windows-1256

Cp1256

Windows Arabic

windows-1258

Cp1258

Windows Vietnamese

windows-31j

MS932

Windows Japanese

x-EUC-CN

EUC_CN

GB2312, EUC encoding, Simplified Chinese

x-EUC-JP-LINUX

EUC_JP_LINUX

JIS X 0201, 0208, EUC encoding, Japanese

x-EUC-TW

EUC_TW

CNS11643 (Plane 1-3), EUC encoding, Traditional Chinese

x-MS950-HKSCS

MS950_HKSCS

Windows Traditional Chinese with Hong Kong extensions

x-mswin-936

MS936

Windows Simplified Chinese

x-windows-949

MS949

Windows Korean

x-windows-950

MS950

Windows Traditional Chinese


Table 12-6. Legacy Character Encodings

Legacy Name

Description

Cp037

USA, Canada (Bilingual, French), Netherlands, Portugal, Brazil, Australia

Cp273

IBM Austria, Germany

Cp277

IBM Denmark, Norway

Cp278

IBM Finland, Sweden

Cp280

IBM Italy

Cp284

IBM Catalan/Spain, Spanish Latin America

Cp285

IBM United Kingdom, Ireland

Cp297

IBM France

Cp420

IBM Arabic

Cp424

IBM Hebrew

Cp437

MS-DOS United States, Australia, New Zealand, South Africa

Cp500

EBCDIC 500V1

Cp737

PC Greek

Cp775

PC Baltic

Cp838

IBM Thailand extended SBCS

Cp850

MS-DOS Latin-1

Cp852

MS-DOS Latin-2

Cp855

IBM Cyrillic

Cp856

IBM Hebrew

Cp857

IBM Turkish

Cp858

Variant of Cp850 with Euro character

Cp860

MS-DOS Portuguese

Cp861

MS-DOS Icelandic

Cp862

PC Hebrew

Cp863

MS-DOS Canadian French

Cp864

PC Arabic

Cp865

MS-DOS Nordic

Cp866

MS-DOS Russian

Cp868

MS-DOS Pakistan

Cp869

IBM Modern Greek

Cp870

IBM Multilingual Latin-2

Cp871

IBM Iceland

Cp874

IBM Thai

Cp875

IBM Greek

Cp918

IBM Pakistan (Urdu)

Cp921

IBM Latvia, Lithuania (AIX, DOS)

Cp922

IBM Estonia (AIX, DOS)

Cp930

Japanese Katakana-Kanji mixed with 4370 UDC, superset of 5026

Cp933

Korean Mixed with 1880 UDC, superset of 5029

Cp935

Simplified Chinese Host mixed with 1880 UDC, superset of 5031

Cp937

Traditional Chinese Host mixed with 6204 UDC, superset of 5033

Cp939

Japanese Latin Kanji mixed with 4370 UDC, superset of 5035

Cp942

IBM OS/2 Japanese, superset of Cp932

Cp942C

Variant of Cp942

Cp943

IBM OS/2 Japanese, superset of Cp932 and Shift-JIS

Cp943C

Variant of Cp943

Cp948

OS/2 Chinese (Taiwan) superset of 938

Cp949

PC Korean

Cp949C

Variant of Cp949

Cp950

PC Chinese (Hong Kong, Taiwan)

Cp964

AIX Chinese (Taiwan)

Cp970

AIX Korean

Cp1006

IBM AIX Pakistan (Urdu)

Cp1025

IBM Multilingual Cyrillic: Bulgaria, Bosnia, Herzegovina, Macedonia (FYR)

Cp1026

IBM Latin-5, Turkey

Cp1046

IBM Arabic - Windows

Cp1097

IBM Iran (Farsi)/Persian

Cp1098

IBM Iran (Farsi)/Persian (PC)

Cp1112

IBM Latvia, Lithuania

Cp1122

IBM Estonia

Cp1123

IBM Ukraine

Cp1124

IBM AIX Ukraine

Cp1140

Variant of Cp037 with Euro character

Cp1141

Variant of Cp273 with Euro character

Cp1142

Variant of Cp277 with Euro character

Cp1143

Variant of Cp278 with Euro character

Cp1144

Variant of Cp280 with Euro character

Cp1145

Variant of Cp284 with Euro character

Cp1146

Variant of Cp285 with Euro character

Cp1147

Variant of Cp297 with Euro character

Cp1148

Variant of Cp500 with Euro character

Cp1149

Variant of Cp871 with Euro character

Cp1381

IBM OS/2, DOS People's Republic of China (PRC)

Cp1383

IBM AIX People's Republic of China (PRC)

Cp33722

IBM-eucJP - Japanese (superset of 5050)

ISO2022CN

ISO 2022 CN, Chinese (conversion to Unicode only)

ISO2022CN_CNS

CNS 11643 in ISO 2022 CN form, Traditional Chinese (conversion from Unicode only)

ISO2022CN_GB

GB 2312 in ISO 2022 CN form, Simplified Chinese (conversion from Unicode only)

JIS0201

JIS X 0201, Japanese

JIS0208

JIS X 0208, Japanese

JIS0212

JIS X 0212, Japanese

JISAutoDetect

Detects and converts from Shift-JIS, EUC-JP, ISO 2022 JP (conversion to Unicode only)

Johab

Johab, Korean

MS874

Windows Thai

MacArabic

Macintosh Arabic

MacCentralEurope

Macintosh Latin-2

MacCroatian

Macintosh Croatian

MacCyrillic

Macintosh Cyrillic

MacDingbat

Macintosh Dingbat

MacGreek

Macintosh Greek

MacHebrew

Macintosh Hebrew

MacIceland

Macintosh Iceland

MacRoman

Macintosh Roman

MacRomania

Macintosh Romania

MacSymbol

Macintosh Symbol

MacThai

Macintosh Thai

MacTurkish

Macintosh Turkish

MacUkraine

Macintosh Ukraine


Local encoding schemes cannot represent all Unicode characters. If a character cannot be represented, it is transformed to a ?.

Once you have a character set, you can use it to convert between Unicode strings and encoded byte sequences. Here is how you encode a Unicode string.

 String str = . . .; ByteBuffer buffer = cset.encode(str); byte[] bytes = buffer.array(); 

Conversely, to decode a byte sequence, you need a byte buffer. Use the static wrap method of the ByteBuffer array to turn a byte array into a byte buffer. The result of the decode method is a CharBuffer. Call its toString method to get a string.

 byte[] bytes = . . .; ByteBuffer bbuf = ByteBuffer.wrap(bytes, offset, length); CharBuffer cbuf = cset.decode(bbuf); String str = cbuf.toString(); 


 java.nio.charset.Charset 1.4 

  • static SortedMap availableCharsets()

    gets all available character sets for this virtual machine. Returns a map whose keys are character set names and whose values are character sets.

  • static Charset forName(String name)

    gets a character set for the given name.

  • Set aliases()

    returns the set of alias names for this character set.

  • ByteBuffer encode(String str)

    encodes the given string into a sequence of bytes.

  • CharBuffer decode(ByteBuffer buffer)

    decodes the given character sequence. Unrecognized inputs are converted to the Unicode "replacement character" ('\uFFFD').


 java.nio.ByteBuffer 1.4 

  • byte[] array()

    returns the array of bytes that this buffer manages.

  • static ByteBuffer wrap(byte[] bytes)

  • static ByteBuffer wrap(byte[] bytes, int offset, int length)

    return a byte buffer that manages the given array of bytes or the given range.


 java.nio.CharBuffer 

  • char[] array()

    returns the array of code units that this buffer manages.

  • char charAt(int index)

    returns the code unit at the given index.

  • String toString()

    returns a string consisting of the code units that this buffer manages

How to Write Text Output

For text output, you want to use a PrintWriter. A print writer can print strings and numbers in text format. Just as a DataOutputStream has useful output methods but no destination, a PrintWriter must be combined with a destination writer.

 PrintWriter out = new PrintWriter(new FileWriter("employee.txt")); 

You can also combine a print writer with a destination (output) stream.

 PrintWriter out = new PrintWriter(new FileOutputStream("employee.txt")); 

The PrintWriter(OutputStream) constructor automatically adds an OutputStreamWriter to convert Unicode characters to bytes in the stream.

To write to a print writer, you use the same print and println methods that you used with System.out. You can use these methods to print numbers (int, short, long, float, double), characters, Boolean values, strings, and objects.

NOTE

Java veterans may wonder whatever happened to the PrintStream class and to System.out. In Java 1.0, the PrintStream class simply truncated all Unicode characters to ASCII characters by dropping the top byte. Conversely, the readLine method of the DataInputStream turned ASCII to Unicode by setting the top byte to 0. Clearly, that was not a clean or portable approach, and it was fixed with the introduction of readers and writers in Java 1.1. For compatibility with existing code, System.in, System.out, and System.err are still streams, not readers and writers. But now the PrintStream class internally converts Unicode characters to the default host encoding in the same way as the PrintWriter does. Objects of type PrintStream act exactly like print writers when you use the print and println methods, but unlike print writers, they allow you to send raw bytes to them with the write(int) and write(byte[]) methods.


For example, consider this code:

 String name = "Harry Hacker"; double salary = 75000; out.print(name); out.print(' '); out.println(salary); 

This writes the characters

 Harry Hacker 75000 

to the stream out. The characters are then converted to bytes and end up in the file employee.txt.

The println method automatically adds the correct end-of-line character for the target system ("\r\n" on Windows, "\n" on UNIX, "\r" on Macs) to the line. This is the string obtained by the call System.getProperty("line.separator").

If the writer is set to autoflush mode, then all characters in the buffer are sent to their destination whenever println is called. (Print writers are always buffered.) By default, autoflushing is not enabled. You can enable or disable autoflushing by using the PrintWriter(Writer, boolean) constructor and passing the appropriate Boolean as the second argument.

 PrintWriter out = new PrintWriter(new FileWriter("employee.txt"), true); // autoflush 

The print methods don't throw exceptions. You can call the checkError method to see if something went wrong with the stream.

NOTE

You cannot write raw bytes to a PrintWriter. Print writers are designed for text output only.



 java.io.PrintWriter 1.1 

  • PrintWriter(Writer out)

    creates a new PrintWriter, without automatic line flushing.

    Parameters:

    out

    A character-output writer


  • PrintWriter(Writer out, boolean autoFlush)

    creates a new PrintWriter.

    Parameters:

    out

    A character-output writer

     

    autoFlush

    If true, the println methods will flush the output buffer


  • PrintWriter(OutputStream out)

    creates a new PrintWriter, without automatic line flushing, from an existing OutputStream by automatically creating the necessary intermediate OutputStreamWriter.

    Parameters:

    out

    An output stream


  • PrintWriter(OutputStream out, boolean autoFlush)

    creates a new PrintWriter from an existing OutputStream but allows you to determine whether the writer autoflushes or not.

    Parameters:

    out

    An output stream

     

    autoFlush

    If TRue, the println methods will flush the output buffer


  • void print(Object obj)

    prints an object by printing the string resulting from toString.

    Parameters:

    obj

    The object to be printed


  • void print(String s)

    prints a Unicode string.

  • void println(String s)

    prints a string followed by a line terminator. Flushes the stream if the stream is in autoflush mode.

  • void print(char[] s)

    prints an array of Unicode characters.

  • void print(char c)

    prints a Unicode character.

  • void print(int i)

    prints an integer in text format.

  • void print(long l)

    prints a long integer in text format.

  • void print(float f)

    prints a floating-point number in text format.

  • void print(double d)

    prints a double-precision floating-point number in text format.

  • void print(boolean b)

    prints a Boolean value in text format.

  • boolean checkError()

    returns true if a formatting or output error occurred. Once the stream has encountered an error, it is tainted and all calls to checkError return true.

How to Read Text Input

As you know:

  • To write data in binary format, you use a DataOutputStream.

  • To write in text format, you use a PrintWriter.

Therefore, you might expect that there is an analog to the DataInputStream that lets you read data in text format. The closest analog is the Scanner class that we have used extensively. However, before JDK 5.0, the only game in town for processing text input was the BufferedReader method it has a method, readLine, that lets you read a line of text. You need to combine a buffered reader with an input source.

 BufferedReader in = new BufferedReader(new FileReader("employee.txt")); 

The readLine method returns null when no more input is available. A typical input loop, therefore, looks like this:


String line;
while ((line = in.readLine()) != null)
{
   do something with line
}

The FileReader class already converts bytes to Unicode characters. For other input sources, you need to use the InputStreamReader unlike the PrintWriter, the InputStreamReader has no automatic convenience method to bridge the gap between bytes and Unicode characters.

 BufferedReader in2 = new BufferedReader(new InputStreamReader(System.in)); BufferedReader in3 = new BufferedReader(new InputStreamReader(url.openStream())); 

To read numbers from text input, you need to read a string first and then convert it.

 String s = in.readLine(); double x = Double.parseDouble(s); 

That works if there is a single number on each line. Otherwise, you must work harder and break up the input string, for example, by using the StringTokenizer utility class. We see an example of this later in this chapter.

TIP

Java has StringReader and StringWriter classes that allow you to treat a string as if it were a data stream. This can be quite convenient if you want to use the same code to parse both strings and data from a stream.



       
    top



    Core Java 2 Volume I - Fundamentals
    Core Java(TM) 2, Volume I--Fundamentals (7th Edition) (Core Series) (Core Series)
    ISBN: 0131482025
    EAN: 2147483647
    Year: 2003
    Pages: 132

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net