The integers to which Unicode maps characters can be encoded in a variety of ways. The simplest approach is to write each integer as a normal big-endian 4-byte int. This encoding scheme is called UCS-4. However, it's rather inefficient because the vast majority of characters seen in practice have code points less than 65,535, and in English text most are less than 127.
In practice, most Unicode text is encoded in either UTF-16 or UTF-8. UTF-16 uses two bytes for characters with code points less than or equal to 65,535 and four bytes for characters with code points greater than 65,535. It comes in both big-endian and little-endian formats. The endianness is normally indicated by an initial byte order mark. That is, the first character in the file is the zero-width nonbreaking space, code point 65,279. In big-endian UTF-16, this is the two bytes 0xFEFF (in hexadecimal). In little-endian UTF-16, this is the reverse, 0xFFFE.
UTF-16 encodes characters with code points from 0 to 65,535 (the Basic Multilingual Plane, or BMP for short) as 2-byte unsigned ints. Characters from beyond the BMP are encoded as surrogate pairs made up of four bytes: first a high surrogate, then a low surrogate. The Java char data type is really a big-endian UTF-16 code point, not a Unicode character, though the difference is significant only for characters from outside the BMP.
To see how this works, consider a character from outside the BMP in a typical UCS-4 (4-byte) big-endian representation. This is composed of four bytes of eight bits each. I will label the bits as x0 through x31:
In reality, the high-order byte is always 0. The first three bits of the second byte are also always 0, so these don't need to be encoded. Only bits x0 through x20 need to be encoded. These are encoded in four bytes, like this:
Here, w1w2w3w4 is the 4-byte number formed by subtracting 1 from the 5-bit number x20x19x18x17x16. There are simple, efficient algorithms for breaking up non-BMP characters into these surrogate pairs and recomposing them. Most of the time you'll let the Reader and Writer classes do this for you automatically. The main thing you need to remember is that a Java char is really a UTF-16 code point, and while 99% of the time this is the same as one Unicode character, there are cases where it takes two chars to make a single character.
Streams in Memory
The File System
Working with Files
File Dialogs and Choosers
Character Sets and Unicode
Readers and Writers
Formatted I/O with java.text
The Java Communications API
The J2ME Generic Connection Framework