A char Is Not a Character | Effective XML: 50 Specific Ways to Improve Your XML

The prime problem with Unicode in most programming languages is that a Unicode character is not equivalent to the native char type. For instance, on many systems, a C char is one signed byte. This allows 128 characters that cover the ASCII range ( barely ) but fails as soon as you need an ¼ or an . On other systems, the C char type is an unsigned byte that provides 256 characters. This works adequately for most Latin-alphabet languages, though different languages have to use different character sets and not all languages can be processed simultaneously . It fails completely when faced with a language such as Japanese that has more than 256 characters. This has led to the development of wide character types such as wchar. However, the effective size of wchar varies from seven bits to four bytes, depending on platform and compiler, and may use the platform default character set rather than Unicode.

In Java a char is two unsigned bytes and is based on Unicode. That's room for 65,536 different characters, which should be enough except that some languages (notably Chinese) have more than 70,000 characters. (The exact number is debated, but everyone agrees it's a lot.) Thus even two-byte chars like those found in Java don't adequately handle Unicode, which covers Chinese and a lot more. To some extent, this was hidden up through Unicode 3.0 because Unicode hadn't actually defined any characters with code points beyond 65,535; but that began to change in Unicode 3.1, and the process seems likely to accelerate in the future.

One common misconception about Unicode is that each Unicode character occupies exactly two bytes. In fact, Unicode has space for over one million characters, which clearly can't be represented in two bytes. This misconception arose because until Unicode 3.1 all Unicode characters were assigned to code points below 65,536. However, in Unicode 3.1 several scripts, including musical and mathematical symbols, were assigned to Plane 1 with code points above 65,536. More will be assigned there in the future. This means that more than two bytes and more than one Java char are necessary to hold a single Unicode character. In fact, a Java char does not encode a Unicode character. Rather, it represents a UTF-16 code point. Characters with code points less than or equal to 65,535 are represented with one Java char. Characters with code points greater than 65,535 are represented with two Java chars using surrogate pairs. However, the two chars that make up the surrogate pair account for just one Unicode character.

Normally, this distinction is not a big deal when processing XML. The parser passes strings or char arrays to the application that contain all necessary chars. On rare occasions it's possible for a parser that splits XML text across multiple method calls to pass in half of one character at the end of one string and the other half at the beginning of the next , but this is unlikely to occur and generally doesn't cause any major problems even when it does.

The proper handling of Unicode is slightly (but only slightly) more troublesome when writing XML. Here, you cannot simply assume that the characters and the strings are as you need them to be. You need to use classes that convert the native char type, whatever it is, into proper UTF-8 or UTF-16. Alternately, you can use other encodings as long as the XML document carries an encoding declaration identifying which encoding you are using. In Java, the OutputStreamWriter class is up to this task. Python makes this fairly easy, and Perl 5.6 and later generate UTF-8 by default. Standard C and C++ don't have anything like this, but most platform-dependent APIs, such as the Unicode stream I/O functions in the Microsoft runtime library on Windows (fwprintf, fputwc, fputws, and so on) or glibc 2.2 and later in Linux, support Unicode output, typically in UTF-8 and/or UTF-16.