Normalization Forms | Effective XML: 50 Specific Ways to Improve Your XML

For reasons of compatibility with legacy character sets, as well as out-and-out mistakes, a number of characters have more than one representation in Unicode. For example, the umlaut character can be represented as either the single character ¼ or as a u followed by a combining diaresis. XML 1.0 ^[1] treats these two forms as distinct. For example, Münchn (M ¼nchen) is not the same as Münchn (M ¼nchen). You can see that this might be a bit of a problem.

^[1] This is one of the few changes that may be made in XML 1.1. However, exactly how or when characters will be normalized has not yet been finalized.

While such differences are not significant to XML parsing, they may very well be significant to applications that build on top of XML. You should normalize all text that comes into your program before acting on it. Unicode defines four separate normalization algorithms, suitable for different needs. Probably the most generally useful is Normalization Form C (NFC). This tends to produce text that is best displayed by existing software. However, for sorting, searching, indexing, and so forth, Normalization Form KC (NFKC) is usually more appropriate. It's similar to NFC except that it's a little more aggressive in unifying characters. In particular, stylistic variants such as the fi ligature would be replaced by the two letters f and i, whereas NFC would not unify them. Both NFC and NFKC unify stylistically equivalent sequences and characters such as ¼ and a u followed by a combining diaresis.

Actually implementing the various normalization algorithms is relatively tricky, although it mostly involves table lookups. It is a task best left to the experts. Fortunately, high-quality open source and public domain code is available that can do the job for you.

The Unicode Consortium has published sample algorithms in Java at http://www.unicode.org/unicode/ reports /tr15/Normalizer.html.
IBM's International Components for Unicode (ICU, http://oss.software.ibm.com/icu/) are a high-quality class library for performing normalization and many other tasks . The complete library is a little large for some tastes, but many developers rebuild it with only the parts their own programs need.
Perl has the Unicode::Normalize module, which you can download from CPAN.

A Google search will turn up numerous other options. Normalization is still something of an esoteric subject. Few developers realize how much they need this, so it hasn't made its way into the standard libraries in major programming languages just yet. Indeed it may not be necessary in pure ASCII environments. However, as soon as you move beyond the ASCII character set and the English language, normalization of strings becomes very important.