Sorting | Effective XML: 50 Specific Ways to Improve Your XML

Many CS101 textbooks demonstrate sorting on strings by using code point order. Unfortunately this does not work in the real world, even in ASCII, much less in Unicode. Most obviously, real sorts (such as that found in the index in the back of this book) sort capital letters identically to their lowercase equivalents. Lichenstein should appear after language, not before it as it does when ordered by code points. Less obviously, the punctuation marks generally appear before all letters whether they're # (ASCII code point 35), [ (ASCII code point 91), or ~ (ASCII code point 126). And of course sorting is language dependent. While converting all characters to upper case and lexically ordering the resulting strings may give passable results in English, it fails completely in languages like French where and are intermixed with e even though in almost all character sets the code points for and come well after z .

Text comparison and sorting has to be done in a locale-sensitive manner. You need to know which language you're sorting, and you need to use an appropriate collation table, as well as normalizing the data before you sort. In Java, the java.text.Collator class performs locale-sensitive string comparison. IBM's aforementioned International Components for Unicode provide more powerful and configurable options. How fancy you want to get depends on your needs and the language or languages of the documents you're processing. The main thing to remember is that any time you're using code point order, you're doing it wrong. Code point order is never adequate for sorting something that will be shown to a person.