Collation


Sorting strings in alphabetical order is easy when the strings are made up of only English ASCII characters. You just compare the strings with the compareTo method of the String class. The value of

 a.compareTo(b) 

is a negative number if a is lexicographically less than b, 0 if they are identical, and positive otherwise.

Unfortunately, unless all your words are in uppercase English ASCII characters, this method is useless. The problem is that the compareTo method in the Java programming language uses the values of the Unicode character to determine the ordering. For example, lowercase characters have a higher Unicode value than do uppercase characters, and accented characters have even higher values. This leads to absurd results; for example, the following five strings are ordered according to the compareTo method:

 America Zulu ant zebra Ângström 

For dictionary ordering, you want to consider upper case and lower case to be equivalent. To an English speaker, the sample list of words would be ordered as

 America Ângström ant zebra Zulu 

However, that order would not be acceptable to a Swedish user. In Swedish, the letter Å is different from the letter A, and it is collated after the letter Z! That is, a Swedish user would want the words to be sorted as

 America ant zebra Zulu Ângström 

Fortunately, once you are aware of the problem, collation is quite easy. As always, you start by obtaining a Locale object. Then, you call the getInstance factory method to obtain a Collator object. Finally, you use the compare method of the collator, not the compareTo method of the String class, whenever you want to sort strings.

 Locale loc = . . .; Collator coll = Collator.getInstance(loc); if (coll.compare(a, b) < 0) // a comes before b . . .; 

Most important, the Collator class implements the Comparator interface. Therefore, you can pass a collator object to the Collections.sort method to sort a list of strings:

 Collections.sort(strings, coll); 

You can set a collator's strength to select how selective it should be. Character differences are classified as primary, secondary, and tertiary. For example, in English, the difference between "A" and "Z" is considered primary, the difference between "A" and "" is secondary, and between "A" and "a" is tertiary.

By setting the strength of the collator to Collator.PRIMARY, you tell it to pay attention only to primary differences. By setting the strength to Collator.SECONDARY, you instruct the collator to take secondary differences into account. That is, two strings will be more likely to be considered different when the strength is set to "secondary." Table 10-4 shows how a sample set of strings is sorted with the three collation strengths. Note that the strength indicates only whether two strings are considered identical.

Table 10-4. Collations with Different Strengths (English Locale)

Primary

Secondary

Tertiary

Angstrom = Ångström

Angstrom

Angstrom

Ant = ant

Ångström

Ångström

 

Ant = ant

Ant

  

ant


Finally, there is one technical setting, the decomposition mode. The default, "canonical decomposition," is appropriate for most uses. If you choose "no decomposition," then accented characters are not decomposed into their base form + accent. This option is faster, but it gives correct results only when the input does not contain accented characters. Finally, "full decomposition" analyzes Unicode variants, that is, Unicode characters that ought to be considered identical. For example, Japanese displays have two versions of English, Katakana and Hiragana characters, called half-width and full-width. The half-width characters have normal character spacing, whereas the full-width characters are spaced in the same grid as the ideographs. (One could argue that this is a presentation issue and it should not have resulted in different Unicode characters, but we don't make the rules.) With full decomposition, half-width and full-width variants of the same letter are recognized as identical.

It is wasteful to have the collator decompose a string many times. If one string is compared many times against other strings, then you can save the decomposition in a collation key object. The getCollationKey method returns a CollationKey object that you can use for further, faster comparisons. Here is an example:

 String a = . . .; CollationKey aKey = coll.getCollationKey(a); if(aKey.compareTo(coll.getCollationKey(b)) == 0) // fast comparison    . . . 

The program in Example 10-4 lets you experiment with collation order. Type a word into the text field and click the Add button to add it to the list of words. Each time you add another word, or change the locale, strength, or decomposition mode, the list of words is sorted again. An = sign indicates words that are considered identical (see Figure 10-3).

Figure 10-3. The CollationTest program


In this program, we supply an improved combo box for selecting the locale (see Example 10-5). The combo box solves two problems:

  • The locale strings are sorted, according to the collation order of the current locale. We install a custom model that holds the sorted strings and replace it when the component's locale is changed. (Note how the locale names change when you change the locale.)

  • The combo box holds Locale objects, not strings. We install a custom renderer so that the getdisplayName method of the component's locale is used for rendering.

Example 10-4. CollationTest.java

[View full width]

   1. import java.awt.*;   2. import java.awt.event.*;   3. import java.text.*;   4. import java.util.*;   5. import java.util.List;   6. import javax.swing.*;   7.   8. /**   9.    This program demonstrates collating strings under  10.    various locales.  11. */  12. public class CollationTest  13. {  14.    public static void main(String[] args)  15.    {  16.       JFrame frame = new CollationFrame();  17.       frame.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);  18.       frame.setVisible(true);  19.    }  20. }  21.  22. /**  23.    This frame contains combo boxes to pick a locale, collation  24.    strength and decomposition rules, a text field and button  25.    to add new strings, and a text area to list the collated  26.    strings.  27. */  28. class CollationFrame extends JFrame  29. {  30.    public CollationFrame()  31.    {  32.       setTitle("CollationTest");  33.  34.       setLayout(new GridBagLayout());  35.       add(new JLabel("Locale"), new GBC(0, 0).setAnchor(GBC.EAST));  36.       add(new JLabel("Strength"), new GBC(0, 1).setAnchor(GBC.EAST));  37.       add(new JLabel("Decomposition"), new GBC(0, 2).setAnchor(GBC.EAST));  38.       add(addButton, new GBC(0, 3).setAnchor(GBC.EAST));  39.       add(localeCombo, new GBC(1, 0).setAnchor(GBC.WEST));  40.       add(strengthCombo, new GBC(1, 1).setAnchor(GBC.WEST));  41.       add(decompositionCombo, new GBC(1, 2).setAnchor(GBC.WEST));  42.       add(newWord, new GBC(1, 3).setFill(GBC.HORIZONTAL));  43.       add(new JScrollPane(sortedWords), new GBC(1, 4).setFill(GBC.BOTH));  44.  45.       strings.add("America");  46.       strings.add("ant");  47.       strings.add("Zulu");  48.       strings.add("zebra");  49.       strings.add("\u00C5ngstr\u00F6m");  50.       strings.add("Angstrom");  51.       strings.add("Ant");  52.       updateDisplay();  53.  54.       addButton.addActionListener(new  55.          ActionListener()  56.          {  57.             public void actionPerformed(ActionEvent event)  58.             {  59.                strings.add(newWord.getText());  60.                updateDisplay();  61.             }  62.          });  63.  64.       ActionListener listener = new  65.          ActionListener()  66.          {  67.             public void actionPerformed(ActionEvent event) { updateDisplay(); }  68.          };  69.  70.       localeCombo.addActionListener(listener);  71.       strengthCombo.addActionListener(listener);  72.       decompositionCombo.addActionListener(listener);  73.       pack();  74.    }  75.    /**  76.       Updates the display and collates the strings according  77.       to the user settings.  78.    */  79.    public void updateDisplay()  80.    {  81.       Locale currentLocale = (Locale) localeCombo.getSelectedItem();  82.       localeCombo.setLocale(currentLocale);  83.  84.       currentCollator = Collator.getInstance(currentLocale);  85.       currentCollator.setStrength(strengthCombo.getValue());  86.       currentCollator.setDecomposition(decompositionCombo.getValue());  87.  88.       Collections.sort(strings, currentCollator);  89.  90.       sortedWords.setText("");  91.       for (int i = 0; i < strings.size(); i++)  92.       {  93.          String s = strings.get(i);  94.          if (i > 0 && currentCollator.compare(s, strings.get(i - 1)) == 0)  95.             sortedWords.append("= ");  96.          sortedWords.append(s + "\n");  97.       }  98.       pack();  99.    } 100. 101.    private Locale[] locales; 102.    private List<String> strings = new ArrayList<String>(); 103.    private Collator currentCollator; 104.    private JComboBox localeCombo = new LocaleCombo(Collator.getAvailableLocales()); 105. 106.    private EnumCombo strengthCombo = new EnumCombo(Collator.class, 107.       new String[] { "Primary", "Secondary", "Tertiary" }); 108.    private EnumCombo decompositionCombo = new EnumCombo(Collator.class, 109.       new String[] { "Canonical Decomposition", "Full Decomposition", "No  Decomposition" }); 110.    private JTextField newWord = new JTextField(20); 111.    private JTextArea sortedWords = new JTextArea(10, 20); 112.    private JButton addButton = new JButton("Add"); 113. } 

Example 10-5. LocaleCombo.java

[View full width]

   1. import java.awt.*;   2. import java.text.*;   3. import java.util.*;   4. import javax.swing.*;   5. import javax.swing.event.*;   6.   7. /**   8.    This combo box lets a user pick a locale. The locales are displayed in the locale of   9.    the combo box, and sorted according to the collator of the display locale. 10. */ 11. public class LocaleCombo extends JComboBox 12. { 13.    /** 14.       Constructs a locale combo that displays an immutable collection of locales. 15.       @param locales the locales to display in this combo box 16.    */ 17.    public LocaleCombo(Locale[] locales) 18.    { 19.       this.locales = (Locale[]) locales.clone(); 20.       sort(); 21.       setSelectedItem(getLocale()); 22.    } 23. 24.    public void setLocale(Locale newValue) 25.    { 26.       super.setLocale(newValue); 27.       sort(); 28.    } 29. 30.    private void sort() 31.    { 32.       Object selected = getSelectedItem(); 33.       final Locale loc = getLocale(); 34.       final Collator collator = Collator.getInstance(loc); 35.       final Comparator<Locale> comp = new 36.          Comparator<Locale>() 37.          { 38.             public int compare(Locale a, Locale b) 39.             { 40.                return collator.compare(a.getDisplayName(loc), b.getDisplayName(loc)); 41.             } 42.          }; 43.       Arrays.sort(locales, comp); 44.       setModel(new 45.          ComboBoxModel() 46.          { 47.             public Object getElementAt(int i) { return locales[i]; } 48.             public int getSize() { return locales.length; } 49.             public void addListDataListener(ListDataListener l) {} 50.             public void removeListDataListener(ListDataListener l) {} 51.             public Object getSelectedItem() { return selected >= 0 ? locales[selected]  : null; } 52.             public void setSelectedItem(Object anItem) 53.             { 54.                if (anItem == null) selected = -1; 55.                else selected = Arrays.binarySearch(locales, (Locale) anItem, comp); 56.             } 57. 58.             private int selected; 59.          }); 60.       setSelectedItem(selected); 61.    } 62. 63.    public ListCellRenderer getRenderer() 64.    { 65.       if (renderer == null) 66.       { 67.          final ListCellRenderer originalRenderer = super.getRenderer(); 68.          if (originalRenderer == null) return null; 69.          renderer = new 70.             ListCellRenderer() 71.             { 72.                public Component getListCellRendererComponent(JList list, 73.                   Object value, int index, boolean isSelected, boolean cellHasFocus) 74.                { 75.                   String renderedValue = ((Locale) value).getDisplayName(getLocale()); 76.                   return originalRenderer.getListCellRendererComponent( 77.                      list, renderedValue, index, isSelected, cellHasFocus); 78.                } 79.             }; 80.       } 81.       return renderer; 82.    } 83. 84.    public void setRenderer(ListCellRenderer newValue) 85.    { 86.       renderer = null; 87.       super.setRenderer(newValue); 88.    } 89. 90.    private Locale[] locales; 91.    private ListCellRenderer renderer; 92. } 


 java.text.Collator 1.1 

  • static Locale[] getAvailableLocales()

    returns an array of Locale objects for which Collator objects are available.

  • static Collator getInstance()

  • static Collator getInstance(Locale l)

    return a collator for the default locale or the given locale.

  • int compare(String a, String b)

    returns a negative value if a comes before b, 0 if they are considered identical, and a positive value otherwise.

  • boolean equals(String a, String b)

    returns true if they are considered identical, false otherwise.

  • void setStrength(int strength)

  • int getStrength()

    set or get the strength of the collator. Stronger collators tell more words apart. Strength values are Collator.PRIMARY, Collator.SECONDARY, and Collator.TERTIARY.

  • void setDecomposition(int decomp)

  • int getDecompositon()

    set or get the decomposition mode of the collator. The more a collator decomposes a string, the more strict it will be in deciding whether two strings ought to be considered identical. Decomposition values are Collator.NO_DECOMPOSITION, Collator.CANONICAL_DECOMPOSITION, and Collator.FULL_DECOMPOSITION.

  • CollationKey getCollationKey(String a)

    returns a collation key that contains a decomposition of the characters in a form that can be quickly compared against another collation key.


 java.text.CollationKey 1.1 

  • int compareTo(CollationKey b)

    returns a negative value if this key comes before b, 0 if they are considered identical, and a positive value otherwise.



    Core JavaT 2 Volume II - Advanced Features
    Building an On Demand Computing Environment with IBM: How to Optimize Your Current Infrastructure for Today and Tomorrow (MaxFacts Guidebook series)
    ISBN: 193164411X
    EAN: 2147483647
    Year: 2003
    Pages: 156
    Authors: Jim Hoskins

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net