Putting Data in Order | Java Oracle Database Development

One of the most basic yet essential operations we can perform on data is to put it in order. Imagine how useless a phone book would be if the names appeared in random order. One of the most valuable features of a relational database is the flexibility it allows us in sorting our data.

To sort the data returned from a SELECT statement, we add an ORDER BY clause at the end of the other options we've seen so far. This is the basic format:

 SELECT select_list FROM table_list [WHERE condition] [GROUP BY clause] ORDER BY order_list

The order_list is similar to the select list: It is a list of column names or expressions containing column names that are used in decreasing order of priority in determining the sort order.

For example, suppose that we have the following table of names:

 FIRST_NAME  MIDDLE_INITIAL  LAST_NAME  PHONE John        B               Smith      555-2121 John        A               Smith      555-1234 James       R               Smith      555-1212 Maria       T               Garcia     555-4321

A simple select will return the rows in arbitrary order (as a practical matter, this is most likely in the order in which they were entered). But there is no guarantee that repeating a query will return the results in the same order. (The order can change if data is exported then imported, for example; for this reason, it is better to consider the data unordered.) If we sort using only last name, all the Smiths will fall together, but the first names will be in an arbitrary order.

 SQL> SELECT * FROM NAMES ORDER BY LAST_NAME; FIRST_NAME   M LAST_NAME            PHONE_NUMBER ------------ - -------------------- ------------- Maria        T Garcia               555-4321 John         B Smith                555-2121 John         A Smith                555-1234 James        R Smith                555-1212

What we need, of course, is to use FIRST_NAME and MIDDLE_INITIAL in addition to LAST_NAME in our sort order to make sure that not only does James R. Smith come before the John Smiths, but also that the John Smiths are sorted correctly, according to their middle initials:

 SQL> SELECT * FROM NAMES ORDER BY   2  LAST_NAME, FIRST_NAME, MIDDLE_INITIAL; FIRST_NAME   M LAST_NAME            PHONE_NUMBER ------------ - -------------------- ------------- Maria        T Garcia               555-4321 James        R Smith                555-1212 John         A Smith                555-1234 John         B Smith                555-2121

The default sort order, as you see, is ascending. We can change this for each order list item independently by adding DESC, for descending. (We can also explicitly specify ascending by adding ASC, but because it is the default, this isn't necessary.) In the following example, we sort the EMP sample table in descending order by salary but in ascending order by name:

 SQL> SELECT ENAME,SAL FROM EMP   2  ORDER BY SAL DESC, ENAME; ENAME             SAL ---------- ---------- KING             5000 FORD             3000 SCOTT            3000 JONES            2975 BLAKE            2850 CLARK            2450 ALLEN            1600 TURNER           1500 MILLER           1300 MARTIN           1250 WARD             1250 ENAME             SAL ---------- ---------- ADAMS            1100 JAMES             950 SMITH             800 14 rows selected.

This ensures that within groups of people who have the same salary, such as FORD and SCOTT, names are listed in alphabetical order.

In addition to allowing ASC and DESC options, the order list also differs from a select list in another way. In the order list, instead of using column names or expressions, we can use numbers to refer to the columns in the select list by position. We can rewrite the SELECT statement in the previous expression, for example, using 1 in the order list to refer to ENAME in the select list, and 2 to refer to SAL:

 SELECT ENAME, SAL FROM EMP ORDER BY 2 DESC, 1;

So far, we've avoided making an important distinction. There are two basic ways that we can sort strings. We can sort by numeric, or binary, value or we can sort according to linguistic rules. In the examples we've seen so far, we've sorted strings using the binary value of the characters in the string. In Oracle, this is the default for English. This generally works adequately as long as we are consistent about the case of the strings in our table mostly because we don't have to worry about messy things such as accent marks and expanding characters in English.

Linguistic Sorting

Human languages are complex, naturally arising systems that sometimes don't seem to develop in rational ways. Sorting, or collation, for most languages, is a complex problem. It can be especially vexing because it seems that sorting should be easy exactly the type of simple-minded task for which a computer is designed. Fortunately, much of this complex functionality has been implemented in Oracle, and it is easy to enable.

The primary issue that arises for English, something to which we've already alluded, is the issue of case sensitivity. In most common applications, we don't expect a sort to be case-sensitive. We want all the A's, big and small, together, as well as all the B's together, etc. In a dictionary, we expect to find Erasmus between erase and erbium.

Suppose we have the following table of unsorted data:

 INITIALS  LAST_NAME  BORN ee        cummings   1894 WH        Auden      1907 TS        Eliot      1888

If we sort these names, we would want to find ee cummings after WH Auden but before TS Eliot. Sometimes, cumming's name is spelled with traditional capitalization to solve the problem. But if we respect his lowercase usage and our default sort is a binary sort, his name falls at the end. All names beginning with A Z come before those beginning with a z:

 SQL> SELECT INITIALS, LAST_NAME, BORN   2  FROM POETS   3  ORDER BY LAST_NAME, INITIALS; INITIALS LAST_NAME          BORN -------- ------------ ---------- WH       Auden              1907 TS       Eliot              1888 ee       cummings           1894

One way to solve this in English is to convert all strings in the ORDER BY clause to uppercase (or lowercase):

 SQL> SELECT INITIALS, LAST_NAME, BORN   2  FROM POETS   3  ORDER BY UPPER(LAST_NAME), UPPER(INITIALS); INITIALS LAST_NAME          BORN -------- ------------ ---------- WH       Auden              1907 ee       cummings           1894 TS       Eliot              1888

We need to be careful about using functions such as UPPER() in ORDER BY clauses because they can be very inefficient unless we create and enable function-based indexes.

A more general solution is to use Oracle's NLS (National Language Support) features. Oracle has built-in collation support for many languages that addresses not only case, but also other issues, such as characters with diacritical marks, expanding characters (single characters that are treated as two characters), and contracting characters (two characters that are treated as a single character). This collation support is controlled by a system parameter, NLS_SORT, which is set to value called a linguistic definition by Oracle. The basic linguistic definitions are generally named the same as the language name, for example, FRENCH or SWEDISH. Extended support for special cases is also available for many languages; these linguistic definitions begin with the prefix X, for example, XFRENCH or XSPANISH.

System parameters are used to configure the database and the client session, somewhat like environment variables in some operating systems. Many database features can be modified by changing a system parameter. When the database is first started, it reads system parameters from a parameter file, [ORACLE_HOME]/database/init[DB_NAME].ora, or from files included by this file using an IFILE parameter. System parameters not set in this file are derived from other system parameters or from the operating system environment, or else they default to some reasonable or arbitrary value. Finally, some system parameters can also be changed dynamically by a client using the ALTER SESSION command.

The NLS_SORT parameter, in particular, is derived from the NLS_LANGUAGE parameter, which, in turn, is derived from the operating system NLS_LANG variable; if the NLS_LANG variable is not set, NLS_LANGUAGE defaults to AMERICAN.

You can find out the current session's NLS parameters by querying the data dictionary as follows:

 SELECT * FROM NLS_SESSION_PARAMETERS;

In addition to supporting correct linguistic sorting for a single language at a time as in Oracle 8i, Oracle 9i introduced support for sorting multiple languages. The general multilingual sort, based on the International Standards Organizations standard for multilingual sorting, ISO 14651, is GENERIC_M. There are other versions of this linguistic definition that provide additional specialized sorting for individual languages. For example, FRENCH_M is the ISO14651 multilingual sort but handles French diacritics in the proper order, from right to left. See the Oracle9i Globalization Support Guide or the Oracle 8i National Language Support Guide for a complete list of linguistic definitions.

Consider the following unordered table with multilingual names. Each name is included twice, once capitalized and once not, to demonstrate issues with case:

 SQL> SELECT NAME FROM ML_NAMES; NAME ---------- Zamboni zamboni Ångström ångström Amenhotep amenhotep Álvarez álvarez chytilova Chytilova capek Capek Havel havel 14 rows selected.

Assuming that our database is set up for English, the default sort is BINARY. All uppercase letters will sort before all lowercase letters, because A Z are encoded as decimal 65 90, and a z are encoded as 97 122. (These are in the ASCII range, 0 127, which does not generally differ from one character set to another. Letters with diacritical marks, non-Roman letters, and other symbols are generally encoded above this range.) In ISO-8859-1 (Latin 1) and Windows 1252, which are commonly used character sets in Western Europe, uppercase letters with various diacritical marks are in the range 192 220, and lowercase letters with diacritical marks are in the range 224 252. Using a binary sort order, despite the ORDER BY clause, this list appears almost as unordered as before:

 SQL> SELECT NAME FROM ML_NAMES ORDER BY NAME; NAME ---------- Amenhotep Capek Chytilova Havel Zamboni amenhotep capek chytilova havel zamboni Álvarez Ångström álvarez ångström 14 rows selected.

We can correct the problem with case by selecting a linguistic sort, such as WEST_EUROPEAN (or GENERIC_M in Oracle 9i) for our NLS_SORT parameter. We set the NLS_SORT parameter dynamically in the ALTER SESSION statement:

     SQL> ALTER SESSION SET NLS_SORT='WEST_EUROPEAN'; Session altered.

This change to the NLS_SORT parameter will remain in effect until the end of the session or until we change it with another ALTER SESSION command. The ORDER BY clause will now sort uppercase and lowercase letters together, including those with diacritical marks:

 SQL> SELECT NAME FROM ML_NAMES ORDER BY NAME; NAME ---------- Álvarez álvarez Amenhotep amenhotep Ångström ångström Capek capek Chytilova chytilova Havel havel Zamboni zamboni 14 rows selected.

This is clearly an improvement and it's actually the best we can do with this multilingual list without favoring one language over another.

Several of the languages represented by the names in this multilingual list have rules that this generic sort ignores. For example, in Swedish, the letter A with a circle above it, Å, is considered a separate letter that sorts after Z. Setting the sort order to SWEDISH yields this list:

 SQL> ALTER SESSION SET NLS_SORT='SWEDISH'; Session altered. SQL> SELECT NAME FROM ML_NAMES ORDER BY NAME; NAME ---------- Álvarez álvarez Amenhotep amenhotep Capek capek Chytilova chytilova Havel havel Zamboni zamboni Ångström ångström 14 rows selected.

In some languages, certain combinations of letters are treated as a single character when sorting this is called contraction. In Czech, for example, CH is sorted as a single letter that falls between H and I. If we set NLS_SORT to XCZECH to enable extended Czech collation, names beginning with CH, disregarding case, will fall after names beginning with H:

 SQL> ALTER SESSION SET NLS_SORT='XCZECH'; Session altered. SQL> SELECT NAME FROM ML_NAMES ORDER BY NAME; NAME ---------- álvarez Álvarez amenhotep Amenhotep capek Capek havel Havel chytilova Chytilova Ångström ångström zamboni Zamboni 14 rows selected.

In other languages, certain single characters are treated as though they were two characters this is called expansion. In German, for example, vowels with an umlaut are sorted as though they were the combination of the base vowel (the vowel without the umlaut) plus the letter e. For example, ä is sorted as though it were ae.

There are many other issues involved in collation, especially in non-Western European languages, which are beyond the scope of this book. Eastern European languages, bidirectional languages, and Asian languages all present unique challenges. If your requirements include support for these languages, you will also need to consider related issues, such as character set support and language tagging.

Java's internationalization classes also include support for linguistic sorts (see the Collator class, in particular), but sorting is much more costly to perform in a client application than it is using the database. Even if you are using other internationalization features in Java such as date and currency formats, which you may find easier and more flexible to use than Oracle's, you should still rely on Oracle for linguistically correct sorting.