# General Sort Considerations

### General Sort Considerations

Sorting has been a subject of research for decades, and it is not difficult to find a reasonable recipe nowadays. What we expect, at a minimum, is that each of the Big Eight act this way:

• If physical records are large and ORDER BY columns are small, DBMSs will start by extracting the essential information, loading it into memory, then performing a "tag sort" which compares only the extracted snippets. This procedure makes it unnecessary to reread a whole physical record every time a comparison is performed. The DBMS may have to do large dynamic-memory allocations during the process.

• DBMSs will use one of the better algorithms, such as tournament sort. These algorithms often have a counter-intuitive flaw: If all the rows are already in order or nearly in order, it won't help much because the speed depends very little on the order of the input data, and very much on other factors.

All right, so what does affect sort speed? Here's the answer, in order of importance:

• Number of rows selected

• Number of columns in the ORDER BY clause

• Length of the columns in the ORDER BY clause

Figures 3-1, 3-2, and 3-3 show the effect, on sort speed, of increasing each of these variables .

## Testing ORDER BY speed

To test a single factor X, it's necessary to keep all other factors equal while rerunning SQL statements with different values of X. We tried. First we created a table containing CHAR columns. We populated the table using a randomizer that generated equal amounts of the letters A through G, with the result that some columns contained duplicate values or at least began with the same letters . Filler columns were added as needed to ensure that total row length was the same in all tests. Then we ran these three tests:

Test #1, result shown in Figure 3-1

Number of rows: Varying from 1,000 to 10,000

Number of columns: Fixed at one

Column length: Fixed at CHAR(10)

SQL statements: Fixed at one:

` SELECT * FROM Table1 ORDER BY column1 `

Test #2, result shown in Figure 3-2

Number of rows: Fixed at 1,000

Number of columns: Varying from one to ten

Column length: Fixed at CHAR(10)

SQL statements: Varying from one to ten:

` SELECT * FROM Table1 ORDER BY column1 SELECT * FROM Table1 ORDER BY column1, column2 SELECT * FROM Table1 ORDER BY column1, column2, column3 `

etc.

Test #3, result shown in Figure 3-3

Number of rows: Fixed at 1,000

Number of columns: Fixed at one

Column length: Varying from CHAR(1) to CHAR(10)

SQL statements: Fixed at one:

` SELECT * FROM Table1 ORDER BY column1 `

The timing numbers shown in Figures 3-1, 3-2, and 3-3 are the result and are averages for the Big Eight. Tests were run on a single-processor machine using whatever memory the DBMS chose for its default configuration.

It's no surprise that an increase in the row count has a geometric effectif you increase the number of rows by a factor of ten, the job takes about twenty times as long to complete. We had hoped for a better result. That's achievable by adding more processors and more memory, because DBMSs can sort groups of tags in parallel threads and will thresh less if all tags are in memory.

Surprisingly, an increase in the column count has a more drastic effect than an increase in the column length. Remember, total number of bytes was the same in both cases, so the effect is more than illusion. The likely explanation is that our DBMSs are using multipass algorithms, comparing only one column at a time.

The bottom line: Take drastic action to reduce row countfor example, by selecting only parts of a table at a time. Take severe action to reduce column countfor example, by concatenating two columns instead of specifying them separately. Take moderate action to reduce column lengthfor example, by using the SUBSTRING function. It's also slightly helpful if values are presorted and unique on the first few characters , but Christmas comes only one day a year.

We make several more statements about what happens with a sort in the following sections.

#### Partial duplicates slow sorts

Suppose you have a table with a CHAR column. Populate the column with data in which the first five characters are duplicates. Now SELECT:

` SELECT column1 FROM Table1   ORDER BY column1 `

Delete all rows and repopulate , this time with random data. SELECT again:

` SELECT column1 FROM Table1   ORDER BY column1 GAIN: 5/8 `

Skipping the duplicated characters made the sort faster for five DBMSs.

#### Presorting speeds sorts

Go back to your table with a CHAR column. Populate the column with random data and SELECT:

` SELECT column1 FROM Table1   ORDER BY column1 `

Delete all rows and repopulate, this time with data that's in alphabetic order and SELECT:

` SELECT column1 FROM Table1   ORDER BY column1 GAIN: 4/8 `

Having the rows already in order made the sort faster for half of the Big Eight.

#### It's the defined length that matters

Remember that a variable-length column has a defined length and an actual length. For example, a VARCHAR(30) column may contain ABC , in which case its defined length is 30 characters and its actual length is three characters. For sorting, it's the 30-character definition that matters! We tested this by varying the defined length of a VARCHAR column, while keeping the actual data length the same. The result was that Scenario #1 is slower than Scenario #2 even when the contents are the same in both cases:

` Scenario #1: CREATE TABLE Table1 (    column1 VARCHAR(100)) SELECT column1 FROM Table1   ORDER BY column1 Scenario #2: CREATE TABLE Table2 (    column1 VARCHAR(10)) SELECT column1 FROM Table2    ORDER BY column1 GAIN: 6/8 `

Moral

Defining VARCHAR columns with "room to grow" degrades sorts. The gain might seem surprising, but it's consistent with the way that most sort algorithms work: They allot a fixed memory buffer in advance, and the size of the buffer depends on the maximum anticipated key size .

#### INTEGERs beat SMALLINTs

On Windows machines, where an INTEGER requires 32 bits and a SMALLINT requires 16 bits, the tendency may be to think that SMALLINT sorting is faster than sorting INTEGER columns. But comparisons of integers are usually faster because 32 bits is your computer's word size and DBMSs are able to take advantage of the fact. So Scenario #1 is slower than Scenario #2:

` Scenario #1: CREATE TABLE Table1 (    column1 SMALLINT) SELECT column1 FROM Table1   ORDER BY column1 Scenario #2: CREATE TABLE Table2 (    column1 INTEGER) SELECT column1 FROM Table2    ORDER BY column1 GAIN: 5/8 `

#### INTEGERs beat CHARs

We sorted a CHAR(4) column, then we sorted an INTEGER column, both with random data. The INTEGER sort was often faster, because a character string cannot be compared four bytes at a timebut this is done for an integer. Thus Scenario #1 is slower than Scenario #2:

` Scenario #1: CREATE TABLE Table1 (    column1 CHAR(4)) SELECT column1 FROM Table1   ORDER BY column1 Scenario #2: CREATE TABLE Table2 (    column1 INTEGER) SELECT column1 FROM Table2    ORDER BY column1 GAIN: 4/8 `

#### Sets beat multisets

Technically, a multiset differs from a set because it contains duplicates. Testing against two tables with the same number of rows, but with significant numbers of duplicates in the first table, we found, once again, that sorting is faster when there are no duplicates.

#### Conclusion

The fastest sort is an ascending sort of an integer with unique values, presorted.

#### The ORDER BY Clause

A simplified description of the SELECT ... ORDER BY syntax looks like this:

` SELECT <column list>                   /* the select list */   FROM <Table list>   ORDER BY <column expression> [ASC  DESC] [,...] `

This syntax has a new wrinkle. In SQL-92, you had to ORDER BY <column name > , and the column named had to be in the select list. In SQL:1999, you're allowed to ORDER BY <column expression> . For example, to sort numbers in backwards order, you've now got these two choices:

` With New SQL:1999 Wrinkle: SELECT numeric_column    FROM Table1    ORDER BY numeric_column * -1 Without New Wrinkle: SELECT numeric_column, numeric_column * -1 AS num    FROM Table1    ORDER BY num `

• In theory, an SQL:1999 DBMS silently adds numeric_column * -1 to the select list just before sorting and takes it out just after sorting.

• The result of ORDER BY numeric_column * -1 can differ slightly from the result of ORDER BY numeric_column DESC . The difference, as usual, is caused by the presence of NULLs.

Table 3-1 shows the SQL Standard requirements and the level of support the Big Eight have for ORDER BY.

Notes on Table 3-1:

• Max Columns column

Shows how many columns may be listed in the ORDER BY clause.

• For Sybase, our tests showed it was possible to sort up to 31 columns. This differs from Sybase's response to JDBC's getMaxColumnsIn Order By call, which returns 16 .

• Max Bytes column

Shows the maximum allowed length, in bytes, of an ORDER BY clause.

This column shows a significant number. If column1 is defined as CHAR(500) and Max Bytes shows a value of "500" then ORDER BY column1 , column2 is ineffective because column2 doesn't get into the tag.

##### Table 3-1. ANSI/DBMS ORDER BY Support
Max Columns Max Bytes NULLs Sort Sort LOBs ORDER BY Expression COLLATE Clause
ANSI SQL N/S N/S Low or High No Yes Yes
IBM 500 253 High No Yes No
Informix >=1000 >=1000 Low No No No
Ingres 300 >=1000 High No Yes No
InterBase 254 254 At End Yes No Yes
Microsoft >=1000 >=1000 Low No Yes Yes
MySQL >=1000 >=1000 Low Yes Yes No
Oracle 254 >=1000 High No Yes Yes
Sybase 31 >=1000 Low No Yes No
• NULLs Sort column

Shows where the DBMS places NULLs in a sorted list. This column is "Low" if NULLs are considered less than all other values, "High" if NULLs are considered greater than all other values, and "At End" if the DBMS places NULLs at the end of a sorted list both when you ORDER BY...ASC and when you ORDER BY...DESC .

• For InterBase, our tests showed that NULLs sort At End. This differs from InterBase's response to (a) JDBC's NullsAreSortedAtEnd call, which returns false and (b) JDBC's NullsAreSortedHigh call, which returns true .

• For MySQL and Sybase, our tests showed that NULLs sort Low. In the first case, this differs from MySQL's response to (a) JDBC's NullsAreSortedLow call, which returns false and (b) JDBC's NullsAreSortedAtStart call, which returns true . In the second case, this differs from Sybase's response to (a) JDBC's NullsAreSortedLow call, which returns false and (b) JDBC's NullsAreSortedHigh call, which returns true .

• For Oracle, our tests showed that NULLs sort High. This differs from Oracle's response to (a) JDBC's NullsAreSortedLow call, which returns true and (b) JDBC's NullsAreSortedHigh call, which returns false .

• Sort LOBs column

This column is "Yes" if the DBMS allows large object data (i.e., BLOB, CLOB, NCLOB, or the nonstandard SQL extensions TEXT, IMAGE, BINARY, GRAPHIC, etc.) to be sorted.

• ORDER BY Expression column

Shows whether the DBMS allows ORDER BY to contain columns or expressions that are not in the select list. This column is "Yes" if the DBMS supports SQL:1999-style statements like either of these two:

` SELECT column1 FROM Table1 ORDER BY column2 SELECT column1 FROM Table1 ORDER BY <function>(column1) `
• COLLATE Clause column

This column is "Yes" if the DBMS supports SQL Standard-style COLLATE clauses, or Oracle-style NLSSORT() function calls, or a CAST to a different character set with a different collation in ORDER BY, like this:

` SELECT column1, column2 FROM Table1   ORDER BY column1 COLLATE SQL_Latin1_General `

For many DBMSs, NULLs sort Lowthat is, NULLs are considered to be less than the smallest non-NULL value. For IBM, Ingres, and Oracle, NULLs sort Highthat is, NULLs are considered to be greater than the largest non-NULL value. For InterBase and PostgreSQL, NULLs sort At the Endwhether you're sorting in ascending or in descending order. That means the set of values {-1, +1, NULL} sorts four different ways depending on DBMS and depending on your sort order:

• {NULL, 1, +1}

/* result after ORDER BY column1 for many DBMSs, including Informix, Microsoft, MySQL, and Sybase */

• {1, +1, NULL}

/* result after ORDER BY column1 for IBM, Ingres, InterBase, Oracle, PostgreSQL */

• {+1, 1, NULL}

/* result after ORDER BY column1 DESC for Informix, InterBase, Microsoft, MySQL, PostgreSQL, and Sybase */

• {NULL, +1, 1}

/* result after ORDER BY column1 DESC for IBM, Ingres, Oracle */

The result is that ORDER BY column1 * -1 and ORDER BY column1 DESC return different results (with NULL in a different spot) unless you use InterBase or PostgreSQL.

The use of expressions in ORDER BY is not 100% portable, but some useful expressions can help speed or clarity:

• ORDER BY LOWER(column1)

Useful if case-insensitive sorting is unavailable.

• ORDER BY SUBSTRING(column1 FROM 1 FOR 6)

Rough sorts are faster because tags are small.

• ORDER BY CAST(column1 AS CHAR...)

Useful if column1 's data type is not sortable.

Such functions would be unacceptable if the DBMS called them multiple times. However, the DBMS evaluates the sort-key value only once per input row, at the time it forms the tags. To make sure of this, we tested ORDER BY with a user -defined function that simply counted how many times the function was called. Six of the Big Eight called the function only once per row (GAIN: 6/6).

Portability

Informix and InterBase don't allow expressions in ORDER BY. The gain shown is for only six DBMSs.

#### To Sort or Not to Sort

"Do not use ORDER BY if the query has a DISTINCT or GROUP BY on the same set of terms, because they have the side effect of ordering rows."

Kevin Kline et al., Transact-SQL Programming, O'Reilly & Associates

Sometimes there is a temptation to follow such advice and skip ORDER BY if it's certain that the rows are in order anyway. Our tests showed this about such assumptions:

• SELECT column1 FROM Table1

is returned in order by column1 if Table1 is clustered and column1 is the cluster key (see Chapter 9, "Indexes") or is otherwise preordered.

• SELECT column1 FROM Table1 WHERE column1 > -32768

is returned in order by column1 if column1 is indexed (in ASC order) and the DBMS makes use of the index.

• SELECT DISTINCT column1 FROM Table1

is returned in order by column1 if column1 is not unique.

If you add ORDER BY column1 in any of these three cases, the SELECT is always slower (AVERAGE GAIN: 5/8 without ORDER BY). This suggests that DBMSs will not remove unnecessary ORDER BY clauses automatically. However, our tests also showed that the effect is unpredictable with more complex statements that contain any of the following: joins, unions, multiple columns in the select list, long columns, columns indexed with a descending index, or columns requiring secondary sorts (we'll talk about secondary sorts later in this chapter). Finally, we failed to find that any such side effect was documented in any vendor manual. So we urge caution.

#### The Bottom Line: General Sorts

The three variables that affect sort speed are, in order of importance:

• The number of rows you select

• The number of columns you put in the ORDER BY clause

• The defined length of the columns you put in the ORDER BY clause

An increase in the row count has a geometric effect on sort speed. If you multiply the number of rows by ten, the job takes twenty times as long. Take drastic action to reduce the number of rows you sort.

Take severe action to reduce the number of sorted columns.

Take moderate action to reduce the length of sorted columns.

The fastest sort is an ascending sort of a presorted integer with unique values.

Partial duplicates slow sorts.

Presorting speeds sorts.

It's the defined length that matters.

Some DBMSs sort NULL high. Some DBMSs sort NULL low. Some DBMSs sort NULL at the end of a list. Since there's no standard way of sorting NULL; don't write code that depends on the DBMS putting all the NULLs in a specific place.

The use of expressions in ORDER BY is not 100% portable. But using expressions like ORDER BY LOWER(column1) , ORDER BY SUBSTRING(column1 FROM 1 FOR 6) , and ORDER BY CAST(column1 AS CHAR...) can help speed or clarity.

` SELECT column1 FROM Table1 `

returns a result set in order by column1 if Table1.column1 is clustered or otherwise preordered.

` SELECT column1 FROM Table1 WHERE column1 > -32768 `

returns a result set in order by column1 if column1 is indexed and the DBMS makes use of the index.

` SELECT DISTINCT column1 FROM Table1 `

returns a result set in order by column1 if column1 is not unique.

"Omit ORDER BY" is a popular tip, and it actually has an effectbut the effect is unreliable. This is fine if the ordering is for cosmetic purposes, or if the number of rows is tiny, but in other cases . Judge the tip's merits for yourself.

SQL Performance Tuning
ISBN: 0201791692
EAN: 2147483647
Year: 2005
Pages: 125

Similar book on Amazon