Set Functions and Summary Aggregates | SQL Performance Tuning

In the last section we looked briefly at how you can force the DBMS to use indexes to speed up MIN and MAX, and we mentioned that COUNT is an unusual function because it has a special optimization of its own. In this section, we'll look at this special optimization and the set functions in greater detail. We'll also discuss summary aggregation.

COUNT

DBMSs keep statistics that are associated with each table. One of these statistics is the table size , a figure that the optimizer uses for (among other things) determining the driver in a join. (The driver is the table that the DBMS examines first when evaluating an expression. For example, if a join is based on the expressions Table1.column1 = 5 AND Table1.column2 = Table2.column2 and the evaluation plan is to read Table1 , then lookup Table2 rows based on values in Table1 's rows, then Table1 is the driver.) If the statistics are 100% up to date, then the DBMS could retrieve this value, and the response to a statement like:

 SELECT COUNT(*) FROM Table1

would be instantaneous. Statistics become out of date as soon as a row is inserted, updated, or deleted, so this happy state of affairs is rare.

DBMSs also keep more permanent statistics, such as the number of pages, or the highest serial number for a table's serial-number column. For example, with MySQL you can get a statistic called LAST_INSERT_ID. Such numbers always reflect rows that have been deleted, so COUNT(*) can't use them. But they can serve as substitutes for COUNT(*) if you can accept a small amount of error. Remember that in the dBASE Dark Age, it was considered perfectly normal that a "row count" included the table's deleted rows.

The COUNT(DISTINCT column) function is an awful performer. If it's alone in the select list, you have a workaround. For example, this statement:

 SELECT COUNT(DISTINCT column1)   FROM Table1   WHERE column2 > 55

can be replaced by:

 SELECT DISTINCT column1   FROM Table1   WHERE column2 > 55 GAIN: 5/8

Once you have the result set for the alternative SELECT, get the result set size using the ODBC function SQLGetDiagField . If you can do this, it's several times quicker.

Portability

This tip is not portable. Many DBMSs will reject the SQLGetDiagField call.

The statistics-based COUNT(*) optimizations are useless if there is a WHERE clause. That is, this statement:

 SELECT COUNT(*) FROM Table1

normally goes many times faster than:

 SELECT COUNT(*) FROM Table1   WHERE column1 = 55

In cases where the search condition is necessary, you can sometimes speed things up again by counting the column onlythat is, change the query to:

 SELECT COUNT(column1) FROM Table1   WHERE column1 = 55 GAIN: 1/8

The reasoning here is that the DBMS will use a covering index if one is available.

SUM Trouble

Summing up a long list of FLOAT columns can result in a loss of precision under two circumstances: (a) where the exponent varies wildly, and (b) where much subtraction occurs because values are negative. It can also result in a perceived loss of precision when numbers that we think of as "exact" are handled as floats. (For example, the number .1 added ten times does not result in the number 1 when floating-point arithmetic is used, though we may subconsciously expect that result.)

COUNT Mumbo Jumbo

Some people swear that you can speed up COUNT(*) by replacing the function with a logical equivalent. We failed to find any evidence for this assertion. As far as we can tell, the following statements are exactly the same:

 SELECT COUNT(*) FROM Table1 SELECT COUNT(1) FROM Table1 SELECT SUM(1) FROM Table1

We tried out all three variants on the Big Eight and found no performance distinction among them.

If you consciously decided to use FLOATs, then you know this is part of the game. If you consciously decided to define your column as a DECIMAL but your DBMS silently stores that as a FLOAT (see Chapter 7, "Columns"), you might regard the precision loss as significant. You can solve this problem by casting the SUM to a data type with more precision, such as DECIMAL or DOUBLE PRECISION. If you do this, note that it's important to CAST the number within the expression, not within the SUM. That is, rather than using this statement:

 SELECT CAST(SUM(column1) AS DECIMAL(10))   FROM Table1

use this statement instead:

 SELECT SUM(CAST(column1 AS DECIMAL(10)))   FROM Table1 GAIN: 3/6

Portability

MySQL and Ingres don't support CAST. The gain shown is for only six DBMSs.

Precision will also be slightly better if you say SUM(x + y) instead of SUM(x) + SUM(y) because the total number of additions will be smaller. On the other hand, precision is better if you say SUM(x) SUM(y) instead of SUM(x y) because the total number of subtractions will be smaller.

You can also have a problem with SUMfor example, SUM(column1) if column1 is an INTEGER. This time it isn't a problem with precision (in fact, a precision loss never occurs if you use exact numeric data types). The problem goes like this:

 INSERT INTO Table1    VALUES (2000000000)   /* 2 billion */ INSERT INTO Table1    VALUES (2000000000)   /* 2 billion */ SELECT SUM(column1)   FROM Table1            /* 4 billion ... error? */

Four DBMSsIngres, InterBase, Microsoft, and Sybasereturn an "overflow" error in such a case. Once again, you can avoid the problem by using CAST:

 SELECT SUM(CAST(column1 AS BIGINT))   FROM Table1             /* if BIGINT is legal */ SELECT SUM(CAST(column1 AS DECIMAL(10)))   FROM Table1             /* if you prefer standard SQL */

Similar considerations apply if you want to pick your own precision for AVG functions, rather than depending on the arbitrary precision that the DBMS picks. Some DBMSs add a few positions (usually three or more) after the decimal pointso if, for example, column1 is defined as DECIMAL(5,3) the result of AVG(column1) is at least DECIMAL(8,6) . Other DBMSs (e.g., IBM and Informix) return a FLOAT result.

Multiple Aggregation Levels

Reports frequently include both detail and summary information. That's what the CUBE and ROLLUP keywords are for. You should use GROUP BY CUBE (...)

or GROUP BY ROLLUP (...) if you want detail/summary reportsif your DBMS supports this new SQL:1999 feature. If it doesn't, you can get multiple levels by UNIONing the detail rows with the summary rows, like this:

 SELECT column1 AS col1, SUM(column2)   FROM Table1   WHERE column1 IS NOT NULL   GROUP BY column1 UNION ALL SELECT 'Total' AS col1, SUM(column2)   FROM Table1 ORDER BY col1

Portability

MySQL doesn't support UNION. All other DBMSs can do this, but you should use CUBE/ROLLUP with IBM, Microsoft, and Oracle.

You can achieve the same effect with a subqueryfor example, SELECT ... FROM (SELECT...) but if your DBMS will support the table subqueries enhancement, then it will probably support CUBE and ROLLUP too. A similar trick is useful if you want to treat each occurrence of NULL as a distinct value.

Fast MAX

An ideal way to compute the MAX aggregate function is described in A.K. Dewdney's article "On the spaghetti computer and other analog gadgets for problem solving" ( Scientific American , June 1984, page 19). The article suggests that, for each numeric value in the set that you want to sort , you cut a strand of uncooked spaghetti to a length matching the value. Hold all the strands together loosely in your fist, and rap them on a horizontal surface. Instantly the longest strand will be apparent. Measure it.

Unfortunately, only the MAX is visiblenot the MIN, SUM, AVG, or COUNT of the set. This illustrates that if you design the system for a specific query, you can get fantastic results and make every other query slower.

Expressions

Here are three SQL statements. Feel free to show off by guessing which one is illegal according to SQL Standard rules.

 Statement #1 SELECT MAX(LOWER(column1)) FROM Table1   GROUP BY column1 Statement #2 SELECT LOWER(column1) FROM Table1   GROUP BY LOWER(column1)      /* Hint: this one! */ Statement #3 SELECT column1 FROM Table1   GROUP BY column1 COLLATE SQL_Latin1_General

Doubtless you correctly guessed that Statement #2 is illegal because GROUP BY columns can't be in expressionsexcept COLLATE expressions. Unfortunately, many DBMSs won't support COLLATE, so they have to support other kinds of expressions, such as LOWER. After all, it's frequently necessary to have some choice over how the DBMS decides what's a duplicate and what's not. Do remember, though, that GROUP BY is evaluated before the production of the select-list columns, so you can't do this:

 SELECT LOWER(column1) AS alias_name   FROM Table1   GROUP BY alias_name   ORDER BY alias_name

In this example, alias_name is illegal in the GROUP BY clause. It is, however, legal in the ORDER BY clause because ORDER BY is evaluated after the production of the select-list columns.

Expressions are also helpful with the MIN and MAX functions. For example, assume Table1 contains two rows with these values: {Sam , SAM} . This query:

 SELECT MIN(column1)   FROM Table1

can return either 'Sam' or 'SAM' , because both values are equal in a primary sort comparison. To ensure a consistent result, use this expression in the select list instead:

 SELECT MIN(UPPER(column1))   FROM Table1

The Bottom Line: Aggregates

Summing up a long list of FLOAT columns can result in a loss of precision. You can solve this problem by casting the SUM to a data type with more precision. If you do, make sure to CAST the number within the expression, not within the SUM.

Use CAST to avoid "overflow" problems with SUM and AVG.

Precision will be slightly better if you say SUM(x + y) instead of SUM(x) + SUM(y) .

Precision will be slightly better if you say SUM(x) SUM(y) instead of SUM(x y) .

Use CUBE or ROLLUP to produce detail/summary reports. If the DBMS doesn't support this feature, you can still get multiple levels by UNIONing the detail rows with the summary rows.

Expressions are helpful with the MIN and MAX functions, to ensure a consistent result set for character strings.