Optimal GROUP BY Clauses

We've already mentioned that GROUP BY performs better if you keep the number of grouping columns small. One way you can do so is to avoid grouping redundant columns , as in this example:

 SELECT secondary_key_column, primary_key_column, COUNT(*)   FROM Table1   GROUP BY secondary_key_column, primary_key_column

Because primary key columns are unique and may not contain NULL by definition, the mention of secondary_key_column in this example is redundant. The problem is that if you take secondary_key_column out of the GROUP BY clause, you'll get an error message. All DBMSs except MySQL and Sybase will tell you that you can't have secondary_key_column in the select list if it's not also in the GROUP BY list. This is how to write a query that's legal and that's faster:

 SELECT MIN(secondary_key_column), primary_key_column, COUNT(*)   FROM Table1   GROUP BY primary_key_column GAIN: 4/7

WARNING

Don't do this for Ingres; it shows a loss. The gain shown is for only seven DBMSs.

Here are two ways to speed up GROUP BY when you're joining tables.

Reduce before you expand

GROUP BY tends to reduce row counts, and JOIN tends to expand row counts. Because a DBMS must evaluate FROM and WHERE clauses before GROUP BY clauses, this tip is not easy to put into practice, but there is a way. You can make a join happen late by replacing it with a set operator . (The SQL Standard set operators are UNION, EXCEPT, and INTERSECT.) For example, replace Statement #1 with Statement #2:

 Statement #1: SELECT SUM(Table1.column2), SUM(Table2.column2)   FROM Table1 INNER JOIN Table2        ON Table1.column1 = Table2.column1   GROUP BY Table1.column1 Statement #2: SELECT column1, SUM(column2), 0   FROM Table1   GROUP BY column1 INTERSECT SELECT column1, 0, SUM(column2)   FROM Table2   GROUP BY column1 GAIN: 2/2

Portability

Informix, Ingres, InterBase, Microsoft, MySQL, and Sybase don't support INTERSECT. The gain shown is for only two DBMSs.

GROUP on the same table

When you're grouping joined tables, the GROUP BY column should be from the same table as the column(s) on which you're applying a set function. We're passing this advice along because some vendors think it's important enough to mention in their documentation.

Also to do with joins and GROUP BY, you can improve performance by avoiding joins altogether. Consider this SELECT:

 SELECT COUNT(*) FROM Table1, Table2    WHERE Table1.column1 = Table2.column1

If Table1.column1 is unique, you could replace the join with a subquery. Transform the SELECT to:

 SELECT COUNT(*) FROM Table2   WHERE Table2.column1 IN     (SELECT Table1.column1 FROM Table1) GAIN: 4/6

WARNING

Don't do this for Oracle; it shows a loss. The gain shown is for only six DBMSs (see Portability note).

Portability

MySQL doesn't support subqueries. The gain shown is for only six DBMSs.

HAVING

Most DBMSs do not merge WHERE and HAVING clauses. This means the following statements are logically the same but won't run at the same speed:

 Query with WHERE and HAVING: SELECT column1 FROM Table1   WHERE column2 = 5   GROUP BY column1     HAVING column1 > 6 Query with WHERE only: SELECT column1 FROM Table1   WHERE column2 = 5     AND column1 > 6   GROUP BY column1 GAIN: 3/8

The "Query with WHERE only" runs faster on three of the Big Eight. You should use this type of query except in the rare cases where you need to defer the filtering implied by column1 > 6 for example, if the comparison is hard to evaluate.

Alternatives to GROUP BY

If you're writing a query that doesn't involve set functions, you can use DISTINCT as an alternative to GROUP BY. DISTINCT has three advantages: It's simpler, it's legal to use in expressions, andwith some DBMSsit's faster. So instead of using Query #1, use the alternative Query #2:

 Query #1: SELECT column1    FROM Table1    GROUP BY column1 Query #2: SELECT DISTINCT column1    FROM Table1 GAIN: 4/8

The Bottom Line: Optimal GROUP BY Clauses

GROUP BY performs better if you keep the number of grouping columns small.

Avoid grouping redundant columns by using set functions.

When you're grouping joined tables, reduce before you expand. You can make a join happen late by replacing it with a set operator.

When you're grouping joined tables, the GROUP BY column should be from the same table as the column(s) on which you're applying a set function.

You can improve performance on some grouped joins by replacing the join with a subquery.

Most DBMSs do not merge WHERE and HAVING clauses. Write your queries with only a WHERE clause wherever possible.

Use DISTINCT instead of GROUP BY if your query doesn't involve set functions.