Optimal GROUP BY Clauses
We've already mentioned that GROUP BY performs better if you keep the number of grouping columns small. One way you can do so is to avoid grouping redundant columns , as in this example:
SELECT secondary_key_column, primary_key_column, COUNT(*) FROM Table1 GROUP BY secondary_key_column, primary_key_column
Because primary key columns are unique and may not contain NULL by definition, the mention of secondary_key_column in this example is redundant. The problem is that if you take secondary_key_column out of the GROUP BY clause, you'll get an error message. All DBMSs except MySQL and Sybase will tell you that you can't have secondary_key_column in the select list if it's not also in the GROUP BY list. This is how to write a query that's legal and that's faster:
SELECT MIN(secondary_key_column), primary_key_column, COUNT(*) FROM Table1 GROUP BY primary_key_column GAIN: 4/7
Don't do this for Ingres; it shows a loss. The gain shown is for only seven DBMSs.
Here are two ways to speed up GROUP BY when you're joining tables.
Reduce before you expand
GROUP BY tends to reduce row counts, and JOIN tends to expand row counts. Because a DBMS must evaluate FROM and WHERE clauses before GROUP BY clauses, this tip is not easy to put into practice, but there is a way. You can make a join happen late by replacing it with a set operator . (The SQL Standard set operators are UNION, EXCEPT, and INTERSECT.) For example, replace Statement #1 with Statement #2:
Statement #1: SELECT SUM(Table1.column2), SUM(Table2.column2) FROM Table1 INNER JOIN Table2 ON Table1.column1 = Table2.column1 GROUP BY Table1.column1 Statement #2: SELECT column1, SUM(column2), 0 FROM Table1 GROUP BY column1 INTERSECT SELECT column1, 0, SUM(column2) FROM Table2 GROUP BY column1 GAIN: 2/2
Informix, Ingres, InterBase, Microsoft, MySQL, and Sybase don't support INTERSECT. The gain shown is for only two DBMSs.
GROUP on the same table
When you're grouping joined tables, the GROUP BY column should be from the same table as the column(s) on which you're applying a set function. We're passing this advice along because some vendors think it's important enough to mention in their documentation.
Also to do with joins and GROUP BY, you can improve performance by avoiding joins altogether. Consider this SELECT:
SELECT COUNT(*) FROM Table1, Table2 WHERE Table1.column1 = Table2.column1
If Table1.column1 is unique, you could replace the join with a subquery. Transform the SELECT to:
SELECT COUNT(*) FROM Table2 WHERE Table2.column1 IN (SELECT Table1.column1 FROM Table1) GAIN: 4/6
Don't do this for Oracle; it shows a loss. The gain shown is for only six DBMSs (see Portability note).
MySQL doesn't support subqueries. The gain shown is for only six DBMSs.
Most DBMSs do not merge WHERE and HAVING clauses. This means the following statements are logically the same but won't run at the same speed:
Query with WHERE and HAVING: SELECT column1 FROM Table1 WHERE column2 = 5 GROUP BY column1 HAVING column1 > 6 Query with WHERE only: SELECT column1 FROM Table1 WHERE column2 = 5 AND column1 > 6 GROUP BY column1 GAIN: 3/8
The "Query with WHERE only" runs faster on three of the Big Eight. You should use this type of query except in the rare cases where you need to defer the filtering implied by column1 > 6 for example, if the comparison is hard to evaluate.
Alternatives to GROUP BY
If you're writing a query that doesn't involve set functions, you can use DISTINCT as an alternative to GROUP BY. DISTINCT has three advantages: It's simpler, it's legal to use in expressions, andwith some DBMSsit's faster. So instead of using Query #1, use the alternative Query #2:
Query #1: SELECT column1 FROM Table1 GROUP BY column1 Query #2: SELECT DISTINCT column1 FROM Table1 GAIN: 4/8
The Bottom Line: Optimal GROUP BY Clauses
GROUP BY performs better if you keep the number of grouping columns small.
Avoid grouping redundant columns by using set functions.
When you're grouping joined tables, reduce before you expand. You can make a join happen late by replacing it with a set operator.
When you're grouping joined tables, the GROUP BY column should be from the same table as the column(s) on which you're applying a set function.
You can improve performance on some grouped joins by replacing the join with a subquery.
Most DBMSs do not merge WHERE and HAVING clauses. Write your queries with only a WHERE clause wherever possible.
Use DISTINCT instead of GROUP BY if your query doesn't involve set functions.