RecipeA.1.Grouping | SQL Cookbook (Cookbooks (OReilly))

Recipe A.1. Grouping

Before moving on to window functions, it is crucial that you understand how grouping works in SQL. In my experience, the concept of grouping results in SQL has been a stumbling block for many. The problems stem from not fully understanding how the GROUP BY clause works and why certain queries return certain results when using GROUP BY.

Simply stated, grouping is a way to organize like rows together. When you use GROUP BY in a query, each row in the result set is a group and represents one or more rows with the same values in one or more columns that you specify. That's the gist of it.

If a group is simply a unique instance of a row that represents one or more rows with the same value for a particular column (or columns), then practical examples of groups from table EMP include all employees in department 10 (the common value for these employees that enable them to be in the same group is DEPTNO=10) or all clerks (the common value for these employees that enable them to be in the same group is JOB='CLERK'). Consider the following queries. The first shows all employees in department 10; the second query groups the employees in department 10 and returns the following information about the group: the number of rows (members) in the group, the highest salary, and the lowest salary:

 select deptno,ename from emp where deptno=10   DEPTNO ENAME   ------ ----------   10 CLARK   10 KING   10 MILLER  select deptno,    count(*) as cnt,    max(sal) as hi_sal,    min(sal) as lo_sal  from emp    where deptno=10    group by deptno DEPTNO   CNT   HI_SAL  LO_SAL ------ ---------- ---------- ---------- 10 3 5000 1300

If you were not able to group the employees in department 10 together, to get the information in the second query above you would have to manually inspect the rows for that department (trivial if there are only three rows, but what if there were three million rows?). So, why would anyone want to group? Reasons for doing so vary; perhaps you want to see how many different groups exist or how many members (rows) are in each group. As you can see from the simple example above, grouping allows you to get information about many rows in a table without having to inspect them one by one.

Definition of an SQL Group

In mathematics, a group is defined, for the most part, as (G, •, e), where G is a set, • is a binary operation in G, and e is a member of G. We will use this definition as the foundation for what a SQL group is. A SQL group will be defined as (G, e), where G is a result set of a single or self-contained query that uses GROUP BY, e is a member of G, and the following axioms are satisfied:

For each e in G, e is distinct and represents one or more instances of e.
For each e in G, the aggregate function COUNT returns a value > 0.

The result set is included in the definition of a SQL group to reinforce the fact that we are defining what groups are when working with queries only. Thus, it would be accurate to replace "e" in each axiom with the word "row" because the rows in the result set are technically the groups.

Because these properties are fundamental to what we consider a group, it is important that we prove they are true (and we will proceed to do so through the use of some example SQL queries).

Groups are non-empty

By its very definition, a group must have at least one member (or row). If we accept this as a truth, then it can be said that a group cannot be created from an empty table. To prove that proposition true, simply try to prove it is false. The following example creates an empty table, and then attempts to create groups via three different queries against that empty table:

 create table fruits (name varchar(10)) select name from fruits   group by name   (no rows selected)  select count(*) as cnt   from fruits group by name (no rows selected) select name, count(*) as cnt from fruits    group by name    (no rows selected)

As you can see from these queries, it is impossible to create what SQL considers a group from an empty table.

Groups are distinct

Now let's prove that the groups created via queries with a GROUP BY clause are distinct. The following example inserts five rows into table FRUITS, and then creates groups from those rows:

 insert into fruits values ('Oranges') insert into fruits values ('Oranges') insert into fruits values ('Oranges') insert into fruits values ('Apple') insert into fruits values ('Peach') select * from fruits NAME -------- Oranges Oranges Oranges Apple Peach select name from fruits   group by name  NAME  -------  Apple  Oranges  Peach  select name, count(*) as cnt from fruits    group by name    NAME CNT    ------- --------    Apple 1    Oranges 3    Peach 1

The first query shows that "Oranges" occurs three times in table FRUITS. However, the second and third queries (using GROUP BY) return only one instance of "Oranges." Taken together, these queries prove that the rows in the result set (e in G, from our definition) are distinct, and each value of NAME represents one or more instances of itself in table FRUITS.

Knowing that groups are distinct is important because it means, typically, you would not use the DISTINCT keyword in your SELECT list when using a GROUP BY in your queries.

I am in no way suggesting GROUP BY and DISTINCT are the same. They represent two completely different concepts. I am merely stating that the items listed in the GROUP BY clause will be distinct in the result set and that using DISTINCT as well as GROUP BY is redundant.

Frege's Axiom and Russell's Paradox

For those of you who are interested, Frege's axiom of abstraction, based on Cantor's solution for defining set membership for infinite or uncountable sets, states that, given a specific identifying property, there exists a set whose members are only those items having that property. The source of trouble, as put by Robert Stoll, "is the unrestrictd use of the principal of abstraction." Bertrand Russell asked Gottlob Frege to consider a set whose members are sets and have the defining property of not being members of themselves.

As Russell pointed out, the axiom of abstraction gives too much freedom because you are simply specifiying a condition or property to define set membership, thus a contradiction can be found. To better explain how a contradiction can be found, he devised the "Barber Puzzle." The Barber Puzzle states:

In a certain town there is a male barber who shaves all those men, and only those men, who do not shave themselves. If this is true, who, then, shaves the barber?

For a more concrete example, consider the set that can be described as:

For all members x in y that satisfy a specific condition (P)

The mathematical notation for this description is:

 {x e y | P(x)}

Because the above set considers only those x in y that satisfy a condition (P) you may find it more intuitive to describe the set as x is a member of y if and only if x satisfies a condition (P).

At this point let us define this condition P(x) as x is not a member of x:

 ( x e x )

The set is now defined as x is a member of y if and only if x is not a member of x:

 {x e y | ( x e x )}

Russell's paradox may not be clear to you yet, but ask yourself this: can the set above be a member of itself? Let's assume that x = y and look at the above set again. The following set can be defined as y is a member of y if and only if y is not a member of y:

 {y e y | ( y e y )}

Simply put, Russell's paradox leaves us in a position to have a set that is concurrently a member of itself and not a member of itself, which is a contradiction. Intuitive thinking would lead one to believe this isn't a problem at all; indeed, how can a set be a member of itself? The set of all books, after all, is not a book. So why does this paradox exist and how can it be an issue? It becomes an issue when you consider more abstract applications of set theory. For example, a "practical" application of Russell's paradox can be demonstrated by considering the set of all sets. If we allow such a concept to exist, then by its very definition, it must be a member of itself (it is, after all, the set of all sets). What then happens when you apply P(x) above to the set of all sets? Simply stated, Russell's paradox would state that the set of all sets is a member of itself if and only if it is not a member of itselfclearly a contradiction.

For those of you who are interested, Ernst Zermelo developed the axiom schema of separation (also referred to as the axiom schema of subsets or the axiom of specification) to elegantly sidestep Russell's paradox in axiomatic set theory.

COUNT is never zero

The queries and results in the preceding section also prove the final axiom that the aggregate function COUNT will never return zero when used in a query with GROUP BY on a nonempty table. It should not be surprising that you cannot return a count of zero for a group. We have already proved that a group cannot be created from an empty table, thus a group must have at least one row. If at least one row exists, then the count will always be at least 1.

Remember, we are talking about using COUNT with GROUP BY, not COUNT by itself. A query using COUNT without a GROUP BY on an empty table will of course return zero.

Paradoxes

"Hardly anything more unfortunate can befall a scientific writer than to have one of the foundations of his edifice shaken after the work is finished…. This was the position I was placed in by a letter of Mr. Bertrand Russell, just when the printing of this volume was nearing its completion."

The preceding quote is from Gottlob Frege in response to Bertrand Russell's discovery of a contradiction to Frege's axiom of abstraction in set theory.

Paradoxes many times provide scenarios that would seem to contradict established theories or ideas. In many cases these contradictions are localized and can be "worked around," or they are applicable to such small test cases that they can be safely ignored.

You may have guessed by now that the point to all this discussion of paradoxes is that there exists a paradox concerning our definition of an SQL group, and that paradox must be addressed. Although our focus right now is on groups, ultimately we are discussing SQL queries. In its GROUP BY clause, a query may have a wide range of values such as constants, expressions, or, most commonly, columns from a table. We pay a price for this flexibility, because NULL is a valid "value" in SQL. NULLs present problems because they are effectively ignored by aggregate functions. With that said, if a table consists of a single row and its value is NULL, what would the aggregate function COUNT return when used in a GROUP BY query? By our very definition, when using GROUP BY and the aggregate function COUNT, a value >= 1 must be returned. What happens, then, in the case of values ignored by functions such as COUNT, and what does this mean to our definition of a GROUP? Consider the following example, which reveals the NULL group paradox (using the function COALESCE when necessary for readability):

 select * from fruits  NAME  -------  Oranges  Oranges  Oranges  Apple  Peach  insert into fruits values (null)  insert into fruits values (null)  insert into fruits values (null)  insert into fruits values (null)  insert into fruits values (null)  select coalesce(name,'NULL') as name from fruits  NAME  --------  Oranges  Oranges  Oranges  Apple  Peach  NULL  NULL  NULL  NULL  NULL  select coalesce(name,'NULL') as name, count(name) as cnt from fruits   group by name   NAME   CNT   -------- ----------   Apple 1   NULL 0   Oranges 3   Peach 1

It would seem that the presence of NULL values in our table introduces a contradiction, or paradox, to our definition of a SQL group. Fortunately, this contradiction is not a real cause for concern, because the paradox has more to do with the implementation of aggregate functions than our definition. Consider the final query in the preceding set; a general problem statement for that query would be:

Count the number of times each name occurs in table FRUITS or count the number of members in each group.

Examining the INSERT statements above, it's clear that there are five rows with NULL values, which means there exists a NULL group with five members.

While NULL certainly has properties that differentiate it from other values, it is nevertheless a value, and can in fact be a group.

How, then, can we write the query to return a count of 5 instead of 0, thus returning the information we are looking for while conforming to our definition of a group? The example below shows a workaround to deal with the NULL group paradox:

 select coalesce(name,'NULL') as name,    count(*) as cnt    from fruits  group by name  NAME CNT  --------- --------  Apple  1  Oranges  3  Peach  1  NULL  5

The workaround is to use COUNT(*) rather than COUNT(NAME) to avoid the NULL group paradox. Aggregate functions will ignore NULL values if any exist in the column passed to them. Thus, to avoid a zero when using COUNT do not pass the column name; instead, pass in an asterisk (*). The * causes the COUNT function to count rows rather than the actual column values, so whether or not the actual values are NULL or not NULL is irrelevant.

One more paradox has to do with the axiom that each group in a result set (for each e in G) is distinct. Because of the nature of SQL result sets and tables, which are more accurately defined as multisets or "bags," not sets (because duplicate rows are allowed), it is possible to return a result set with duplicate groups. Consider the following queries:

 select coalesce(name,'NULL') as name,    count(*) as cnt    from fruits  group by name  union all select coalesce(name,'NULL') as name, count(*) as cnt    from fruits  group by name  NAME   CNT  ---------- ---------  Apple 1  Oranges 3  Peach 1  NULL 5  Apple 1  Oranges 3  Peach 1  NULL 5  select x.* from (  select coalesce(name,'NULL') as name, count(*) as cnt from fruits   group by name ) x, (select deptno from dept) y  NAME    CNT  ---------- ----------  Apple  1  Apple  1  Apple  1  Apple  1  Oranges  3  Oranges  3  Oranges  3  Oranges  3  Peach  1  Peach  1  Peach  1  Peach  1  NULL  5  NULL  5  NULL  5  NULL  5

As you can see in these queries, the groups are in fact repeated in the final results. Fortunately, this is not much to worry about because it represents only a partial paradox. The first property of a group states that for (G, e), G is a result set from a single or self-contained query that uses GROUP BY. Simply put, the result set from any GROUP BY query itself conforms to our definition of a group. It is only when you combine the result sets from two GROUP BY queries to create a multiset that groups may repeat. The first query in the preceding example uses UNION ALL, which is not a set operation but a multiset operation, and invokes GROUP BY twice, effectively executing two queries.

If you use UNION, which is a set operation, you will not see repeating groups.

The second query in the preceding set uses a Cartesian product, which only works if you materialize the group first and then perform the Cartesian. Thus the GROUP BY query when self-contained conforms to our definition. Neither of the two examples takes anything away from the definition of a SQL group. They are shown for completeness, and so that you can be aware that almost anything is possible in SQL.

Relationship Between SELECT and GROUP BY

With the concept of a group defined and proved, it is now time to move on to more practical matters concerning queries using GROUP BY. It is important to understand the relationship between the SELECT clause and the GROUP BY clause when grouping in SQL. It is important to keep in mind when using aggregate functions such as COUNT that any item in your SELECT list that is not used as an argument to an aggregate function must be part of your group. For example, if you write a SELECT clause such as:

 select deptno, count(*) as cnt from emp

then you must list DEPTNO in your GROUP BY clause:

 select deptno, count(*) as cnt from emp    group by deptno    DEPTNO CNT    ------- ---- 10   3 20   5 30    6

Constants, scalar values returned by user-defined functions, window functions, and non-correlated scalar subqueries are exceptions to this rule. Since the SELECT clause is evaluated after the GROUP BY clause, these constructs are allowed in the SELECT list and do not have to (and in some cases cannot) be specified in the GROUP BY clause. For example:

 select 'hello' as msg, 1 as num, deptno, (select count(*) from emp) as total, count(*) as cnt    from emp  group by deptno  MSG NUM DEPTNO TOTAL CNT  -----  --- ------ ----- ---  hello 1 10 14 3  hello 1 20 14 5  hello 1 30 14 6

Don't let this query confuse you. The items in the SELECT list not listed in the GROUP BY clause do not change the value of CNT for each DEPTNO, nor do the values for DEPTNO change. Based on the results of the preceding query, we can define the rule about matching items in the SELECT list and the GROUP BY clause when using aggregates a bit more precisely:

Items in a SELECT list that can potentially change the group or change the value returned by an aggregate function must be included in the GROUP BY clause.

The additional items in the preceding SELECT list did not change the value of CNT for any group (each DEPTNO), nor did they change the groups themselves.

Now it's fair to ask: exactly what items in a SELECT list can change a grouping or the value returned by an aggregate function? The answer is simple: other columns from the table(s) you are selecting from. Consider the prospect of adding the JOB column to the query we've been looking at:

 select deptno, job, count(*) as cnt from emp   group by deptno, job   DEPTNO JOB CNT   ------ ---------- ----   10 CLERK 1   10 MANAGER 1   10 PRESIDENT 1   20 CLERK 2   20 ANALYST 2   20 MANAGER 1   30 CLERK 1   30 MANAGER 1   30 SALESMAN 4

By listing another column, JOB, from table EMP, we are changing the group and changing the result set; thus we must now include JOB in the GROUP BY clause along with DEPTNO, otherwise the query will fail. The inclusion of JOB in the SELECT/GROUP BY clauses changes the query from "How many employees are in each department?" to "How many different types of employees are in each department?" Notice again that the groups are distinct; the values for DEPTNO and JOB individually are not distinct, but the combination of the two (which is what is in the GROUP BY and SELECT list, and thus is the group) are distinct (e.g., 10 and CLERK appear only once).

If you choose not to put items other than aggregate functions in the SELECT list, then you may list any valid column you wish, in the GROUP BY clause. Consider the following two queries, which highlight this fact:

 select count(*) from emp    group by deptno COUNT(*)    --------- 3 5 6 select count(*) from emp    group by deptno,job    COUNT(*)  ----------   1 1 1 2 2 1 1 1 4

Including items other than aggregate functions in the SELECT list is not mandatory, but often improves readability and usability of the results.

As a rule, when using GROUP BY and aggregate functions, any items in the SELECT list [from the table(s) in the FROM clause] not used as an argument to an aggregate function must be included in the GROUP BY clause. However, MySQL has a "feature" that allows you to deviate from this rule, allowing you to place items in your SELECT list [that are columns in the table(s) you are selecting from] that are not used as arguments to an aggregate function and that are not present in your GROUP BY clause. I use the term "feature" very loosely here as its use is a bug waiting to happen and I urge you to avoid it. As a matter of fact, if you use MySQL and care at all about the accuracy of your queries I suggest you urge them to remove this, ahem, "feature."