Recipe A.1. GroupingBefore moving on to window functions, it is crucial that you understand how grouping works in SQL. In my experience, the concept of grouping results in SQL has been a stumbling block for many. The problems stem from not fully understanding how the GROUP BY clause works and why certain queries return certain results when using GROUP BY. Simply stated, grouping is a way to organize like rows together. When you use GROUP BY in a query, each row in the result set is a group and represents one or more rows with the same values in one or more columns that you specify. That's the gist of it. If a group is simply a unique instance of a row that represents one or more rows with the same value for a particular column (or columns), then practical examples of groups from table EMP include all employees in department 10 (the common value for these employees that enable them to be in the same group is DEPTNO=10) or all clerks (the common value for these employees that enable them to be in the same group is JOB='CLERK'). Consider the following queries. The first shows all employees in department 10; the second query groups the employees in department 10 and returns the following information about the group: the number of rows (members) in the group, the highest salary, and the lowest salary: select deptno,ename from emp where deptno=10 DEPTNO ENAME ------ ---------- 10 CLARK 10 KING 10 MILLER select deptno, count(*) as cnt, max(sal) as hi_sal, min(sal) as lo_sal from emp where deptno=10 group by deptno DEPTNO CNT HI_SAL LO_SAL ------ ---------- ---------- ---------- 10 3 5000 1300 If you were not able to group the employees in department 10 together, to get the information in the second query above you would have to manually inspect the rows for that department (trivial if there are only three rows, but what if there were three million rows?). So, why would anyone want to group? Reasons for doing so vary; perhaps you want to see how many different groups exist or how many members (rows) are in each group. As you can see from the simple example above, grouping allows you to get information about many rows in a table without having to inspect them one by one. Definition of an SQL GroupIn mathematics, a group is defined, for the most part, as (G, •, e), where G is a set, • is a binary operation in G, and e is a member of G. We will use this definition as the foundation for what a SQL group is. A SQL group will be defined as (G, e), where G is a result set of a single or self-contained query that uses GROUP BY, e is a member of G, and the following axioms are satisfied:
Because these properties are fundamental to what we consider a group, it is important that we prove they are true (and we will proceed to do so through the use of some example SQL queries). Groups are non-emptyBy its very definition, a group must have at least one member (or row). If we accept this as a truth, then it can be said that a group cannot be created from an empty table. To prove that proposition true, simply try to prove it is false. The following example creates an empty table, and then attempts to create groups via three different queries against that empty table: create table fruits (name varchar(10)) select name from fruits group by name (no rows selected) select count(*) as cnt from fruits group by name (no rows selected) select name, count(*) as cnt from fruits group by name (no rows selected) As you can see from these queries, it is impossible to create what SQL considers a group from an empty table. Groups are distinctNow let's prove that the groups created via queries with a GROUP BY clause are distinct. The following example inserts five rows into table FRUITS, and then creates groups from those rows: insert into fruits values ('Oranges') insert into fruits values ('Oranges') insert into fruits values ('Oranges') insert into fruits values ('Apple') insert into fruits values ('Peach') select * from fruits NAME -------- Oranges Oranges Oranges Apple Peach select name from fruits group by name NAME ------- Apple Oranges Peach select name, count(*) as cnt from fruits group by name NAME CNT ------- -------- Apple 1 Oranges 3 Peach 1 The first query shows that "Oranges" occurs three times in table FRUITS. However, the second and third queries (using GROUP BY) return only one instance of "Oranges." Taken together, these queries prove that the rows in the result set (e in G, from our definition) are distinct, and each value of NAME represents one or more instances of itself in table FRUITS. Knowing that groups are distinct is important because it means, typically, you would not use the DISTINCT keyword in your SELECT list when using a GROUP BY in your queries.
COUNT is never zeroThe queries and results in the preceding section also prove the final axiom that the aggregate function COUNT will never return zero when used in a query with GROUP BY on a nonempty table. It should not be surprising that you cannot return a count of zero for a group. We have already proved that a group cannot be created from an empty table, thus a group must have at least one row. If at least one row exists, then the count will always be at least 1.
Paradoxes
The preceding quote is from Gottlob Frege in response to Bertrand Russell's discovery of a contradiction to Frege's axiom of abstraction in set theory. Paradoxes many times provide scenarios that would seem to contradict established theories or ideas. In many cases these contradictions are localized and can be "worked around," or they are applicable to such small test cases that they can be safely ignored. You may have guessed by now that the point to all this discussion of paradoxes is that there exists a paradox concerning our definition of an SQL group, and that paradox must be addressed. Although our focus right now is on groups, ultimately we are discussing SQL queries. In its GROUP BY clause, a query may have a wide range of values such as constants, expressions, or, most commonly, columns from a table. We pay a price for this flexibility, because NULL is a valid "value" in SQL. NULLs present problems because they are effectively ignored by aggregate functions. With that said, if a table consists of a single row and its value is NULL, what would the aggregate function COUNT return when used in a GROUP BY query? By our very definition, when using GROUP BY and the aggregate function COUNT, a value >= 1 must be returned. What happens, then, in the case of values ignored by functions such as COUNT, and what does this mean to our definition of a GROUP? Consider the following example, which reveals the NULL group paradox (using the function COALESCE when necessary for readability): select * from fruits NAME ------- Oranges Oranges Oranges Apple Peach insert into fruits values (null) insert into fruits values (null) insert into fruits values (null) insert into fruits values (null) insert into fruits values (null) select coalesce(name,'NULL') as name from fruits NAME -------- Oranges Oranges Oranges Apple Peach NULL NULL NULL NULL NULL select coalesce(name,'NULL') as name, count(name) as cnt from fruits group by name NAME CNT -------- ---------- Apple 1 NULL 0 Oranges 3 Peach 1 It would seem that the presence of NULL values in our table introduces a contradiction, or paradox, to our definition of a SQL group. Fortunately, this contradiction is not a real cause for concern, because the paradox has more to do with the implementation of aggregate functions than our definition. Consider the final query in the preceding set; a general problem statement for that query would be:
Examining the INSERT statements above, it's clear that there are five rows with NULL values, which means there exists a NULL group with five members.
How, then, can we write the query to return a count of 5 instead of 0, thus returning the information we are looking for while conforming to our definition of a group? The example below shows a workaround to deal with the NULL group paradox: select coalesce(name,'NULL') as name, count(*) as cnt from fruits group by name NAME CNT --------- -------- Apple 1 Oranges 3 Peach 1 NULL 5 The workaround is to use COUNT(*) rather than COUNT(NAME) to avoid the NULL group paradox. Aggregate functions will ignore NULL values if any exist in the column passed to them. Thus, to avoid a zero when using COUNT do not pass the column name; instead, pass in an asterisk (*). The * causes the COUNT function to count rows rather than the actual column values, so whether or not the actual values are NULL or not NULL is irrelevant. One more paradox has to do with the axiom that each group in a result set (for each e in G) is distinct. Because of the nature of SQL result sets and tables, which are more accurately defined as multisets or "bags," not sets (because duplicate rows are allowed), it is possible to return a result set with duplicate groups. Consider the following queries: select coalesce(name,'NULL') as name, count(*) as cnt from fruits group by name union all select coalesce(name,'NULL') as name, count(*) as cnt from fruits group by name NAME CNT ---------- --------- Apple 1 Oranges 3 Peach 1 NULL 5 Apple 1 Oranges 3 Peach 1 NULL 5 select x.* from ( select coalesce(name,'NULL') as name, count(*) as cnt from fruits group by name ) x, (select deptno from dept) y NAME CNT ---------- ---------- Apple 1 Apple 1 Apple 1 Apple 1 Oranges 3 Oranges 3 Oranges 3 Oranges 3 Peach 1 Peach 1 Peach 1 Peach 1 NULL 5 NULL 5 NULL 5 NULL 5 As you can see in these queries, the groups are in fact repeated in the final results. Fortunately, this is not much to worry about because it represents only a partial paradox. The first property of a group states that for (G, e), G is a result set from a single or self-contained query that uses GROUP BY. Simply put, the result set from any GROUP BY query itself conforms to our definition of a group. It is only when you combine the result sets from two GROUP BY queries to create a multiset that groups may repeat. The first query in the preceding example uses UNION ALL, which is not a set operation but a multiset operation, and invokes GROUP BY twice, effectively executing two queries.
The second query in the preceding set uses a Cartesian product, which only works if you materialize the group first and then perform the Cartesian. Thus the GROUP BY query when self-contained conforms to our definition. Neither of the two examples takes anything away from the definition of a SQL group. They are shown for completeness, and so that you can be aware that almost anything is possible in SQL. Relationship Between SELECT and GROUP BYWith the concept of a group defined and proved, it is now time to move on to more practical matters concerning queries using GROUP BY. It is important to understand the relationship between the SELECT clause and the GROUP BY clause when grouping in SQL. It is important to keep in mind when using aggregate functions such as COUNT that any item in your SELECT list that is not used as an argument to an aggregate function must be part of your group. For example, if you write a SELECT clause such as: select deptno, count(*) as cnt from emp then you must list DEPTNO in your GROUP BY clause: select deptno, count(*) as cnt from emp group by deptno DEPTNO CNT ------- ---- 10 3 20 5 30 6 Constants, scalar values returned by user-defined functions, window functions, and non-correlated scalar subqueries are exceptions to this rule. Since the SELECT clause is evaluated after the GROUP BY clause, these constructs are allowed in the SELECT list and do not have to (and in some cases cannot) be specified in the GROUP BY clause. For example: select 'hello' as msg, 1 as num, deptno, (select count(*) from emp) as total, count(*) as cnt from emp group by deptno MSG NUM DEPTNO TOTAL CNT ----- --- ------ ----- --- hello 1 10 14 3 hello 1 20 14 5 hello 1 30 14 6 Don't let this query confuse you. The items in the SELECT list not listed in the GROUP BY clause do not change the value of CNT for each DEPTNO, nor do the values for DEPTNO change. Based on the results of the preceding query, we can define the rule about matching items in the SELECT list and the GROUP BY clause when using aggregates a bit more precisely:
The additional items in the preceding SELECT list did not change the value of CNT for any group (each DEPTNO), nor did they change the groups themselves. Now it's fair to ask: exactly what items in a SELECT list can change a grouping or the value returned by an aggregate function? The answer is simple: other columns from the table(s) you are selecting from. Consider the prospect of adding the JOB column to the query we've been looking at: select deptno, job, count(*) as cnt from emp group by deptno, job DEPTNO JOB CNT ------ ---------- ---- 10 CLERK 1 10 MANAGER 1 10 PRESIDENT 1 20 CLERK 2 20 ANALYST 2 20 MANAGER 1 30 CLERK 1 30 MANAGER 1 30 SALESMAN 4 By listing another column, JOB, from table EMP, we are changing the group and changing the result set; thus we must now include JOB in the GROUP BY clause along with DEPTNO, otherwise the query will fail. The inclusion of JOB in the SELECT/GROUP BY clauses changes the query from "How many employees are in each department?" to "How many different types of employees are in each department?" Notice again that the groups are distinct; the values for DEPTNO and JOB individually are not distinct, but the combination of the two (which is what is in the GROUP BY and SELECT list, and thus is the group) are distinct (e.g., 10 and CLERK appear only once). If you choose not to put items other than aggregate functions in the SELECT list, then you may list any valid column you wish, in the GROUP BY clause. Consider the following two queries, which highlight this fact: select count(*) from emp group by deptno COUNT(*) --------- 3 5 6 select count(*) from emp group by deptno,job COUNT(*) ---------- 1 1 1 2 2 1 1 1 4 Including items other than aggregate functions in the SELECT list is not mandatory, but often improves readability and usability of the results.
|