RecipeA.2.Windowing | SQL Cookbook (Cookbooks (OReilly))

Recipe A.2. Windowing

Once you understand the concept of grouping and using aggregates in SQL, understanding window functions is easy. Window functions, like aggregate functions, perform an aggregation on a defined set (a group) of rows, but rather than returning one value per group, window functions can return multiple values for each group. The group of rows to perform the aggregation on is the window (hence the name "window functions"). DB2 actually calls such functions online analytic processing (OLAP) functions, and Oracle calls them analytic functions, but the ISO SQL standard calls them window functions, so that's the term I use in this book.

A Simple Example

Let's say that you wish to count the total number of employees across all departments. The traditional method for doing that is to issue a COUNT(*) query against the entire EMP table:

 select count(*) as cnt from emp   CNT -----      14

This is easy enough, but often you will find yourself wanting to access such aggregate data from rows that do not represent an aggregation, or that represent a different aggregation. Window functions make light work of such problems. For example, the following query shows how you can use a window function to access aggregate data (the total count of employees) from detail rows (one per employee):

 select ename,    deptno,    count(*) over( ) as cnt  from emp    order by 2    ENAME     DEPTNO CNT    ----------   ------ ------    CLARK 10    14    KING 10    14    MILLER 10    14    SMITH 20    14    ADAMS 20    14    FORD 20    14    SCOTT 20    14    JONES 20    14    ALLEN 30    14    BLAKE 30    14    MARTIN 30    14    JAMES 30    14    TURNER 30    14    WARD 30    14

The window function invocation in this example is COUNT(*) OVER( ). The presence of the OVER keyword indicates that the invocation of COUNT will be treated as a window function, not as an aggregate function. In general, the SQL standard allows for all aggregate functions to also be window functions, and the keyword OVER is how the language distinguishes between the two uses.

So, what did the window function COUNT(*) OVER ( ) do exactly? For every row being returned in the query, it returned the count of all the rows in the table. As the empty parentheses suggest, the OVER keyword accepts additional clauses to affect the range of rows that a given window function considers. Absent any such clauses, the window function looks at all rows in the result set, which is why you see the value 14 repeated in each row of output.

Hopefully you begin to see the great utility of window functions, which is that they allow you to work with multiple levels of aggregation in one row. As you continue through this appendix, you'll begin to see even more just how incredibly useful that ability can be.

Order of Evaluation

Before digging deeper into the OVER clause, it is important to note that window functions are performed as the last step in SQL processing prior to the ORDER BY clause. As an example of how window functions are processed last, let's take the query from the preceding section and use a WHERE clause to filter out employees from DEPTNO 20 and 30:

 select ename,    deptno,    count(*) over( ) as cnt    from emp  where deptno = 10  order by 2  ENAME DEPTNO   CNT  ---------- ------  ------  CLARK 10 3  KING 10 3  MILLER 10 3

The value for CNT for each row is no longer 14, it is now 3. In this example, it is the WHERE clause that restricts the result set to three rows, hence the window function will count only three rows (there are only three rows available to the window function by the time processing reaches the SELECT portion of the query). From this example you can see that window functions perform their computations after clauses such as WHERE and GROUP BY are evaluated.

Partitions

Use the PARTITION BY clause to define a partition or group of rows to perform an aggregation over. As we've seen already, if you use empty parentheses then the entire result set is the partition that a window function aggregation will be computed over. You can think of the PARTITION BY clause as a "moving GROUP BY" because unlike a traditional GROUP BY, a group created by PARTITION BY is not distinct in a result set. You can use PARTITION BY to compute an aggregation over a defined group of rows (resetting when a new group is encountered) and rather than having one group represent all instances of that value in the table, each value (each member in each group) is returned. Consider the following query:

 select ename,   deptno,   count(*) over(partition by deptno) as cnt from emp order by 2 ENAME DEPTNO CNT ---------- ------   ------ CLARK 10 3 KING 10 3 MILLER 10 3 SMITH 20 5 ADAMS 20 5 FORD 20 5 SCOTT 20 5 JONES 20 5 ALLEN 30 6 BLAKE 30 6 MARTIN 30 6 JAMES 30 6 TURNER 30 6 WARD 30      6

This query still returns 14 rows, but now the COUNT is performed for each department as a result of the PARTITION BY DEPTNO clause. Each employee in the same department (in the same partition) will have the same value for CNT, because the aggregation will not reset (recompute) until a new department is encountered. Also note that you are returning information about each group, along with the members of each group. You can think of the preceding query as a more efficient version of the following:

 select e.ename,    e.deptno,    (select count(*) from emp d   where e.deptno=d.deptno) as cnt from emp e   order by 2   ENAME  DEPTNO    CNT   ---------- ------ ------   CLARK 10 3   KING 10 3   MILLER 10 3   SMITH 20 5   ADAMS 20 5   FORD 20 5   SCOTT 20 5   JONES 20 5   ALLEN 30 6   BLAKE 30 6   MARTIN 30 6   JAMES 30 6   TURNER 30 6   WARD 30 6

Additionally, what's nice about the PARTITION BY clause is that it performs its computations independently of other window functions, partitioning by different columns in the same SELECT statement. Consider the following query, which returns each employee, her department, the number of employees in her respective department, her job, and the number of employees with the same job:

 select ename, deptno, count(*) over(partition by deptno) as dept_cnt, job, count(*) over(partition by job) as job_cnt from emp    order by 2    ENAME     DEPTNO   DEPT_CNT JOB JOB_CNT    ----------   ------   -------- --------- -------    MILLER 10 3 CLERK 4    CLARK 10 3 MANAGER 3    KING 10 3 PRESIDENT 1    SCOTT 20 5 ANALYST 2    FORD 20 5 ANALYST 2    SMITH 20 5 CLERK 4    JONES 20 5 MANAGER 3    ADAMS 20 5 CLERK 4    JAMES 30 6 CLERK 4    MARTIN 30 6 SALESMAN 4    TURNER 30 6 SALESMAN 4    WARD 30 6 SALESMAN 4    ALLEN 30 6 SALESMAN 4    BLAKE 30 6 MANAGER 3

In this result set, you can see that employees in the same department have the same value for DEPT_CNT, and that employees who have the same job position have the same value for JOB_CNT.

By now it should be clear that the PARTITION BY clause works like a GROUP BY clause, but it does so without being affected by the other items in the SELECT clause and without requiring you to write a GROUP BY clause.

Effect of NULLs

Like the GROUP BY clause, the PARTITION BY clause lumps all the NULLs into one group or partition. Thus, the effect from NULLs when using PARTITION BY is similar to that from using GROUP BY. The following query uses a window function to count the number of employees with each distinct commission (returning1 in place of NULL for readability):

 select coalesce(comm,-1) as comm, count(*)over(partition by comm) as cnt   from emp   COMM   CNT ------ ---------- 0 1   300 1   500 1  1400 1    -1    10    -1    10    -1    10    -1    10    -1    10    -1    10    -1          10    -1          10    -1    10    -1    10

Because COUNT(*) is used, the function counts rows. You can see that there are 10 employees having NULL commissions. Use COMM instead of *, however, and you get quite different results:

 select coalesce(comm,-1) as comm,            count(comm)over(partition by comm) as cnt       from emp     COMM CNT ---- ----------    0          1  300   1      500          1     1400          1       -1          0       -1          0       -1          0       -1          0       -1          0       -1          0       -1          0       -1          0       -1          0       -1          0

This query uses COUNT(COMM), which means that only the non-NULL values in the COMM column are counted. There is one employee with a commission of 0, one employee with a commission of 300, and so forth. But notice the counts for those with NULL commissions! Those counts are 0. Why? Because aggregate functions ignore NULL values, or more accurately, aggregate functions count only non-NULL values.

When using COUNT, consider whether you wish to include NULLs. Use COUNT(column) to avoid counting NULLs. Use COUNT(*) if you do wish to include NULLs (since you are no longer counting actual column values, you are counting rows).

When Order Matters

Sometimes the order in which rows are treated by a window function is material to the results that you wish to obtain from a query. For this reason, window function syntax includes an ORDER BY subclause that you can place within an OVER clause. Use the ORDER BY clause to specify how the rows are ordered with a partition (remember, "partition" in the absence of a PARTITION BY clause means the entire result set).

Some window functions require you to impose order on the partitions of rows being affected. Thus, for some window functions an ORDER BY clause is mandatory.

When you use an ORDER BY clause in the OVER clause of a window function you are specifying two things:

How the rows in the partition are ordered
What rows are included in the computation

Consider the following query, which sums and computes a running total of salaries for employees in DEPTNO 10:

 select deptno, ename, hiredate, sal, sum(sal)over(partition by deptno) as total1, sum(sal)over( ) as total2, sum(sal)over(order by hiredate) as running_total from emp    where deptno=10    DEPTNO  ENAME   HIREDATE    SAL TOTAL1 TOTAL2 RUNNING_TOTAL    ------  ------  ----------- ----- ------ ------ -------------    10  CLARK   09-JUN-1981  2450   8750   8750 2450    10  KING    17-NOV-1981  5000   8750   8750 7450    10  MILLER  23-JAN-1982  1300   8750   8750 8750

Just to keep you on your toes, I've included a sum with empty parentheses. Notice how TOTAL1 and TOTAL2 have the same values. Why? Once again, the order in which window functions are evaluated answers the question. The WHERE clause filters the result set such that only salaries from DEPTNO 10 are considered for summation. In this case there is only one partitionthe entire result set, which consists of only salaries from DEPTNO 10. Thus TOTAL1 and TOTAL2 are the same.

Looking at the values retuned by column SAL, you can easily see where the values for RUNNING_TOTAL come from. You can eyeball the values and add them yourself to compute the running total. But more importantly, why did including an ORDER BY in the OVER clause create a running total in the first place? The reason is, when you use ORDER BY in the OVER clause you are specify a default "moving" or "sliding" window within the partition even though you don't see it. The ORDER BY HIREDATE clause terminates summation at the HIREDATE in the current row.

The following query is the same as the previous one, but uses the RANGE BETWEEN clause (which you'll learn more about later) to explicitly specify the default behavior that results from ORDER BY HIREDATE:

 select deptno,    ename,    hiredate,    sal,    sum(sal)over(partition by deptno) as total1,    sum(sal)over( ) as total2,    sum(sal)over(order by hiredate range between unbounded preceding and current row) as running_total    from emp  where deptno=10  DEPTNO ENAME  HIREDATE  SAL TOTAL1 TOTAL2 RUNNING_TOTAL  ------ ------ ----------- ----- ------ ------ -------------  10 CLARK  09-JUN-1981 2450 8750   8750 2450  10 KING   17-NOV-1981 5000 8750   8750 7450  10 MILLER 23-JAN-1982  1300 8750     8750          8750

The RANGE BETWEEN clause that you see in this query is termed the framing clause by ANSI and I'll use that term here. Now, it should be easy to see why specifying an ORDER BY in the OVER clause created a running total; we've (by default) told the query to sum all rows starting from the current row and include all prior rows ("prior" as defined in the ORDER BY, in this case ordering the rows by HIREDATE).

The Framing Clause

Let's apply the framing clause from the preceding query to the result set, starting with the first employee hired, who is named CLARK.

Starting with CLARK's salary, 2450, and including all employees hired before CLARK, compute a sum. Since CLARK was the first employee hired in DEPTNO 10, the sum is simply CLARK's salary, 2450, which is the first value returned by RUNNING_TOTAL.
Let's move to the next employee based on HIREDATE, named KING, and apply the framing clause once again. Compute a sum on SAL starting with the current row, 5000 (KING's salary), and include all prior rows (all employees hired before KING). CLARK is the only one hired before KING so the sum is 5000 + 2450, which is 7450, the second value returned by RUNNING_TOTAL.
Moving on to MILLER, the last employee in the partition based on HIREDATE, let's one more time apply the framing clause. Compute a sum on SAL starting with the current row, 1300 (MILLER's salary), and include all prior rows (all employees hired before MILLER). CLARK and KING were both hired before MILLER, and thus their salaries are included in MILLER's RUNNING_TOTAL: 2450 + 5000 + 1300 is 8750, which is the value for RUNNING_TOTAL for MILLER.

As you can see, it is really the framing clause that produces the running total. The ORDER BY defines the order of evaluation and happens to also imply a default framing.

In general, the framing clause allows you to define different "sub-windows" of data to include in your computations. There are many ways to specify such sub-windows. Consider the following query:

 select deptno, ename, sal, sum(sal)over(order by hiredate  range between unbounded preceding    and current row) as run_total1, sum(sal)over(order by hiredate   rows between 1 preceding    and current row) as run_total2, sum(sal)over(order by hiredate  range between current row    and unbounded following) as run_total3, sum(sal)over(order by hiredate   rows between current row    and 1 following) as run_total4 from emp   where deptno=10 DEPTNO ENAME SAL RUN_TOTAL1 RUN_TOTAL2 RUN_TOTAL3 RUN_TOTAL4 ------ ------ ----- ---------- ---------- ---------- ---------- 10 CLARK 2450   2450  2450     8750 7450 10 KING 5000   7450       7450       6300        6300 10 MILLER   1300      8750       6300       1300        1300

Don't be intimidated here; this query is not as bad as it looks. You've already seen RUN_TOTAL1 and the effects of the framing clause "UNBOUNDED PRECEDING AND CURRENT ROW". Here's a quick description of what's happening in the other examples:

RUN_TOTAL2

Rather than the keyword RANGE, this framing clause specifies ROWS, which means the frame, or window, is going to be constructed by counting some number of rows. The 1 PRECEDING means that the frame will begin with the row immediately preceding the current row. The range continues through the CUR-RENT ROW. So what you get in RUN_TOTAL2 is the sum of the current employee's salary and that of the preceding employee, based on HIREDATE.

It so happens that RUN_TOTAL1 and RUN_TOTAL2 are the same for both CLARK and KING. Why? Think about which values are being summed for each of those employees, for each of the two window functions. Think carefully, and you'll get the answer.

RUN_TOTAL3

The window function for RUN_TOTAL3 works just the opposite of that for RUN_TOTAL1; rather than starting with the current row and including all prior rows in the summation, summation begins with the current row and includes all subsequent rows in the summation.

RUN_TOTAL4

Is inverse of RUN_TOTAL2; rather than starting from the current row and including one prior row in the summation, start with the current row and include one subsequent row in the summation.

If you can understand what's been explained thus far, you will have no problem with any of the recipes in this book. If you're not catching on, though, try practicing with your own examples and your own data. I personally find learning easier by actually coding new features rather than just reading about them.

A Framing Finale

As a final example of the effect of the framing clause on query output, consider the following query:

 select ename, sal, min(sal)over(order by sal) min1, max(sal)over(order by sal) max1, min(sal)over(order by sal  range between unbounded preceding    and unbounded following) min2, max(sal)over(order by sal  range between unbounded preceding    and unbounded following) max2, min(sal)over(order by sal  range between current row    and current row) min3, max(sal)over(order by sal  range between current row    and current row) max3, max(sal)over(order by sal   rows between 3 preceding    and 3 following) max4   from emp   ENAME SAL MIN1 MAX1 MIN2 MAX2 MIN3 MAX3 MAX4   ------  -----   ------  ------  ------  ------  ------  ------  ------   SMITH 800 800 800 800 5000 800 800 1250   JAMES 950 800 950 800 5000 950 950 1250   ADAMS 1100 800 1100 800 5000 1100 1100 1300   WARD 1250 800 1250 800 5000 1250 1250 1500   MARTIN 1250 800 1250 800 5000 1250 1250 1600   MILLER 1300 800 1300 800 5000 1300 1300 2450   TURNER 1500 800 1500 800 5000 1500 1500 2850   ALLEN 1600 800 1600 800 5000 1600 1600 2975   CLARK 2450 800 2450 800 5000 2450 2450 3000   BLAKE 2850 800 2850 800 5000 2850 2850 3000   JONES 2975 800 2975 800 5000 2975 2975 5000   SCOTT 3000 800 3000 800 5000 3000 3000 5000   FORD 3000 800 3000 800 5000 3000 3000 5000   KING 5000 800 5000 800 5000 5000 5000 5000

OK, let's break this query down:

MIN1

The window function generating this column does not specify a framing clause, so the default framing clause of UNBOUNDED PRECEDING AND CURRENT ROW kicks in. Why is MIN1 800 for all rows? It's because the lowest salary comes first (ORDER BY SAL), and it remains the lowest, or minimum, salary forever after.

MAX1

The values for MAX1 are much different from those for MIN1. Why? The answer (again) is the default framing clause UNBOUNDED PRECEDING AND CURRENT ROW. In conjunction with ORDER BY SAL, this framing clause ensures that the maximum salary will also correspond to that of the current row.

Consider the first row, for SMITH. When evaluating SMITH's salary and all prior salaries, MAX1 for SMITH is SMITH's salary, because there are no prior salaries. Moving on to the next row, JAMES, when comparing JAMES' salary to all prior salaries, in this case comparing to the salary of SMITH, JAMES' salary is the higher of the two, and thus it is the maximum. If you apply this logic to all rows, you will see that the value of MAX1 for each row is the current employee's salary.

MIN2 and MAX2

The framing clause given for these is UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, which is the same as specifying empty parentheses. Thus, all rows in the result set are considered when computing MIN and MAX. As you might expect, the MIN and MAX values for the entire result set are constant, and thus the value of these columns is constant as well.

MIN3 and MAX3

The framing clause for these is CURRENT ROW AND CURRENT ROW, which simply means use only the current employee's salary when looking for the MIN and MAX salary. Thus both MIN3 and MAX3 are the same as SAL for each row. That was easy, wasn't it?

MAX4

The framing clause defined for MAX4 is 3 PRECEDING AND 3 FOLLOWING, which means, for every row, consider the three rows prior and the three rows after the current row, as well as the current row itself. This particular invocation of MAX(SAL) will return from those rows the highest salary value.

If you look at the value of MAX4 for employee MARTIN you can see how the framing clause is applied. MARTIN's salary is 1250 and the three employee salaries prior to MARTIN's are WARD's (1250), ADAMS' (1100) and JAMES' (950). The three employee salaries after MARTIN's are MILLER's (1300), TURNER's (1500), and ALLEN's (1600). Out of all those salaries, including MARTIN's, the highest is ALLEN's, and thus the value of MAX4 for MARTIN is 1600.

Readability + Performance = Power

As you can see, window functions are extremely powerful as they allow you to write queries that contain both detailed and aggregate information. Using window functions allows you to write smaller, more efficient queries as compared to using multiple self join and/or scalar subqueries. Consider the following query, which easily answers all of the following questions: "What is the number of employees in each department? How many different types of employees are in each department (e.g., how many clerks are in department 10)? How many total employees are in table EMP?"

 select deptno, job, count(*) over (partition by deptno) as emp_cnt, count(job) over (partition by deptno,job) as job_cnt, count(*) over ( ) as total from emp DEPTNO JOB EMP_CNT JOB_CNT TOTAL ------ --------- ---------- ---------- ---------- 10 CLERK 3 1 14 10 MANAGER 3 1 14 10 PRESIDENT 3 1 14 20 ANALYST 5 2 14 20 ANALYST 5 2 14 20 CLERK 5 2 14 20 CLERK 5 2 14 20 MANAGER 5 1 14 30 CLERK 6 1 14 30 MANAGER 6 1 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14

To return the same result set without using window functions would require a bit more work:

 select a.deptno, a.job, (select count(*) from emp b where b.deptno = a.deptno) as emp_cnt, (select count(*) from emp b where b.deptno = a.deptno and b.job = a.job) as job_cnt, (select count(*) from emp) as total  from emp a order by 1,2 DEPTNO JOB EMP_CNT JOB_CNT TOTAL ------ --------- ----------  ---------- ---------- 10 CLERK 3 1 14 10 MANAGER 3 1 14 10 PRESIDENT 3 1 14 20 ANALYST 5 2 14 20 ANALYST 5 2 14 20 CLERK 5 2 14 20 CLERK 5 2 14 20 MANAGER 5 1 14 30 CLERK 6 1 14 30 MANAGER 6 1 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14

The non-window solution is obviously not difficult to write, yet it certainly is not as clean or efficient (you won't see performance differences with a 14-row table, but try these queries with, say, a 1,000- or 10,000-row table and then you'll see the benefit of using window functions over multiple self joins and scalar subqueries).

Providing a Base

Besides readability and performance, window functions are useful for providing a "base" for more complex "report style" queries. For example, consider the following "report style" query that uses window functions in an inline view and then aggregates the results in an outer query. Using window functions allows you to return detailed as well as aggregate data, which is useful for reports. The query below uses window functions to find counts using different partitions. Because the aggregation is applied to multiple rows, the inline view returns all rows from EMP, which the outer CASE expressions can use to transpose and create a formatted report:

 select deptno,    emp_cnt as dept_total,    total,    max(case when job = 'CLERK' then job_cnt else 0 end) as clerks,    max(case when job = 'MANAGER' then job_cnt else 0 end) as mgrs,    max(case when job = 'PRESIDENT' then job_cnt else 0 end) as prez,    max(case when job = 'ANALYST' then job_cnt else 0 end) as anals,    max(case when job = 'SALESMAN' then job_cnt else 0 end) as smen from (   select deptno, job, count(*) over (partition by deptno) as emp_cnt, count(job) over (partition by deptno,job) as job_cnt, count(*) over () as total from emp ) x   group by deptno, emp_cnt, total   DEPTNO DEPT_TOTAL TOTAL CLERKS MGRS PREZ ANALS SMEN   ------ ---------- ----- ------ ---- ---- ----- ---- 10 3 14 1 1 1   0 0 20 5 14 2 1 0   2 0 30 6 14 1 1 0   0 4

The query above returns each department, the total number of employees in each department, the total number of employees in table EMP, and a breakdown of the number of different job types in each department. All this is done in one query, without additional joins or temp tables!

As a final example of how easily multiple questions can be answered using window functions, consider the following query:

 select ename as name, sal, max(sal)over(partition by deptno) as hiDpt, min(sal)over(partition by deptno) as loDpt, max(sal)over(partition by job) as hiJob, min(sal)over(partition by job) as loJob, max(sal)over( ) as hi, min(sal)over( ) as lo, sum(sal)over(partition by deptno  order by sal,empno) as dptRT, sum(sal)over(partition by deptno) as dptSum, sum(sal)over( ) as ttl  from emp    order by deptno,dptRT    NAME SAL HIDPT LODPT HIJOB LOJOB HI LO DPTRT DPTSUM TTL    ------ ----- ----- ----- ----- ----- ----- ---- ------ ------ ------    MILLER 1300  5000  1300  1300   800  5000  800   1300   8750  29025    CLARK  2450  5000  1300  2975  2450  5000  800   3750   8750  29025    KING   5000  5000  1300  5000  5000  5000  800   8750   8750  29025    SMITH   800  3000   800  1300   800  5000  800    800  10875  29025    ADAMS  1100  3000   800  1300   800  5000  800   1900  10875  29025    JONES  2975  3000   800  2975  2450  5000  800   4875  10875  29025    SCOTT  3000  3000   800  3000  3000  5000  800   7875  10875  29025    FORD   3000  3000   800  3000  3000  5000  800  10875  10875  29025    JAMES   950  2850   950  1300   800  5000  800    950   9400  29025    WARD   1250  2850   950  1600  1250  5000  800   2200   9400  29025    MARTIN 1250  2850   950  1600  1250  5000  800   3450   9400  29025    TURNER 1500  2850   950 1600  1250  5000  800   4950   9400  29025    ALLEN  1600  2850   950  1600  1250  5000  800   6550   9400  29025    BLAKE  2850  2850   950  2975  2450  5000  800   9400   9400  29025

This query answers the following questions easily, efficiently, and readably (and without additional joins to EMP!). Simply match the employee and her salary with the different rows in the result set to determine:

who makes the highest salary of all employees (HI)
who makes the lowest salary of all employees (LO)
who makes the highest salary in her department (HIDPT)
who makes the lowest salary in her department (LODPT)
who makes the highest salary in her job (HIJOB)
who makes the lowest salary in her job (LOJOB)
what is the sum of all salaries (TTL)
what is the sum of salaries per department (DPTSUM)
what is the running total of all salaries per department (DPTRT)