Recipe A.2. WindowingOnce you understand the concept of grouping and using aggregates in SQL, understanding window functions is easy. Window functions, like aggregate functions, perform an aggregation on a defined set (a group) of rows, but rather than returning one value per group, window functions can return multiple values for each group. The group of rows to perform the aggregation on is the window (hence the name "window functions"). DB2 actually calls such functions online analytic processing (OLAP) functions, and Oracle calls them analytic functions, but the ISO SQL standard calls them window functions, so that's the term I use in this book. A Simple ExampleLet's say that you wish to count the total number of employees across all departments. The traditional method for doing that is to issue a COUNT(*) query against the entire EMP table: select count(*) as cnt from emp CNT ----- 14 This is easy enough, but often you will find yourself wanting to access such aggregate data from rows that do not represent an aggregation, or that represent a different aggregation. Window functions make light work of such problems. For example, the following query shows how you can use a window function to access aggregate data (the total count of employees) from detail rows (one per employee): select ename, deptno, count(*) over( ) as cnt from emp order by 2 ENAME DEPTNO CNT ---------- ------ ------ CLARK 10 14 KING 10 14 MILLER 10 14 SMITH 20 14 ADAMS 20 14 FORD 20 14 SCOTT 20 14 JONES 20 14 ALLEN 30 14 BLAKE 30 14 MARTIN 30 14 JAMES 30 14 TURNER 30 14 WARD 30 14 The window function invocation in this example is COUNT(*) OVER( ). The presence of the OVER keyword indicates that the invocation of COUNT will be treated as a window function, not as an aggregate function. In general, the SQL standard allows for all aggregate functions to also be window functions, and the keyword OVER is how the language distinguishes between the two uses. So, what did the window function COUNT(*) OVER ( ) do exactly? For every row being returned in the query, it returned the count of all the rows in the table. As the empty parentheses suggest, the OVER keyword accepts additional clauses to affect the range of rows that a given window function considers. Absent any such clauses, the window function looks at all rows in the result set, which is why you see the value 14 repeated in each row of output. Hopefully you begin to see the great utility of window functions, which is that they allow you to work with multiple levels of aggregation in one row. As you continue through this appendix, you'll begin to see even more just how incredibly useful that ability can be. Order of EvaluationBefore digging deeper into the OVER clause, it is important to note that window functions are performed as the last step in SQL processing prior to the ORDER BY clause. As an example of how window functions are processed last, let's take the query from the preceding section and use a WHERE clause to filter out employees from DEPTNO 20 and 30: select ename, deptno, count(*) over( ) as cnt from emp where deptno = 10 order by 2 ENAME DEPTNO CNT ---------- ------ ------ CLARK 10 3 KING 10 3 MILLER 10 3 The value for CNT for each row is no longer 14, it is now 3. In this example, it is the WHERE clause that restricts the result set to three rows, hence the window function will count only three rows (there are only three rows available to the window function by the time processing reaches the SELECT portion of the query). From this example you can see that window functions perform their computations after clauses such as WHERE and GROUP BY are evaluated. PartitionsUse the PARTITION BY clause to define a partition or group of rows to perform an aggregation over. As we've seen already, if you use empty parentheses then the entire result set is the partition that a window function aggregation will be computed over. You can think of the PARTITION BY clause as a "moving GROUP BY" because unlike a traditional GROUP BY, a group created by PARTITION BY is not distinct in a result set. You can use PARTITION BY to compute an aggregation over a defined group of rows (resetting when a new group is encountered) and rather than having one group represent all instances of that value in the table, each value (each member in each group) is returned. Consider the following query: select ename, deptno, count(*) over(partition by deptno) as cnt from emp order by 2 ENAME DEPTNO CNT ---------- ------ ------ CLARK 10 3 KING 10 3 MILLER 10 3 SMITH 20 5 ADAMS 20 5 FORD 20 5 SCOTT 20 5 JONES 20 5 ALLEN 30 6 BLAKE 30 6 MARTIN 30 6 JAMES 30 6 TURNER 30 6 WARD 30 6 This query still returns 14 rows, but now the COUNT is performed for each department as a result of the PARTITION BY DEPTNO clause. Each employee in the same department (in the same partition) will have the same value for CNT, because the aggregation will not reset (recompute) until a new department is encountered. Also note that you are returning information about each group, along with the members of each group. You can think of the preceding query as a more efficient version of the following: select e.ename, e.deptno, (select count(*) from emp d where e.deptno=d.deptno) as cnt from emp e order by 2 ENAME DEPTNO CNT ---------- ------ ------ CLARK 10 3 KING 10 3 MILLER 10 3 SMITH 20 5 ADAMS 20 5 FORD 20 5 SCOTT 20 5 JONES 20 5 ALLEN 30 6 BLAKE 30 6 MARTIN 30 6 JAMES 30 6 TURNER 30 6 WARD 30 6 Additionally, what's nice about the PARTITION BY clause is that it performs its computations independently of other window functions, partitioning by different columns in the same SELECT statement. Consider the following query, which returns each employee, her department, the number of employees in her respective department, her job, and the number of employees with the same job: select ename, deptno, count(*) over(partition by deptno) as dept_cnt, job, count(*) over(partition by job) as job_cnt from emp order by 2 ENAME DEPTNO DEPT_CNT JOB JOB_CNT ---------- ------ -------- --------- ------- MILLER 10 3 CLERK 4 CLARK 10 3 MANAGER 3 KING 10 3 PRESIDENT 1 SCOTT 20 5 ANALYST 2 FORD 20 5 ANALYST 2 SMITH 20 5 CLERK 4 JONES 20 5 MANAGER 3 ADAMS 20 5 CLERK 4 JAMES 30 6 CLERK 4 MARTIN 30 6 SALESMAN 4 TURNER 30 6 SALESMAN 4 WARD 30 6 SALESMAN 4 ALLEN 30 6 SALESMAN 4 BLAKE 30 6 MANAGER 3 In this result set, you can see that employees in the same department have the same value for DEPT_CNT, and that employees who have the same job position have the same value for JOB_CNT. By now it should be clear that the PARTITION BY clause works like a GROUP BY clause, but it does so without being affected by the other items in the SELECT clause and without requiring you to write a GROUP BY clause. Effect of NULLsLike the GROUP BY clause, the PARTITION BY clause lumps all the NULLs into one group or partition. Thus, the effect from NULLs when using PARTITION BY is similar to that from using GROUP BY. The following query uses a window function to count the number of employees with each distinct commission (returning1 in place of NULL for readability): select coalesce(comm,-1) as comm, count(*)over(partition by comm) as cnt from emp COMM CNT ------ ---------- 0 1 300 1 500 1 1400 1 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 -1 10 Because COUNT(*) is used, the function counts rows. You can see that there are 10 employees having NULL commissions. Use COMM instead of *, however, and you get quite different results: select coalesce(comm,-1) as comm, count(comm)over(partition by comm) as cnt from emp COMM CNT ---- ---------- 0 1 300 1 500 1 1400 1 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 -1 0 This query uses COUNT(COMM), which means that only the non-NULL values in the COMM column are counted. There is one employee with a commission of 0, one employee with a commission of 300, and so forth. But notice the counts for those with NULL commissions! Those counts are 0. Why? Because aggregate functions ignore NULL values, or more accurately, aggregate functions count only non-NULL values.
When Order MattersSometimes the order in which rows are treated by a window function is material to the results that you wish to obtain from a query. For this reason, window function syntax includes an ORDER BY subclause that you can place within an OVER clause. Use the ORDER BY clause to specify how the rows are ordered with a partition (remember, "partition" in the absence of a PARTITION BY clause means the entire result set).
When you use an ORDER BY clause in the OVER clause of a window function you are specifying two things:
Consider the following query, which sums and computes a running total of salaries for employees in DEPTNO 10: select deptno, ename, hiredate, sal, sum(sal)over(partition by deptno) as total1, sum(sal)over( ) as total2, sum(sal)over(order by hiredate) as running_total from emp where deptno=10 DEPTNO ENAME HIREDATE SAL TOTAL1 TOTAL2 RUNNING_TOTAL ------ ------ ----------- ----- ------ ------ ------------- 10 CLARK 09-JUN-1981 2450 8750 8750 2450 10 KING 17-NOV-1981 5000 8750 8750 7450 10 MILLER 23-JAN-1982 1300 8750 8750 8750
Looking at the values retuned by column SAL, you can easily see where the values for RUNNING_TOTAL come from. You can eyeball the values and add them yourself to compute the running total. But more importantly, why did including an ORDER BY in the OVER clause create a running total in the first place? The reason is, when you use ORDER BY in the OVER clause you are specify a default "moving" or "sliding" window within the partition even though you don't see it. The ORDER BY HIREDATE clause terminates summation at the HIREDATE in the current row. The following query is the same as the previous one, but uses the RANGE BETWEEN clause (which you'll learn more about later) to explicitly specify the default behavior that results from ORDER BY HIREDATE: select deptno, ename, hiredate, sal, sum(sal)over(partition by deptno) as total1, sum(sal)over( ) as total2, sum(sal)over(order by hiredate range between unbounded preceding and current row) as running_total from emp where deptno=10 DEPTNO ENAME HIREDATE SAL TOTAL1 TOTAL2 RUNNING_TOTAL ------ ------ ----------- ----- ------ ------ ------------- 10 CLARK 09-JUN-1981 2450 8750 8750 2450 10 KING 17-NOV-1981 5000 8750 8750 7450 10 MILLER 23-JAN-1982 1300 8750 8750 8750 The RANGE BETWEEN clause that you see in this query is termed the framing clause by ANSI and I'll use that term here. Now, it should be easy to see why specifying an ORDER BY in the OVER clause created a running total; we've (by default) told the query to sum all rows starting from the current row and include all prior rows ("prior" as defined in the ORDER BY, in this case ordering the rows by HIREDATE). The Framing ClauseLet's apply the framing clause from the preceding query to the result set, starting with the first employee hired, who is named CLARK.
As you can see, it is really the framing clause that produces the running total. The ORDER BY defines the order of evaluation and happens to also imply a default framing. In general, the framing clause allows you to define different "sub-windows" of data to include in your computations. There are many ways to specify such sub-windows. Consider the following query: select deptno, ename, sal, sum(sal)over(order by hiredate range between unbounded preceding and current row) as run_total1, sum(sal)over(order by hiredate rows between 1 preceding and current row) as run_total2, sum(sal)over(order by hiredate range between current row and unbounded following) as run_total3, sum(sal)over(order by hiredate rows between current row and 1 following) as run_total4 from emp where deptno=10 DEPTNO ENAME SAL RUN_TOTAL1 RUN_TOTAL2 RUN_TOTAL3 RUN_TOTAL4 ------ ------ ----- ---------- ---------- ---------- ---------- 10 CLARK 2450 2450 2450 8750 7450 10 KING 5000 7450 7450 6300 6300 10 MILLER 1300 8750 6300 1300 1300 Don't be intimidated here; this query is not as bad as it looks. You've already seen RUN_TOTAL1 and the effects of the framing clause "UNBOUNDED PRECEDING AND CURRENT ROW". Here's a quick description of what's happening in the other examples:
A Framing FinaleAs a final example of the effect of the framing clause on query output, consider the following query: select ename, sal, min(sal)over(order by sal) min1, max(sal)over(order by sal) max1, min(sal)over(order by sal range between unbounded preceding and unbounded following) min2, max(sal)over(order by sal range between unbounded preceding and unbounded following) max2, min(sal)over(order by sal range between current row and current row) min3, max(sal)over(order by sal range between current row and current row) max3, max(sal)over(order by sal rows between 3 preceding and 3 following) max4 from emp ENAME SAL MIN1 MAX1 MIN2 MAX2 MIN3 MAX3 MAX4 ------ ----- ------ ------ ------ ------ ------ ------ ------ SMITH 800 800 800 800 5000 800 800 1250 JAMES 950 800 950 800 5000 950 950 1250 ADAMS 1100 800 1100 800 5000 1100 1100 1300 WARD 1250 800 1250 800 5000 1250 1250 1500 MARTIN 1250 800 1250 800 5000 1250 1250 1600 MILLER 1300 800 1300 800 5000 1300 1300 2450 TURNER 1500 800 1500 800 5000 1500 1500 2850 ALLEN 1600 800 1600 800 5000 1600 1600 2975 CLARK 2450 800 2450 800 5000 2450 2450 3000 BLAKE 2850 800 2850 800 5000 2850 2850 3000 JONES 2975 800 2975 800 5000 2975 2975 5000 SCOTT 3000 800 3000 800 5000 3000 3000 5000 FORD 3000 800 3000 800 5000 3000 3000 5000 KING 5000 800 5000 800 5000 5000 5000 5000 OK, let's break this query down:
Readability + Performance = PowerAs you can see, window functions are extremely powerful as they allow you to write queries that contain both detailed and aggregate information. Using window functions allows you to write smaller, more efficient queries as compared to using multiple self join and/or scalar subqueries. Consider the following query, which easily answers all of the following questions: "What is the number of employees in each department? How many different types of employees are in each department (e.g., how many clerks are in department 10)? How many total employees are in table EMP?" select deptno, job, count(*) over (partition by deptno) as emp_cnt, count(job) over (partition by deptno,job) as job_cnt, count(*) over ( ) as total from emp DEPTNO JOB EMP_CNT JOB_CNT TOTAL ------ --------- ---------- ---------- ---------- 10 CLERK 3 1 14 10 MANAGER 3 1 14 10 PRESIDENT 3 1 14 20 ANALYST 5 2 14 20 ANALYST 5 2 14 20 CLERK 5 2 14 20 CLERK 5 2 14 20 MANAGER 5 1 14 30 CLERK 6 1 14 30 MANAGER 6 1 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 To return the same result set without using window functions would require a bit more work: select a.deptno, a.job, (select count(*) from emp b where b.deptno = a.deptno) as emp_cnt, (select count(*) from emp b where b.deptno = a.deptno and b.job = a.job) as job_cnt, (select count(*) from emp) as total from emp a order by 1,2 DEPTNO JOB EMP_CNT JOB_CNT TOTAL ------ --------- ---------- ---------- ---------- 10 CLERK 3 1 14 10 MANAGER 3 1 14 10 PRESIDENT 3 1 14 20 ANALYST 5 2 14 20 ANALYST 5 2 14 20 CLERK 5 2 14 20 CLERK 5 2 14 20 MANAGER 5 1 14 30 CLERK 6 1 14 30 MANAGER 6 1 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 30 SALESMAN 6 4 14 The non-window solution is obviously not difficult to write, yet it certainly is not as clean or efficient (you won't see performance differences with a 14-row table, but try these queries with, say, a 1,000- or 10,000-row table and then you'll see the benefit of using window functions over multiple self joins and scalar subqueries). Providing a BaseBesides readability and performance, window functions are useful for providing a "base" for more complex "report style" queries. For example, consider the following "report style" query that uses window functions in an inline view and then aggregates the results in an outer query. Using window functions allows you to return detailed as well as aggregate data, which is useful for reports. The query below uses window functions to find counts using different partitions. Because the aggregation is applied to multiple rows, the inline view returns all rows from EMP, which the outer CASE expressions can use to transpose and create a formatted report: select deptno, emp_cnt as dept_total, total, max(case when job = 'CLERK' then job_cnt else 0 end) as clerks, max(case when job = 'MANAGER' then job_cnt else 0 end) as mgrs, max(case when job = 'PRESIDENT' then job_cnt else 0 end) as prez, max(case when job = 'ANALYST' then job_cnt else 0 end) as anals, max(case when job = 'SALESMAN' then job_cnt else 0 end) as smen from ( select deptno, job, count(*) over (partition by deptno) as emp_cnt, count(job) over (partition by deptno,job) as job_cnt, count(*) over () as total from emp ) x group by deptno, emp_cnt, total DEPTNO DEPT_TOTAL TOTAL CLERKS MGRS PREZ ANALS SMEN ------ ---------- ----- ------ ---- ---- ----- ---- 10 3 14 1 1 1 0 0 20 5 14 2 1 0 2 0 30 6 14 1 1 0 0 4 The query above returns each department, the total number of employees in each department, the total number of employees in table EMP, and a breakdown of the number of different job types in each department. All this is done in one query, without additional joins or temp tables! As a final example of how easily multiple questions can be answered using window functions, consider the following query: select ename as name, sal, max(sal)over(partition by deptno) as hiDpt, min(sal)over(partition by deptno) as loDpt, max(sal)over(partition by job) as hiJob, min(sal)over(partition by job) as loJob, max(sal)over( ) as hi, min(sal)over( ) as lo, sum(sal)over(partition by deptno order by sal,empno) as dptRT, sum(sal)over(partition by deptno) as dptSum, sum(sal)over( ) as ttl from emp order by deptno,dptRT NAME SAL HIDPT LODPT HIJOB LOJOB HI LO DPTRT DPTSUM TTL ------ ----- ----- ----- ----- ----- ----- ---- ------ ------ ------ MILLER 1300 5000 1300 1300 800 5000 800 1300 8750 29025 CLARK 2450 5000 1300 2975 2450 5000 800 3750 8750 29025 KING 5000 5000 1300 5000 5000 5000 800 8750 8750 29025 SMITH 800 3000 800 1300 800 5000 800 800 10875 29025 ADAMS 1100 3000 800 1300 800 5000 800 1900 10875 29025 JONES 2975 3000 800 2975 2450 5000 800 4875 10875 29025 SCOTT 3000 3000 800 3000 3000 5000 800 7875 10875 29025 FORD 3000 3000 800 3000 3000 5000 800 10875 10875 29025 JAMES 950 2850 950 1300 800 5000 800 950 9400 29025 WARD 1250 2850 950 1600 1250 5000 800 2200 9400 29025 MARTIN 1250 2850 950 1600 1250 5000 800 3450 9400 29025 TURNER 1500 2850 950 1600 1250 5000 800 4950 9400 29025 ALLEN 1600 2850 950 1600 1250 5000 800 6550 9400 29025 BLAKE 2850 2850 950 2975 2450 5000 800 9400 9400 29025 This query answers the following questions easily, efficiently, and readably (and without additional joins to EMP!). Simply match the employee and her salary with the different rows in the result set to determine:
|