Recipe12.19.Performing Aggregations over a Moving Range of Values


Recipe 12.19. Performing Aggregations over a Moving Range of Values

Problem

You want to compute a moving aggregation, such as a moving sum on the salaries in table EMP. You want to compute a sum for every 90 days, starting with the HIREDATE of the first employee. You want to see how spending has fluctuated for every 90-day period between the first and last employee hired. You want to return the following result set:

 HIREDATE        SAL SPENDING_PATTERN ----------- ------- ---------------- 17-DEC-1980     800              800 20-FEB-1981    1600             2400 22-FEB-1981    1250             3650 02-APR-1981    2975             5825 01-MAY-1981    2850             8675 09-JUN-1981    2450             8275 08-SEP-1981    1500             1500 28-SEP-1981    1250             2750 17-NOV-1981    5000             7750 03-DEC-1981     950            11700 03-DEC-1981    3000            11700 23-JAN-1982    1300            10250 09-DEC-1982    3000             3000 12-JAN-1983    1100             4100 

Solution

Being able to specify a moving window in the framing or windowing clause of window functions makes this problem very easy to solve, if your RDBMS supports such functions. The key is to order by HIREDATE in your window function and then specify a window of 90 days starting from the earliest employee hired. The sum will be computed using the salaries of employees hired up to 90 days prior to the current employee's HIREDATE (the current employee is included in the sum). If you do not have window functions available, you can use scalar subqueries, but the solution will be more complex.

DB2 and Oracle

For DB2 and Oracle, use the window function SUM OVER and order by HIREDATE. Specify a range of 90 days in the window or "framing" clause to allow the sum to be computed for each employee's salary and to include the salaries of all employees hired up to 90 days earlier. Because DB2 does not allow you to specify HIREDATE in the ORDER BY clause of a window function (line 3 below), you can order by DAYS(HIREDATE) instead:

  1 select hiredate,  2        sal,  3        sum(sal)over(order by days(hiredate)  4                        range between 90 preceding  5                          and current row) spending_pattern  6   from emp e 

The Oracle solution is more straightforward than DB2's, because Oracle allows window functions to order by datetime types:

  1 select hiredate,  2        sal,  3        sum(sal)over(order by hiredate  4                        range between 90 preceding  5                          and current row) spending_pattern  6   from emp e 

MySQL, PostgreSQL, and SQL Server

Use a scalar subquery to sum the salaries of all employees hired up to 90 days prior to the day each employee was hired:

  1 select e.hiredate,  2        e.sal,  3        (select sum(sal) from emp d  4          where d.hiredate between e.hiredate-90  5                              and e.hiredate) as spending_pattern  6   from emp e  7  order by 1 

Discussion

DB2 and Oracle

DB2 and Oracle share the same solution. The only difference, and it's minor between the two solutions, lies in how you specify HIREDATE in the ORDER BY clause of the window function. At the time of this book's writing, DB2 doesn't allow a DATE value in such an ORDER BY clause if you are using a numeric value to set the window's range. (For example, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW allows you to order by a date, but RANGE BETWEEN 90 PRECEDING AND CURRENT ROW does not.)

To understand what the solution query is doing, you simply need to understand what the window clause is doing. The window you are defining orders the salaries for all employees by HIREDATE. Then the function computes a sum. The sum is not computed for all salaries. Instead, the processing is as follows:

  1. The salary of the first employee hired is evaluated. Since no employees were hired before the first employee, the sum at this point is simply the first employee's salary.

  2. The salary of the next employee (by HIREDATE) is evaluated. This employee's salary is included in the moving sum along with any other employees who were hired up to 90 days prior.

The HIREDATE of the first employee is December 17, 1980, and the HIREDATE of the next hired employee is February 20, 1981. The second employee was hired less than 90 days after the first employee, and thus the moving sum for the second employee is 2400 (1600 + 800). If you are having trouble understanding where the values in SPENDING_PATTERN come from, examine the following query and result set:

  select distinct        dense_rank( )over(order by e.hiredate) window,        e.hiredate current_hiredate,        d.hiredate hiredate_within_90_days,        d.sal sals_used_for_sum   from emp e,        emp d where d.hiredate between e.hiredate-90 and e.hiredate WINDOW CURRENT_HIREDATE HIREDATE_WITHIN_90_DAYS SALS_USED_FOR_SUM ------ ---------------- ----------------------- -----------------      1 17-DEC-1980      17-DEC-1980                           800      2 20-FEB-1981      17-DEC-1980                           800      2 20-FEB-1981      20-FEB-1981                          1600      3 22-FEB-1981      17-DEC-1980                           800      3 22-FEB-1981      20-FEB-1981                          1600      3 22-FEB-1981      22-FEB-1981                          1250      4 02-APR-1981      20-FEB-1981                          1600      4 02-APR-1981      22-FEB-1981                          1250      4 02-APR-1981      02-APR-1981                          2975      5 01-MAY-1981      20-FEB-1981                          1600      5 01-MAY-1981      22-FEB-1981                          1250      5 01-MAY-1981      02-APR-1981                          2975      5 01-MAY-1981      01-MAY-1981                          2850      6 09-JUN-1981      02-APR-1981                          2975      6 09-JUN-1981      01-MAY-1981                          2850      6 09-JUN-1981      09-JUN-1981                          2450      7 08-SEP-1981      08-SEP-1981                          1500      8 28-SEP-1981      08-SEP-1981                          1500      8 28-SEP-1981      28-SEP-1981                          1250      9 17-NOV-1981      08-SEP-1981                          1500      9 17-NOV-1981      28-SEP-1981                          1250      9 17-NOV-1981      17-NOV-1981                          5000     10 03-DEC-1981      08-SEP-1981                          1500     10 03-DEC-1981      28-SEP-1981                          1250     10 03-DEC-1981      17-NOV-1981                          5000     10 03-DEC-1981      03-DEC-1981                           950     10 03-DEC-1981      03-DEC-1981                          3000     11 23-JAN-1982      17-NOV-1981                          5000     11 23-JAN-1982      03-DEC-1981                           950     11 23-JAN-1982      03-DEC-1981                          3000     11 23-JAN-1982      23-JAN-1982                          1300     12 09-DEC-1982      09-DEC-1982                          3000     13 12-JAN-1983      09-DEC-1982                          3000     13 12-JAN-1983      12-JAN-1983                          1100 

If you look at the WINDOW column, only those rows with the same WINDOW value will be considered for each sum. Take for example, WINDOW 3. The salaries used for the sum for that window are 800, 1600, and 1250, which total 3650. If you look at the final result set in the "Problem" section, you'll see the SPENDING_PATTERN for February 22, 1981 (WINDOW 3) is 3650. As proof, to verify that the above self join includes the correct salaries for the windows defined, simply sum the values in SALS_USED_FOR_SUM and group by CURRENT_DATE. The result should be the same as the result set shown in the "Problem" section (with the duplicate row for December 3, 1981, filtered out):

  select current_hiredate,        sum(sals_used_for_sum) spending_pattern   from ( select distinct        dense_rank()over(order by e.hiredate) window,        e.hiredate current_hiredate,        d.hiredate hiredate_within_90_days,        d.sal sals_used_for_sum   from emp e,        emp d   where d.hiredate between e.hiredate-90 and e.hiredate         ) x   group by current_hiredate CURRENT_HIREDATE SPENDING_PATTERN ---------------- ---------------- 17-DEC-1980                   800 20-FEB-1981                  2400 22-FEB-1981                  3650 02-APR-1981                  5825 01-MAY-1981                  8675 09-JUN-1981                  8275 08-SEP-1981                  1500 28-SEP-1981                  2750 17-NOV-1981                  7750 03-DEC-1981                 11700 23-JAN-1982                 10250 09-DEC-1982                  3000 12-JAN-1983                  4100 

MySQL, PostgreSQL, and SQL Server

The key to this solution is to use a scalar subquery (a self join will work as well) while using the aggregate function SUM to compute a sum for every 90 days based on HIREDATE. If you are having trouble seeing how this works, simply convert the solution to a self join and examine which rows are included in the computations. Consider the result set below, which returns the same result set as that in the solution:

  select e.hiredate,        e.sal,        sum(d.sal) as spending_pattern   from emp e, emp d  where d.hiredate        between e.hiredate-90 and e.hiredate  group by e.hiredate,e.sal  order by 1\ HIREDATE      SAL   SPENDING_PATTERN ----------- -----   ---------------- 17-DEC-1980   800                800 20-FEB-1981  1600               2400 22-FEB-1981  1250               3650 02-APR-1981  2975               5825 01-MAY-1981  2850               8675 09-JUN-1981  2450               8275 08-SEP-1981  1500               1500 28-SEP-1981  1250               2750 17-NOV-1981  5000               7750 03-DEC-1981   950              11700 03-DEC-1981  3000              11700 23-JAN-1982  1300              10250 09-DEC-1982  3000               3000 12-JAN-1983  1100               4100 

If it is still unclear, simply remove the aggregation and start with the Cartesian product. The first step is to generate a Cartesian product using table EMP so that each HIREDATE can be compared with all the other HIREDATEs. [Only a snippet of the result set is shown below because there are 196 rows (14x14) returned by a Cartesian of EMP.]

  select e.hiredate,        e.sal,        d.sal,        d.hiredate   from emp e, emp d HIREDATE      SAL      SAL HIREDATE ----------- -----    ----- ----------- 17-DEC-1980   800      800 17-DEC-1980 17-DEC-1980   800     1600 20-FEB-1981 17-DEC-1980   800     1250 22-FEB-1981 17-DEC-1980   800     2975 02-APR-1981 17-DEC-1980   800     1250 28-SEP-1981 17-DEC-1980   800     2850 01-MAY-1981 17-DEC-1980   800     2450 09-JUN-1981 17-DEC-1980   800     3000 09-DEC-1982 17-DEC-1980   800     5000 17-NOV-1981 17-DEC-1980   800     1500 08-SEP-1981 17-DEC-1980   800     1100 12-JAN-1983 17-DEC-1980   800      950 03-DEC-1981 17-DEC-1980   800     3000 03-DEC-1981 17-DEC-1980   800     1300 23-JAN-1982 20-FEB-1981  1600      800 17-DEC-1980 20-FEB-1981  1600     1600 20-FEB-1981 20-FEB-1981  1600     1250 22-FEB-1981 20-FEB-1981  1600     2975 02-APR-1981 20-FEB-1981  1600     1250 28-SEP-1981 20-FEB-1981  1600     2850 01-MAY-1981 20-FEB-1981  1600     2450 09-JUN-1981 20-FEB-1981  1600     3000 09-DEC-1982 20-FEB-1981  1600     5000 17-NOV-1981 20-FEB-1981  1600     1500 08-SEP-1981 20-FEB-1981  1600     1100 12-JAN-1983 20-FEB-1981  1600      950 03-DEC-1981 20-FEB-1981  1600     3000 03-DEC-1981 20-FEB-1981  1600     1300 23-JAN-1982 

If you examine the result set above, you'll notice that there is no HIREDATE 90 days earlier or equal to December 17, except for December 17. So, the sum for that row should be only 800. If you examine the next HIREDATE, February 20, you'll notice that there is one HIREDATE that falls within the 90-day window (within 90 days prior), and that is December 17. If you sum the SAL from December 17 with the SAL from February 20 (because we are looking for HIREDATEs equal to each HIREDATE or within 90 days earlier) you get 2400, which happens to be the final result for that HIREDATE.

Now that you know how it works, use a filter in the WHERE clause to return for each HIREDATE and HIREDATE that is equal to it or is no more than 90 days earlier:

  select e.hiredate,        e.sal,        d.sal sal_to_sum,        d.hiredate within_90_days   from emp e, emp d  where d.hiredate        between e.hiredate-90 and e.hiredate  order by 1 HIREDATE      SAL SAL_TO_SUM WITHIN_90_DAYS ----------- ----- ---------- -------------- 17-DEC-1980   800        800    17-DEC-1980 20-FEB-1981  1600        800    17-DEC-1980 20-FEB-1981  1600       1600    20-FEB-1981 22-FEB-1981  1250        800    17-DEC-1980 22-FEB-1981  1250       1600    20-FEB-1981 22-FEB-1981  1250       1250    22-FEB-1981 02-APR-1981  2975       1600    20-FEB-1981 02-APR-1981  2975       1250    22-FEB-1981 02-APR-1981  2975       2975    02-APR-1981 01-MAY-1981  2850       1600    20-FEB-1981 01-MAY-1981  2850       1250    22-FEB-1981 01-MAY-1981  2850       2975    02-APR-1981 01-MAY-1981  2850       2850    01-MAY-1981 09-JUN-1981  2450       2975    02-APR-1981 09-JUN-1981  2450       2850    01-MAY-1981 09-JUN-1981  2450       2450    09-JUN-1981 08-SEP-1981  1500       1500    08-SEP-1981 28-SEP-1981  1250       1500    08-SEP-1981 28-SEP-1981  1250       1250    28-SEP-1981 17-NOV-1981  5000       1500    08-SEP-1981 17-NOV-1981  5000       1250    28-SEP-1981 17-NOV-1981  5000       5000    17-NOV-1981 03-DEC-1981   950       1500    08-SEP-1981 03-DEC-1981   950       1250    28-SEP-1981 03-DEC-1981   950       5000    17-NOV-1981 03-DEC-1981   950        950    03-DEC-1981 03-DEC-1981   950       3000    03-DEC-1981 03-DEC-1981  3000       1500    08-SEP-1981 03-DEC-1981  3000       1250    28-SEP-1981 03-DEC-1981  3000       5000    17-NOV-1981 03-DEC-1981  3000        950    03-DEC-1981 03-DEC-1981  3000       3000    03-DEC-1981 23-JAN-1982  1300       5000    17-NOV-1981 23-JAN-1982  1300        950    03-DEC-1981 23-JAN-1982  1300       3000    03-DEC-1981 23-JAN-1982  1300       1300    23-JAN-1982 09-DEC-1982  3000       3000    09-DEC-1982 12-JAN-1983  1100       3000    09-DEC-1982 12-JAN-1983  1100       1100    12-JAN-1983 

Now that you know which SALs are to be included in the moving window of summation, simply use the aggregate function SUM to produce a more expressive result set:

 select e.hiredate,        e.sal,        sum(d.sal) as spending_pattern   from emp e, emp d  where d.hiredate        between e.hiredate-90 and e.hiredate  group by e.hiredate,e.sal  order by 1 

If you compare the result set for the query above and the result set for the query below (which is the original solution presented), you will see they are the same:

 select e.hiredate,        e.sal,        (select sum(sal) from emp d         where d.hiredate between e.hiredate-90                              and e.hiredate) as spending_pattern   from emp e  order by 1 HIREDATE      SAL SPENDING_PATTERN ----------- ----- ---------------- 17-DEC-1980  800               800 20-FEB-1981 1600              2400 22-FEB-1981 1250              3650 02-APR-1981 2975              5825 01-MAY-1981 2850              8675 09-JUN-1981 2450              8275 08-SEP-1981 1500              1500 28-SEP-1981 1250              2750 17-NOV-1981 5000              7750 03-DEC-1981  950             11700 03-DEC-1981 3000             11700 23-JAN-1982 1300             10250 09-DEC-1982 3000              3000 12-JAN-1983 1100              4100 




SQL Cookbook
SQL Cookbook (Cookbooks (OReilly))
ISBN: 0596009763
EAN: 2147483647
Year: 2005
Pages: 235

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net