Chapter 12

Overview

SQL is a very capable language and there are very few questions that it cannot answer. I find that I can come up with some convoluted SQL query to answer virtually any question you could ask from the data. However, the performance of some of these queries is not what it should be - nor is the query itself easy to write in the first place. Some of the things that are hard to do in straight SQL are actually very commonly requested operations, including:

Calculate a running total - Show the cumulative salary within a department row by row, with each row including a summation of the prior rows' salary.
Find percentages within a group - Show the percentage of the total salary paid to an individual in a certain department. Take their salary and divide it by the sum of the salary in the department.
Top-N queries - Find the top N highest-paid people or the top N sales by region.
Compute a moving average - Average the current row's value and the previous N rows values together.
Perform ranking queries - Show the relative rank of an individual's salary within their department.

Analytic functions, which have been available since Oracle 8.1.6, are designed to address these issues. They add extensions to the SQL language that not only make these operations easier to code; they make them faster than could be achieved with the pure SQL approach. These extensions are currently under review by the ANSI SQL committee for inclusion in the SQL specification.

We'll start this chapter with an example that will give you a good idea of what analytic functions are about. From there we'll address the full syntax, describe the available functions and then run through some worked examples that tackle some of the above operations. As usual, we will finish by signposting some of the potential pitfalls of using these functions.

An Example

A quick example, which calculates a running total of salaries by department, and an explanation of what exactly is happening, will give you a good initial understanding of analytic functions:

tkyte@TKYTE816> break on deptno skip 1      tkyte@TKYTE816> select ename, deptno, sal,   2      sum(sal) over   3        (order by deptno, ename) running_total,   4      sum(sal) over   5        (partition by deptno   6         order by ename) department_total,   7      row_number() over   8        (partition by deptno   9         order by ename) seq  10  from emp  11  order by deptno, ename  12  /      ENAME          DEPTNO        SAL RUNNING_TOTAL DEPARTMENT_TOTAL        SEQ ---------- ---------- ---------- ------------- ---------------- ---------- CLARK              10       2450          2450             2450          1 KING                        5000          7450             7450          2 MILLER                      1300          8750             8750          3      ADAMS              20       1100          9850             1100          1 FORD                        3000         12850             4100          2 JONES                       2975         15825             7075          3 SCOTT                       3000         18825            10075          4 SMITH                        800         19625            10875          5      ALLEN              30       1600         21225             1600          1 BLAKE                       2850         24075             4450          2 JAMES                        950         25025             5400          3 MARTIN                      1250         26275             6650          4 TURNER                      1500         27775             8150          5 WARD                        1250         29025             9400          6      14 rows selected.

In the above code, we were able to compute a RUNNING_TOTAL for the entire query. This was done using the entire ordered resultset, via SUM(SAL) OVER (ORDER BY DEPTNO, ENAME). We were also able to compute a running total within each department, a total that would be reset at the beginning of the next department. The PARTITION BY DEPTNO in that SUM(SAL) caused this to happen - a partitioning clause was specified in the query in order to break the data up into groups. The ROW_NUMBER() function is used to sequentially number the rows returned in each group, according to our ordering criteria(a SEQ column was added to in order to display this position). So, we see that SCOTT is the fourth row in department 20, when ordered by ENAME. This ROW_NUMBER() feature has many uses elsewhere, for example to transpose or pivot resultsets (as we will discuss later).

This new set of functionality holds some exciting possibilities. It opens up a whole new way of looking at the data. It will remove a lot of procedural code and complex (or inefficient) queries that would have taken a long time to develop, to achieve the same result. Just to give you a flavor of how efficient these analytic functions can be, over the old 'pure relational ways', let's compare the performance of the above query with 1000 rows instead of just 14 rows. Both the new analytical functions and the 'old' relational methods are used here to test the performance of the query. The following two statements will set up a copy of the SCOOTT.EMP table with just the ENAME, DEPTNO, and SAL columns along with an index (the only one needed for this example). I am doing everything using DEPTNO and ENAME:

tkyte@TKYTE816> create table t   2  as   3  select object_name ename,   4         mod(object_id,50) deptno,   5         object_id sal   6  from all_objects   7  where rownum <= 1000   8  /      Table created.      tkyte@TKYTE816> create index t_idx on t(deptno,ename); Index created.

We execute our query on the new table, using AUTOTRACE in trace only mode to see how much work is done (you will need to have the PLUSTRACE role enabled):

tkyte@TKYTE816> set autotrace traceonly      tkyte@TKYTE816> select ename, deptno, sal,   2    sum(sal) over   3      (order by deptno, ename) running_total,   4    sum(sal) over   5      (partition by deptno   6       order by ename) department_total,   7    row_number() over   8      (partition by deptno   9       order by ename) seq  10  from t emp  11  order by deptno, ename  12  /      1000 rows selected.      Elapsed: 00:00:00.61      Execution Plan ----------------------------------------------------------    0      SELECT STATEMENT Optimizer=CHOOSE    1    0   WINDOW (BUFFER)    2    1     TABLE ACCESS (BY INDEX ROWID) OF 'T'    3    2       INDEX (FULL SCAN) OF 'T_IDX' (NON-UNIQUE)      Statistics ----------------------------------------------------------           0  recursive calls           2  db block gets         292  consistent gets          66  physical reads           0  redo size      106978  bytes sent via SQL*Net to client        7750  bytes received via SQL*Net from client          68  SQL*Net roundtrips to/from client           0  sorts (memory)           1  sorts (disk)        1000  rows processed

This took 0.61 seconds and 294 logical I/Os. Now, repeat the same exact query using only 'standard' SQL functionality:

tkyte@TKYTE816> select ename, deptno, sal,   2    (select sum(sal)   3      from t e2   4      where e2.deptno < emp.deptno   5      or (e2.deptno = emp.deptno and e2.ename <= emp.ename ))   6  running_total,   7    (select sum(sal)   8      from t e3   9      where e3.deptno = emp.deptno  10      and e3.ename <= emp.ename)  11  department_total,  12    (select count(ename)  13      from t e3  14      where e3.deptno = emp.deptno  15      and e3.ename <= emp.ename)  16  seq  17  from t emp  18  order by deptno, ename  19  /      1000 rows selected.      Elapsed: 00:00:06.89      Execution Plan ----------------------------------------------------------    0      SELECT STATEMENT Optimizer=CHOOSE    1    0   TABLE ACCESS (BY INDEX ROWID) OF 'T'    2    1     INDEX (FULL SCAN) OF 'T_IDX' (NON-UNIQUE)      Statistics ----------------------------------------------------------           0  recursive calls           0  db block gets      665490  consistent gets           0  physical reads           0  redo size      106978  bytes sent via SQL*Net to client        7750  bytes received via SQL*Net from client          68  SQL*Net roundtrips to/from client           0  sorts (memory)           0  sorts (disk)        1000  rows processed      tkyte@TKYTE816> set autotrace off

You get the same exact answer from both queries, but what a difference these functions can make. The run time is many times longer and the number of logical I/Os is increased by many orders of magnitude. The analytical functions processed the resultset using significantly fewer resources and an astoundingly reduced amount of time. Not only that, but once you understand the syntax of the analytic functions, you will find them easier to code with then the equivalent standard SQL - just compare the syntax of the above two queries to see what a difference they can make.

How Analytic Functions Work

The first part of this section will contain the dry details of the syntax and definitions of terms. After that, we'll dive right into examples. I'll demonstrate many of the 26 new functions (not all of them, as many of the examples would be repetitive). The analytic functions utilize the same general syntax and many provide specialized functionality designed for certain technical disciplines not used by the everyday developer. Once you get familiar with the syntax - how to partition, how to define windows of data, and so on - using these functions will become very natural.

The Syntax

The syntax of the analytic function is rather straightforward in appearance, but looks can be deceiving. It starts with:

FUNCTION_NAME(<argument>,<argument>, ) OVER (<Partition-Clause> <Order-by-Clause> <Windowing Clause>)

There are up to four parts to an analytic function; it can be invoked with arguments, a partition clause, an order by clause, and a windowing clause. In the example shown in the above introduction:

  4      sum(sal) over   5        (partition by deptno   6         order by ename) department_total,

In this case:

SUM is our FUNCTION_NAME.
(SAL) is the argument to our analytic function. Each function takes between zero and three arguments. The arguments are expressions - that could have been the SUM(SAL+COMM) just as well.
OVER is a keyword that identifies this as an analytic function. Otherwise, the query parser would not be able to tell the difference between SUM() the aggregate function and SUM() the analytic function. The clause that follows the OVER keyword describes the slice of data that this analytic function will be performed 'over'.
PARTITION BY DEPTNO is the optional partitioning clause. If no partitioning clause exists, the entire resultset is treated as a single large partition. You will use that to break a result set into groups and the analytic function will be applied to the group, not to the entire result set. In the introductory example when the partitioning clause was left out, the SUM of SAL was generated for the entire resultset. By partitioning by DEPTNO, the SUM of SAL by DEPTNO was computed- resetting the running total for each group.
ORDER BY ENAME is the optional ORDER BY clause; some functions demand it, and some do not. Functions that depend on ordered data, such as LAG and LEAD that are used to access the 'previous' and 'next' rows in a result set, necessitate the use of an ORDER BY clause. Other functions, such as AVG do not. This clause is mandatory when using a windowing function of any sort (see below in the section on The Windowing Clause for more details). This specifies how the data is ordered within a group when computing the analytic function. You did not have to order by both DEPTNO and ENAME in this case since it is partitioned by DEPTNO - it is implied that the partitioning columns are part of the sort key by definition (the ORDER BY is applied to each partition in turn).
WINDOWING CLAUSE was left out in this example. This is where the syntax gets a little confusing looking in some cases. We will delve into all of the permutations for the windowing clause in detail below..

Now we will look at each of the four parts of the analytic function in more detail to understand what is valid for each.

The Function Clause

Oracle provides 26 analytic functions for us to use. They fall into five major classes of functionality.

There are various ranking functions that are useful for finding the answers to TOP-N type queries. We have already used one such function, ROW_NUMBER, when generating the SEQ column in the previous example. This ranked people in their department based on their ENAME. We could have easily ranked them by SALARY or any other attribute.

There are windowing functions, which are useful for computing various aggregates. We saw two examples of this in the introductory example where we computed the SUM(SAL) over different groups. We could use many other functions in place of SUM, such as COUNT, AVG, MIN, MAX, and so on.

There are various reporting functions. These are very similar to the windowing functions above. In fact their names are the same: SUM, MIN, MAX, and so on. Whereas a windowing function is used to work on a window of data, as the running total in the earlier example did, a report function works on all of the rows in a partition or group. For example, if, in our initial query, we had simply asked for:

sum(sal) over () total_salary, sum(sal) over (partition by deptno) total_salary_for_department

then, we would have received total sums for the group, not running totals as before. The key differentiator between a windowing function and a reporting function is the absence of an ORDER BY clause in the OVER statement. In the absence of the ORDER BY, the function is applied to every row in the group. With an ORDER BY clause, it is applied to a window (more on this in the section describing this clause).

There are also LAG and LEAD functions available. These functions allow you to look backwards or forwards in a result set to retrieve values. These are useful in order to avoid a self-join of data. For example, if you had a table that recorded patient visits by date and you wanted to compute the time between visits for each patient, the LAG function would come in handy. You could simply partition the data by patient and sort it by date. The LAG function would easily be able to return the data from the previous record for that patient. You could then just subtract the two dates. Prior to the introduction of analytic functions, this would have required a complex self-join of the patient data with itself in order to retrieve the data.

Lastly, there is a large set of statistical functions such as VAR_POP, VAR_SAMP, STDEV_POP, a set of linear regression functions, and so on. These functions compute the statistical values for any unordered partition.

At the close of this section on syntax, there is a table with each of the analytic functions listed and a brief explanation of the operation that they perform.

The Partition Clause

The PARTITION BY clause logically breaks a single result set into N groups, according to the criteria set by the partition expressions. The words 'partition' and 'group' are used synonymously here, and in the Oracle documentation. The analytic functions are applied to each group independently - they are 'reset' for each group. For example, above when we demonstrated a cumulative SAL function, we partitioned by DEPTNO. When the DEPTNO changed in the result set, we reset the cumulative SAL to ZERO and summation started anew.

If you omit a partitioning clause, then the entire result set is considered a single group. In the introductory example we used SUM(SAL) without a partitioning clause in order to obtain a running total for the entire result set.

It is interesting to note that each instance of an analytic function in a query may have an entirely different partitioning clause; the simple example we started this chapter with does this in fact. The column RUNNING_TOTAL did not supply a partitioning clause; hence, the entire result set was its target group. The column DEPARTMENTAL_TOTAL on the other hand, partitioned the result set by departments, allowing us to compute a running total within a department.

The partition clause has a simple syntax and is very similar in syntax to the GROUP BY clause you normally see with SQL queries:

PARTITION BY expression <, expression> <, expression>

The Order By Clause

The ORDER BY clause specifies how the data is sorted within each group (partition). This will definitely affect the outcome of any analytic function. The analytic functions are computed differently in the presence of an ORDER BY clause (or lack thereof). As a very simple example, consider what happens when we use AVG() with and without an ORDER BY clause:

scott@TKYTE816> select ename, sal, avg(sal) over ()   2  from emp;   3  /      ENAME            SAL AVG(SAL)OVER() ---------- --------- -------------- SMITH         800.00        2073.21 ALLEN        1600.00        2073.21 WARD         1250.00        2073.21 JONES        2975.00        2073.21 MARTIN       1250.00        2073.21 BLAKE        2850.00        2073.21 CLARK        2450.00        2073.21 SCOTT        3000.00        2073.21 KING         5000.00        2073.21 TURNER       1500.00        2073.21 ADAMS        1100.00        2073.21 JAMES         950.00        2073.21 FORD         3000.00        2073.21 MILLER       1300.00        2073.21      14 rows selected.      scott@TKYTE816> select ename, sal, avg(sal) over (ORDER BY ENAME)   2  from emp   3  order by ename   4  /      ENAME            SAL AVG(SAL)OVER(ORDERBYENAME) ---------- --------- -------------------------- ADAMS        1100.00                    1100.00 ALLEN        1600.00                    1350.00 BLAKE        2850.00                    1850.00 CLARK        2450.00                    2000.00 FORD         3000.00                    2200.00 JAMES         950.00                    1991.67 JONES        2975.00                    2132.14 KING         5000.00                    2490.63 MARTIN       1250.00                    2352.78 MILLER       1300.00                    2247.50 SCOTT        3000.00                    2315.91 SMITH         800.00                    2189.58 TURNER       1500.00                    2136.54 WARD         1250.00                    2073.21      14 rows selected.

Without the ORDER BY clause, the average is computed over the entire group and the same value is given to every row (it is being used as a reporting function). When AVG() is used with the ORDER BY, the average for each row is the average of that row and all preceding rows (here it is used as a window function). For example, the average salary for ALLEN in the query with the ORDER BY clause is 1350 (the average of 1100 and 1600).

Note

Jumping ahead, just a little, to the very next section on The Windowing Clause - it can be said that the existence of an ORDER BY in an analytic function will add a default window clause of RANGE UNBOUNDED PRECEDING. What this means is that the set of rows to be used in the computation is the current and all preceding rows in the current partition. Without the ORDER BY the default window is the entire partition.

To get a real feel for how this works, it is instructive to use the same analytic function two times with a different ORDER BY each time. In the first example, a running total was computed for the entire EMP table using ORDER BY DEPTNO, ENAME. This caused the running total to be computed starting with the first row through the last row where the row order was specified by the ORDER BY function. If the order of the columns was reversed or the columns that were sorted changed all together, the results of our running total would be very different; the last row would have the same grand total but all of the intermediate values would be different. For example:

ops$tkyte@DEV816> select ename, deptno,   2    sum(sal) over (order by ename, deptno) sum_ename_deptno,   3    sum(sal) over (order by deptno, ename) sum_deptno_ename   4  from emp   5  order by ename, deptno   6  /      ENAME          DEPTNO SUM_ENAME_DEPTNO SUM_DEPTNO_ENAME ---------- ---------- ---------------- ---------------- ADAMS              20             1100             9850 ALLEN              30             2700            21225 BLAKE              30             5550            24075 CLARK              10             8000             2450 FORD               20            11000            12850 JAMES              30            11950            25025 JONES              20            14925            15825 KING               10            19925             7450 MARTIN             30            21175            26275 MILLER             10            22475             8750 SCOTT              20            25475            18825 SMITH              20            26275            19625 TURNER             30            27775            27775 WARD               30            29025            29025      14 rows selected.

Both of the SUM(SAL) columns are equally correct; one of them is computing the SUM(SAL) by DEPTNO and then ENAME whereas the other does it by ENAME and then DEPTNO. Since the result set is order by (ENAME, DEPTNO) the SUM(SAL) that is computed in that order looks more correct but they both come to the same result, the grand total is 29025.

The syntax of an ORDER BY clause with analytic functions is as follows:

ORDER BY expression <ASC|DESC> <NULLS FIRST|NULLS LAST>,

It is exactly the same as an ORDER BY clause for a query but it will only order the rows within partitions and does not have to be the same as the ORDER BY for the query (or another partition for that matter). The NULLS FIRST and NULLS LAST clauses are new syntax with Oracle 8.1.6. They allow us to specify whether NULLS should come first or last in a sort. When using DESC sorts, especially with analytic functions, this new functionality is crucial. We will see why in the Caveats section towards the end.

The Windowing Clause

This is where the syntax gets a little complicated in appearance. Although it's not that hard in reality, the terms are a little confusing at first. Terms such as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which is the default window with an ORDER BY clause, are not terms that you use everyday. The syntax for a window clause is fairly complex to list out. Rather than trying to redraw the 'wire diagram' that you can review in the Oracle8i SQL Reference Manual yourself, I will list all of the variants of the windowing clause and explain the set of data that each would use, within a group. First, though, let's look at what the windowing clause does for you.

The windowing clause gives us a way to define a sliding or anchored window of data, on which the analytic function will operate, within a group. This clause can be used to have the analytic function compute its value based on any arbitrary sliding or anchored window within a group. For example, the range clause RANGE UNBOUNDED PRECEDING means 'apply the analytic function to every row in the current group from the first row in the group to the current row'. The default window is an anchored window that simply starts at the first row of a group and continues to the current row. If a window is used such as:

SUM(sal) OVER   (PARTITION BY deptno    ORDER BY ename    ROWS 2 PRECEDING) department_total2,

This would create a sliding window within a group and compute the sum of the current row's SAL column plus the previous 2 rows in that group. If we need a report that shows the sum of the current employee's salary with the preceding two salaries within a department, it would look like this:

scott@TKYTE816> break on deptno      scott@TKYTE816> select deptno, ename, sal,   2    sum(sal) over   3      (partition by deptno   4       order by ename   5       rows 2 preceding) sliding_total   6  from emp   7  order by deptno, ename   8  /          DEPTNO ENAME             SAL SLIDING_TOTAL ---------- ---------- ---------- -------------         10 CLARK            2450          2450            KING             5000          7450            MILLER           1300          8750              20 ADAMS            1100          1100            FORD             3000          4100            JONES            2975          7075            SCOTT            3000          8975            SMITH             800          6775              30 ALLEN            1600          1600            BLAKE            2850          4450            JAMES             950          5400            MARTIN           1250          5050            TURNER           1500          3700            WARD             1250          4000      14 rows selected.

The relevant portion of the query here was:

  2    sum(sal) over   3      (partition by deptno   4       order by ename   5       rows 2 preceding) sliding_total

The partition clause makes the SUM(SAL) be computed within each department, independent of the other groups (the SUM(SAL) is 'reset' as the department changes). The ORDER BY ENAME clause sorts the data within each department by ENAME; this allows the window clause, rows 2 preceding, to access the 2 rows prior to the current row in a group in order to sum the salaries. For example, if you note the SLIDING_TOTAL value for SMITH is 6775, which is the sum of 800, 3000, and 2975. That was simply SMITH's row plus the salary from the preceding two rows in the window.

We can set up windows based on two criteria: RANGES of data values or ROWS offset from the current row. We've seen the RANGE clause a couple of times, with a RANGE UNBOUNDED PRECEDING, for example. That says to get all rows in our partition that came before us as specified by the ORDER BY clause. It should be noted that in order to use a window, you must use an ORDER BY clause. We will now look at the ROW and RANGE windows and then finish up by describing the various ways in which the windows may be specified.

Range Windows

Range windows collect rows together based on a WHERE clause. If I say 'range 5 preceding' for example, this will generate a sliding window that has the set of all preceding rows in the group such that they are within 5 units of the current row. These units may either be numeric comparisons or date comparisons and it is not valid to use RANGE with datatypes other than numbers and dates.

If I have the EMP table with the date column HIREDATE and I specify:

count(*) over  (order by hiredate asc range 100 preceding)

then this will find all of the preceding rows in the partition such that the HIREDATE is within 100 days of the current row's HIREDATE. In this case, since the data is sorted by ASC (ascending, small to big), the values in the window would consist of all of the rows in the current group such that the HIREDATE was less then the current row's HIREDATE and within 100 days of it. If we used:

count(*) over  (order by hiredate desc range 100 preceding)

instead and sorted the partition DESC (descending, big to small) this would perform the same basic logic but since the data in the group is sorted differently, it will find a different set of rows for the window. In this case, it will find all of the rows preceding the current row, such that the HIREDATE was greater than the current rows HIREDATE and within 100 days of it. An example will help make this clearer. I will use a query that utilizes the FIRST_VALUE analytic function. This function returns the value of the expression using the FIRST row in a window. We can see where the window begins easily:

scott@TKYTE816> select ename, sal, hiredate, hiredate-100 windowtop,   2    first_value(ename)   3    over (order by hiredate asc   4          range 100 preceding) ename_prec,   5    first_value(hiredate)   6    over (order by hiredate asc   7          range 100 preceding) hiredate_prec   8  from emp   9  order by hiredate asc  10  /      ENAME             SAL HIREDATE  WINDOW_TOP ENAME_PREC HIREDATE_ ---------- ---------- --------- ---------- ---------- --------- SMITH             800 17-DEC-80 08-SEP-80  SMITH      17-DEC-80 ALLEN            1600 20-FEB-81 12-NOV-80  SMITH      17-DEC-80 WARD             1250 22-FEB-81 14-NOV-80  SMITH      17-DEC-80 JONES            2975 02-APR-81 23-DEC-80  ALLEN      20-FEB-81 BLAKE            2850 01-MAY-81 21-JAN-81  ALLEN      20-FEB-81 CLARK            2450 09-JUN-81 01-MAR-81  JONES      02-APR-81 TURNER           1500 08-SEP-81 31-MAY-81  CLARK      09-JUN-81 MARTIN           1250 28-SEP-81 20-JUN-81  TURNER     08-SEP-81 KING             5000 17-NOV-81 09-AUG-81  TURNER     08-SEP-81 FORD             3000 03-DEC-81 25-AUG-81  TURNER     08-SEP-81 JAMES             950 03-DEC-81 25-AUG-81  TURNER     08-SEP-81 MILLER           1300 23-JAN-82 15-OCT-81  KING       17-NOV-81 SCOTT            3000 09-DEC-82 31-AUG-82  SCOTT      09-DEC-82 ADAMS            1100 12-JAN-83 04-OCT-82  SCOTT      09-DEC-82      14 rows selected.

We ordered the single partition by HIREDATE ASC. We used the analytic function FIRST_VALUE to find the value of the first ENAME in our window and the first HIREDATE in our window. If we look at the row for CLARK we can see that his HIREDATE was 09-JUN-81, and 100 days prior to that is the date 01-MAR-81. For convenience, this date is put into the column WINDOWTOP. The analytic function then defined as the window every row in the sorted partition that preceded the CLARK record and where the HIREDATE was between 09-JUN-81 and 01-MAR-81. The first value of ENAME for that window is JONES and this is the name that the analytic function returns in the column ENAME_PREC.

Looking at this from the HIREDATE DESC (descending) perspective, we see instead:

scott@TKYTE816> select ename, sal, hiredate, hiredate+100 windowtop,   2    first_value(ename)   3    over (order by hiredate desc   4          range 100 preceding) ename_prec,   5    first_value(hiredate)   6    over (order by hiredate desc   7          range 100 preceding) hiredate_prec   8   from emp   9   order by hiredate desc  10  /      ENAME             SAL HIREDATE  WINDOWTOP ENAME_PREC HIREDATE_ ---------- ---------- --------- --------- ---------- --------- ADAMS            1100 12-JAN-83 22-APR-83 ADAMS      12-JAN-83 SCOTT            3000 09-DEC-82 19-MAR-83 ADAMS      12-JAN-83 MILLER           1300 23-JAN-82 03-MAY-82 MILLER     23-JAN-82 FORD             3000 03-DEC-81 13-MAR-82 MILLER     23-JAN-82 JAMES             950 03-DEC-81 13-MAR-82 MILLER     23-JAN-82 KING             5000 17-NOV-81 25-FEB-82 MILLER     23-JAN-82 MARTIN           1250 28-SEP-81 06-JAN-82 FORD       03-DEC-81 TURNER           1500 08-SEP-81 17-DEC-81 FORD       03-DEC-81 CLARK            2450 09-JUN-81 17-SEP-81 TURNER     08-SEP-81 BLAKE            2850 01-MAY-81 09-AUG-81 CLARK      09-JUN-81 JONES            2975 02-APR-81 11-JUL-81 CLARK      09-JUN-81 WARD             1250 22-FEB-81 02-JUN-81 BLAKE      01-MAY-81 ALLEN            1600 20-FEB-81 31-MAY-81 BLAKE      01-MAY-81 SMITH             800 17-DEC-80 27-MAR-81 WARD       22-FEB-81 14 rows selected.

If we look at CLARK again - the window selected is different since the data in the partition is sorted differently. CLARK's window for RANGE 100 PRECEDING now goes back to TURNER since the HIREDATE for TURNER is the last HIREDATE preceding CLARK's record which is within 100 days of it.

At times it is a little confusing trying to figure out what the range will actually include. I find using FIRST_VALUE a handy method to help visualize the window and to verify that I have set up the parameters correctly. Now that we can clearly 'see' the windows for this example, we'll use them to compute something meaningful. We need to report each employee's salary and the average salary of everyone hired within the 100 preceding days and the average salary of everyone hired within 100 days subsequently. The query would look like this:

scott@TKYTE816> select ename, hiredate, sal,   2    avg(sal)   3    over (order by hiredate asc  range 100 preceding)   4      avg_sal_100_days_before,   5    avg(sal)   6    over (order by hiredate desc  range 100 preceding)   7      avg_sal_100_days_after   8  from emp   9  order by   8  /      ENAME      HIREDATE       SAL AVG_SAL_100_DAYS_BEFORE AVG_SAL_100_DAYS_AFTER ---------- --------- -------- ----------------------- ---------------------- SMITH      17-DEC-80   800.00                  800.00                1216.67 ALLEN      20-FEB-81  1600.00                 1200.00                2168.75 WARD       22-FEB-81  1250.00                 1216.67                2358.33 JONES      02-APR-81  2975.00                 1941.67                2758.33 BLAKE      01-MAY-81  2850.00                 2168.75                2650.00 CLARK      09-JUN-81  2450.00                 2758.33                1975.00 TURNER     08-SEP-81  1500.00                 1975.00                2340.00 MARTIN     28-SEP-81  1250.00                 1375.00                2550.00 KING       17-NOV-81  5000.00                 2583.33                2562.50 JAMES      03-DEC-81   950.00                 2340.00                1750.00 FORD       03-DEC-81  3000.00                 2340.00                1750.00 MILLER     23-JAN-82  1300.00                 2562.50                1300.00 SCOTT      09-DEC-82  3000.00                 3000.00                2050.00 ADAMS      12-JAN-83  1100.00                 2050.00                1100.00      14 rows selected.

Here, if we look at CLARK again, since we understand his window within the group, we can see that the average salary of 2758.33 is equal to (2450+2850+2975)/3. This is the average of the salaries for CLARK and the rows preceding CLARK (these are JONES and BLAKE) when the data is sorted in an ascending order. On the other hand, the average salary of 1975.00 is equal to (2450+1500)/2. These are the values of CLARK's salary and the rows preceding CLARK when the data is sorted in a descending order. Using this query we are able to compute the average salary of people hired within 100 days before and 100 day after CLARK simultaneously.

RANGE windows only work on NUMBERS and DATES since we cannot add or subtract N units from a VARCHAR2 in general. The other limitation they have is that there can only be one column in the ORDER BY - ranges are single dimensional in nature. We cannot do a range in N dimensional space.

Row Windows

Row windows are physical units; physical numbers of rows, to include in the window. Using the preceding example as a ROW partition:

count (*) over  ( order by x ROWS 5 preceding )

That window will consist of up to 6 rows; the current row and the five rows 'in front of' this row (where 'in front of' is defined by the ORDER BY clause). With ROW partitions, we do not have the limitations of the RANGE partition - the data may be of any type and the order by may include many columns. Here is an example similar to the one we just did above:

scott@TKYTE816> select ename, sal, hiredate,   2    first_value(ename)   3    over (order by hiredate asc   4          rows 5 preceding) ename_prec,   5    first_value(hiredate)   6    over (order by hiredate asc   7          rows 5 preceding) hiredate_prec   8  from emp   9  order by hiredate asc  10  /      ENAME           SAL HIREDATE  ENAME_PREC HIREDATE_ ---------- -------- --------- ---------- --------- SMITH        800.00 17-DEC-80 SMITH      17-DEC-80 ALLEN       1600.00 20-FEB-81 SMITH      17-DEC-80 WARD        1250.00 22-FEB-81 SMITH      17-DEC-80 JONES       2975.00 02-APR-81 SMITH      17-DEC-80 BLAKE       2850.00 01-MAY-81 SMITH      17-DEC-80 CLARK       2450.00 09-JUN-81 SMITH      17-DEC-80 TURNER      1500.00 08-SEP-81 ALLEN      20-FEB-81 MARTIN      1250.00 28-SEP-81 WARD       22-FEB-81 KING        5000.00 17-NOV-81 JONES      02-APR-81 JAMES        950.00 03-DEC-81 BLAKE      01-MAY-81 FORD        3000.00 03-DEC-81 CLARK      09-JUN-81 MILLER      1300.00 23-JAN-82 TURNER     08-SEP-81 SCOTT       3000.00 09-DEC-82 MARTIN     28-SEP-81 ADAMS       1100.00 12-JAN-83 KING       17-NOV-81 14 rows selected.

Looking at CLARK again, we see that the first value in the window ROWS 5 PRECEDING is SMITH; the first row in the window preceding CLARK going back 5 rows. In fact, SMITH is the first value in all of the preceding rows, for BLAKE, JONES, and so on. This is because SMITH is the first record in this group (SMITH is the first value for SMITH even). Sorting the group in a descending fashion reverses the windows:

scott@TKYTE816> select ename, sal, hiredate,   2    first_value(ename)   3    over (order by hiredate desc   4          rows 5 preceding) ename_prec,   5    first_value(hiredate)   6    over (order by hiredate desc   7          rows 5 preceding) hiredate_prec   8  from emp   9  order by hiredate desc  10  /      ENAME           SAL HIREDATE  ENAME_PREC HIREDATE_ ---------- -------- --------- ---------- --------- ADAMS       1100.00 12-JAN-83 ADAMS      12-JAN-83 SCOTT       3000.00 09-DEC-82 ADAMS      12-JAN-83 MILLER      1300.00 23-JAN-82 ADAMS      12-JAN-83 JAMES        950.00 03-DEC-81 ADAMS      12-JAN-83 FORD        3000.00 03-DEC-81 ADAMS      12-JAN-83 KING        5000.00 17-NOV-81 ADAMS      12-JAN-83 MARTIN      1250.00 28-SEP-81 SCOTT      09-DEC-82 TURNER      1500.00 08-SEP-81 MILLER     23-JAN-82 CLARK       2450.00 09-JUN-81 JAMES      03-DEC-81 BLAKE       2850.00 01-MAY-81 FORD       03-DEC-81 JONES       2975.00 02-APR-81 KING       17-NOV-81 WARD        1250.00 22-FEB-81 MARTIN     28-SEP-81 ALLEN       1600.00 20-FEB-81 TURNER     08-SEP-81 SMITH        800.00 17-DEC-80 CLARK      09-JUN-81      14 rows selected.

So now JAMES is the first value in the set of 5 rows preceding CLARK in the group. Now, we can compute the average salary of a given record with the (up to) 5 employees hired before them or after them as follows:

scott@TKYTE816> select ename, hiredate, sal,   2    avg(sal)   3    over (order by hiredate asc rows 5 preceding) avg_5_before,   4      count(*)   5    over (order by hiredate asc rows 5 preceding) obs_before,   6      avg(sal)   7    over (order by hiredate desc rows 5 preceding) avg_5_after,   8      count(*)   9    over (order by hiredate desc rows 5 preceding) obs_after  10  from emp  11  order by hiredate  12  /      ENAME      HIREDATE       SAL AVG_5_BEFORE OBS_BEFORE AVG_5_AFTER OBS_AFTER ---------- --------- -------- ------------ ---------- ----------- --------- SMITH      17-DEC-80   800.00       800.00       1.00     1987.50      6.00 ALLEN      20-FEB-81  1600.00      1200.00       2.00     2104.17      6.00 WARD       22-FEB-81  1250.00      1216.67       3.00     2045.83      6.00 JONES      02-APR-81  2975.00      1656.25       4.00     2670.83      6.00 BLAKE      01-MAY-81  2850.00      1895.00       5.00     2675.00      6.00 CLARK      09-JUN-81  2450.00      1987.50       6.00     2358.33      6.00 TURNER     08-SEP-81  1500.00      2104.17       6.00     2166.67      6.00 MARTIN     28-SEP-81  1250.00      2045.83       6.00     2416.67      6.00 KING       17-NOV-81  5000.00      2670.83       6.00     2391.67      6.00 JAMES      03-DEC-81   950.00      2333.33       6.00     1587.50      4.00 FORD       03-DEC-81  3000.00      2358.33       6.00     1870.00      5.00 MILLER     23-JAN-82  1300.00      2166.67       6.00     1800.00      3.00 SCOTT      09-DEC-82  3000.00      2416.67       6.00     2050.00      2.00 ADAMS      12-JAN-83  1100.00      2391.67       6.00     1100.00      1.00      14 rows selected.

Notice here I selected out a COUNT(*) as well. This is useful just to demonstrate how many rows went into making up a given average. We can see clearly that for ALLEN's record, the average salary computation for people hired before him used only 2 records whereas the computation for salaries of people hired after him used 6. At the point in the group where ALLEN's record is located there was only 1 preceding record so the analytic function just used as many as there were in the computation.

Specifying Windows

Now that we understand the differences between a RANGE and a ROWS window, we can investigate the ways in which we can specify these ranges.

In its simplest form, the window is specified with one of three mutually exclusive statements:

UNBOUNDED PRECEDING - the window starts with the first row of the current partition and ends with the current row being processed.
CURRENT ROW - the window starts (and ends) with the current row
Numeric Expression PRECEDING - the window starts from the row that is Numeric Expression rows 'in front' of the current row for ROWS and starts from the row who's order by value is less then the current row by 'Numeric Expression' for RANGE

The CURRENT ROW range would probably never be used in the simple form as it would restrict the analytic function to the single row only, something for which an analytic function is not needed. In a more complex form, the window is specified with a BETWEEN clause. There, we might use the CURRENT ROW as either a starting point or ending point of the window. The start points and end points of the BETWEEN clause may be specified using all of the items in the list above plus one additional one:

Numeric Expression FOLLOWING - the window ends (or starts) from the row that is 'Numeric Expression' rows 'after' the current row for ROWS and starts (or ends) from the row who's order by value is more then the current row by 'Numeric Expression' for RANGE

Some examples of these windows would be:

scott@TKYTE816> select deptno, ename, hiredate,   2    count(*) over (partition by deptno   3                   order by hiredate nulls first   4                   range 100 preceding) cnt_range,   5    count(*) over (partition by deptno   6                   order by hiredate nulls first   7                   rows 2 preceding) cnt_rows   8  from emp   9  where deptno in (10, 20)  10  order by deptno, hiredate  11  /          DEPTNO ENAME      HIREDATE   CNT_RANGE   CNT_ROWS ---------- ---------- --------- ---------- ----------         10 CLARK      09-JUN-81          1          1            KING       17-NOV-81          1          2            MILLER     23-JAN-82          2          3              20 SMITH      17-DEC-80          1          1            JONES      02-APR-81          1          2            FORD       03-DEC-81          1          3            SCOTT      09-DEC-82          1          3            ADAMS      12-JAN-83          2          3      8 rows selected.

As you can see, the RANGE 100 PRECEDING counts only the rows in the current partition such that the HIREDATE is between HIREDATE-100 and HIREDATE+100 and that row preceded the current row in the window. In this case the count is always 1 or 2 indicating that in these departments, it was rare to have an employee hired within 100 days of another employee; it only happened twice. The ROWS 2 PRECEDING window however, varies from 1 to 3 rows depending on how far into the group we are. For the first row in the group, the count is one (there are no preceding rows). For the next row in the group there are 2. Finally for rows 3 and higher the COUNT(*) remains constant as we are counting only the current row and the preceding 2 rows.

Now we will take a look at using BETWEEN. All of the windows we have defined so far have ended at the current row and have looked 'backwards' in the resultset for more information. We can define a window such that the current row being processed is not the last row in the window, but is somewhere in the middle of the window. For example:

scott@TKYTE816> select ename, hiredate,   2     first_value(ename) over   3       (order by hiredate asc   4        range between 100 preceding and 100 following),   5        last_value(ename) over   6       (order by hiredate asc   7        range between 100 preceding and 100 following)   8  from emp   9  order by hiredate asc  10  /      ENAME      HIREDATE  FIRST_VALU LAST_VALUE ---------- --------- ---------- ---------- SMITH      17-DEC-80 SMITH      WARD ALLEN      20-FEB-81 SMITH      BLAKE WARD       22-FEB-81 SMITH      BLAKE JONES      02-APR-81 ALLEN      CLARK BLAKE      01-MAY-81 ALLEN      CLARK CLARK      09-JUN-81 JONES      TURNER TURNER     08-SEP-81 CLARK      JAMES MARTIN     28-SEP-81 TURNER     JAMES KING       17-NOV-81 TURNER     MILLER FORD       03-DEC-81 TURNER     MILLER JAMES      03-DEC-81 TURNER     MILLER MILLER     23-JAN-82 KING       MILLER SCOTT      09-DEC-82 SCOTT      ADAMS ADAMS      12-JAN-83 SCOTT      ADAMS      14 rows selected.

Using CLARK again, we can see this window extends back to JONES and down to TURNER. Instead of having a window consisting of the people hired 100 days before or after, the window now consists of people hired 100 days before and after the current record.

OK, now we have a good understanding of the syntax of the four components of the analytic function clause:

The function itself
The partitioning clause used to break the larger resultset into independent groups
The order by clause used to sort data for the windowing functions within a group
The windowing clause used to define the set of rows the analytic function will operate on

FUNCTION_NAME( <argument>,<argument>,  ) OVER (<Partition-Clause> <Order-by-Clause> <Windowing Clause>)

We'll now take a brief look at all of the functions available.

The Functions

There are over 26 analytic functions available to use with this feature. Some of them have the same name as existing aggregate functions such as AVG and SUM. Others have new names and provide new functionality. What we will do in this section is simply list the available functions and give a short description of what their purpose is.

Analytic Function	Purpose
AVG (<distinct\|all> expression )	Used to compute an average of an expression within a group and window. Distinct may be used to find the average of the values in a group after duplicates have been removed.
CORR (expression, expression)	Returns the coefficient of correlation of a pair of expressions that return numbers. It is shorthand for: COVAR_POP(expr1, expr2) / STDDEV_POP(expr1) * STDDEV_POP(expr2)). Statistically speaking, a correlation is the strength of an association between variables. An association between variables means that the value of one variable can be predicted, to some extent, by the value of the other. The correlation coefficient gives the strength of the association by returning a number between -1 (strong inverse correlation) and 1 (strong correlation). A value of 0 would indicate no correlation.
COUNT (<distinct> <*> <expression>)	This will count occurrences within a group. If you specify * or some non-null constant, count will count all rows. If you specify an expression, count returns the count of non-null evaluations of expression. You may use the DISTINCT modifier to count occurrences of rows in a group after duplicates have been removed.
COVAR_POP (expression, expression)	This returns the population covariance of a pair of expressions that return numbers.
COVAR_SAMP (expression, expression)	This returns the sample covariance of a pair of expressions that return numbers.
CUME_DIST	This computes the relative position of a row in a group. CUME_DIST will always return a number greater then 0 and less then or equal to 1. This number represents the 'position' of the row in the group of N rows. In a group of three rows, the cumulate distribution values returned would be 1/3, 2/3, and 3/3 for example.
DENSE_RANK	This function computes the relative rank of each row returned from a query with respect to the other rows, based on the values of the expressions in the ORDER BY clause. The data within a group is sorted by the ORDER BY clause and then a numeric ranking is assigned to each row in turn starting with 1 and continuing on up. The rank is incremented every time the values of the ORDER BY expressions change. Rows with equal values receive the same rank (nulls are considered equal in this comparison). A dense rank returns a ranking number without any gaps. This is in comparison to RANK below.
FIRST_VALUE	This simply returns the first value from a group.
LAG (expression, <offset>, <default>)	LAG gives you access to other rows in a resultset without doing a self-join. It allows you to treat the cursor as if it were an array in effect. You can reference rows that come before the current row in a given group. This would allow you to select 'the previous rows' from a group along with the current row. See LEAD for how to get 'the next rows'. Offset is a positive integer that defaults to 1 (the previous row). Default is the value to be returned if the index is out of range of the window (for the first row in a group, the default will be returned)
LAST_VALUE	This simply returns the last value from a group.
LEAD (expression, <offset>, <default>)	LEAD is the opposite of LAG. Whereas LAG gives you access to the a row preceding yours in a group - LEAD gives you access to the a row that comes after your row. Offset is a positive integer that defaults to 1 (the next row). Default is the value to be returned if the index is out of range of the window (for the last row in a group, the default will be returned).
MAX(expression)	Finds the maximum value of expression within a window of a group.
MIN(expression)	Finds the minimum value of expression within a window of a group.
NTILE (expression)	Divides a group into 'value of expression' buckets. For example; if expression = 4, then each row in the group would be assigned a number from 1 to 4 putting it into a percentile. If the group had 20 rows in it, the first 5 would be assigned 1, the next 5 would be assigned 2 and so on. In the event the cardinality of the group is not evenly divisible by the expression, the rows are distributed such that no percentile has more than 1 row more then any other percentile in that group and the lowest percentiles are the ones that will have 'extra' rows. For example, using expression = 4 again and the number of rows = 21, percentile = 1 will have 6 rows, percentile = 2 will have 5, and so on.
PERCENT_RANK	This is similar to the CUME_DIST (cumulative distribution) function. For a given row in a group, it calculates the rank of that row minus 1, divided by 1 less than the number of rows being evaluated in the group. This function will always return values from 0 to 1 inclusive.
RANK	This function computes the relative rank of each row returned from a query with respect to the other rows, based on the values of the expressions in the ORDER BY clause. The data within a group is sorted by the ORDER BY clause and then a numeric ranking is assigned to each row in turn starting with 1 and continuing on up. Rows with the same values of the ORDER BY expressions receive the same rank; however, if two rows do receive the same rank the rank numbers will subsequently 'skip'. If two rows are number 1, there will be no number 2 - rank will assign the value of 3 to the next row in the group. This is in contrast to DENSE_RANK, which does not skip values.
RATIO_TO_REPORT (expression)	This function computes the value of expression / (sum(expression)) over the group. This gives you the percentage of the total the current row contributes to the sum(expression).
REGR_ xxxxxxx (expression, expression)	These linear regression functions fit an ordinary-least-squares regression line to a pair of expressions. There are 9 different regression functions available for use.
ROW_NUMBER	Returns the offset of a row in an ordered group. Can be used to sequentially number rows, ordered by certain criteria.
STDDEV (expression)	Computes the standard deviation of the current row with respect to the group.
STDDEV_POP (expression)	This function computes the population standard deviation and returns the square root of the population variance. Its return value is same as the square root of the VAR_POP function.
STDDEV_SAMP (expression)	This function computes the cumulative sample standard deviation and returns the square root of the sample variance. This function returns the same value as the square root of the VAR_SAMP function would.
SUM(expression)	This function computes the cumulative sum of expression in a group.
VAR_POP (expression)	This function returns the population variance of a non-null set of numbers (nulls are ignored). VAR_POP function makes the following calculation for us: (SUM(exprexpr) - SUM(expr)SUM(expr) / COUNT(expr)) / COUNT(expr)
VAR_SAMP (expression)	This function returns the sample variance of a non-null set of numbers (nulls in the set are ignored). This function makes the following calculation for us: (SUM(exprexpr) - SUM(expr)SUM(expr) / COUNT(expr)) / (COUNT(expr) - 1)
VARIANCE (expression)	This function returns the variance of expression. Oracle will calculate the variance as follows: 0 if the number of rows in expression = 1 VAR_SAMP if the number of rows in expression > 1

Examples

Now we are ready for the fun part; what we can do with this functionality. I am going to demonstrate with a couple of examples that by no means exhaust what you can do with these new functions but will give you a good working set of examples to get going with.

The TOP-N Query

A question I hear frequently is, "How can I get the TOP-N records by some set of fields?" Prior to having access to these analytic functions, questions of this nature were extremely difficult to answer. Now, they are very easy.

There are some problems with TOP-N queries however; mostly in the way people phrase them. It is something to be careful about when designing your reports. Consider this seemingly sensible request:

Note

I would like the top three paid sales reps by department

The problem with this question is that it is ambiguous. It is ambiguous because of repeated values, there might be four people who all make the same astronomical salary, what should we do then?

I can come up with at least three equally reasonable interpretations of that request - none of which might return three records! I could interpret the request as:

Give me the set of sales people who make the top 3 salaries - that is, find the set of distinct salary amounts, sort them, take the largest three, and give me everyone who makes one of those values.
Give me up to three people who make the top salary. If four people happen to make the largest salary, the answer would be no rows. If two people make the highest salary and two people make the same second highest; the answer will be only two rows (the two highest).
Sort the sales people by salary from greatest to least. Give me the first three rows. If there are less then three people in a department, this will return less then three records.

After further questioning and clarification, most people want the first case; the rest will want the second or third case. Let's see how we can use this analytic functionality to answer any of the three and compare it to how we might do it without analytic functions.

We will use the SCOTT.EMP table for these examples. The first question we will answer is 'Give me the set of sales people making the top 3 salaries in each department':

scott@TKYTE816> select *   2    from (select deptno, ename, sal,   3          dense_rank()   4            over (partition by deptno   5             order by sal desc)   6          dr from emp)   7    where dr <= 3   8  order by deptno, sal desc   9  /          DEPTNO ENAME             SAL         DR ---------- ---------- ---------- ----------         10 KING             5000          1            CLARK            2450          2            MILLER           1300          3              20 SCOTT            3000          1            FORD             3000          1            JONES            2975          2            ADAMS            1100          3              30 BLAKE            2850          1            ALLEN            1600          2            TURNER           1500          3      10 rows selected.

Here the DENSE_RANK() function was used to get the top three salaries. We assigned the dense rank to the salary column and sorted it in a descending order. If you recall from above, a dense rank does not skip numbers and will assign the same number to those rows with the same value. Hence, after the resultset is built in the inline view, we can simply select all of the rows with a dense rank of three or less, this gives us everyone who makes the top three salaries by department number. Just to show what would happen if we tried to use RANK and encountered these duplicate values:

scott@TKYTE816> select deptno, ename, sal,   2         dense_rank()   3           over (partition by deptno   4                 order by sal desc) dr,   5         rank()   6           over (partition by deptno   7                 order by sal desc) r   8    from emp   9   order by deptno, sal desc  10  /          DEPTNO ENAME             SAL         DR          R ---------- ---------- ---------- ---------- ----------         10 KING             5000          1          1            CLARK            2450          2          2            MILLER           1300          3          3              20 SCOTT            3000          1          1            FORD             3000          1          1            JONES            2975          2          3            ADAMS            1100          3          4            SMITH             800          4          5              30 BLAKE            2850          1          1            ALLEN            1600          2          2            TURNER           1500          3          3            WARD             1250          4          4            MARTIN           1250          4          4            JAMES             950          5          6      14 rows selected.

If we had used RANK, it would have left ADAMS (because he is ranked at 4) out of the resultset but ADAMS is one of the people in department 20 making the top 3 salaries so he belongs in the result. In this case, using RANK over DENSE_RANK would not have answered our specific query.

Lastly, we had to use an inline view and alias the dense rank column to DR. This is because we cannot use analytic functions in a WHERE or HAVING clause directly so we had to compute the view and then filter just the rows we wanted to keep. The use of an inline view with a predicate will be a common operation in many of our examples.

Now on to the question 'Give me up to three people who make the top salaries by department':

scott@TKYTE816> select *   2  from (select deptno, ename, sal,   3        count(*) over (partition by deptno   4                       order by sal desc   5                       range unbounded preceding)   6        cnt from emp)   7  where cnt <= 3   8  order by deptno, sal desc   9  /          DEPTNO ENAME             SAL        CNT ---------- ---------- ---------- ----------         10 KING             5000          1            CLARK            2450          2            MILLER           1300          3              20 SCOTT            3000          2            FORD             3000          2            JONES            2975          3              30 BLAKE            2850          1            ALLEN            1600          2            TURNER           1500          3      9 rows selected.

This one was a little tricky. What we are doing is counting all of the records that precede the current record in the window and sorting them by salary. The RANGE UNBOUNDED PRECEDING, which would be the default range here, makes a window that includes all of the records whose salary is larger than or equal to ours since we sorted by descending (DESC) order. By counting everyone who makes what we make or more, we can retrieve only rows such that the count (CNT) of people making the same or more is less then or equal to 3. Notice how for department 20, SCOTT and FORD both have a count of 2; they are the top two salary earners in that department hence they are both in the same window with regards to each other. It is interesting to note the subtle difference in this query:

scott@TKYTE816> select *   2  from (select deptno, ename, sal,   3        count(*) over (partition by deptno   4                       order by sal desc, ename   5                       range unbounded preceding)   6        cnt from emp)   7  where cnt <= 3   8  order by deptno, sal desc   9  /          DEPTNO ENAME             SAL        CNT ---------- ---------- ---------- ----------         10 KING             5000          1            CLARK            2450          2            MILLER           1300          3              20 FORD             3000          1            SCOTT            3000          2            JONES            2975          3              30 BLAKE            2850          1            ALLEN            1600          2            TURNER           1500          3      9 rows selected.

Notice how adding the ORDER BY function affected the window. Previously both FORD and SCOTT had a count of two. That is because the window was built using only the salary column. A more specific window here changes the outcome of the COUNT. This just points out that the window function is computed based on the ORDER BY and the RANGE. When we sorted the partition just by salary, FORD preceded SCOTT when SCOTT was the current row and SCOTT preceded FORD when FORD was the current row. Only when we sort by both SAL and ENAME columns, did the SCOTT and FORD records have any sort of order with respect to each other.

To see that this approach, using the count, allows us to return three or less records, we can update the data to make it so that more than three people make the top salaries:

scott@TKYTE816> update emp set sal = 99 where deptno = 30;      6 rows updated.      scott@TKYTE816> select *   2  from (select deptno, ename, sal,   3        count(*) over (partition by deptno   4                       order by sal desc   5                       range unbounded preceding)   6        cnt from emp)   7  where cnt <= 3   8  order by deptno, sal desc   9  /          DEPTNO ENAME             SAL        CNT ---------- ---------- ---------- ----------         10 KING             5000          1            CLARK            2450          2            MILLER           1300          3              20 SCOTT            3000          2            FORD             3000          2            JONES            2975          3      6 rows selected.

Now, department 30 no longer appears in the report, because all 6 people in that department make the same amount. The CNT field for all of them is 6, which is a number that is not less than or equal to 3.

Now for the last question of 'Sort the sales people by salary from greatest to least; give me the first three rows'. Using ROW_NUMBER() this is easy to accomplish:

scott@TKYTE816> select *   2  from (select deptno, ename, sal,   3        row_number() over (partition by deptno   4                           order by sal desc)   5        rn from emp)   6  where rn <= 3   7  /          DEPTNO ENAME             SAL         RN ---------- ---------- ---------- ----------         10 KING             5000          1            CLARK            2450          2            MILLER           1300          3              20 SCOTT            3000          1            FORD             3000          2            JONES            2975          3              30 ALLEN              99          1            BLAKE              99          2            MARTIN             99          3      9 rows selected.

This query works by sorting each partition, in a descending order, based on the salary column and then assigning a sequential row number to each row in the partition as it is processed. We use a WHERE clause after doing this to get just the first three rows in each partition. Below, in the pivot example, we will use this same concept to turn rows into columns. It should be noted here however that the rows we got back for DEPTNO=30 are somewhat random. If you recall, department 30 was updated such that all 6 employees had the value of 99. You could control which of the three records came back via the ORDER BY to some degree. For instance, you could use ORDER SAL DESC, ENAME, to get the three highest paid in order of employee name when all three earn the same amount.

It is interesting to note that using the above method with ROW_NUMBER, you have the ability to get an arbitrary 'slice' of data from a group of rows. This may be useful in a stateless environment where you need to paginate through a set of data. For example, if you decide to display the EMP table sorted by ENAME five rows at a time, you could use a query similar to this:

scott@TKYTE816> select ename, hiredate, sal   2  from (select ename, hiredate, sal,   3        row_number() over (order by ename)   4        rn from emp)   5  where rn between 5 and 10   6  order by rn   7  /      ENAME      HIREDATE         SAL ---------- --------- ---------- FORD       03-DEC-81       3000 JAMES      03-DEC-81        950 JONES      02-APR-81       2975 KING       17-NOV-81       5000 MARTIN     28-SEP-81       1250 MILLER     23-JAN-82       1300      6 rows selected.

One last thing, to demonstrate how truly powerful this functionality is, would be to see a comparison between queries that use the analytic function approach and queries that do not. What I did to test this was to create a table T, which is just a 'bigger' EMP table for all intents and purposes:

scott@TKYTE816> create table t   2  as   3  select object_name ename,   4         mod(object_id,50) deptno,   5         object_id sal   6  from all_objects   7  where rownum <= 1000   8  /      Table created.      scott@TKYTE816> create index t_idx on t(deptno,sal desc);      Index created.      scott@TKYTE816> analyze table t   2  compute statistics   3  for table   4  for all indexed columns   5  for all indexes   6  /      Table analyzed.

We've placed an index on this table that will be useful in answering the types of questions we have been asking of it. We would now like to compare the syntax and performance of the queries with and without analytic functions. For this, I used SQL_TRACE, TIMED_STATISTICS and TKPROF to compare the performance of the queries. See the chapter on Performance Strategies and Tools for more details on using those tools and fully interpreting the following results:

scott@TKYTE816> select *   2  from (select deptno, ename, sal,   3        dense_rank() over (partition by deptno   4                           order by sal desc)   5        dr from t)   6  where dr <= 3   7  order by deptno, sal desc   8  /      call     count       cpu    elapsed   disk      query    current        rows ------- ------  -------- ---------- ------ ---------- ----------  ---------- Parse        1      0.00       0.00      0          0          0           0 Execute      2      0.00       0.00      0          0          0           0 Fetch       11      0.01       0.07      7         10         17         150 ------- ------  -------- ---------- ------ ---------- ----------  ---------- total       14      0.01       0.07      7         10         17         150      Misses in library cache during parse: 0 Optimizer goal: CHOOSE Parsing user id: 54      Rows     Row Source Operation -------  ---------------------------------------------------     150  VIEW     364   WINDOW SORT PUSHED RANK    1000    TABLE ACCESS FULL T      ************************************      scott@TKYTE816> select deptno, ename, sal   2  from t e1   3  where sal in (select sal   4                from (select distinct sal , deptno   5                      from t e3   6                      order by deptno, sal desc) e2   7                where e2.deptno = e1.deptno   8                and rownum <= 3)   9   order by deptno, sal desc  10  /      call     count       cpu    elapsed   disk      query    current        rows ------- ------  -------- ---------- ------ ---------- ----------  ---------- Parse        1      0.00       0.00      0          0          0           0 Execute      1      0.00       0.00      0          0          0           0 Fetch       11      0.80       0.80      0      10010      12012         150 ------- ------  -------- ---------- ------ ---------- ----------  ---------- total       13      0.80       0.80      0      10010      12012         150      Misses in library cache during parse: 0 Optimizer goal: CHOOSE Parsing user id: 54      Rows     Row Source Operation -------  ---------------------------------------------------     150  SORT ORDER BY     150   FILTER    1001    TABLE ACCESS FULL T    1000    FILTER    3700     COUNT STOPKEY    2850      VIEW    2850       SORT ORDER BY STOPKEY   20654        SORT UNIQUE   20654         TABLE ACCESS FULL T

The queries opposite both answer the question of 'who make the top three salaries'. The analytic query can answer this question with almost no work at all - 0.01 CPU seconds and 27 logical I/Os. The relational query on the other hand, must do a lot more work - 0.80 CPU seconds and over 22,000 logical I/Os. For the non-analytical function query, we need to execute a subquery for each row in T to find the three biggest salaries for a given department. Not only does this query perform slower, but also it was hard to write. There are some tricks I could use in order to attempt to improve performance, some hints I might use but the query will only get less understandable and more 'fragile'. Fragile in the sense that any query that relies on hints is fragile. A hint is just a suggestion, the optimizer is free to ignore it at any time in the future. The analytic query clearly wins here for both performance and readability.

Now for the second question - 'give me up to the three people who make the top salary'.

scott@TKYTE816> select *   2  from (select deptno, ename, sal,   3        count(*) over (partition by deptno   4                       order by sal desc   5                       range unbounded preceding)   6        cnt from t)   7  where cnt <= 3   8  order by deptno, sal desc   9  /      call     count       cpu    elapsed   disk      query    current        rows ------- ------  -------- ---------- ------ ---------- ----------  ---------- Parse        1      0.01       0.01      0          0          0           0 Execute      2      0.00       0.00      0          0          0           0 Fetch       11      0.02       0.12     15         10         17         150 ------- ------  -------- ---------- ------ ---------- ----------  ---------- total       14      0.03       0.13     15         10         17         150      Misses in library cache during parse: 0 Optimizer goal: CHOOSE Parsing user id: 54      Rows     Row Source Operation -------  ---------------------------------------------------     150  VIEW    1000   WINDOW SORT    1000    TABLE ACCESS FULL T      *************************************      scott@TKYTE816> select deptno, ename, sal   2  from t e1   3  where (select count(*)   4         from t e2   5         where e2.deptno = e1.deptno   6         and e2.sal >= e1.sal) <= 3   7  order by deptno, sal desc   8  /      call     count       cpu    elapsed   disk      query    current        rows ------- ------  -------- ---------- ------ ---------- ----------  ---------- Parse        1      0.01       0.01      0          0          0           0 Execute      1      0.00       0.00      0          0          0           0 Fetch       11      0.60       0.66      0       4010       4012         150 ------- ------  -------- ---------- ------ ---------- ----------  ---------- total       13      0.61       0.67      0       4010       4012         150      Misses in library cache during parse: 0 Optimizer goal: CHOOSE Parsing user id: 54      Rows     Row Source Operation -------  ---------------------------------------------------     150  SORT ORDER BY     150   FILTER    1001    TABLE ACCESS FULL T    2000    SORT AGGREGATE   10827     INDEX FAST FULL SCAN (object id 27867)

Once again, the results are clear. 0.03 CPU seconds and 27 logical I/Os versus 0.61 CPU seconds and over 8000 logical I/Os. The clear winner here is the analytical function again. The reason, once again, is the fact that we need to execute a correlated subquery for each and every row in base table without the analytic functions. This query counts the number of records in the same department that make the same or greater salary. We only keep the records such that the count is less then or equal to 3. The results from the queries are the same; the run-time resources needed by both are radically different. In this particular case, I would say the query was no harder to code using either method, but performance-wise, there is no comparison.

Lastly, we need to get the first three highest paid employees we happen to see in a department. The results are:

scott@TKYTE816> select deptno, ename, sal   2       from t e1   3       where (select count(*)   4              from t e2   5              where e2.deptno = e1.deptno   6              and e2.sal >= e1.sal ) <=3   7       order by deptno, sal desc   8  /      call     count       cpu    elapsed   disk      query    current        rows ------- ------  -------- ---------- ------ ---------- ----------  ---------- Parse        1      0.00       0.00      0          0          0           0 Execute      2      0.00       0.00      0          0          0           0 Fetch       11      0.00       0.12     14         10         17         150 ------- ------  -------- ---------- ------ ---------- ----------  ---------- total       14      0.00       0.12     14         10         17         150      Misses in library cache during parse: 0 Optimizer goal: CHOOSE Parsing user id: 54      Rows     Row Source Operation -------  ---------------------------------------------------     150  VIEW    1000   WINDOW SORT    1000    TABLE ACCESS FULL T      *************************************      scott@TKYTE816> select deptno, ename, sal   2  from t e1   3  where (select count(*)   4         from t e2   5         where e2.deptno = e1.deptno   6         and e2.sal >= e1.sal   7         and (e2.sal > e1.sal OR e2.rowid > e1.rowid)) < 3   8  order by deptno, sal desc   9  /      call     count       cpu    elapsed   disk      query    current        rows ------- ------  -------- ---------- ------ ---------- ----------  ---------- Parse        1      0.00       0.00      0          0          0           0 Execute      1      0.00       0.00      0          0          0           0 Fetch       11      0.88       0.88      0       4010       4012         150 ------- ------  -------- ---------- ------ ---------- ----------  ---------- total       13      0.88       0.88      0       4010       4012         150      Misses in library cache during parse: 0 Optimizer goal: CHOOSE Parsing user id: 54      Rows     Row Source Operation -------  ---------------------------------------------------     150  SORT ORDER BY     150   FILTER    1001    TABLE ACCESS FULL T    2000    SORT AGGREGATE    9827     INDEX FAST FULL SCAN (object id 27867)

Once again, performance-wise there is no comparison. The analytic function version simply outperforms the query that does not use analytic functions many times over. In this particular case, the analytic query is many orders of magnitude easier to code and understand as well. The correlated subquery we had to code is very complex. We need to count the number of records in a department such that salary is greater than or equal to the current row. Additionally, if the salary is not greater than ours (it is equal), we can only count this if the ROWID (any unique column would do) is greater than that record. That ensures we do not double count rows and just retrieve an arbitrary set of rows.

In each case, it is clear that the analytic functions not only ease the job of writing complex queries but that they can add substantial performance increases in the run-time performance of your queries. They allow you to do things you would not have considered doing in SQL otherwise due to the associated cost.

Pivot Query

A pivot query is when you want to take some data such as:

C1        C2      C3 -----     -----   ------ a1        b1      x1 a1        b1      x2 a1        b1      x3

and you would like to display in the following format:

C1        C2      C3(1)   C3(2)  C3(3) -----     -----   ------  -----  ---- a1        b1      x1      x2     x3

This turns rows into columns. For example taking the distinct jobs within a department and making them be columns so the output would look like:

DEPTNO     JOB_1     JOB_2     JOB_3 ---------- --------- --------- ---------         10 CLERK     MANAGER   PRESIDENT         20 ANALYST   ANALYST   CLERK         30 CLERK     MANAGER   SALESMAN

instead of this:

DEPTNO     JOB ---------- ---------         10 CLERK         10 MANAGER         10 PRESIDENT         20 ANALYST         20 CLERK         20 MANAGER         30 CLERK         30 MANAGER         30 SALESMAN

I'm going to show two examples for pivots. The first will be another implementation of the preceding question. The second shows how to pivot any resultset in a generic fashion and gives you a template for doing so.

In the first case, let's say you wanted to show the top 3 salary earners in each department as columns. The query needs to return exactly 1 row per department and the row would have 4 columns; the DEPTNO, the name of the highest paid employee in the department, the name of the next highest paid, and so on. Using this new functionality - this is almost easy (before these functions - this was virtually impossible):

ops$tkyte@DEV816> select deptno,   2             max(decode(seq,1,ename,null)) highest_paid,   3             max(decode(seq,2,ename,null)) second_highest,   4             max(decode(seq,3,ename,null)) third_highest   5    from (SELECT deptno, ename,   6                 row_number() OVER   7                     (PARTITION BY deptno   8                      ORDER BY sal desc NULLS LAST) seq   9          FROM emp)  10    where seq <= 3  11    group by deptno  12  /          DEPTNO HIGHEST_PA SECOND_HIG THIRD_HIGH ---------- ---------- ---------- ----------         10 KING       CLARK      MILLER         20 SCOTT      FORD       JONES         30 BLAKE      ALLEN      TURNER

This simply created an inner resultset that had a sequence assigned to employees by department number in order of salary. The decode in the outer query keeps only rows with sequences 1, 2, or 3 and assigns them to the correct 'column'. The GROUP BY gets rid of the redundant rows and we are left with our collapsed result. It may be easier to understand what I mean by that if you see the resultset without the GROUP BY and MAX:

scott@TKYTE816> select deptno,   2    max(decode(seq,1,ename,null)) highest_paid,   3    max(decode(seq,2,ename,null)) second_highest,   4    max(decode(seq,3,ename,null)) third_highest   5  from (select deptno, ename,   6        row_number() over   7          (partition by deptno   8           order by sal desc nulls last)   9        seq from emp)  10  where seq <= 3  11  group by deptno  12  /          DEPTNO HIGHEST_PA SECOND_HIG THIRD_HIGH ---------- ---------- ---------- ----------         10 KING         10            CLARK         10                       MILLER              20 SCOTT         20            FORD         20                       JONES              30 ALLEN         30            BLAKE         30                       MARTIN      9 rows selected.

The MAX aggregate function will be applied by the GROUP BY column DEPTNO. In any given DEPTNO above only one row will have a non-null value for HIGHTEST_PAID, the remaining rows in that group will always be NULL. The MAX function will pick out the non-null row and keep that for us. Hence, the group by and MAX will 'collapse' our resultset, removing the NULL values from it and giving us what we want.

If you have a table T with columns C1, C2 and you would like to get a result like:

C1     C2(1)   C2(2)   .  C2(N)

where column C1 is to stay cross record (going down the page) and column C2 will be pivoted to be in record (going across the page), the values of C2 are to become columns instead of rows - you will generate a query of the form:

Select c1   max(decode(rn,1,c2,null)) c2_1,   max(decode(rn,2,c2,null)) c2_2,       max(decode(rn,N,c2,null)) c2_N from (select c1, c2       row_number() over (partition by C1                          order by <something>)       rn from T       <some predicate>) group by C1

In the above example, C1 was simply DEPTNO and C2 was ENAME. Since we ordered by SAL DESC, the first three columns we retrieved were the top three paid employees in that department (bearing in mind that if four people made the top three, we would of course lose one).

The second example is a more generic 'I want to pivot my resultset'. Here, instead of having a single column C1 to anchor on and a single column C2 to pivot - we'll look at the more general case where C1 is a set of columns as is C2. As it turns out, this is very similar to the above. Suppose you want to report by JOB and DEPTNO the employees in that job and their salary. The report needs to have the employees going across the page as columns, however, not down the page, and the same with their salaries. Additionally, the employees need to appear from left to right in order of their salary. The steps would be:

scott@TKYTE816> select max(count(*)) from emp group by deptno, job;      MAX(COUNT(*)) -------------             4

This tells us the number of columns; now we can generate the query:

scott@TKYTE816> select deptno, job,   2    max(decode(rn, 1, ename, null)) ename_1,   3    max(decode(rn, 1, sal, null)) sal_1,   4    max(decode(rn, 2, ename, null)) ename_2,   5    max(decode(rn, 2, sal, null)) sal_2,   6    max(decode(rn, 3, ename, null)) ename_3,   7    max(decode(rn, 3, sal, null)) sal_3,   8    max(decode(rn, 4, ename, null)) ename_4,   9    max(decode(rn, 4, sal, null)) sal_4  10  from (select deptno, job, ename, sal,  11        row_number() over (partition by deptno, job  12                           order by sal, ename)  13        rn from emp)  14  group by deptno, job  15  /       DEPTNO JOB       ENAME_1 SAL_1 ENAME_2   SAL_2 ENAME_3    SAL_3 ENAME_ SAL_4 ------ --------- ------  ----- --------- ----- ---------- ----- ------ -----     10 CLERK     MILLER   1300     10 MANAGER   CLARK    2450     10 PRESIDENT KING     5000          20 ANALYST   FORD     3000 SCOTT      3000     20 CLERK     SMITH     800 ADAMS      1100     20 MANAGER   JONES    2975          30 CLERK     JAMES      99     30 MANAGER   BLAKE      99     30 SALESMAN  ALLEN      99 MARTIN       99 TURNER        99 WARD      99      9 rows selected.

We inserted values of 99 into the salary column of all the employees in department 30, earlier in the chapter. To pivot a resultset, we can generalize further. If you have a set of columns C1, C2, C3, ... CN and you want to keep columns C1 ... Cx cross record and Cx+1 ... CN in record, the syntax of the query would be:

Select C1, C2, ... CX,   max(decode(rn,1,C{X+1},null)) cx+1_1,...max(decode(rn,1,CN,null)) CN_1   max(decode(rn,2,C{X+1},null)) cx+1_2,...max(decode(rn,1,CN,null)) CN_2   ...   max(decode(rn,N,c{X+1},null)) cx+1_N,...max(decode(rn,1,CN,null)) CN_N from (select C1, C2, ... CN       row_number() over (partition by C1, C2, ... CX                          order by <something>)       rn from T       <some predicate>) group by C1, C2, ... CX

In the example above, we used C1 as DEPTNO, C2 as JOB, C3 as ENAME, and C4 as SAL.

One other thing that we must know is the maximum number of rows per partition that we anticipate. This will dictate the number of columns we will be generating. SQL needs to know the number of columns, there is no way around that fact, and without this knowledge we will not be able to pivot. This leads us into the next, more generic, example of pivoting. If we do not know the number of total columns until runtime, we'll have to use dynamic SQL to deal with the fact that the SELECT list is variable. We can use PL/SQL to demonstrate how to do this; this will result in a generic routine that can be reused whenever you need a pivot. This routine will have the following specification:

scott@TKYTE816> create or replace package my_pkg   2  as   3    type refcursor is ref cursor;   4    type array is table of varchar2(30);   5      procedure pivot(p_max_cols       in number   default NULL,   6                      p_max_cols_query in varchar2 default NULL,   7                      p_query          in varchar2,   8                      p_anchor         in array,   9                      p_pivot          in array,  10                      p_cursor in out refcursor);  12  end;      Package created.

Here, you must input values for either P_MAX_COLS or P_MAX_COLS_QUERY. SQL needs to know the number of columns in a query and this parameter will allow us to build a query with the proper number of columns. The value you should send in here will be the output of a query similar to:

scott@TKYTE816> select max(count(*)) from emp group by deptno, job;

This is the count of the discrete values that are currently in rows, which we will put into columns. You can either use a query to obtain this number, or insert the number yourself if you already know it.

The P_QUERY parameter is simply the query that gathers your data together. Using the example shown earlier the query would be:

 10  from (select deptno, job, ename, sal,  11        row_number() over (partition by deptno, job  12                           order by sal, ename)  13        rn from emp)

The next two inputs are arrays of column names. The P_ANCHOR tells us what columns will stay cross record (down the page) and P_PIVOT states the columns that will go in record (across the page). In our example from above, P_ANCHOR = ('DEPTNO', 'JOB') and P_PIVOT = ('ENAME','SAL'). Skipping over the implementation for a moment, the entire call put together might look like this:

scott@TKYTE816> variable x refcursor      scott@TKYTE816> set autoprint on      scott@TKYTE816> begin   2  my_pkg.pivot  3  (p_max_cols_query => 'select max(count(*)) from emp  4                        group by deptno,job',  5   p_query => 'select deptno, job, ename, sal,  6   row_number() over (partition by deptno, job  7                      order by sal, ename)  8   rn from emp a',  9 10     p_anchor => my_pkg.array('DEPTNO','JOB'), 11     p_pivot  => my_pkg.array('ENAME', 'SAL'), 12     p_cursor => :x); 13  end;      PL/SQL procedure successfully completed.      DEPTNO JOB       ENAME_ SAL_1 ENAME_2    SAL_2 ENAME_3    SAL_3 ENAME_ SAL_4 ------ --------- ------ ----- ---------- ----- ---------- ----- ------ -----     10 CLERK     MILLER  1300     10 MANAGER   CLARK   2450     10 PRESIDENT KING    5000     20 ANALYST   FORD    3000 SCOTT       3000     20 CLERK     SMITH    800 ADAMS       1100     20 MANAGER   JONES   2975     30 CLERK     JAMES     99     30 MANAGER   BLAKE     99     30 SALESMAN  ALLEN     99 MARTIN        99 TURNER        99 WARD      99      9 rows selected.

As you can see, that dynamically rewrote our query using the generalized template we developed. The implementation of the package body is straightforward:

scott@TKYTE816> create or replace package body my_pkg   2  as   3   4  procedure pivot(p_max_cols       in number   default null,   5                  p_max_cols_query in varchar2 default null,   6                  p_query          in varchar2,   7                  p_anchor         in array,   8                  p_pivot          in array,   9                  p_cursor in out refcursor)  10  as  11      l_max_cols number;  12      l_query    long;  13      l_cnames   array;  14  begin  15    -- figure out the number of columns we must support  16    -- we either KNOW this or we have a query that can tell us  17    if (p_max_cols is not null)  18    then  19          l_max_cols := p_max_cols;  20    elsif (p_max_cols_query is not null)  21    then  22          execute immediate p_max_cols_query into l_max_cols;  23    else  24          raise_application_error(-20001, 'Cannot figure out max cols');  25    end if;  26  27  28    -- Now, construct the query that can answer the question for us...  29    -- start with the C1, C2, ... CX columns:  30  31    l_query := 'select ';  32    for i in 1 .. p_anchor.count  33  34    loop  35      l_query := l_query || p_anchor(i) || ',';  36    end loop;  37  38    -- Now add in the C{x+1}... CN columns to be pivoted:  39    -- the format is "max(decode(rn,1,C{X+1},null)) cx+1_1"  40  41    for i in 1 .. l_max_cols  42    loop  43      for j in 1 .. p_pivot.count  44        loop  45          l_query := l_query ||  46                     'max(decode(rn,'||i||','||  47                     p_pivot(j)||',null)) ' ||  48                     p_pivot(j) || '_' || i || ',';  49        end loop;  50    end loop;  51  52    -- Now just add in the original query  53  54    l_query := rtrim(l_query,',') || ' from (' || p_query || ') group by ';  55  56    -- and then the group by columns...  57  58    for i in 1 .. p_anchor.count  59    loop  60        l_query := l_query || p_anchor(i) || ',';  61    end loop;  62    l_query := rtrim(l_query,',');  63  64    -- and return it  65  66    execute immediate 'alter session set cursor_sharing=force';  67    open p_cursor for l_query;  68    execute immediate 'alter session set cursor_sharing=exact';  69    end;  70  71  end;  72  /      Package body created.

It only does a little string manipulation to rewrite the query and open a REF CURSOR dynamically. In the likely event the query had a predicate with constants and such in it, we set cursor sharing on and then back off for the parse of this query to facilitate bind variables (see the section on tuning for more information on that). Now we have a fully parsed query that is ready to be fetched from.

Accessing Rows Around Your Current Row

Frequently people want to access data not only from the current row but the current row and the rows 'in front of' or 'behind' them. For example, let's say you needed a report that shows, by department all of the employees; their hire date; how many days before was the last hire; how many days after was the next hire. Using straight SQL this query would be nightmarish to write. It could be done but it would be quite difficult. Not only that but its performance would once again definitely be questionable. The approach I typically took in the past was either to 'select a select' or write a PL/SQL function that would take some data from the current row and 'find' the previous and next rows data. This worked, but introduce large overhead into both the development of the query (I had to write more code) and the run-time execution of the query.

Using analytic functions, this is easy and efficient to do. It would look like this:

scott@TKYTE816> select deptno, ename, hiredate,   2         lag( hiredate, 1, null ) over ( partition by deptno   3                                   order by hiredate, ename ) last_hire,   4         hiredate - lag( hiredate, 1, null )   5                        over ( partition by deptno   6                               order by hiredate, ename ) days_last,   7         lead( hiredate, 1, null )   8          over ( partition by deptno   9                 order by hiredate, ename ) next_hire,  10         lead( hiredate, 1, null )  11          over ( partition by deptno  12                 order by hiredate, ename ) - hiredate days_next  13  from emp  14  order by deptno, hiredate  15  /          DEPTNO ENAME      HIREDATE  LAST_HIRE  DAYS_LAST NEXT_HIRE  DAYS_NEXT ---------- ---------- --------- --------- ---------- --------- ----------         10 CLARK      09-JUN-81                      17-NOV-81        161            KING       17-NOV-81 09-JUN-81        161 23-JAN-82         67            MILLER     23-JAN-82 17-NOV-81         67              20 SMITH      17-DEC-80                      02-APR-81        106            JONES      02-APR-81 17-DEC-80        106 03-DEC-81        245            FORD       03-DEC-81 02-APR-81        245 09-DEC-82        371            SCOTT      09-DEC-82 03-DEC-81        371 12-JAN-83         34            ADAMS      12-JAN-83 09-DEC-82         34              30 ALLEN      20-FEB-81                      22-FEB-81          2            WARD       22-FEB-81 20-FEB-81          2 01-MAY-81         68            BLAKE      01-MAY-81 22-FEB-81         68 08-SEP-81        130            TURNER     08-SEP-81 01-MAY-81        130 28-SEP-81         20            MARTIN     28-SEP-81 08-SEP-81         20 03-DEC-81         66            JAMES      03-DEC-81 28-SEP-81         66      14 rows selected.

The LEAD and LAG routines could be considered a way to 'index into your partitioned group'. Using these functions, you can access any individual row. Notice for example in the above printout, it shows that the record for KING includes the data (in bold font) from the prior row (LAST_HIRE) and the next row (NEXT_HIRE). We can access the fields in records preceding or following the current record in an ordered partition easily.

Before we look in more detail at LAG and LEAD, I would like to compare this to the way you would do this without analytic functions. Once again, I'll create an appropriately indexed table in an attempt to answer the question as quickly as possible:

scott@TKYTE816> create table t   2  as   3  select object_name ename,   4         created hiredate,   5             mod(object_id,50) deptno   6  from all_objects 7  /      Table created.      scott@TKYTE816> alter table t modify deptno not null;      Table altered.      scott@TKYTE816> create index t_idx on t(deptno,hiredate,ename)   2  /      Index created.      scott@TKYTE816> analyze table t   2  compute statistics   3  for table   4  for all indexes   5  for all indexed columns   6  /      Table analyzed.

I even added ENAME to the index in order to permit accessing only the index to answer the question and avoid the table access by ROWID. The query with the analytic function performed as follows:

scott@TKYTE816> select deptno, ename, hiredate,   2  lag(hiredate, 1, null) over (partition by deptno   3                               order by hiredate, ename) last_hire,   4      hiredate - lag(hiredate, 1, null)   5      over (partition by deptno   6            order by hiredate, ename) days_last,   7      lead(hiredate, 1, null)   8      over (partition by deptno   9            order by hiredate, ename) next_hire,  10      lead(hiredate, 1, null)  11      over (partition by deptno  12            order by hiredate, ename) - hiredate days_next  13  from emp  14  order by deptno, hiredate  15  /      call     count       cpu    elapsed  disk      query    current        rows ------- ------  -------- ---------- ----- ---------- ----------  ---------- Parse        1      0.01       0.01     0          0          0           0 Execute      2      0.00       0.00     0          0          0           0 Fetch     1313      0.72       1.57   142        133          2       19675 ------- ------  -------- ---------- ----- ---------- ----------  ---------- total     1316      0.73       1.58   142        133          2       19675      Misses in library cache during parse: 0 Optimizer goal: FIRST_ROWS Parsing user id: 54      Rows     Row Source Operation -------  ---------------------------------------------------   19675  WINDOW BUFFER   19675   INDEX FULL SCAN (object id 27899)

As compared to the equivalent query without the analytic functions:

scott@TKYTE816> select deptno, ename, hiredate,   2  hiredate-(select max(hiredate)   3            from t e2   4            where e2.deptno = e1.deptno   5            and e2.hiredate < e1.hiredate ) last_hire,   6  hiredate-(select max(hiredate)   7            from t e2   8            where e2.deptno = e1.deptno   9            and e2.hiredate < e1.hiredate ) days_last,  10           (select min(hiredate)  11            from t e3  12            where e3.deptno = e1.deptno  13            and e3.hiredate > e1.hiredate) next_hire,  14           (select min(hiredate)  15            from t e3  16            where e3.deptno = e1.deptno  17            and e3.hiredate > e1.hiredate) - hiredate days_next  18    from t e1  19   order by deptno, hiredate  20  /      call     count       cpu    elapsed   disk      query    current        rows ------- ------  -------- ---------- ------ ---------- ----------  ---------- Parse        1      0.01       0.01      0          0          0           0 Execute      1      0.00       0.00      0          0          0           0 Fetch     1313      2.48       2.69      0     141851          0       19675 ------- ------  -------- ---------- ------ ---------- ----------  ---------- total     1315      2.49       2.70      0     141851          0       19675      Misses in library cache during parse: 0 Optimizer goal: FIRST_ROWS Parsing user id: 54      Rows     Row Source Operation -------  ---------------------------------------------------   19675  INDEX FULL SCAN (object id 27899)

There is a significant difference between the performance of both queries; 135 Logical I/Os versus over 141000, 0.73 CPU seconds versus 2.49. The analytical function is once again the clear winner in this case. You should also consider the complexity of the queries as well. In my opinion, when we used LAG and LEAD it was not only easier to code, but it was also obvious what the query was retrieving. The 'select of a select' is a nice trick, but it is harder to write the code in the first place and with a quick read of the query, it is not at all obvious what you are retrieving. It takes more 'thought' to reverse engineer the second query.

Now for some details on the LAG and LEAD functions. These functions take three arguments:

lag( Arg1, Arg2, Arg3)

Arg1 is the expression to be returned from the other row
Arg2 is the offset into the partition from the current row you wish to retrieve. This is a positive integer offset from the current row. In the case of LAG, it is an index back into preceding rows. In the case of LEAD, it is an index forward into the rows to come. The default value for this argument is 1.
Arg3 is what to return by default when the index supplied by Arg2 goes out of the window. For example, the first row in every partition does not have a preceding row so LAG(..., 1 ) for that row will not be defined. You can allow it to return NULL by default or supply a value. It should be noted that windows are not used with LAG and LEAD - you may only use the PARTITION BY and ORDER BY, but not ROWS or RANGE.

So, in our example:

  4         hiredate - lag( hiredate, 1, null )   5                        over ( partition by deptno   6                           order by hiredate, ename ) days_last,

We used LAG to find the record 'in front' of the current record by passing 1 as the second parameter (if there was no preceding record NULL would be returned by default). We partitioned the data by DEPTNO so each department was done independent of the others. We ordered the group by HIREDATE so that LAG( HIREDATE,1,NULL) would return the largest HIREDATE that was less then the current record.

Caveats

With analytic functions, I have found very few caveats. They answer a whole new range of questions in a much more efficient way than was possible before their introduction. Once you master their syntax, the possibilities are limitless. It is very rarely that I will say you can get something for nothing but with analytic functions this seems to be the case. However, here are four things to be aware of.

PL/SQL and Analytic functions

In the areas of errors you might encounter, PL/SQL and analytic functions will be one. If I take a simple query and put it into a PL/SQL block such as:

scott@TKYTE816> variable x refcursor      scott@TKYTE816> set autoprint on      scott@TKYTE816> begin   2  open :x for   3  select mgr, ename,   4         row_number() over (partition by mgr   5                            order by ename)   6         rn from emp;   7  end;   8  /             row_number() over (partition by mgr                               * ERROR at line 5: ORA-06550: line 5, column 31: PLS-00103: Encountered the symbol "(" when expecting one of the following: , from into bulk

PL/SQL will reject it. The SQL parser used by PL/SQL does not understand this syntax as yet. Anytime I have that happen however (there are other constructs that will confuse PL/SQL as well) I use a dynamically opened ref cursor. An implementation of the above would look like this:

scott@TKYTE816> variable x refcursor      scott@TKYTE816> set autoprint on      scott@TKYTE816> begin   2  open :x for   3  'select mgr, ename,   4          row_number() over (partition by mgr   5                             order by ename)   6   rn from emp';   7  end;   8  /      PL/SQL procedure successfully completed.             MGR ENAME              RN ---------- ---------- ----------       7566 FORD                1       7566 SCOTT               2       7698 ALLEN               1       7698 JAMES               2       7698 MARTIN              3       7698 TURNER              4       7698 WARD                5       7782 MILLER              1       7788 ADAMS               1       7839 BLAKE               1       7839 CLARK               2       7839 JONES               3       7902 SMITH               1            KING                1      14 rows selected.

What we have to do here is 'trick' the PL/SQL parser by not letting it see the constructs that it does not understand - the ROW_NUMBER() function in this case. Dynamically opened ref cursors are the way to accomplish this. They behave just like a normal cursor once we have them open, we fetch from them, close them and so on but the PL/SQL engine doesn't attempt to parse the statements at either compile time or run-time - so the syntax succeeds.

Another solution is to create a permanent VIEW using the query with the analytic functions and accessing the view in the PL/SQL block. For example:

scott@TKYTE816> create or replace view   2  emp_view   3  as   4  select mgr, ename,   5         row_number() over (partition by mgr   6                            order by ename) rn   7    from emp   8  /      View created.      scott@TKYTE816> begin   2  open :x for   3  'select mgr, ename,   4              row_number() over (partition by mgr   5                                 order by ename)   6  rn from emp';   7  end;   8  /      PL/SQL procedure successfully completed.             MGR ENAME              RN ---------- ---------- ----------       7566 FORD                1       7566 SCOTT               2

Analytic Functions in the Where Clause

It should be noted that analytic functions are the last set of operations performed in a query except for the final ORDER BY clause. What this means is that we cannot use an analytic function directly in a predicate - you cannot use where or having clause on them. Rather we will have to use an inline view if we need to select from a resultset based on the outcome of an analytic function. Analytic functions can appear only in the select list or ORDER BY clause of a query.

We have seen many cases in this chapter we we've used the inline view capability - the TOP-N Query section had quite a few. For example to find the group of employees, by department, that made the top three salaries we coded:

scott@TKYTE816> select *   2  from (select deptno, ename, sal,   3        dense_rank() over (partition by deptno   4                           order by sal desc) dr   5          from emp)   6  where dr <= 3   7  order by deptno, sal desc   8  /

Since the DENSE_RANK cannot be used in the where clause directly, we must push it into an inline view and alias it (DR in the above) so that we can later use DR in a predicate to get just the rows we want. We'll find this to be a common operation with analytic functions.

NULLS and Sorting

NULLS can affect the outcome of the analytic functions - especially when you use a descending sort. By default, NULLS are greater than any other value. Consider this for example:

scott@TKYTE816> select ename, comm from emp order by comm desc;      ENAME            COMM ---------- ---------- SMITH JONES CLARK BLAKE SCOTT KING JAMES MILLER FORD ADAMS MARTIN           1400 WARD              500 ALLEN             300 TURNER              0      14 rows selected.

If we applied our TOP-N logic to this, we may get:

scott@TKYTE816> select ename, comm, dr   2  from (select ename, comm,   3        dense_rank() over (order by comm desc)   4        dr from emp)   5  where dr <= 3   6  order by comm   8  /      ENAME            COMM         DR ---------- ---------- ---------- SMITH                          1 JONES                          1 CLARK                          1 BLAKE                          1 SCOTT                          1 KING                           1 JAMES                          1 MILLER                         1 FORD                           1 ADAMS                          1 MARTIN           1400          2 WARD              500          3      12 rows selected.

While this may technically be 'correct' it probably is not what you really wanted. You would either not want NULLS to be considered at all or to have NULLS be interpreted as 'small' in this case. So, we can either remove NULLS from consideration, using where comm is not null:

scott@TKYTE816> select ename, comm, dr   2  from (select ename, comm,   3        dense_rank() over (order by comm desc)   4        dr from emp   5        where comm is not null)   6  where dr <= 3   7  order by comm desc   8  /      ENAME            COMM         DR ---------- ---------- ---------- MARTIN           1400          1 WARD              500          2 ALLEN             300          3

or we can use the NULLS LAST extension to the ORDER BY clause:

scott@TKYTE816> select ename, comm, dr   2  from (select ename, comm,   3        dense_rank() over (order by comm desc nulls last)   4        dr from emp   5        where comm is not null)   6  where dr <= 3   7  order by comm desc   8  /      ENAME            COMM         DR ---------- ---------- ---------- MARTIN           1400          1 WARD              500          2 ALLEN             300          3

It should be noted that the NULLS LAST works on a normal ORDER BY as well; its use is not limited to the analytic functions.

Performance

So far, everything we've seen about analytic functions makes it look as if they are the silver bullet for performance tuning - if you use them everything will go fast. That's not an accurate representation of any technology or feature that I have ever come across anywhere. Anything can be abused and negatively impact on performance. Analytic functions are no different in this regard.

The one thing to be wary of with these functions is the extreme ease with which they allow you to sort and sift a set of data in ways that SQL never could. Each analytic function call in a select list may have different partitions, different windows and different sort orders. If they are not compatible with each other (not subsets of each other) you could be doing a massive amount of sorting and sifting. For example, early on we ran this query:

ops$tkyte@DEV816> select ename, deptno,   2         sum(sal) over ( order by ename, deptno) sum_ename_deptno,   3         sum(sal) over ( order by deptno, ename ) sum_deptno_ename   4    from emp   5   order by ename, deptno   6  /

There are three ORDER BY clauses in this query, and as such, potentially three sorts involved. Two of the sorts might be done together since they sort on the same columns but the other sort will have to be done independently. This is not a bad thing, this is not cause for alarm, this should not cause you to outlaw the use of this feature. It is simply something to take into consideration. You can just as easily write the query that will consume all available machine resources using this feature as you can write queries that elegantly and efficiently answer your questions.

Summary

In this chapter, we thoroughly explored the syntax and implementation of analytic functions. We have seen how many common operations such as running totals, pivoting resultsets, accessing 'nearby' rows in the current row and so on are now easily achieved. Analytic functions open up a whole new potential for queries.

Table of content

Full Frontal PR: Getting People Talking about You, Your Business, or Your Product

ISBN: 1590595254
EAN: 2147483647

Year: 2005
Pages: 41

Authors: Richard Laermer, Michael Prichinello

BUY ON AMAZON

Introducing Microsoft Office InfoPath 2003 (Bpg-Other)

Software Configuration Management

VBScript Programmers Reference

WebLogic: The Definitive Guide

Secure Programming Cookbook for C and C++: Recipes for Cryptography, Authentication, Input Validation & More

Lotus Notes Developers Toolbox: Tips for Rapid and Successful Deployment