Recipe10.2.Finding Differences Between Rows in the Same Group or Partition

Recipe 10.2. Finding Differences Between Rows in the Same Group or Partition

Problem

You want to return the DEPTNO, ENAME, and SAL of each employee along with the difference in SAL between employees in the same department (i.e., having the same value for DEPTNO). The difference should be between each current employee and the employee hired immediately afterwards (you want to see if there is a correlation between seniority and salary on a "per department" basis). For each employee hired last in his department, return "N/A" for the difference. The result set should look like this:

 DEPTNO ENAME             SAL HIREDATE    DIFF ------ ---------- ---------- ----------- ----------     10 CLARK            2450 09-JUN-1981      -2550     10 KING             5000 17-NOV-1981       3700     10 MILLER           1300 23-JAN-1982        N/A     20 SMITH             800 17-DEC-1980      -2175     20 JONES            2975 02-APR-1981        -25     20 FORD             3000 03-DEC-1981          0     20 SCOTT            3000 09-DEC-1982       1900     20 ADAMS            1100 12-JAN-1983        N/A     30 ALLEN            1600 20-FEB-1981        350     30 WARD             1250 22-FEB-1981      -1600     30 BLAKE            2850 01-MAY-1981       1350     30 TURNER           1500 08-SEP-1981        250     30 MARTIN           1250 28-SEP-1981        300     30 JAMES             950 03-DEC-1981        N/A

Solution

The is another example of where the Oracle window functions LEAD OVER and LAG OVER come in handy. You can easily access next and prior rows without additional joins. For other RDBMSs, you can use scalar subqueries, though not as easily. This particular problem is not at all elegant when having to use scalar subqueries or self joins to solve it.

DB2, MySQL, PostgreSQL, and SQL Server

Use a scalar subquery to retrieve the HIREDATE of the employee hired immediately after each employee. Then use another scalar subquery to find the salary of said employee:

  1 select deptno, ename, hiredate, sal,  2        coalesce(cast(sal-next_sal as char(10)), 'N/A') as diff  3   from (  4 select e.deptno,  5        e.ename,  6        e.hiredate,  7        e.sal,  8        (select min(sal) from emp d  9          where d.deptno=e.deptno 10            and d.hiredate = 11                 (select min(hiredate) from emp d 12                   where e.deptno=d.deptno 13                     and d.hiredate > e.hiredate)) as next_sal 14   from emp e 15        ) x

Oracle

Use the window function LEAD OVER to access the "next" employee's salary relative to the current row:

 1 select deptno, ename, sal, hiredate, 2        lpad(nvl(to_char(sal-next_sal), 'N/A'), 10) diff 3   from ( 4 select deptno, ename, sal, hiredate, 5        lead(sal)over(partition by deptno 6                          order by hiredate) next_sal 7   from emp 8        )

Discussion

DB2, MySQL, PostgreSQL, and SQL Server

The first step is to use a scalar subquery to find the HIREDATE of the employee hired immediately after each employee in the same department. The solution uses MIN(HIREDATE) in the scalar subquery to ensure that only one value is returned even in the event of multiple people being hired on the same date:

  select e.deptno,        e.ename,        e.hiredate,        e.sal,        (select min(hiredate) from emp d           where e.deptno=d.deptno             and d.hiredate > e.hiredate) as next_hire    from emp e  order by 1 DEPTNO ENAME      HIREDATE           SAL NEXT_HIRE ------ ---------- ----------- ---------- -----------     10 CLARK      09-JUN-1981       2450 17-NOV-1981     10 KING       17-NOV-1981       5000 23-JAN-1982     10 MILLER     23-JAN-1982       1300     20 SMITH      17-DEC-1980        800 02-APR-1981     20 ADAMS      12-JAN-1983       1100     20 FORD       03-DEC-1981       3000 09-DEC-1982     20 SCOTT      09-DEC-1982       3000 12-JAN-1983     20 JONES      02-APR-1981       2975 03-DEC-1981     30 ALLEN      20-FEB-1981       1600 22-FEB-1981     30 BLAKE      01-MAY-1981       2850 08-SEP-1981     30 MARTIN     28-SEP-1981       1250 03-DEC-1981     30 JAMES      03-DEC-1981        950     30 TURNER     08-SEP-1981       1500 28-SEP-1981     30 WARD       22-FEB-1981       1250 01-MAY-1981

The next step is to use another scalar subquery to find the salary of the employee who was hired on the NEXT_HIRE date. Again, the solution uses MIN to ensure that just one value is always returned:

  select e.deptno,        e.ename,        e.hiredate,        e.sal,        (select min(sal) from emp d          where d.deptno=e.deptno            and d.hiredate =                 (select min(hiredate) from emp d                   where e.deptno=d.deptno                     and d.hiredate > e.hiredate)) as next_sal   from emp e order by 1 DEPTNO ENAME      HIREDATE           SAL   NEXT_SAL ------ ---------- ----------- ---------- ----------     10 CLARK      09-JUN-1981       2450       5000     10 KING       17-NOV-1981       5000       1300     10 MILLER     23-JAN-1982       1300     20 SMITH      17-DEC-1980        800       2975     20 ADAMS      12-JAN-1983       1100     20 FORD       03-DEC-1981       3000       3000     20 SCOTT      09-DEC-1982       3000       1100     20 JONES      02-APR-1981       2975       3000     30 ALLEN      20-FEB-1981       1600       1250     30 BLAKE      01-MAY-1981       2850       1500     30 MARTIN     28-SEP-1981       1250        950     30 JAMES      03-DEC-1981        950     30 TURNER     08-SEP-1981       1500       1250     30 WARD       22-FEB-1981       1250       2850

The final step is to find the difference between SAL and NEXT_SAL, and to use the function COALESCE to return "N/A" when applicable. Since the result of the subtraction is a number and can potentially be NULL, you must cast to a string for COALESCE to work:

  select deptno, ename, hiredate, sal,        coalesce(cast(sal-next_sal as char(10)), 'N/A') as diff   from ( select e.deptno,        e.ename,        e.hiredate,        e.sal,        (select min(sal) from emp d          where d.deptno=e.deptno            and d.hiredate =                 (select min(hiredate) from emp d                   where e.deptno=d.deptno                     and d.hiredate > e.hiredate)) as next_sal   from emp e        ) x order by 1 DEPTNO ENAME      HIREDATE           SAL DIFF ------ ---------- ----------- ---------- ---------     10 CLARK      09-JUN-1981       2450 -2550     10 KING       17-NOV-1981       5000 3700     10 MILLER     23-JAN-1982       1300 N/A     20 SMITH      17-DEC-1980        800 -2175     20 ADAMS      12-JAN-1983       1100 N/A     20 FORD       03-DEC-1981       3000 0     20 SCOTT      09-DEC-1982       3000 1900     20 JONES      02-APR-1981       2975 -25     30 ALLEN      20-FEB-1981       1600 350     30 BLAKE      01-MAY-1981       2850 1350     30 MARTIN     28-SEP-1981       1250 300     30 JAMES      03-DEC-1981        950 N/A     30 TURNER     08-SEP-1981       1500 250     30 WARD       22-FEB-1981       1250 -1600

The use of MIN(SAL) in this solution is an example of how, in some ways, you can unintentionally inject business logic into a query while making what appears to be a solely technical decision. If multiple salaries are available for a given date, should you take the least? the highest? the average? In my example, I choose to take the least. In real life, I might well punt that decision back to the business client who requested the report to begin with.

Oracle

The first step is to use the LEAD OVER window function to find the "next" salary for each employee within her department. The employees hired last in each department will have a NULL value for NEXT_SAL:

  select deptno,ename,sal,hiredate,        lead(sal)over(partition by deptno order by hiredate) next_sal   from emp DEPTNO ENAME             SAL HIREDATE      NEXT_SAL ------ ---------- ---------- ----------- ----------     10 CLARK            2450 09-JUN-1981       5000     10 KING             5000 17-NOV-1981       1300     10 MILLER           1300 23-JAN-1982     20 SMITH             800 17-DEC-1980       2975     20 JONES            2975 02-APR-1981       3000     20 FORD             3000 03-DEC-1981       3000     20 SCOTT            3000 09-DEC-1982       1100     20 ADAMS            1100 12-JAN-1983     30 ALLEN            1600 20-FEB-1981       1250     30 WARD             1250 22-FEB-1981       2850     30 BLAKE            2850 01-MAY-1981       1500     30 TURNER           1500 08-SEP-1981       1250     30 MARTIN           1250 28-SEP-1981        950     30 JAMES             950 03-DEC-1981

The next step is to take the difference between each employee's salary and the salary of the employee hired immediately after her in the same department:

  select deptno,ename,sal,hiredate, sal-next_sal diff   from ( select deptno,ename,sal,hiredate,        lead(sal)over(partition by deptno order by hiredate) next_sal   from emp        ) DEPTNO ENAME             SAL HIREDATE          DIFF ------ ---------- ---------- ----------- ----------     10 CLARK            2450 09-JUN-1981      -2550     10 KING             5000 17-NOV-1981       3700     10 MILLER           1300 23-JAN-1982     20 SMITH             800 17-DEC-1980      -2175     20 JONES            2975 02-APR-1981        -25     20 FORD             3000 03-DEC-1981          0     20 SCOTT            3000 09-DEC-1982       1900     20 ADAMS            1100 12-JAN-1983     30 ALLEN            1600 20-FEB-1981        350     30 WARD             1250 22-FEB-1981      -1600     30 BLAKE            2850 01-MAY-1981       1350     30 TURNER           1500 08-SEP-1981        250     30 MARTIN           1250 28-SEP-1981        300     30 JAMES             950 03-DEC-1981

The next step is to use the function NVL to return "N/A" when DIFF is NULL. To be able to return "N/A" you must cast the value of DIFF to a string, otherwise NVL will fail:

  select deptno,ename,sal,hiredate,        nvl(to_char(sal-next_sal),'N/A') diff   from ( select deptno,ename,sal,hiredate,        lead(sal)over(partition by deptno order by hiredate) next_sal   from emp        ) DEPTNO ENAME             SAL HIREDATE    DIFF ------ ---------- ---------- ----------- ---------------     10 CLARK            2450 09-JUN-1981 -2550     10 KING             5000 17-NOV-1981 3700     10 MILLER           1300 23-JAN-1982 N/A     20 SMITH             800 17-DEC-1980 -2175     20 JONES            2975 02-APR-1981 -25     20 FORD             3000 03-DEC-1981 0     20 SCOTT            3000 09-DEC-1982 1900     20 ADAMS            1100 12-JAN-1983 N/A     30 ALLEN            1600 20-FEB-1981 350     30 WARD             1250 22-FEB-1981 -1600     30 BLAKE            2850 01-MAY-1981 1350     30 TURNER           1500 08-SEP-1981 250     30 MARTIN           1250 28-SEP-1981 300     30 JAMES             950 03-DEC-1981 N/A

The last step is to use the function LPAD to format the values for DIFF. This is because, by default, numbers are right justified while strings are left justified. Using LPAD, you can right justify all the results in the column:

  select deptno,ename,sal,hiredate,        lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff   from ( select deptno,ename,sal,hiredate,        lead(sal)over(partition by deptno order by hiredate) next_sal   from emp        ) DEPTNO ENAME             SAL HIREDATE    DIFF ------ ---------- ---------- ----------- ----------     10 CLARK            2450 09-JUN-1981      -2550     10 KING             5000 17-NOV-1981       3700     10 MILLER           1300 23-JAN-1982        N/A     20 SMITH             800 17-DEC-1980      -2175     20 JONES            2975 02-APR-1981        -25     20 FORD             3000 03-DEC-1981          0     20 SCOTT            3000 09-DEC-1982       1900     20 ADAMS            1100 12-JAN-1983        N/A     30 ALLEN            1600 20-FEB-1981        350     30 WARD             1250 22-FEB-1981      -1600     30 BLAKE            2850 01-MAY-1981       1350     30 TURNER           1500 08-SEP-1981        250     30 MARTIN           1250 28-SEP-1981        300     30 JAMES             950 03-DEC-1981        N/A

While the majority of the solutions provided in this book do not deal with "what if" scenarios (for the sake of readability and the author's sanity), the scenario involving duplicates when using Oracle's LEAD OVER function in this manner must be discussed. In the simple sample data in table EMP, no employees have duplicate HIREDATEs, yet this is a very likely situation. Normally, I would not discuss a "what if" situation such as duplicates (since there aren't any in table EMP), but the workaround involving LEAD (particularly to those of you with non-Oracle backgrounds) may not be immediately obvious. Consider the following query, which returns the difference in SAL between the employees in DEPTNO 10 (the difference is performed in the order in which they were hired):

  select deptno,ename,sal,hiredate,        lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff   from ( select deptno,ename,sal,hiredate,        lead(sal)over(partition by deptno                          order by hiredate) next_sal   from emp  where deptno=10 and empno > 10        ) DEPTNO ENAME    SAL HIREDATE    DIFF ------ ------ ----- ----------- ----------     10 CLARK   2450 09-JUN-1981      -2550     10 KING    5000 17-NOV-1981       3700     10 MILLER  1300 23-JAN-1982        N/A

This solution is correct considering the data in table EMP but, if there were duplicate rows, the solution would fail. Consider the example below, showing four more employees hired on the same day as KING:

  insert into emp (empno,ename,deptno,sal,hiredate) values (1,'ant',10,1000,to_date('17-NOV-1981')) insert into emp (empno,ename,deptno,sal,hiredate) values (2,'joe',10,1500,to_date('17-NOV-1981')) insert into emp (empno,ename,deptno,sal,hiredate) values (3,'jim',10,1600,to_date('17-NOV-1981')) insert into emp (empno,ename,deptno,sal,hiredate) values (4,'jon',10,1700,to_date('17-NOV-1981')) select deptno,ename,sal,hiredate,        lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff   from ( select deptno,ename,sal,hiredate,        lead(sal)over(partition by deptno                          order by hiredate) next_sal   from emp  where deptno=10        ) DEPTNO ENAME    SAL HIREDATE    DIFF ------ ------ ----- ----------- ----------     10 CLARK   2450 09-JUN-1981       1450     10 ant     1000 17-NOV-1981       -500     10 joe     1500 17-NOV-1981      -3500     10 KING    5000 17-NOV-1981       3400     10 jim     1600 17-NOV-1981       -100     10 jon     1700 17-NOV-1981        400     10 MILLER  1300 23-JAN-1982        N/A

You'll notice that with the exception of employee JON, all employees hired on the same date (November 17) evaluate their salary against another employee hired on the same date! This is incorrect. All employees hired on November 17 should have the difference of salary computed against MILLER's salary, not another employee hired on November 17. Take, for example, employee ANT. The value for DIFF for ANT is500 because ANT's SAL is compared with JOE's SAL and is 500 less than JOE's SAL, hence the value of500. The correct value for DIFF for employee ANT should be300 because ANT makes 300 less than MILLER, who is the next employee hired by HIREDATE. The reason the solution seems to not work is due to the default behavior of Oracle's LEAD OVER function. By default, LEAD OVER only looks ahead one row. So, for employee ANT, the next SAL based on HIREDATE is JOE's SAL, because LEAD OVER simply looks one row ahead and doesn't skip duplicates. Fortunately, Oracle planned for such a situation and allows you to pass an additional parameter to LEAD OVER to determine how far ahead it should look. In the example above, the solution is simply a matter of counting: find the distance from each employee hired on November 17 to January 23 (MILLER's HIREDATE). The solution below shows how to accomplish this:

  select deptno,ename,sal,hiredate,        lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff   from ( select deptno,ename,sal,hiredate,        lead(sal,cnt-rn+1)over(partition by deptno                          order by hiredate) next_sal   from ( select deptno,ename,sal,hiredate,        count(*)over(partition by deptno,hiredate) cnt,        row_number( )over(partition by deptno,hiredate order by sal) rn   from emp  where deptno=10        )        ) DEPTNO ENAME     SAL HIREDATE    DIFF ------ ------  ----- ----------- ----------     10 CLARK    2450 09-JUN-1981       1450     10 ant      1000 17-NOV-1981       -300     10 joe      1500 17-NOV-1981        200     10 jim      1600 17-NOV-1981        300     10 jon      1700 17-NOV-1981        400     10 KING     5000 17-NOV-1981       3700     10 MILLER   1300 23-JAN-1982        N/A

Now the solution is correct. As you can see, all the employees hired on November 17 now have their salaries compared with MILLER's salary. Inspecting the results, employee ANT now has a value of300 for DIFF, which is what we were hoping for. If it isn't immediately obvious, the expression passed to LEAD OVER; CNT-RN+1 is simply the distance from each employee hired on November 17 to MILLER. Consider the inline view below, which shows the values for CNT and RN:

  select deptno,ename,sal,hiredate,        count(*)over(partition by deptno,hiredate) cnt,        row_number( )over(partition by deptno,hiredate order by sal) rn   from emp  where deptno=10 DEPTNO ENAME    SAL HIREDATE           CNT         RN ------ ------ ----- ----------- ---------- ----------     10 CLARK   2450 09-JUN-1981          1          1     10 ant     1000 17-NOV-1981          5          1     10 joe     1500 17-NOV-1981          5          2     10 jim     1600 17-NOV-1981          5          3     10 jon     1700 17-NOV-1981          5          4     10 KING    5000 17-NOV-1981          5          5     10 MILLER  1300 23-JAN-1982          1          1

The value for CNT represents, for each employee with a duplicate HIREDATE, how many duplicates there are in total for their HIREDATE. The value for RN represents a ranking for the employees in DEPTNO 10. The rank is partitioned by DEPTNO and HIREDATE so only employees with a HIREDATE that another employee has will have a value greater than one. The ranking is sorted by SAL (this is arbitrary; SAL is convenient, but we could have just as easily chosen EMPNO). Now that you know how many total duplicates there are and you have a ranking of each duplicate, the distance to MILLER is simply the total number of duplicates minus the current rank plus one (CNT-RN+1). The results of the distance calculation and its effect on LEAD OVER are shown below:

  select deptno,ename,sal,hiredate,        lead(sal)over(partition by deptno                          order by hiredate) incorrect,        cnt-rn+1 distance,        lead(sal,cnt-rn+1)over(partition by deptno                          order by hiredate) correct   from ( select deptno,ename,sal,hiredate,        count(*)over(partition by deptno,hiredate) cnt,        row_number( )over(partition by deptno,hiredate                             order by sal) rn   from emp  where deptno=10        ) DEPTNO ENAME    SAL HIREDATE    INCORRECT    DISTANCE    CORRECT ------ ------ ----- ----------- ---------- ---------- ----------     10 CLARK   2450 09-JUN-1981       1000          1       1000     10 ant     1000 17-NOV-1981       1500          5       1300     10 joe     1500 17-NOV-1981       1600          4       1300     10 jim     1600 17-NOV-1981       1700          3       1300     10 jon     1700 17-NOV-1981       5000          2       1300     10 KING    5000 17-NOV-1981       1300          1       1300     10 MILLER  1300 23-JAN-1982                     1

Now you can clearly see the effect that you have when you pass the correct distance to LEAD OVER. The rows for INCORRECT represent the values returned by LEAD OVER using a default distance of one. The rows for CORRECT represent the values returned by LEAD OVER using the proper distance for each employee with a duplicate HIREDATE to MILLER. At this point, all that is left is to find the difference between CORRECT and SAL for each row, which has already been shown.