Recipe 10.2. Finding Differences Between Rows in the Same Group or PartitionProblemYou want to return the DEPTNO, ENAME, and SAL of each employee along with the difference in SAL between employees in the same department (i.e., having the same value for DEPTNO). The difference should be between each current employee and the employee hired immediately afterwards (you want to see if there is a correlation between seniority and salary on a "per department" basis). For each employee hired last in his department, return "N/A" for the difference. The result set should look like this: DEPTNO ENAME SAL HIREDATE DIFF ------ ---------- ---------- ----------- ---------- 10 CLARK 2450 09-JUN-1981 -2550 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 N/A 20 SMITH 800 17-DEC-1980 -2175 20 JONES 2975 02-APR-1981 -25 20 FORD 3000 03-DEC-1981 0 20 SCOTT 3000 09-DEC-1982 1900 20 ADAMS 1100 12-JAN-1983 N/A 30 ALLEN 1600 20-FEB-1981 350 30 WARD 1250 22-FEB-1981 -1600 30 BLAKE 2850 01-MAY-1981 1350 30 TURNER 1500 08-SEP-1981 250 30 MARTIN 1250 28-SEP-1981 300 30 JAMES 950 03-DEC-1981 N/A SolutionThe is another example of where the Oracle window functions LEAD OVER and LAG OVER come in handy. You can easily access next and prior rows without additional joins. For other RDBMSs, you can use scalar subqueries, though not as easily. This particular problem is not at all elegant when having to use scalar subqueries or self joins to solve it. DB2, MySQL, PostgreSQL, and SQL ServerUse a scalar subquery to retrieve the HIREDATE of the employee hired immediately after each employee. Then use another scalar subquery to find the salary of said employee: 1 select deptno, ename, hiredate, sal, 2 coalesce(cast(sal-next_sal as char(10)), 'N/A') as diff 3 from ( 4 select e.deptno, 5 e.ename, 6 e.hiredate, 7 e.sal, 8 (select min(sal) from emp d 9 where d.deptno=e.deptno 10 and d.hiredate = 11 (select min(hiredate) from emp d 12 where e.deptno=d.deptno 13 and d.hiredate > e.hiredate)) as next_sal 14 from emp e 15 ) x OracleUse the window function LEAD OVER to access the "next" employee's salary relative to the current row: 1 select deptno, ename, sal, hiredate, 2 lpad(nvl(to_char(sal-next_sal), 'N/A'), 10) diff 3 from ( 4 select deptno, ename, sal, hiredate, 5 lead(sal)over(partition by deptno 6 order by hiredate) next_sal 7 from emp 8 ) DiscussionDB2, MySQL, PostgreSQL, and SQL ServerThe first step is to use a scalar subquery to find the HIREDATE of the employee hired immediately after each employee in the same department. The solution uses MIN(HIREDATE) in the scalar subquery to ensure that only one value is returned even in the event of multiple people being hired on the same date: select e.deptno, e.ename, e.hiredate, e.sal, (select min(hiredate) from emp d where e.deptno=d.deptno and d.hiredate > e.hiredate) as next_hire from emp e order by 1 DEPTNO ENAME HIREDATE SAL NEXT_HIRE ------ ---------- ----------- ---------- ----------- 10 CLARK 09-JUN-1981 2450 17-NOV-1981 10 KING 17-NOV-1981 5000 23-JAN-1982 10 MILLER 23-JAN-1982 1300 20 SMITH 17-DEC-1980 800 02-APR-1981 20 ADAMS 12-JAN-1983 1100 20 FORD 03-DEC-1981 3000 09-DEC-1982 20 SCOTT 09-DEC-1982 3000 12-JAN-1983 20 JONES 02-APR-1981 2975 03-DEC-1981 30 ALLEN 20-FEB-1981 1600 22-FEB-1981 30 BLAKE 01-MAY-1981 2850 08-SEP-1981 30 MARTIN 28-SEP-1981 1250 03-DEC-1981 30 JAMES 03-DEC-1981 950 30 TURNER 08-SEP-1981 1500 28-SEP-1981 30 WARD 22-FEB-1981 1250 01-MAY-1981 The next step is to use another scalar subquery to find the salary of the employee who was hired on the NEXT_HIRE date. Again, the solution uses MIN to ensure that just one value is always returned: select e.deptno, e.ename, e.hiredate, e.sal, (select min(sal) from emp d where d.deptno=e.deptno and d.hiredate = (select min(hiredate) from emp d where e.deptno=d.deptno and d.hiredate > e.hiredate)) as next_sal from emp e order by 1 DEPTNO ENAME HIREDATE SAL NEXT_SAL ------ ---------- ----------- ---------- ---------- 10 CLARK 09-JUN-1981 2450 5000 10 KING 17-NOV-1981 5000 1300 10 MILLER 23-JAN-1982 1300 20 SMITH 17-DEC-1980 800 2975 20 ADAMS 12-JAN-1983 1100 20 FORD 03-DEC-1981 3000 3000 20 SCOTT 09-DEC-1982 3000 1100 20 JONES 02-APR-1981 2975 3000 30 ALLEN 20-FEB-1981 1600 1250 30 BLAKE 01-MAY-1981 2850 1500 30 MARTIN 28-SEP-1981 1250 950 30 JAMES 03-DEC-1981 950 30 TURNER 08-SEP-1981 1500 1250 30 WARD 22-FEB-1981 1250 2850 The final step is to find the difference between SAL and NEXT_SAL, and to use the function COALESCE to return "N/A" when applicable. Since the result of the subtraction is a number and can potentially be NULL, you must cast to a string for COALESCE to work: select deptno, ename, hiredate, sal, coalesce(cast(sal-next_sal as char(10)), 'N/A') as diff from ( select e.deptno, e.ename, e.hiredate, e.sal, (select min(sal) from emp d where d.deptno=e.deptno and d.hiredate = (select min(hiredate) from emp d where e.deptno=d.deptno and d.hiredate > e.hiredate)) as next_sal from emp e ) x order by 1 DEPTNO ENAME HIREDATE SAL DIFF ------ ---------- ----------- ---------- --------- 10 CLARK 09-JUN-1981 2450 -2550 10 KING 17-NOV-1981 5000 3700 10 MILLER 23-JAN-1982 1300 N/A 20 SMITH 17-DEC-1980 800 -2175 20 ADAMS 12-JAN-1983 1100 N/A 20 FORD 03-DEC-1981 3000 0 20 SCOTT 09-DEC-1982 3000 1900 20 JONES 02-APR-1981 2975 -25 30 ALLEN 20-FEB-1981 1600 350 30 BLAKE 01-MAY-1981 2850 1350 30 MARTIN 28-SEP-1981 1250 300 30 JAMES 03-DEC-1981 950 N/A 30 TURNER 08-SEP-1981 1500 250 30 WARD 22-FEB-1981 1250 -1600
OracleThe first step is to use the LEAD OVER window function to find the "next" salary for each employee within her department. The employees hired last in each department will have a NULL value for NEXT_SAL: select deptno,ename,sal,hiredate, lead(sal)over(partition by deptno order by hiredate) next_sal from emp DEPTNO ENAME SAL HIREDATE NEXT_SAL ------ ---------- ---------- ----------- ---------- 10 CLARK 2450 09-JUN-1981 5000 10 KING 5000 17-NOV-1981 1300 10 MILLER 1300 23-JAN-1982 20 SMITH 800 17-DEC-1980 2975 20 JONES 2975 02-APR-1981 3000 20 FORD 3000 03-DEC-1981 3000 20 SCOTT 3000 09-DEC-1982 1100 20 ADAMS 1100 12-JAN-1983 30 ALLEN 1600 20-FEB-1981 1250 30 WARD 1250 22-FEB-1981 2850 30 BLAKE 2850 01-MAY-1981 1500 30 TURNER 1500 08-SEP-1981 1250 30 MARTIN 1250 28-SEP-1981 950 30 JAMES 950 03-DEC-1981 The next step is to take the difference between each employee's salary and the salary of the employee hired immediately after her in the same department: select deptno,ename,sal,hiredate, sal-next_sal diff from ( select deptno,ename,sal,hiredate, lead(sal)over(partition by deptno order by hiredate) next_sal from emp ) DEPTNO ENAME SAL HIREDATE DIFF ------ ---------- ---------- ----------- ---------- 10 CLARK 2450 09-JUN-1981 -2550 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 20 SMITH 800 17-DEC-1980 -2175 20 JONES 2975 02-APR-1981 -25 20 FORD 3000 03-DEC-1981 0 20 SCOTT 3000 09-DEC-1982 1900 20 ADAMS 1100 12-JAN-1983 30 ALLEN 1600 20-FEB-1981 350 30 WARD 1250 22-FEB-1981 -1600 30 BLAKE 2850 01-MAY-1981 1350 30 TURNER 1500 08-SEP-1981 250 30 MARTIN 1250 28-SEP-1981 300 30 JAMES 950 03-DEC-1981 The next step is to use the function NVL to return "N/A" when DIFF is NULL. To be able to return "N/A" you must cast the value of DIFF to a string, otherwise NVL will fail: select deptno,ename,sal,hiredate, nvl(to_char(sal-next_sal),'N/A') diff from ( select deptno,ename,sal,hiredate, lead(sal)over(partition by deptno order by hiredate) next_sal from emp ) DEPTNO ENAME SAL HIREDATE DIFF ------ ---------- ---------- ----------- --------------- 10 CLARK 2450 09-JUN-1981 -2550 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 N/A 20 SMITH 800 17-DEC-1980 -2175 20 JONES 2975 02-APR-1981 -25 20 FORD 3000 03-DEC-1981 0 20 SCOTT 3000 09-DEC-1982 1900 20 ADAMS 1100 12-JAN-1983 N/A 30 ALLEN 1600 20-FEB-1981 350 30 WARD 1250 22-FEB-1981 -1600 30 BLAKE 2850 01-MAY-1981 1350 30 TURNER 1500 08-SEP-1981 250 30 MARTIN 1250 28-SEP-1981 300 30 JAMES 950 03-DEC-1981 N/A The last step is to use the function LPAD to format the values for DIFF. This is because, by default, numbers are right justified while strings are left justified. Using LPAD, you can right justify all the results in the column: select deptno,ename,sal,hiredate, lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff from ( select deptno,ename,sal,hiredate, lead(sal)over(partition by deptno order by hiredate) next_sal from emp ) DEPTNO ENAME SAL HIREDATE DIFF ------ ---------- ---------- ----------- ---------- 10 CLARK 2450 09-JUN-1981 -2550 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 N/A 20 SMITH 800 17-DEC-1980 -2175 20 JONES 2975 02-APR-1981 -25 20 FORD 3000 03-DEC-1981 0 20 SCOTT 3000 09-DEC-1982 1900 20 ADAMS 1100 12-JAN-1983 N/A 30 ALLEN 1600 20-FEB-1981 350 30 WARD 1250 22-FEB-1981 -1600 30 BLAKE 2850 01-MAY-1981 1350 30 TURNER 1500 08-SEP-1981 250 30 MARTIN 1250 28-SEP-1981 300 30 JAMES 950 03-DEC-1981 N/A While the majority of the solutions provided in this book do not deal with "what if" scenarios (for the sake of readability and the author's sanity), the scenario involving duplicates when using Oracle's LEAD OVER function in this manner must be discussed. In the simple sample data in table EMP, no employees have duplicate HIREDATEs, yet this is a very likely situation. Normally, I would not discuss a "what if" situation such as duplicates (since there aren't any in table EMP), but the workaround involving LEAD (particularly to those of you with non-Oracle backgrounds) may not be immediately obvious. Consider the following query, which returns the difference in SAL between the employees in DEPTNO 10 (the difference is performed in the order in which they were hired): select deptno,ename,sal,hiredate, lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff from ( select deptno,ename,sal,hiredate, lead(sal)over(partition by deptno order by hiredate) next_sal from emp where deptno=10 and empno > 10 ) DEPTNO ENAME SAL HIREDATE DIFF ------ ------ ----- ----------- ---------- 10 CLARK 2450 09-JUN-1981 -2550 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 N/A This solution is correct considering the data in table EMP but, if there were duplicate rows, the solution would fail. Consider the example below, showing four more employees hired on the same day as KING: insert into emp (empno,ename,deptno,sal,hiredate) values (1,'ant',10,1000,to_date('17-NOV-1981')) insert into emp (empno,ename,deptno,sal,hiredate) values (2,'joe',10,1500,to_date('17-NOV-1981')) insert into emp (empno,ename,deptno,sal,hiredate) values (3,'jim',10,1600,to_date('17-NOV-1981')) insert into emp (empno,ename,deptno,sal,hiredate) values (4,'jon',10,1700,to_date('17-NOV-1981')) select deptno,ename,sal,hiredate, lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff from ( select deptno,ename,sal,hiredate, lead(sal)over(partition by deptno order by hiredate) next_sal from emp where deptno=10 ) DEPTNO ENAME SAL HIREDATE DIFF ------ ------ ----- ----------- ---------- 10 CLARK 2450 09-JUN-1981 1450 10 ant 1000 17-NOV-1981 -500 10 joe 1500 17-NOV-1981 -3500 10 KING 5000 17-NOV-1981 3400 10 jim 1600 17-NOV-1981 -100 10 jon 1700 17-NOV-1981 400 10 MILLER 1300 23-JAN-1982 N/A You'll notice that with the exception of employee JON, all employees hired on the same date (November 17) evaluate their salary against another employee hired on the same date! This is incorrect. All employees hired on November 17 should have the difference of salary computed against MILLER's salary, not another employee hired on November 17. Take, for example, employee ANT. The value for DIFF for ANT is500 because ANT's SAL is compared with JOE's SAL and is 500 less than JOE's SAL, hence the value of500. The correct value for DIFF for employee ANT should be300 because ANT makes 300 less than MILLER, who is the next employee hired by HIREDATE. The reason the solution seems to not work is due to the default behavior of Oracle's LEAD OVER function. By default, LEAD OVER only looks ahead one row. So, for employee ANT, the next SAL based on HIREDATE is JOE's SAL, because LEAD OVER simply looks one row ahead and doesn't skip duplicates. Fortunately, Oracle planned for such a situation and allows you to pass an additional parameter to LEAD OVER to determine how far ahead it should look. In the example above, the solution is simply a matter of counting: find the distance from each employee hired on November 17 to January 23 (MILLER's HIREDATE). The solution below shows how to accomplish this: select deptno,ename,sal,hiredate, lpad(nvl(to_char(sal-next_sal),'N/A'),10) diff from ( select deptno,ename,sal,hiredate, lead(sal,cnt-rn+1)over(partition by deptno order by hiredate) next_sal from ( select deptno,ename,sal,hiredate, count(*)over(partition by deptno,hiredate) cnt, row_number( )over(partition by deptno,hiredate order by sal) rn from emp where deptno=10 ) ) DEPTNO ENAME SAL HIREDATE DIFF ------ ------ ----- ----------- ---------- 10 CLARK 2450 09-JUN-1981 1450 10 ant 1000 17-NOV-1981 -300 10 joe 1500 17-NOV-1981 200 10 jim 1600 17-NOV-1981 300 10 jon 1700 17-NOV-1981 400 10 KING 5000 17-NOV-1981 3700 10 MILLER 1300 23-JAN-1982 N/A Now the solution is correct. As you can see, all the employees hired on November 17 now have their salaries compared with MILLER's salary. Inspecting the results, employee ANT now has a value of300 for DIFF, which is what we were hoping for. If it isn't immediately obvious, the expression passed to LEAD OVER; CNT-RN+1 is simply the distance from each employee hired on November 17 to MILLER. Consider the inline view below, which shows the values for CNT and RN: select deptno,ename,sal,hiredate, count(*)over(partition by deptno,hiredate) cnt, row_number( )over(partition by deptno,hiredate order by sal) rn from emp where deptno=10 DEPTNO ENAME SAL HIREDATE CNT RN ------ ------ ----- ----------- ---------- ---------- 10 CLARK 2450 09-JUN-1981 1 1 10 ant 1000 17-NOV-1981 5 1 10 joe 1500 17-NOV-1981 5 2 10 jim 1600 17-NOV-1981 5 3 10 jon 1700 17-NOV-1981 5 4 10 KING 5000 17-NOV-1981 5 5 10 MILLER 1300 23-JAN-1982 1 1 The value for CNT represents, for each employee with a duplicate HIREDATE, how many duplicates there are in total for their HIREDATE. The value for RN represents a ranking for the employees in DEPTNO 10. The rank is partitioned by DEPTNO and HIREDATE so only employees with a HIREDATE that another employee has will have a value greater than one. The ranking is sorted by SAL (this is arbitrary; SAL is convenient, but we could have just as easily chosen EMPNO). Now that you know how many total duplicates there are and you have a ranking of each duplicate, the distance to MILLER is simply the total number of duplicates minus the current rank plus one (CNT-RN+1). The results of the distance calculation and its effect on LEAD OVER are shown below: select deptno,ename,sal,hiredate, lead(sal)over(partition by deptno order by hiredate) incorrect, cnt-rn+1 distance, lead(sal,cnt-rn+1)over(partition by deptno order by hiredate) correct from ( select deptno,ename,sal,hiredate, count(*)over(partition by deptno,hiredate) cnt, row_number( )over(partition by deptno,hiredate order by sal) rn from emp where deptno=10 ) DEPTNO ENAME SAL HIREDATE INCORRECT DISTANCE CORRECT ------ ------ ----- ----------- ---------- ---------- ---------- 10 CLARK 2450 09-JUN-1981 1000 1 1000 10 ant 1000 17-NOV-1981 1500 5 1300 10 joe 1500 17-NOV-1981 1600 4 1300 10 jim 1600 17-NOV-1981 1700 3 1300 10 jon 1700 17-NOV-1981 5000 2 1300 10 KING 5000 17-NOV-1981 1300 1 1300 10 MILLER 1300 23-JAN-1982 1 Now you can clearly see the effect that you have when you pass the correct distance to LEAD OVER. The rows for INCORRECT represent the values returned by LEAD OVER using a default distance of one. The rows for CORRECT represent the values returned by LEAD OVER using the proper distance for each employee with a duplicate HIREDATE to MILLER. At this point, all that is left is to find the difference between CORRECT and SAL for each row, which has already been shown. |