Recipe10.1.Locating a Range of Consecutive Values | SQL Cookbook (Cookbooks (OReilly))

Recipe 10.1. Locating a Range of Consecutive Values

Problem

You want to determine which rows represent a range of consecutive projects. Consider the following result set from view V, which contains data about a project and its start and end dates:

 select *   from V PROJ_ID PROJ_START  PROJ_END ------- ----------- -----------       1 01-JAN-2005 02-JAN-2005       2 02-JAN-2005 03-JAN-2005       3 03-JAN-2005 04-JAN-2005       4 04-JAN-2005 05-JAN-2005       5 06-JAN-2005 07-JAN-2005       6 16-JAN-2005 17-JAN-2005       7 17-JAN-2005 18-JAN-2005       8 18-JAN-2005 19-JAN-2005       9 19-JAN-2005 20-JAN-2005      10 21-JAN-2005 22-JAN-2005      11 26-JAN-2005 27-JAN-2005      12 27-JAN-2005 28-JAN-2005      13 28-JAN-2005 29-JAN-2005      14 29-JAN-2005 30-JAN-2005

Excluding the first row, each row's PROJ_START should equal the PROJ_END of the row before it ("before" is defined as PROJ_ID1 for the current row). Examining the first five rows from view V, PROJ_IDs 1 through 3 are part of the same "group" as each PROJ_END equals the PROJ_START of the row after it. Because you want to find the range of dates for consecutive projects, you would like to return all rows where the current PROJ_END equals the next row's PROJ_START. If the first five rows comprised the entire result set, you would like to return only the first three rows. The final result set (using all 14 rows from view V) should be:

 PROJ_ID PROJ_START  PROJ_END ------- ----------- -----------      1  01-JAN-2005 02-JAN-2005      2  02-JAN-2005 03-JAN-2005      3  03-JAN-2005 04-JAN-2005      6  16-JAN-2005 17-JAN-2005      7  17-JAN-2005 18-JAN-2005      8  18-JAN-2005 19-JAN-2005     11  26-JAN-2005 27-JAN-2005     12  27-JAN-2005 28-JAN-2005     13  28-JAN-2005 29-JAN-2005

The rows with PROJ_IDs 4,5,9,10, and 14 are excluded from this result set because the PROJ_END of each of these rows does not match the PROJ_START of the row following it.

Solution

DB2, MySQL, PostgreSQL, and SQL Server

Use a self join to find the rows with consecutive values:

 1 select v1.proj_id, 2        v1.proj_start, 3        v1.proj_end 4   from V v1, V v2 5 where v1.proj_end = v2.proj_start

Oracle

The preceding solution will also work for Oracle. Alternatively, here is another solution that takes advantage of the window function LEAD OVER to look at the "next" row's BEGIN_DATE, thus avoiding the need to self join:

 1 select proj_id, proj_start, proj_end 2   from ( 3 select proj_id, proj_start, proj_end, 4        lead(proj_start)over(order by proj_id) next_proj_start 5   from V 6        ) 7 where next_proj_start = proj_end

Discussion

DB2, MySQL, PostgreSQL, and SQL Server

By self joining the view to itself, each row can be compared to every other row returned. Consider a partial result set for IDs 1 and 4:

  select v1.proj_id as v1_id,        v1.proj_end as v1_end,        v2.proj_start as v2_begin,        v2.proj_id as v2_id  from v v1,  v v2 where v1.proj_id in ( 1, 4 ) V1_ID V1_END      V2_BEGIN         V2_ID ----- ----------- ----------- ----------     1 02-JAN-2005 01-JAN-2005          1     1 02-JAN-2005 02-JAN-2005          2     1 02-JAN-2005 03-JAN-2005          3     1 02-JAN-2005 04-JAN-2005          4     1 02-JAN-2005 06-JAN-2005          5     1 02-JAN-2005 16-JAN-2005          6     1 02-JAN-2005 17-JAN-2005          7     1 02-JAN-2005 18-JAN-2005          8     1 02-JAN-2005 19-JAN-2005          9     1 02-JAN-2005 21-JAN-2005         10     1 02-JAN-2005 26-JAN-2005         11     1 02-JAN-2005 27-JAN-2005         12     1 02-JAN-2005 28-JAN-2005         13     1 02-JAN-2005 29-JAN-2005         14     4 05-JAN-2005 01-JAN-2005          1     4 05-JAN-2005 02-JAN-2005          2     4 05-JAN-2005 03-JAN-2005          3     4 05-JAN-2005 04-JAN-2005          4     4 05-JAN-2005 06-JAN-2005          5     4 05-JAN-2005 16-JAN-2005          6     4 05-JAN-2005 17-JAN-2005          7     4 05-JAN-2005 18-JAN-2005          8     4 05-JAN-2005 19-JAN-2005          9     4 05-JAN-2005 21-JAN-2005         10     4 05-JAN-2005 26-JAN-2005         11     4 05-JAN-2005 27-JAN-2005         12     4 05-JAN-2005 28-JAN-2005         13     4 05-JAN-2005 29-JAN-2005         14

Examining this result set, you can see why PROJ_ID 1 is included in the final result set and PROJ_ID 4 is not: there is no corresponding V2_BEGIN value for the V1_ END value returned for V1_ID 4.

Depending on how you view the data, PROJ_ID 4 can just as easily be considered contiguous. Consider the following result set:

  select *   from V where proj_id <= 5 PROJ_ID PROJ_START  PROJ_END ------- ----------  -----------       1 01-JAN-2005 02-JAN-2005       2 02-JAN-2005 03-JAN-2005       3 03-JAN-2005 04-JAN-2005       4 04-JAN-2005 05-JAN-2005       5 06-JAN-2005 07-JAN-2005

If "contiguous" is defined as a project that starts the same day another project ends, then PROJ_ID 4 should be included in the result set. PROJ_ID 4 was originally eliminated because of the forward comparison (comparing its PROJ_END with the next PROJ_START), but if you do a backwards comparison (PROJ_START with the prior PROJ_END), then PROJ_ID 4 will be included in the result set.

Modifying the solution to include PROJ_ID 4 is trivial: simply add an additional predicate to ensure that both PROJ_START and PROJ_END are checked for being contiguous, not just PROJ_END. The modification shown in the following query produces a result set that includes PROJ_ID 4 (DISTINCT is necessary because some rows satisfy both predicate conditions):

  select distinct        v1.proj_id,        v1.proj_start,        v1.proj_end  from V v1,  V v2 where v1.proj_end = v2.proj_start    or v1.proj_start = v2.proj_end PROJ_ID PROJ_START  PROJ_END ------- ----------- -----------       1 01-JAN-2005 02-JAN-2005       2 02-JAN-2005 03-JAN-2005       3 03-JAN-2005 04-JAN-2005       4 04-JAN-2005 05-JAN-2005

Oracle

While the self-join solution certainly works, the window function LEAD OVER is perfect for this type of problem. The function LEAD OVER allows you to examine other rows without performing a self join (though the function must impose order on the result set to do so). Consider the results of the inline view (lines 35) for IDs 1 and 4:

  select *   from ( select proj_id, proj_start, proj_end,        lead(proj_start)over(order by proj_id) next_proj_start   from v        )  where proj_id in ( 1, 4 ) PROJ_ID PROJ_START  PROJ_END    NEXT_PROJ_START ------- ----------- ----------- ---------------       1 01-JAN-2005 02-JAN-2005 02-JAN-2005       4 04-JAN-2005 05-JAN-2005 06-JAN-2005

Examining the above snippet of code and its result set, it is particularly easy to see why PROJ_ID 4 is excluded from the final result set of the complete solution. It's excluded because its PROJ_END date of 05-JAN-2005 does not match the "next" project's start date of 06-JAN-2005.

The function LEAD OVER is extremely handy when it comes to problems such as this one, particularly when examining partial results. When working with window functions, keep in mind that they are evaluated after the FROM and WHERE clauses, so the LEAD OVER function in the preceding query must be embedded within an inline view. Otherwise the LEAD OVER function is applied to the result set after the WHERE clause has filtered out all rows except for PROJ_ID's 1 and 4.

Now, depending on how you view the data, you may very well want to include PROJ_ID 4 in the final result set. Consider the first five rows from view V:

  select *   from V  where proj_id <= 5 PROJ_ID PROJ_START  PROJ_END ------- ----------- -----------       1 01-JAN-2005 02-JAN-2005       2 02-JAN-2005 03-JAN-2005       3 03-JAN-2005 04-JAN-2005       4 04-JAN-2005 05-JAN-2005       5 06-JAN-2005 07-JAN-2005

If your requirement is such that PROJ_ID 4 is in fact contiguous (because PROJ_ START for PROJ_ID 4 matches PROJ_END for PROJ_ID 3), and that only PROJ_ ID 5 should be discarded, the proposed solution for this recipe is incorrect (!), or at the very least, incomplete:

  select proj_id, proj_start, proj_end   from ( select proj_id, proj_start, proj_end,         lead(proj_start)over(order by proj_id) next_start   from V where proj_id <= 5       ) where proj_end = next_start PROJ_ID PROJ_START  PROJ_END ------- ----------- -----------       1 01-JAN-2005 02-JAN-2005       2 02-JAN-2005 03-JAN-2005       3 03-JAN-2005 04-JAN-2005

If you believe PROJ_ID 4 should be included, simply add LAG OVER to the query and use an additional filter in the WHERE clause:

  select proj_id, proj_start, proj_end   from ( select proj_id, proj_start, proj_end,         lead(proj_start)over(order by proj_id) next_start,        lag(proj_end)over(order by proj_id) last_end   from V where proj_id <= 5       ) where proj_end = next_start    or proj_start = last_end PROJ_ID PROJ_START  PROJ_END ------- ----------- -----------       1 01-JAN-2005 02-JAN-2005       2 02-JAN-2005 03-JAN-2005       3 03-JAN-2005 04-JAN-2005       4 04-JAN-2005 05-JAN-2005

Now PROJ_ID 4 is included in the final result set, and only the evil PROJ_ID 5 is excluded. Please consider your exact requirements when applying these recipes to your code.