Recipe10.3.Locating the Beginning and End of a Range of Consecutive Values

Recipe 10.3. Locating the Beginning and End of a Range of Consecutive Values

Problem

This recipe is an extension of the prior recipe , and it uses the same view V from the prior recipe. Now that you've located the ranges of consecutive values, you want to find just their start and end points. Unlike the prior recipe, if a row is not part of a set of consecutive values, you still want to return it. Why? Because such a row represents both the beginning and end of its range. Using the data from view V:

  select *   from V PROJ_ID PROJ_START  PROJ_END ------- ----------- -----------       1 01-JAN-2005 02-JAN-2005       2 02-JAN-2005 03-JAN-2005       3 03-JAN-2005 04-JAN-2005       4 04-JAN-2005 05-JAN-2005       5 06-JAN-2005 07-JAN-2005       6 16-JAN-2005 17-JAN-2005       7 17-JAN-2005 18-JAN-2005       8 18-JAN-2005 19-JAN-2005       9 19-JAN-2005 20-JAN-2005      10 21-JAN-2005 22-JAN-2005      11 26-JAN-2005 27-JAN-2005      12 27-JAN-2005 28-JAN-2005      13 28-JAN-2005 29-JAN-2005      14 29-JAN-2005 30-JAN-2005

you want the final result set to be:

 PROJ_GRP PROJ_START  PROJ_END -------- ----------- -----------        1 01-JAN-2005 05-JAN-2005        2 06-JAN-2005 07-JAN-2005        3 16-JAN-2005 20-JAN-2005        4 21-JAN-2005 22-JAN-2005        5 26-JAN-2005 30-JAN-2005

Solution

This problem is a bit more involved than its predecessor. First, you must identify what the ranges are. A range of rows is defined by the values for PROJ_START and PROJ_END. For a row to be considered "consecutive" or part of a group, its PROJ_ START value must equal the PROJ_END value of the row before it. In the case where a row's PROJ_START value does not equal the prior row's PROJ_END value and its PROJ_END value does not equal the next row's PROJ_START value, this is an instance of a single row group. Once you have identify the ranges, you need to be able to group the rows in these ranges together (into groups) and return only their start and end points.

Examine the first row of the desired result set. The PROJ_START is the PROJ_ START for PROJ_ID 1 from view V and the PROJ_END is the PROJ_END for PROJ_ID 4 from view V. Despite the fact that PROJ_ID 4 does not have a consecutive value following it, it is the last of a range of consecutive values, and thus it is included in the first group.

DB2, MySQL, PostgreSQL, and SQL Server

The solution for these platforms will use use view V2 to help improve readability. View V2 is defined as follows:

 create view v2 as select a.*,        case          when (             select b.proj_id               from V b             where a.proj_start = b.proj_end              )              is not null then 0 else 1           end as flag from V a

The result set from view V2 is:

  select *   from V2 PROJ_ID PROJ_START  PROJ_END          FLAG ------- ----------- ----------- ----------       1 01-JAN-2005 02-JAN-2005          1       2 02-JAN-2005 03-JAN-2005          0       3 03-JAN-2005 04-JAN-2005          0       4 04-JAN-2005 05-JAN-2005          0       5 06-JAN-2005 07-JAN-2005          1       6 16-JAN-2005 17-JAN-2005          1       7 17-JAN-2005 18-JAN-2005          0       8 18-JAN-2005 19-JAN-2005          0       9 19-JAN-2005 20-JAN-2005          0      10 21-JAN-2005 22-JAN-2005          1      11 26-JAN-2005 27-JAN-2005          1      12 27-JAN-2005 28-JAN-2005          0      13 28-JAN-2005 29-JAN-2005          0      14 29-JAN-2005 30-JAN-2005          0

Using V2, the solution is as follows. First, find the rows that are part of a set of consecutive values. Group those rows together. Then use the MIN and MAX functions to find their start and end points:

  1 select proj_grp,  2        min(proj_start) as proj_start,  3        max(proj_end) as proj_end  4   from (  5 select a.proj_id,a.proj_start,a.proj_end,  6        (select sum(b.flag)  7           from V2 b  8          where b.proj_id <= a.proj_id) as proj_grp  9   from V2 a 10       ) x 11  group by proj_grp

Oracle

While the solution for the other vendors will work for Oracle, there's no need to introduce additional views when you can take advantage of Oracle's LAG OVER window function. Use LAG OVER to determine whether or not each prior row's PROJ_END equals the current row's PROJ_START to help place the rows into groups. Once they are grouped, use the aggregate functions MIN and MAX to find their start and end points:

  1 select proj_grp, min(proj_start), max(proj_end)  2   from (  3 select proj_id,proj_start,proj_end,  4        sum(flag)over(order by proj_id) proj_grp  5   from (  6 select proj_id,proj_start,proj_end,  7        case when  8             lag(proj_end)over(order by proj_id) = proj_start  9             then 0 else 1 10        end flag 11   from V 12        ) 13        ) 14  group by proj_grp

Discussion

DB2, MySQL, PostgreSQL, and SQL Server

Using view V2 makes this problem relatively easy to solve. View V2 uses a scalar subquery in a CASE expression to determine whether or not a particular row is part of a set of consecutive values. The CASE expression, aliased FLAG, returns a 0 if the current row is part of a consecutive set or a 1 if it is not (membership in a consecutive set is determined by whether or not there is a record with a PROJ_END value that matches the current row's PROJ_START value). The next step is to examine inline view X (lines 59). Inline view X returns all rows from view V2 along with a running total on FLAG; this running total is what creates our groups and can be seen below:

  select a.proj_id,a.proj_start,a.proj_end,        (select sum(b.flag)           from v2 b          where b.proj_id <= a.proj_id) as proj_grp   from v2 a PROJ_ID PROJ_START  PROJ_END      PROJ_GRP ------- ----------- ----------- ----------       1 01-JAN-2005 02-JAN-2005          1       2 02-JAN-2005 03-JAN-2005          1       3 03-JAN-2005 04-JAN-2005          1       4 04-JAN-2005 05-JAN-2005          1       5 06-JAN-2005 07-JAN-2005          2       6 16-JAN-2005 17-JAN-2005          3       7 17-JAN-2005 18-JAN-2005          3       8 18-JAN-2005 19-JAN-2005          3       9 19-JAN-2005 20-JAN-2005          3      10 21-JAN-2005 22-JAN-2005          4      11 26-JAN-2005 27-JAN-2005          5      12 27-JAN-2005 28-JAN-2005          5      13 28-JAN-2005 29-JAN-2005          5      14 29-JAN-2005 30-JAN-2005          5

Now that the ranges have been grouped, find the start and end point for each by simply using the aggregate functions MIN and MAX on PROJ_START and PROJ_END respectively, and group by the values created by the running total.

Oracle

The window function LAG OVER is extremely useful in this situation. You can examine each prior row's PROJ_END value without a self join, without a scalar sub-query, and without a view. The results of the LAG OVER function without the CASE expression are as follows:

  select proj_id,proj_start,proj_end,       lag(proj_end)over(order by proj_id) prior_proj_end   from V PROJ_ID PROJ_START  PROJ_END    PRIOR_PROJ_END ------- ----------- ----------- --------------       1 01-JAN-2005 02-JAN-2005       2 02-JAN-2005 03-JAN-2005 02-JAN-2005       3 03-JAN-2005 04-JAN-2005 03-JAN-2005       4 04-JAN-2005 05-JAN-2005 04-JAN-2005       5 06-JAN-2005 07-JAN-2005 05-JAN-2005       6 16-JAN-2005 17-JAN-2005 07-JAN-2005       7 17-JAN-2005 18-JAN-2005 17-JAN-2005       8 18-JAN-2005 19-JAN-2005 18-JAN-2005       9 19-JAN-2005 20-JAN-2005 19-JAN-2005      10 21-JAN-2005 22-JAN-2005 20-JAN-2005      11 26-JAN-2005 27-JAN-2005 22-JAN-2005      12 27-JAN-2005 28-JAN-2005 27-JAN-2005      13 28-JAN-2005 29-JAN-2005 28-JAN-2005      14 29-JAN-2005 30-JAN-2005 29-JAN-2005

The CASE expression in the complete solution simply compares the value returned by LAG OVER to the current row's PROJ_START value; if they are the same, return 0, else return 1. The next step is to create a running total on the 0's and 1's returned by the CASE expression to put each row into a group. The results of the running total can be seen below:

  select proj_id,proj_start,proj_end,        sum(flag)over(order by proj_id) proj_grp   from ( select proj_id,proj_start,proj_end,        case when             lag(proj_end)over(order by proj_id) = proj_start             then 0 else 1        end flag   from V        ) PROJ_ID PROJ_START  PROJ_END      PROJ_GRP ------- ----------- ----------- ----------       1 01-JAN-2005 02-JAN-2005          1       2 02-JAN-2005 03-JAN-2005          1       3 03-JAN-2005 04-JAN-2005          1       4 04-JAN-2005 05-JAN-2005          1       5 06-JAN-2005 07-JAN-2005          2       6 16-JAN-2005 17-JAN-2005          3       7 17-JAN-2005 18-JAN-2005          3       8 18-JAN-2005 19-JAN-2005          3       9 19-JAN-2005 20-JAN-2005          3      10 21-JAN-2005 22-JAN-2005          4      11 26-JAN-2005 27-JAN-2005          5      12 27-JAN-2005 28-JAN-2005          5      13 28-JAN-2005 29-JAN-2005          5      14 29-JAN-2005 30-JAN-2005          5

Now that each row has been placed into a group, simply use the aggregate functions MIN and MAX on PROJ_START and PROJ_END respectively, and group by the values created in the PROJ_GRP running total column.