Section 6.6. Self-Joins on One Table

6.6. Self-Joins on One Table

In a correctly designed relational database (third normal form or above), all non-key columns are about the key, the whole key, and nothing but the key, to use an excellent and frequently quoted formula.^[*] Each row is both logically consistent and distinct from all other rows in the same table. It is this design characteristic that enables join relationships to be established within the same table. You can therefore select in the same query different (not necessarily disjoint) sets of rows from the same table and join them as if those rows came from several different tables. In this section, I'll discuss the simple self-join and exclude the more complex examples of nested hierarchies that I discuss later in Chapter 7.

^[*] I have seen this elegant formula credited only onceto a 1983 paper by William Kent, available at http://www.bkent.net.

Self-joinstables joined to themselvesare much more common than hierarchies. In some cases, it is simply because the data is seen in an identical way, but from two different angles; for instance, we can imagine that a query listing air flights would refer to the airports table twice, once to find the name of the departure airport, and once to find the name of the arrival airport. For example:

     select f.flight_number,            a.airport_name departure_airport,            b.airport_name arrival_airport     from flights f,          airports a,          airports b     where f.dep_iata_code = a.iata_code       and f.arr_iata_code = b.iata_code

In such a case, the usual rules apply: what matters is to ensure that highly efficient index access takes place. But what if the criteria are such that efficient access is not possible? The last thing we want is to do a first pass on the table, then a second one to pick up rows that were discarded during the first pass. In that case, what we should do is a single pass, collect all the rows of interest, and then use a construct such as the case statement to display separately rows from the two sets; I show examples of this "single-pass" approach in Chapter 11.

There are subtle cases that only superficially look like the airport case. Imagine that we store in some table cumulative values taken at regular intervals^[*] and we want to display by how much the counter increased between two successive snapshots. In such a case, we have a relationship between two different rows in the same table, but instead of having a strong relationship coming from another table, such as the flights table that links the two instances of airports together, we have a weak, internal relationship: we define that two rows are related not because their keys are associated in another table, but because the timestamp of one row happens to be the timestamp which immediately follows the timestamp of another row.

^[*] This is exactly what happens when you collect values from the V$ views in Oracle, which contain monitoring information.

For instance, if we assume that snapshots are taken every five minutes, with a timestamp expressed in seconds elapsed since a reference date, we might issue the following query:

     select a.timestamp,            a.statistic_id,           (b.counter - a.counter)/5 hits_per_minute     from hit_counter a,          hit_counter b     where b.timestamp = a.timestamp + 300       and b.statistic_id = a.statistic_id     order by a.timestamp, a.statistic_id

There is a significant flaw in this script: if the second snapshot has not been taken exactly five minutes after the first one, down to the second, we may be unable to join the two rows. We may therefore choose to express the join condition as a range condition. For example:

     select a.timestamp,             a.statistic_id,            (b.counter - a.counter) * 60 /                (b.timestamp - a.timestamp) hits_per_minute     from  hit_counter a,           hit_counter b     where b.timestamp between a.timestamp + 200                           and a.timestamp + 400       and b.statistic_id = a.statistic_id     order by a.timestamp, a.statistic_id

One side effect of this approach is the risk of having bigger data gaps than needed when, for one reason or another (such as a change in the sampling frequency), two successive records are no longer collected between 200 and 400 seconds of each other.

We may play it even safer and use an OLAP function that operates on windows of rows. It is indeed difficult to imagine something less relational in nature, but such a function can come in handy as the final shine on a query, and it can even make a noticeable difference in performance. Basically, OLAP functions allow the consideration of different subsets of the final result set, through the use of the partition clause. Sorts, sums, and other similar functions can be applied separately to these individual result subsets. We can use the row_number( ) OLAP function to create one subset by statistic_id, and then assign to each different statistic successive integer numbers that increase as timestamps do. When these numbers are generated by the OLAP function, we can join on both statistic_id and two sequential numbers, as in the following example:

     select a.timestamp,            a.statistic_id,            (b.counter - a.counter) * 60 /                   (b.timestamp - a.timestamp)     from (select timestamp,                  statistic_id,                  counter,                  row_number( ) over (partition by statistic_id                                     order by timestamp) rn           from hit_counter) a,           (select timestamp,                   statistic_id,                   counter,                   row_number( ) over (partition by statistic_id                                     order by timestamp) rn           from hit_counter) b     where b.rn = a.rn + 1       and a.statistic_id = b.statistic_id     order by a.timestamp, a.statistic_id

We may even do betterabout 25% faster than the previous queryif our DBMS implements, as Oracle does, a lag(column_name, n) OLAP function that returns the nth previous value for column_name, on the basis of the specified partitioning and ordering:

      select timestamp,             statistic_id,             (counter - prev_counter) * 60 /             (timestamp - prev_timestamp)      from (select timestamp,                   statistic_id,                   counter,                   lag(counter, 1) over (partition by statistic_id                                         order by timestamp) prev_counter,                   lag(timestamp, 1) over (partition by statistic_id                                           order by timestamp) prev_timestamp           from hit_counter) a     order by a.timestamp, a.statistic_id

In many cases we don't have such symmetry in our data, as is shown by the flight example. Typically, a query looking for all the data associated with the smallest, or the largest, or the oldest, or the most recent value of a specific column, first needs to find the actual smallest, largest, oldest, or most recent value in the column used for filtering (this is the first pass, which compares rows), and then search the table again in a second pass, using as a search criterion the value identified in the first pass. The two passes can be made (at least superficially) into one through the use of OLAP functions that operate on sliding windows. Queries applied to data values associated to timestamps or dates are a special case of sufficient importance to deserve further discussion later in this chapter as the situation "Simple or Range Searching on Dates."

When multiple selection criteria are applied to different rows in the same table, functions that operate on sliding windows may be of assistance.