6.6. Self-Joins on One TableIn a correctly designed relational database (third normal form or above), all non-key columns are about the key, the whole key, and nothing but the key, to use an excellent and frequently quoted formula.[*] Each row is both logically consistent and distinct from all other rows in the same table. It is this design characteristic that enables join relationships to be established within the same table. You can therefore select in the same query different (not necessarily disjoint) sets of rows from the same table and join them as if those rows came from several different tables. In this section, I'll discuss the simple self-join and exclude the more complex examples of nested hierarchies that I discuss later in Chapter 7.
Self-joinstables joined to themselvesare much more common than hierarchies. In some cases, it is simply because the data is seen in an identical way, but from two different angles; for instance, we can imagine that a query listing air flights would refer to the airports table twice, once to find the name of the departure airport, and once to find the name of the arrival airport. For example: select f.flight_number, a.airport_name departure_airport, b.airport_name arrival_airport from flights f, airports a, airports b where f.dep_iata_code = a.iata_code and f.arr_iata_code = b.iata_code In such a case, the usual rules apply: what matters is to ensure that highly efficient index access takes place. But what if the criteria are such that efficient access is not possible? The last thing we want is to do a first pass on the table, then a second one to pick up rows that were discarded during the first pass. In that case, what we should do is a single pass, collect all the rows of interest, and then use a construct such as the case statement to display separately rows from the two sets; I show examples of this "single-pass" approach in Chapter 11. There are subtle cases that only superficially look like the airport case. Imagine that we store in some table cumulative values taken at regular intervals[*] and we want to display by how much the counter increased between two successive snapshots. In such a case, we have a relationship between two different rows in the same table, but instead of having a strong relationship coming from another table, such as the flights table that links the two instances of airports together, we have a weak, internal relationship: we define that two rows are related not because their keys are associated in another table, but because the timestamp of one row happens to be the timestamp which immediately follows the timestamp of another row.
For instance, if we assume that snapshots are taken every five minutes, with a timestamp expressed in seconds elapsed since a reference date, we might issue the following query: select a.timestamp, a.statistic_id, (b.counter - a.counter)/5 hits_per_minute from hit_counter a, hit_counter b where b.timestamp = a.timestamp + 300 and b.statistic_id = a.statistic_id order by a.timestamp, a.statistic_id There is a significant flaw in this script: if the second snapshot has not been taken exactly five minutes after the first one, down to the second, we may be unable to join the two rows. We may therefore choose to express the join condition as a range condition. For example: select a.timestamp, a.statistic_id, (b.counter - a.counter) * 60 / (b.timestamp - a.timestamp) hits_per_minute from hit_counter a, hit_counter b where b.timestamp between a.timestamp + 200 and a.timestamp + 400 and b.statistic_id = a.statistic_id order by a.timestamp, a.statistic_id One side effect of this approach is the risk of having bigger data gaps than needed when, for one reason or another (such as a change in the sampling frequency), two successive records are no longer collected between 200 and 400 seconds of each other. We may play it even safer and use an OLAP function that operates on windows of rows. It is indeed difficult to imagine something less relational in nature, but such a function can come in handy as the final shine on a query, and it can even make a noticeable difference in performance. Basically, OLAP functions allow the consideration of different subsets of the final result set, through the use of the partition clause. Sorts, sums, and other similar functions can be applied separately to these individual result subsets. We can use the row_number( ) OLAP function to create one subset by statistic_id, and then assign to each different statistic successive integer numbers that increase as timestamps do. When these numbers are generated by the OLAP function, we can join on both statistic_id and two sequential numbers, as in the following example: select a.timestamp, a.statistic_id, (b.counter - a.counter) * 60 / (b.timestamp - a.timestamp) from (select timestamp, statistic_id, counter, row_number( ) over (partition by statistic_id order by timestamp) rn from hit_counter) a, (select timestamp, statistic_id, counter, row_number( ) over (partition by statistic_id order by timestamp) rn from hit_counter) b where b.rn = a.rn + 1 and a.statistic_id = b.statistic_id order by a.timestamp, a.statistic_id We may even do betterabout 25% faster than the previous queryif our DBMS implements, as Oracle does, a lag(column_name, n) OLAP function that returns the nth previous value for column_name, on the basis of the specified partitioning and ordering: select timestamp, statistic_id, (counter - prev_counter) * 60 / (timestamp - prev_timestamp) from (select timestamp, statistic_id, counter, lag(counter, 1) over (partition by statistic_id order by timestamp) prev_counter, lag(timestamp, 1) over (partition by statistic_id order by timestamp) prev_timestamp from hit_counter) a order by a.timestamp, a.statistic_id In many cases we don't have such symmetry in our data, as is shown by the flight example. Typically, a query looking for all the data associated with the smallest, or the largest, or the oldest, or the most recent value of a specific column, first needs to find the actual smallest, largest, oldest, or most recent value in the column used for filtering (this is the first pass, which compares rows), and then search the table again in a second pass, using as a search criterion the value identified in the first pass. The two passes can be made (at least superficially) into one through the use of OLAP functions that operate on sliding windows. Queries applied to data values associated to timestamps or dates are a special case of sufficient importance to deserve further discussion later in this chapter as the situation "Simple or Range Searching on Dates." When multiple selection criteria are applied to different rows in the same table, functions that operate on sliding windows may be of assistance. |