Recipe 12.8. Using a Join to Fill or Identify Holes in a List

Problem

You want to produce a summary for each of several categories, but some of them are not represented in the data to be summarized. Consequently, the summary has missing categories.

Solution

Create a reference table that lists each category and produce the summary based on a LEFT JOIN between the list and the table containing your data. Then every category in the reference table will appear in the result, even those not present in the data to be summarized.

Discussion

When you run a summary query, normally it produces entries only for the values that are actually present in the data. Let's say you want to produce a time-of-day summary for the rows in the mail table. That table looks like this:

mysql> SELECT * FROM mail; +---------------------+---------+---------+---------+---------+---------+ | t                   | srcuser | srchost | dstuser | dsthost | size    | +---------------------+---------+---------+---------+---------+---------+ | 2006-05-11 10:15:08 | barb    | saturn  | tricia  | mars    |   58274 | | 2006-05-12 12:48:13 | tricia  | mars    | gene    | venus   |  194925 | | 2006-05-12 15:02:49 | phil    | mars    | phil    | saturn  |    1048 | | 2006-05-13 13:59:18 | barb    | saturn  | tricia  | venus   |     271 | | 2006-05-14 09:31:37 | gene    | venus   | barb    | mars    |    2291 | | 2006-05-14 11:52:17 | phil    | mars    | tricia  | saturn  |    5781 | ...

To determine how many messages were sent for each hour of the day, use the following statement:

mysql> SELECT HOUR(t) AS hour, COUNT(HOUR(t)) AS count     -> FROM mail GROUP BY hour; +------+-------+ | hour | count | +------+-------+ |    7 |     1 | |    8 |     1 | |    9 |     2 | |   10 |     2 | |   11 |     1 | |   12 |     2 | |   13 |     1 | |   14 |     1 | |   15 |     1 | |   17 |     2 | |   22 |     1 | |   23 |     1 | +------+-------+

Here, the summary category is hour of the day. However, the summary is "incomplete" in the sense that it includes entries only for those hours of the day represented in the mail table. To produce a summary that includes all hours of the day, even those during which no messages were sent, create a reference table that lists each category (that is, each hour):

mysql> CREATE TABLE ref (h INT); mysql> INSERT INTO ref (h)     -> VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),     -> (12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23);

Then join the reference table to the mail table using a LEFT JOIN:

mysql> SELECT ref.h AS hour, COUNT(mail.t) AS count     -> FROM ref LEFT JOIN mail ON ref.h = HOUR(mail.t)     -> GROUP BY hour; +------+-------+ | hour | count | +------+-------+ |    0 |     0 | |    1 |     0 | |    2 |     0 | |    3 |     0 | |    4 |     0 | |    5 |     0 | |    6 |     0 | |    7 |     1 | |    8 |     1 | |    9 |     2 | |   10 |     2 | |   11 |     1 | |   12 |     2 | |   13 |     1 | |   14 |     1 | |   15 |     1 | |   16 |     0 | |   17 |     2 | |   18 |     0 | |   19 |     0 | |   20 |     0 | |   21 |     0 | |   22 |     1 | |   23 |     1 | +------+-------+

Now the summary includes an entry for every hour of the day because the LEFT JOIN forces the output to include a row for every row in the reference table, regardless of the contents of the mail table.

The example just shown uses the reference table with a LEFT JOIN to fill in holes in the category list. It's also possible to use the reference table to detect holes in the datasetthat is, to determine which categories are not present in the data to be summarized. The following statement shows those hours of the day during which no messages were sent by looking for reference rows for which no mail table rows have a matching category value:

mysql> SELECT ref.h AS hour     -> FROM ref LEFT JOIN mail ON ref.h = HOUR(mail.t)     -> WHERE mail.t IS NULL; +------+ | hour | +------+ |    0 | |    1 | |    2 | |    3 | |    4 | |    5 | |    6 | |   16 | |   18 | |   19 | |   20 | |   21 | +------+

Reference tables that contain a list of categories are quite useful for summary statements, but creating such tables manually is mind-numbing and error-prone. You might find it preferable to write a script that uses the endpoints of the range of category values to generate the reference table for you. In essence, this type of script acts as an iterator that generates a row for each value in the range. The following Perl script, make_date_list.pl, shows an example of this approach. It creates a reference table containing a row for every date in a particular date range. It also indexes the table so that it will be fast in large joins.

#!/usr/bin/perl # make_date_list.pl - create a table with an entry for every date in # a given date range.  The table can be used in a LEFT JOIN when # producing a summary, to make sure that every date appears in the # summary, regardless of whether the data to be summarized actually # contains any values for a given day. # Usage: make_date_list.pl db_name tbl_name col_name min_date max_date use strict; use warnings; use DBI; # ... process command-line options (not shown) ... # Check number of arguments @ARGV == 5 or die "$usage\n"; my ($db_name, $tbl_name, $col_name, $min_date, $max_date) = @ARGV; # ... connect to database (not shown) ... # Determine the number of days spanned by the date range. my $days = $dbh->selectrow_array (qq{ SELECT DATEDIFF(?,?) + 1 },                                   undef, $max_date, $min_date); print "Minimum date: $min_date\n"; print "Maximum date: $max_date\n"; print "Number of days spanned by range: $days\n"; die "Date range is too small\n" if $days < 1; # Drop table if it exists, and then recreate it $dbh->do ("DROP TABLE IF EXISTS $db_name.$tbl_name"); $dbh->do (qq{         CREATE TABLE $db_name.$tbl_name         ($col_name DATE NOT NULL, PRIMARY KEY ($col_name))   }); # Populate table with each date in the date range my $sth = $dbh->prepare (qq{       INSERT INTO $db_name.$tbl_name ($col_name) VALUES(? + INTERVAL ? DAY)   }); foreach my $i (0 .. $days-1) {   $sth->execute ($min_date, $i); }

Reference tables generated by make_date_list.pl can be used for per-date summaries or to find dates not represented in the table. Suppose that you want to summarize the driver_log table to determine how many drivers were on the road each day. The table has these rows:

mysql> SELECT * FROM driver_log ORDER BY rec_id; +--------+-------+------------+-------+ | rec_id | name  | trav_date  | miles | +--------+-------+------------+-------+ |      1 | Ben   | 2006-08-30 |   152 | |      2 | Suzi  | 2006-08-29 |   391 | |      3 | Henry | 2006-08-29 |   300 | |      4 | Henry | 2006-08-27 |    96 | |      5 | Ben   | 2006-08-29 |   131 | |      6 | Henry | 2006-08-26 |   115 | |      7 | Suzi  | 2006-09-02 |   502 | |      8 | Henry | 2006-09-01 |   197 | |      9 | Ben   | 2006-09-02 |    79 | |     10 | Henry | 2006-08-30 |   203 | +--------+-------+------------+-------+

A simple summary looks like this:

mysql> SELECT trav_date, COUNT(trav_date) AS drivers     -> FROM driver_log GROUP BY trav_date; +------------+---------+ | trav_date  | drivers | +------------+---------+ | 2006-08-26 |       1 | | 2006-08-27 |       1 | | 2006-08-29 |       3 | | 2006-08-30 |       2 | | 2006-09-01 |       1 | | 2006-09-02 |       2 | +------------+---------+

However, that summary does not show dates when no drivers were active. To generate a complete summary that includes the missing dates, use make_date_list.pl. From the simple summary just shown, we can tell the minimum and maximum dates, so generate a reference table named ref with a date column d that spans those dates:

% make_date_list.pl cookbook ref d 2006-08-26 2006-09-02 Minimum date: 2006-08-26 Maximum date: 2006-09-02 Number of days spanned by range: 8

After creating the reference table, use it in the following statement to generate the complete summary:

mysql> SELECT ref.d, COUNT(driver_log.trav_date) AS drivers     -> FROM ref LEFT JOIN driver_log ON ref.d = driver_log.trav_date     -> GROUP BY d; +------------+---------+ | d          | drivers | +------------+---------+ | 2006-08-26 |       1 | | 2006-08-27 |       1 | | 2006-08-28 |       0 | | 2006-08-29 |       3 | | 2006-08-30 |       2 | | 2006-08-31 |       0 | | 2006-09-01 |       1 | | 2006-09-02 |       2 | +------------+---------+

This second summary includes additional rows that show those dates when no drivers were active. To list only those no-driver dates, use this statement:

mysql> SELECT ref.d     -> FROM ref LEFT JOIN driver_log ON ref.d = driver_log.trav_date     -> WHERE driver_log.trav_date IS NULL     -> ORDER BY d; +------------+ | d          | +------------+ | 2006-08-28 | | 2006-08-31 | +------------+