Using a Join to Fill in Holes in a List

12.10.1 Problem

You want to produce a summary for each of several categories, but some of the categories are not represented in the data to be summarized. Consequently, the summary has missing categories.

12.10.2 Solution

Create a reference table that lists each category and produce the summary based on a LEFT JOIN between the list and the table containing your data. Then every category in the reference table will appear in the result, even "empty" ones.

12.10.3 Discussion

When you run a summary query, normally it produces entries only for the values that are actually present in the data. Let's say you want to produce a time-of-day summary for the records in the mail table, which looks like this:

mysql> SELECT * FROM mail;
+---------------------+---------+---------+---------+---------+---------+
| t | srcuser | srchost | dstuser | dsthost | size |
+---------------------+---------+---------+---------+---------+---------+
| 2001-05-11 10:15:08 | barb | saturn | tricia | mars | 58274 |
| 2001-05-12 12:48:13 | tricia | mars | gene | venus | 194925 |
| 2001-05-12 15:02:49 | phil | mars | phil | saturn | 1048 |
| 2001-05-13 13:59:18 | barb | saturn | tricia | venus | 271 |
| 2001-05-14 09:31:37 | gene | venus | barb | mars | 2291 |
| 2001-05-14 11:52:17 | phil | mars | tricia | saturn | 5781 |
...

To determine how many messages were sent for each hour of the day, use the following query:

mysql> SELECT HOUR(t) AS hour, COUNT(HOUR(t)) AS count
 -> FROM mail GROUP BY hour;
+------+-------+
| hour | count |
+------+-------+
| 7 | 1 |
| 8 | 1 |
| 9 | 2 |
| 10 | 2 |
| 11 | 1 |
| 12 | 2 |
| 13 | 1 |
| 14 | 1 |
| 15 | 1 |
| 17 | 2 |
| 22 | 1 |
| 23 | 1 |
+------+-------+

However, this summary is incomplete in the sense that it includes entries only for those hours of the day represented in the mail table. To produce a summary that includes all hours of the day, even those during which no messages were sent, create a reference table that lists each hour:

mysql> CREATE TABLE ref (h INT);
mysql> INSERT INTO ref (h)
 -> VALUES(0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),
 -> (12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23);

Then join the reference table to the mail table using a LEFT JOIN:

mysql> SELECT ref.h AS hour, COUNT(HOUR(mail.t)) AS count
 -> FROM ref LEFT JOIN mail ON ref.h = HOUR(mail.t)
 -> GROUP BY hour;
+------+-------+
| hour | count |
+------+-------+
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
| 7 | 1 |
| 8 | 1 |
| 9 | 2 |
| 10 | 2 |
| 11 | 1 |
| 12 | 2 |
| 13 | 1 |
| 14 | 1 |
| 15 | 1 |
| 16 | 0 |
| 17 | 2 |
| 18 | 0 |
| 19 | 0 |
| 20 | 0 |
| 21 | 0 |
| 22 | 1 |
| 23 | 1 |
+------+-------+

Now the summary includes an entry for every hour of the day. The LEFT JOIN forces the output to include a row for every record in the reference table, regardless of the contents of the mail table.

The example just shown uses the reference table with a LEFT JOIN to fill in holes in the category list. By rewriting the query slightly, you can also use the reference table to find holes in the datasetthat is, to determine which categories are not present in the data to be summarized. The following query shows those hours of the day during which no messages were sent by using a HAVING clause that selects only summary rows with a zero count:

mysql> SELECT ref.h AS hour, COUNT(HOUR(mail.t)) AS count
 -> FROM ref LEFT JOIN mail ON ref.h = HOUR(mail.t)
 -> GROUP BY hour
 -> HAVING count = 0;
+------+-------+
| hour | count |
+------+-------+
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
| 5 | 0 |
| 6 | 0 |
| 16 | 0 |
| 18 | 0 |
| 19 | 0 |
| 20 | 0 |
| 21 | 0 |
+------+-------+

In this case, it's possible to write a simpler query, based on the fact that each hour value appears in the reference table only once. This means that no GROUP BY is necessary; just look for reference rows that don't match any mail table rows:

mysql> SELECT ref.h AS hour
 -> FROM ref LEFT JOIN mail ON ref.h = HOUR(mail.t)
 -> WHERE mail.t IS NULL;
+------+
| hour |
+------+
| 0 |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 16 |
| 18 |
| 19 |
| 20 |
| 21 |
+------+

This query also has the advantage of not producing a count column (which is extraneous anyway, because the counts are always zero).

Reference tables that contain a list of categories are quite useful for summary queries, but creating such tables manually can be a mind-numbing and error-prone exercise. If a category list has a lot of entries, you might find it preferable to write a script that uses the endpoints of the range of category values to generate the reference table for you. In essence, this type of script acts as an iterator that generates a record for each value in the range. The following Perl script, make_date_list.pl, shows an example of this approach. It creates a reference table containing a row for every date in a particular date range:

#! /usr/bin/perl -w
# make_date_list.pl - create a table with an entry for every date in a
# given date range. The table can be used in a LEFT JOIN with a data table
# when producing a summary, to make sure that every date appears in the
# summary, whether or not the data table actually contains any values for
# a given day.

# Usage: make_date_list.pl tbl_name col_name min_date max_date

# This script assumes that you're using the cookbook database.

use strict;
use lib qw(/usr/local/apache/lib/perl);
use Cookbook;

# Check number of arguments, perform minimal tests for ISO-format dates

@ARGV == 4
 or die "Usage: make_date_list.pl tbl_name col_name min_date max_date
";
my ($tbl_name, $col_name, $min_date, $max_date) = (@ARGV);
$min_date =~ /^d+Dd+Dd+$/
 or die "Minimum date $min_date is not in ISO format
";
$max_date =~ /^d+Dd+Dd+$/
 or die "Maximum date $max_date is not in ISO format
";

my $dbh = Cookbook::connect ( );

# Determine the number of days spanned by the date range.

my $days = $dbh->selectrow_array (qq{ SELECT TO_DAYS(?) - TO_DAYS(?) + 1 },
 undef, $max_date, $min_date);

print "Minimum date: $min_date
";
print "Maximum date: $max_date
";
print "Number of days spanned by range: $days
";
die "Date range is too small
" if $days < 1;

# Drop table if it exists, then recreate it

$dbh->do ("DROP TABLE IF EXISTS $tbl_name");
$dbh->do (qq{
 CREATE TABLE $tbl_name
 ($col_name DATE NOT NULL, PRIMARY KEY ($col_name))
 });

# Populate table with each date in the date range

my $sth = $dbh->prepare (qq{
 INSERT INTO $tbl_name ($col_name) VALUES(DATE_ADD(?,INTERVAL ? DAY))
 });
for (my $i = 0; $i < $days; $i++)
{
 $sth->execute ($min_date, $i);
}

$dbh->disconnect ( );
exit (0);

Tables generated by make_date_list.pl can be used for per-day summaries, or to find days not represented in the table. A date-based reference table can be used for calendar-day summaries, too. For example, you could use it to summarize the baseball1.com master table to find out how many ballplayers in the table were born each day of the year, or to find days of the year for which there are no birthdays. When creating a calendar day reference table, be sure to use a leap year so that the table contains an entry for February 29. The year 2004 is one such year, so a suitable reference table can be created like this:

% make_date_list.pl ref d 2004-01-01 2004-12-31

The master table stores birth dates in three columns named birthday, birthmonth, birthyear. After creating the reference table, use the following query to summarize birthdays in the master table for each calendar day:

SELECT
MONTH(ref.d) AS month, DAYOFMONTH(ref.d) AS day,
COUNT(master.lahmanid) AS count
FROM ref LEFT JOIN master
 ON MONTH(ref.d) = master.birthmonth
 AND DAYOFMONTH(ref.d) = master.birthday
GROUP BY month, day;

To see if there are any days on which no birthdays occur, use this query instead:

SELECT MONTH(ref.d) AS month, DAYOFMONTH(ref.d) AS day
FROM ref LEFT JOIN master
 ON MONTH(ref.d) = master.birthmonth
 AND DAYOFMONTH(ref.d) = master.birthday
WHERE master.birthmonth IS NULL and master.birthday IS NULL;

Using the mysql Client Program

Writing MySQL-Based Programs

Record Selection Techniques

Working with Strings

Working with Dates and Times

Sorting Query Results

Generating Summaries

Modifying Tables with ALTER TABLE

Obtaining and Using Metadata

Importing and Exporting Data

Generating and Using Sequences

Using Multiple Tables

Statistical Techniques

Handling Duplicates

Performing Transactions

Introduction to MySQL on the Web

Incorporating Query Resultsinto Web Pages

Processing Web Input with MySQL

Using MySQL-Based Web Session Management

Appendix A. Obtaining MySQL Software

Appendix B. JSP and Tomcat Primer

Appendix C. References