Recipe 12.13. Identifying and Removing Mismatched or Unattached Rows

Problem

You have two datasets that are related, but possibly imperfectly so. You want to determine whether there are records in either dataset that are "unattached" (not matched by any record in the other dataset), and perhaps remove them if so. This might occur, for example, when you receive data from an external source and must check it to verify its integrity.

Solution

Use a LEFT JOIN to identify unmatched values in each table. If there are any and you want to get rid of them, use a multiple-table DELETE statement. It's also possible to identify or remove nonmatching rows by using NOT IN subqueries.

Discussion

Inner joins are useful for identifying relationships, and outer joins are useful for identifying the lack of relationship. This property of outer joins is valuable when you have datasets that are supposed to be related but for which the relationship might be imperfect.

Mismatches between datasets can occur if you receive two datafiles from an external source that are supposed to be related but for which the integrity of the relationship actually is imperfect. It can also occur as an anticipated consequence of a deliberate action. Suppose that an online discussion board uses a parent table that lists discussion topics and a child table that rows the articles posted for each topic. If you purge the child table of old article rows, that may result in any given topic row in the parent table no longer having any children. If so, the lack of recent postings for the topic indicates that it is probably dead and that the parent row in the topic table can be deleted, too. In such a situation, you delete a set of child rows with the explicit recognition that the operation may strand parent rows and cause them to become eligible for being deleted as well.

However you arrive at the point where related tables have unmatched rows, you can analyze and modify them using SQL statements. Specifically, restoring their relationship is a matter of identifying the unattached rows and then deleting them:

To identify unattached rows, use a LEFT JOIN, because this is a "find unmatched rows" problem. (See Section 12.2 for information about LEFT JOIN.)
To delete rows that are unmatched, use a multiple-table DELETE statement that specifies which rows to remove using a similar LEFT JOIN.

The presence of unmatched data is useful to know about because you can alert whoever gave you the data. This may be a signal of a flaw in the data collection method that must be corrected. For example, with sales data, a missing region might mean that some regional manager didn't report in and that the omission was overlooked.

The following example shows how to identify and remove mismatched rows using two datasets that describe sales regions and volume of sales per region. One dataset contains the ID and location of each sales region:

mysql> SELECT * FROM sales_region ORDER BY region_id; +-----------+------------------------+ | region_id | name                   | +-----------+------------------------+ |         1 | London, United Kingdom | |         2 | Madrid, Spain          | |         3 | Berlin, Germany        | |         4 | Athens, Greece         | +-----------+------------------------+

The other dataset contains sales volume figures. Each row contains the amount of sales for a given quarter of a year and indicates the sales region to which the row applies:

mysql> SELECT * FROM sales_volume ORDER BY region_id, year, quarter; +-----------+------+---------+--------+ | region_id | year | quarter | volume | +-----------+------+---------+--------+ |         1 | 2006 |       1 | 100400 | |         1 | 2006 |       2 | 120000 | |         3 | 2006 |       1 | 280000 | |         3 | 2006 |       2 | 250000 | |         5 | 2006 |       1 |  18000 | |         5 | 2006 |       2 |  32000 | +-----------+------+---------+--------+

A little visual inspection reveals that neither table is fully matched by the other. Sales regions 2 and 4 are not represented in the sales volume table, and the sales volume table contains rows for region 5, which is not in the sales region table. But we don't want to check the tables by inspection. We want to find unmatched rows by using SQL statements that do the work for us.

Mismatch identification is a matter of using outer joins. For example, to find sales regions for which there are no sales volume rows, use the following LEFT JOIN:

mysql> SELECT sales_region.region_id AS 'unmatched region row IDs'     -> FROM sales_region LEFT JOIN sales_volume     ->   ON sales_region.region_id = sales_volume.region_id     -> WHERE sales_volume.region_id IS NULL; +--------------------------+ | unmatched region row IDs | +--------------------------+ |                        2 | |                        4 | +--------------------------+

Conversely, to find sales volume rows that are not associated with any known region, reverse the roles of the two tables:

mysql> SELECT sales_volume.region_id AS 'unmatched volume row IDs'     -> FROM sales_volume LEFT JOIN sales_region     ->   ON sales_volume.region_id = sales_region.region_id     -> WHERE sales_region.region_id IS NULL; +--------------------------+ | unmatched volume row IDs | +--------------------------+ |                        5 | |                        5 | +--------------------------+

In this case, an ID appears more than once in the list if there are multiple volume rows for a missing region. To see each unmatched ID only once, use SELECT DISTINCT:

mysql> SELECT DISTINCT sales_volume.region_id AS 'unmatched volume row IDs'     -> FROM sales_volume LEFT JOIN sales_region     ->   ON sales_volume.region_id = sales_region.region_id     -> WHERE sales_region.region_id IS NULL +--------------------------+ | unmatched volume row IDs | +--------------------------+ |                        5 | +--------------------------+

To get rid of unmatched rows, you can use their IDs in a multiple-table DELETE statement. To construct the proper multiple-table DELETE statement for removing unmatched rows from a table, just take the SELECT statement that you use to identify those rows, and replace the stuff leading up to the FROM keyword with DELETE tbl_name. For example, the SELECT that identifies childless parents looks like this:

SELECT sales_region.region_id AS 'unmatched region row IDs' FROM sales_region LEFT JOIN sales_volume   ON sales_region.region_id = sales_volume.region_id WHERE sales_volume.region_id IS NULL;

The corresponding DELETE looks like this:

DELETE sales_region FROM sales_region LEFT JOIN sales_volume   ON sales_region.region_id = sales_volume.region_id WHERE sales_volume.region_id IS NULL;

Conversely, the statement to identify parentless children is as follows:

SELECT sales_volume.region_id AS 'unmatched volume row IDs' FROM sales_volume LEFT JOIN sales_region   ON sales_volume.region_id = sales_region.region_id WHERE sales_region.region_id IS NULL;

And the corresponding DELETE statement removes them:

DELETE sales_volume FROM sales_volume LEFT JOIN sales_region   ON sales_volume.region_id = sales_region.region_id WHERE sales_region.region_id IS NULL;

You can also identify or delete mismatched rows by using NOT IN subqueries. The statements to display or remove sales_region rows that match no sales_volume rows look like this:

SELECT region_id AS 'unmatched region row IDs' FROM sales_region WHERE region_id NOT IN (SELECT region_id FROM sales_volume); DELETE FROM sales_region WHERE region_id NOT IN (SELECT region_id FROM sales_volume);

The statements to identify or delete mismatched sales_volume rows are similar but have the roles of the tables reversed:

SELECT region_id AS 'unmatched volume row IDs' FROM sales_volume WHERE region_id NOT IN (SELECT region_id FROM sales_region); DELETE FROM sales_volume WHERE region_id NOT IN (SELECT region_id FROM sales_region);

Using Foreign Keys to Enforce Referential Integrity

One feature a database system offers to help you maintain consistency between tables is the ability to define foreign key relationships. This means you can specify explicitly in the table definition that a primary key in a parent table (such as the region_id column of the sales_region table) is a parent to a key in another table (the region_id column in the sales_volume table). By defining the ID column in the child table as a foreign key to the ID column in the parent, the database system can enforce certain constraints against illegal operations. For example, it can prevent you from creating a child row with an ID that is not present in the parent or from deleting parent rows without also deleting the corresponding child rows first. A foreign key implementation may also offer cascaded delete and update: if you delete or update a parent row, the database engine cascades the effect of the delete or update to any child tables and automatically deletes or updates the child rows for you. The InnoDB storage engine in MySQL supports foreign keys and cascaded deletes and updates.