Hack 53. Clean Inconsistent Records

Sometimes you have to import tables with redundant data that is "nearly" right. There are some tricks you can use to make the process easier.

You may come across a system where a one-to-many relationship has been implemented incorrectly. There is no easy way to recover from this. Someone is going to have to sift through the data by hand to fix any anomalies.

Look at the table of library books shown in Table 7-12.

Table 7-12. book, a poorly normalized table with inconsistent data
barCode	isbn	title	author
BC001	0140430423	Hard Times	Charles Dickens
BC002	0586089195	Bluebeard	Turk Vonnegut
BC003	0586089195	Bluebeard	Kurt Vonnegut
BC004	0586089195	Bluebeard	Kurt Vonnegut

The library has three copies of Bluebeard and the book title and author have been typed in three times. This has led to a problem because the author's name, Kurt, has been misspelled as "Turk" for BC002. Once you have the data properly normalized that sort of inconsistency will be impossible to create.

To highlight the inconsistencies you can run a GROUP BY query that shows two examples of the different spellings and a count of the number of inconsistencies:

mysql> SELECT isbn,MIN(author) AS Example1
 -> ,MAX(author) AS Example2
 -> ,COUNT(DISTINCT author) AS NumberOfVariations
 -> FROM book
 -> GROUP BY isbn
 -> HAVING COUNT(DISTINCT author)>1;
+------------+---------------+---------------+--------------------+
| isbn | Example1 | Example2 | NumberOfVariations |
+------------+---------------+---------------+--------------------+
| 0586089195 | Kurt Konnegut | Turt Konnegut | 2 |
+------------+---------------+---------------+--------------------+

Armed with this list you can set about cleaning the data. Someone has to go into the database and fix the spellings or delete the incorrect records. If there are three or more different misspellings this query will show only the first and last examples, but it will tell you how many problems need to be fixed.

That first query is good for getting an idea of the scale of the problem and is fine if there are only a handful of errors. If your list is long or the number of variations is high you will need a more detailed report before you set to work fixing the data.

A more detailed report shows every different spelling of the title, together with the "popularity" of that spelling. So the title of the book whose ISBN is 014027944X has been spelled "Armadillo" four times, and "Armadilo" and "Armidillo" once each:

mysql> SELECT isbn, title
 -> ,MIN(barCode) FirstBarCode
 -> ,NULLIF(MAX(barCode),MIN(barCode)) LastbarCode
 -> ,COUNT(barCode) NumBarCodes
 -> FROM book
 -> WHERE isbn IN (SELECT isbn FROM book
 -> GROUP BY isbn
 -> HAVING COUNT(distinct TITLE)>1)
 -> GROUP BY isbn,title
 -> ORDER BY 1 ASC,5 DESC,2 ASC;
+------------+------------------+--------------+-------------+-------------+
| isbn | title | FirstBarCode | LastbarCode | NumBarCodes |
+------------+------------------+--------------+-------------+-------------+
| 014027944X | Armadillo | BC006 | BC010 | 4 |
| 014027944X | Armadilo | BC009 | NULL | 1 |
| 014027944X | Armidillo | BC005 | NULL | 1 |
| 0571225535 | Lake Wobegon Days| BC011 | BC018 | 5 |
| 0571225535 | Lake Wobegon | BC015 | BC017 | 2 |
+------------+------------------+--------------+-------------+-------------+

The use of MAX and MIN to show just two values strikes a compromise. You could list every single title that has contributed to an inconsistency, but most of these titles will be right. In this listing the right spelling is listed only once, and that is appropriate because you want to highlight the wrong spellings.

7.5.1. Normalize the Data

Once all of the problems have been fixed, you can normalize the data. You can prepare a publication table that will hold details of each title:

CREATE TABLE publication
(isbn CHAR(10) NOT NULL PRIMARY KEY
,title VARCHAR(50)
,author VARCHAR(50)
);

You can populate the new table from the existing database:

INSERT INTO publication(isbn, title, author)
 SELECT isbn,title,author
 FROM book
 GROUP BY isbn,title,author

If you missed some inconsistencies in titles or authors during the data cleaning stage, this query will fail with an error because it will attempt to insert two records with the same ISBN value. That is a useful check on the process.

You can now enforce referential integrity between the book table and the publication table:

ALTER TABLE book ADD FOREIGN KEY (isbn)
 REFERENCES publication(isbn)

You can remove the now redundant columns from the book table:

ALTER TABLE book DROP COLUMN title
ALTER TABLE book DROP COLUMN author

You can create a view that looks just like the original, denormalized table, but without the errors. A simple JOIN will do it:

CREATE VIEW orginalBook AS
 SELECT barcode, book.isbn, title, author
 FROM book JOIN publication ON (book.isbn=publication.isbn)

SQL Fundamentals

Joins, Unions, and Views

Text Handling

Date Handling

Number Crunching

Online Applications

Organizing Data

Storing Small Amounts of Data

Locking and Performance

Reporting

Users and Administration

Wider Access

Index