Recipe 14.4. Eliminating Duplicates from a Table


Problem

You want to remove duplicate rows from a table so that it contains only unique rows.

Solution

Select the unique rows from the table into a second table that you use to replace the original one. Or add a unique index to the table using ALTER TABLE, which will remove duplicates as it builds the index. Or use DELETE ... LIMIT n to remove all but one instance of a specific set of duplicate rows.

Discussion

Section 14.1 discusses how to prevent duplicates from being added to a table by creating it with a unique index. However, if you forget to include a unique index when you create a table, you may discover later that it contains duplicates and that it's necessary to apply some sort of duplicate-removal technique. The catalog_list table used earlier is an example of this, because it contains several instances in which the same person is listed multiple times:

mysql> SELECT * FROM catalog_list ORDER BY last_name, first_name; +-----------+-------------+--------------------------+ | last_name | first_name  | street                   | +-----------+-------------+--------------------------+ | Baxter    | Wallace     | 57 3rd Ave.              | | BAXTER    | WALLACE     | 57 3rd Ave.              | | Baxter    | Wallace     | 57 3rd Ave., Apt 102     | | Brown     | Bartholomew | 432 River Run            | | Isaacson  | Jim         | 515 Fordam St., Apt. 917 | | McTavish  | Taylor      | 432 River Run            | | Pinter    | Marlene     | 9 Sunset Trail           | | Pinter    | Marlene     | 9 Sunset Trail           | +-----------+-------------+--------------------------+ 

The table contains redundant entries and it would be a good idea to remove them, to eliminate duplicate mailings and reduce postage costs. To do this, you have several options:

  • Select the table's unique rows into another table, and then use that table to replace the original one. The result is to remove the table's duplicates. This works when "duplicate" means "the entire row is the same as another."

  • Add a unique index to the table using ALTER TABLE. This operation turns duplicate rows into unique rows, where "duplicate" means "the index values are the same."

  • You can remove duplicates for a specific set of duplicate rows by using DELETE ... LIMIT n to remove all but one row.

This recipe discusses each of these duplicate-removal method. When you consider which of them to choose under various circumstances, the applicability of a given method to a specific problem is often determined by several factors:

  • Does the method require the table to have a unique index?

  • If the columns in which duplicate values occur may contain NULL, will the method remove duplicate NULL values?

  • Does the method prevent duplicates from occurring in the future?

Removing duplicates using table replacement

If a row is considered to duplicate another only if the entire row is the same, one way to eliminate duplicates from a table is to select its unique rows into a new table that has the same structure, and then replace the original table with the new one. To perform table replacement, use the following procedure:

  1. Create a new table that has the same structure as the original one. CREATE TABLE ... LIKE is useful for this (see Section 4.1):

    mysql> CREATE TABLE tmp LIKE catalog_list;                      

  2. Use INSERT INTO ... SELECT DISTINCT to select the unique rows from the original table into the new one:

    mysql> INSERT INTO tmp SELECT DISTINCT * FROM catalog_list;                      

    Select rows from the tmp table to verify that the new table contains no duplicates:

    mysql> SELECT * FROM tmp ORDER BY last_name, first_name; +-----------+-------------+--------------------------+ | last_name | first_name  | street                   | +-----------+-------------+--------------------------+ | Baxter    | Wallace     | 57 3rd Ave.              | | Baxter    | Wallace     | 57 3rd Ave., Apt 102     | | Brown     | Bartholomew | 432 River Run            | | Isaacson  | Jim         | 515 Fordam St., Apt. 917 | | McTavish  | Taylor      | 432 River Run            | | Pinter    | Marlene     | 9 Sunset Trail           | +-----------+-------------+--------------------------+ 

  3. After creating the new tmp table that contains unique rows, use it to replace the original catalog_list table:

    mysql> DROP TABLE catalog_list; mysql> RENAME TABLE tmp TO catalog_list;                      

The effective result of this procedure is that catalog_list no longer contains duplicates.

This table-replacement method works in the absence of an index (although it might be slow for large tables). For tables that contain duplicate NULL values, it removes those duplicates. It does not prevent the occurrence of duplicates in the future.

This method requires rows to be completely identical for rows to be considered duplicates. Thus, it treats as distinct those rows for Wallace Baxter that have slightly different street values.

If duplicates are defined only with respect to a subset of the columns in the table, create a new table that has a unique index for those columns, select rows into it using INSERT IGNORE, and then replace the original table with the new one:

mysql> CREATE TABLE tmp LIKE catalog_list; mysql> ALTER TABLE tmp ADD PRIMARY KEY (last_name, first_name); mysql> INSERT IGNORE INTO tmp SELECT * FROM catalog_list; mysql> SELECT * FROM tmp ORDER BY last_name, first_name; +-----------+-------------+--------------------------+ | last_name | first_name  | street                   | +-----------+-------------+--------------------------+ | Baxter    | Wallace     | 57 3rd Ave.              | | Brown     | Bartholomew | 432 River Run            | | Isaacson  | Jim         | 515 Fordam St., Apt. 917 | | McTavish  | Taylor      | 432 River Run            | | Pinter    | Marlene     | 9 Sunset Trail           | +-----------+-------------+--------------------------+ mysql> DROP TABLE catalog_list; mysql> RENAME TABLE tmp TO catalog_list;                

The unique index prevents rows with duplicate key values from being inserted into tmp, and IGNORE tells MySQL not to stop with an error if a duplicate is found. One shortcoming of this method is that if the indexed columns can contain NULL values, you must use a UNIQUE index rather than a PRIMARY KEY, in which case the index will not remove duplicate NULL keys. (UNIQUE indexes allow multiple NULL values.) This method does prevent occurrence of duplicates in the future.

Removing duplicates by adding an index

To remove duplicates from a table "in place," add a unique index to the table with ALTER TABLE, using the IGNORE keyword to tell it to discard rows with duplicate key values during the index construction process. The original catalog_list table looks like this without an index:

mysql> SELECT * FROM catalog_list ORDER BY last_name, first_name; +-----------+-------------+--------------------------+ | last_name | first_name  | street                   | +-----------+-------------+--------------------------+ | Baxter    | Wallace     | 57 3rd Ave.              | | BAXTER    | WALLACE     | 57 3rd Ave.              | | Baxter    | Wallace     | 57 3rd Ave., Apt 102     | | Brown     | Bartholomew | 432 River Run            | | Isaacson  | Jim         | 515 Fordam St., Apt. 917 | | McTavish  | Taylor      | 432 River Run            | | Pinter    | Marlene     | 9 Sunset Trail           | | Pinter    | Marlene     | 9 Sunset Trail           | +-----------+-------------+--------------------------+ 

Add a unique index, and then check what effect doing so has on the table contents:

mysql> ALTER IGNORE TABLE catalog_list     -> ADD PRIMARY KEY (last_name, first_name); mysql> SELECT * FROM catalog_list ORDER BY last_name, first_name; +-----------+-------------+--------------------------+ | last_name | first_name  | street                   | +-----------+-------------+--------------------------+ | Baxter    | Wallace     | 57 3rd Ave.              | | Brown     | Bartholomew | 432 River Run            | | Isaacson  | Jim         | 515 Fordam St., Apt. 917 | | McTavish  | Taylor      | 432 River Run            | | Pinter    | Marlene     | 9 Sunset Trail           | +-----------+-------------+--------------------------+ 

If the indexed columns can contain NULL, you must use a UNIQUE index rather than a PRIMARY KEY. In that case, the index will not remove duplicate NULL key values. In addition to removing existing duplicates, the method prevents the occurrence of duplicates in the future.

Removing duplicates of a particular row

You can use LIMIT to restrict the effect of a DELETE statement to a subset of the rows that it otherwise would delete. This makes the statement applicable to removing duplicate rows. Suppose that you have a table t with the following contents:

+-------+ | color | +-------+ | blue  | | green | | blue  | | blue  | | red   | | green | | red   | +-------+ 

The table lists blue three times, and green and red twice each. To remove the extra instances of each color, do this:

mysql> DELETE FROM t WHERE color = 'blue' LIMIT 2; mysql> DELETE FROM t WHERE color = 'green' LIMIT 1; mysql> DELETE FROM t WHERE color = 'red' LIMIT 1; mysql> SELECT * FROM t; +-------+ | color | +-------+ | blue  | | green | | red   | +-------+ 

This technique works in the absence of a unique index, and it eliminates duplicate NULL values. It's handy if you want to remove duplicates only for a specific set of rows within a table. However, if there are many different sets of duplicates that you want to remove, this is not a procedure you'd want to carry out by hand. The process can be automated by using the techniques discussed earlier in Section 14.3 for determining which values are duplicated. There, we wrote a make_dup_count_query⁠(⁠ ⁠ ⁠) function to generate the statement needed to count the number of duplicate values in a given set of columns in a table. The result of that statement can be used to generate a set of DELETE ... LIMIT n statements that remove duplicate rows and leave only unique rows. The dups directory of the recipes distribution contains code that shows how to generate these statements.

In general, using DELETE ... LIMIT n is likely to be slower than removing duplicates by using a second table or by adding a unique index. Those methods keep the data on the server side and let the server do all the work. DELETE ... LIMIT n involves a lot of client-server interaction because it uses a SELECT statement to retrieve information about duplicates, followed by several DELETE statements to remove instances of duplicated rows. Also, this technique does not prevent duplicates from occurring in the future.




MySQL Cookbook
MySQL Cookbook
ISBN: 059652708X
EAN: 2147483647
Year: 2004
Pages: 375
Authors: Paul DuBois

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net