Introduction | Handling Duplicates

Tables or result sets sometimes contain duplicate records. In some cases this is acceptable. For example, if you conduct a web poll that records dates and client IP numbers along with the votes, duplicate records may be allowable, because it's possible for large numbers of votes to appear to originate from the same IP number for an Internet service that routes traffic from its customers through a single proxy host. In other cases, duplicates will be unacceptable, and you'll want to take steps to avoid them. Operations related to handling of duplicate records include the following:

Counting the number of duplicates to determine whether they occur and to what extent.
Identifying duplicated values (or the records containing them) so you can see what they are and where they occur.
Eliminating duplicates to ensure that each record is unique. This may involve removing rows from a table to leave only unique records. Or it may involve selecting a result set in such a way that no duplicates appear in the output. (For example, to display a list of the states in which you have customers, you probably wouldn't want a long list of state names from all customer records. A list showing each state name only once suffices and is easier to understand.)
Preventing duplicates from being created within a table in the first place. If each record in a table is intended to represent a single entity (such as a person, an item in a catalog, or a specific observation in an experiment), the occurrence of duplicates presents significant difficulties in using it that way. Duplicates make it impossible to refer to some records in the table unambiguously, so it's best to make sure duplicates never occur.

Several tools are at your disposal for dealing with duplicate records. These can be chosen according to the objective you're trying to achieve:

Creating a table to include a unique index will prevent duplicates from being added to the table. MySQL will use the index to enforce the requirement that each record in the table contains a unique key in the indexed column or columns.
In conjunction with a unique index, the INSERT IGNORE and REPLACE statements allow you to handle insertion of duplicate records gracefully without generating errors. For bulk-loading operations, the same options are available in the form of the IGNORE or REPLACE modifiers for the LOAD DATA statement.
If you need to determine whether or not a table contains duplicates, GROUP BY categorizes rows into groups, and COUNT( ) shows how many rows are in each group. These are described in Chapter 7 in the context of producing summaries, but they're useful for duplicate counting and identification as well. After all, a counting summary is essentially an operation that groups values into categories to determine how frequently each occurs.
SELECT DISTINCT is useful for removing duplicate rows from a result set to leave only unique records. Adding a unique index to a table can remove duplicates that are present in the table. If you determine that there are n identical records in a table, you can use DELETE ... LIMIT to eliminate n-1 instances from that specific set of rows.

This chapter describes how each of these techniques applies to duplicate identification and removal, but before proceeding further, I should define what "duplicate" means here. When people say "duplicate record," they may mean different things. For purposes of this chapter, one record is a duplicate of another if both rows contain the same values in columns that are supposed to distinguish them. Consider the following table:

mysql> SELECT * FROM person;
+------+-----------+------------+---------------+------+
| id | last_name | first_name | address | age |
+------+-----------+------------+---------------+------+
| 1 | Smith | Jim | 428 Mill Road | 36 |
| 2 | Smith | Joan | 428 Mill Road | 36 |
| 3 | Smith | Junior | 428 Mill Road | 12 |
+------+-----------+------------+---------------+------+

None of these records are duplicates if you compare rows using all the columns, because then the records contain the id and first_name columns, each of which happen to contain only unique values. However, if you look only at the last_name or address columns, all the records contain duplicated values. Lying between these extremes, a result set consisting of the age column contains a mix of unique and duplicated values.

Scripts related to the examples shown in this chapter are located in the dups directory of the recipes distribution. For scripts that create the tables used here, look in the tables directory.

Using the mysql Client Program

Writing MySQL-Based Programs

Record Selection Techniques

Working with Strings

Working with Dates and Times

Sorting Query Results

Generating Summaries

Modifying Tables with ALTER TABLE

Obtaining and Using Metadata

Importing and Exporting Data

Generating and Using Sequences

Using Multiple Tables

Statistical Techniques

Handling Duplicates

Performing Transactions

Introduction to MySQL on the Web

Incorporating Query Resultsinto Web Pages

Processing Web Input with MySQL

Using MySQL-Based Web Session Management

Appendix A. Obtaining MySQL Software

Appendix B. JSP and Tomcat Primer

Appendix C. References