Section 14.0. Introduction

14.0. Introduction

Tables or result sets sometimes contain duplicate rows. In some cases this is acceptable. For example, if you conduct a web poll that records date and client IP number along with the votes, duplicate rows may be allowable, because it's possible for large numbers of votes to appear to originate from the same IP number for an Internet service that routes traffic from its customers through a single proxy host. In other cases, duplicates will be unacceptable, and you'll want to take steps to avoid them. Operations related to handling of duplicate rows include the following:

Preventing duplicates from being created within a table in the first place. If each row in a table is intended to represent a single entity (such as a person, an item in a catalog, or a specific observation in an experiment), the occurrence of duplicates presents significant difficulties in using it that way. Duplicates make it impossible to refer to every row in the table unambiguously, so it's best to make sure duplicates never occur.
Counting the number of duplicates to determine if they are present and to what extent.
Identifying duplicated values (or the rows containing them) so you can see what they are and where they occur.
Eliminating duplicates to ensure that each row is unique. This may involve removing rows from a table to leave only unique rows. Or it may involve selecting a result set in such a way that no duplicates appear in the output. For example, to display a list of the states in which you have customers, you probably wouldn't want a long list of state names from all customer records. A list showing each state name only once suffices and is easier to understand.

Several tools are at your disposal for dealing with duplicate rows. Choose them according to the objective that you're trying to achieve:

Creating a table to include a primary key or unique index prevents duplicates from being added to the table. MySQL uses the index as a constraint to enforce the requirement that each row in the table contains a unique key in the indexed column or columns.
In conjunction with a unique index, the INSERT IGNORE and REPLACE statements enable you to handle insertion of duplicate rows gracefully without generating errors. For bulk-loading operations, the same options are available in the form of the IGNORE or REPLACE modifiers for the LOAD DATA statement.
If you need to determine whether a table contains duplicates, use GROUP BY to categorize rows into groups, and COUNT⁠(⁠ ⁠ ⁠) to see how many rows are in each group. These techniques are described in Chapter 8, in the context of producing summaries, but they're useful for duplicate counting and identification as well. A counting summary is essentially an operation that groups values into categories to determine how frequently each one occurs.
SELECT DISTINCT is useful for removing duplicate rows from a result set (see Section 3.8 for more information). For an existing table that already contains duplicates, adding a unique index can remove them. If you determine that there are n identical rows in a table, you can use DELETE ... LIMIT to eliminate n1 instances from that specific set of rows.

Scripts related to the examples shown in this chapter are located in the dups directory of the recipes distribution. For scripts that create the tables used here, look in the tables directory.