Recipe 14.2. Dealing with Duplicates When Loading Rows into a Table

Problem

You've created a table with a unique index to prevent duplicate values in the indexed column or columns. But this results in an error if you attempt to insert a duplicate row, and you want to avoid having to deal with such errors.

Solution

One approach is to just ignore the error. Another is to use an INSERT IGNORE, REPLACE, or INSERT ... ON DUPLICATE KEY UPDATE statement, each of which modifies MySQL's duplicate-handling behavior. For bulk-loading operations, LOAD DATA has modifiers that enable you to specify how to handle duplicates.

Discussion

By default, MySQL generates an error when you insert a row that duplicates an existing unique key value. Suppose that the person table has the following structure, with a unique index on the last_name and first_name columns:

CREATE TABLE person (   last_name   CHAR(20) NOT NULL,   first_name  CHAR(20) NOT NULL,   address     CHAR(40),   PRIMARY KEY (last_name, first_name) );

An attempt to insert a row with duplicate values in the indexed columns results in an error:

mysql> INSERT INTO person (last_name, first_name)     -> VALUES('X1','Y1'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO person (last_name, first_name)     -> VALUES('X1','Y1'); ERROR 1062 (23000): Duplicate entry 'X1-Y1' for key 1

If you're issuing the statements from the mysql program interactively, you can simply say, "Okay, that didn't work," ignore the error, and continue. But if you write a program to insert the rows, an error may terminate the program. One way to avoid this is to modify the program's error-handling behavior to trap the error and then ignore it. See Section 2.2 for information about error-handling techniques.

If you want to prevent the error from occurring in the first place, you might consider using a two-query method to solve the duplicate-row problem:

Issue a SELECT to see whether the row is already present.
Issue an INSERT if the row is not present.

But that doesn't really work: another client might insert the same row after the SELECT and before the INSERT, in which case the error would still occur for your INSERT. To make sure that doesn't happen, you could use a transaction or lock the tables, but then you've gone from two statements to four. MySQL provides three single-query solutions to the problem of handling duplicate rows. Choose from among them according to the duplicate-handling behavior you want to affect:

Use INSERT IGNORE rather than INSERT if you want to keep the original row when a duplicate occurs. If the row doesn't duplicate an existing row, MySQL inserts it as usual. If the row is a duplicate, the IGNORE keyword tells MySQL to discard it silently without generating an error:
```
mysql> INSERT IGNORE INTO person (last_name, first_name)     -> VALUES('X2','Y2'); Query OK, 1 row affected (0.00 sec) mysql> INSERT IGNORE INTO person (last_name, first_name)     -> VALUES('X2','Y2'); Query OK, 0 rows affected (0.00 sec) 
```
The row count value indicates whether the row was inserted or ignored. From within a program, you can obtain this value by checking the rows-affected function provided by your API (see Recipes Section 2.4 and Section 9.1).
Use REPLACE rather than INSERT if you want to replace the original row with the new one when a duplicate occurs. If the row is new, it's inserted just as with INSERT. If it's a duplicate, the new row replaces the old one:
```
mysql> REPLACE INTO person (last_name, first_name)     -> VALUES('X3','Y3'); Query OK, 1 row affected (0.00 sec) mysql> REPLACE INTO person (last_name, first_name)     -> VALUES('X3','Y3'); Query OK, 2 rows affected (0.00 sec) 
```
The rows-affected value in the second case is 2 because the original row is deleted and the new row is inserted in its place.
Use INSERT ... ON DUPLICATE KEY UPDATE if you want to modify columns of an existing row when a duplicate occurs. If the row is new, it's inserted. If it's a duplicate, the ON DUPLICATE KEY UPDATE clause indicates how to modify the existing row in the table. In other words, this statement can initialize or update a row as necessary. The rows-affected count indicates what happened: 1 for an insert, 2 for an update.

INSERT IGNORE is more efficient than REPLACE because it doesn't actually insert duplicates. Thus, it's most applicable when you just want to make sure a copy of a given row is present in a table. REPLACE, on the other hand, is often more appropriate for tables in which other nonkey columns need to be replaced. INSERT ... ON DUPLICATE KEY UPDATE is appropriate when you must insert a record if it doesn't exist, but just update some of its columns if the new record is a duplicate in the indexed columns.

Suppose that you're maintaining a table named passtbl for a web application that contains email addresses and password hash values, and that is keyed by email address:

CREATE TABLE passtbl (   email    VARCHAR(60) NOT NULL,   password VARBINARY(60) NOT NULL,   PRIMARY KEY (email) );

How do you create new rows for new users, but change passwords of existing rows for existing users? A typical algorithm for handling row maintenance might look like this:

Issue a SELECT to see whether a row already exists with a given email value.
If no such row exists, add a new one with INSERT.
If the row does exist, update it with UPDATE.

These steps must be performed within a transaction or with the tables locked to prevent other users from changing the tables while you're using them. In MySQL, you can use REPLACE to simplify both cases to the same single-statement operation:

REPLACE INTO passtbl (email,password) VALUES(address,hash_value);

If no row with the given email address exists, MySQL creates a new one. If a row does exist, MySQL replaces it; in effect, this updates the password column of the row associated with the address.

INSERT IGNORE and REPLACE are useful when you know exactly what values should be stored in the table when you attempt to insert a row. That's not always the case. For example, you might want to insert a row if it doesn't exist, but update only certain parts of it otherwise. This commonly occurs when you use a table for counting. Suppose that you're recording votes for candidates in polls, using the following table:

CREATE TABLE poll_vote (   poll_id      INT UNSIGNED NOT NULL AUTO_INCREMENT,   candidate_id INT UNSIGNED,   vote_count   INT UNSIGNED,   PRIMARY KEY (poll_id, candidate_id) );

The primary key is the combination of poll and candidate number. The table should be used like this:

The first time a vote is received for a given poll candidate, a new row should be inserted with a vote count of 1.
For all subsequent votes for that poll candidate, the vote count of the existing record should be incremented.

Neither INSERT IGNORE nor REPLACE are appropriate here because for all votes except the first, you don't know what the vote count should be. INSERT ... ON DUPLICATE KEY UPDATE works better here. The following example shows how it works, beginning with an empty table:

mysql> SELECT * FROM poll_vote; Empty set (0.01 sec) mysql> INSERT INTO poll_vote (poll_id,candidate_id,vote_count) VALUES(14,2,1)     -> ON DUPLICATE KEY UPDATE vote_count = vote_count + 1; Query OK, 1 row affected (0.01 sec) mysql> SELECT * FROM poll_vote; +---------+--------------+------------+ | poll_id | candidate_id | vote_count | +---------+--------------+------------+ | 14      | 2            | 1          | +---------+--------------+------------+ 1 row in set (0.01 sec) mysql> INSERT INTO poll_vote (poll_id,candidate_id,vote_count) VALUES(14,2,1)     -> ON DUPLICATE KEY UPDATE vote_count = vote_count + 1; Query OK, 2 rows affected (0.00 sec) mysql> SELECT * FROM poll_vote; +---------+--------------+------------+ | poll_id | candidate_id | vote_count | +---------+--------------+------------+ | 14      | 2            | 2          | +---------+--------------+------------+ 1 row in set (0.00 sec)

For the first INSERT, no row for the candidate exists, so the row is inserted. For the second INSERT, the row already exists, so MySQL just updates the vote count. With INSERT ... ON DUPLICATE KEY UPDATE, there is no need to check whether the row exists because MySQL does that for you. The row count indicates what action the INSERT statement performs: 1 for a new row and 2 for an update to an existing row.

The techniques just described have the benefit of eliminating overhead that might otherwise be required for a transaction. But this benefit comes at the price of portability because they all involve MySQL-specific syntax. If portability is a high priority, you might prefer to stick with a transactional approach.

Recipe 14.2. Dealing with Duplicates When Loading Rows into a Table

Problem

Solution

Discussion

See Also