Recipe4.16.Deleting Duplicate Records | SQL Cookbook (Cookbooks (OReilly))

Recipe 4.16. Deleting Duplicate Records

Problem

You want to delete duplicate records from a table. Consider the following table:

  create table dupes (id integer, name varchar(10)) insert into dupes values (1, 'NAPOLEON') insert into dupes values (2, 'DYNAMITE') insert into dupes values (3, 'DYNAMITE') insert into dupes values (4, 'SHE SELLS') insert into dupes values (5, 'SEA SHELLS') insert into dupes values (6, 'SEA SHELLS') insert into dupes values (7, 'SEA SHELLS')  select * from dupes order by 1         ID NAME ---------- ----------          1 NAPOLEON          2 DYNAMITE          3 DYNAMITE          4 SHE SELLS          5 SEA SHELLS          6 SEA SHELLS          7 SEA SHELLS

For each group of duplicate names, such as "SEA SHELLS", you wish to arbitrarily retain one ID and delete the rest. In the case of "SEA SHELLS" you don't care whether you delete 5 and 6, or 5 and 7, or 6 and 7, but in the end you want just one record for "SEA SHELLS".

Solution

Use a subquery with an aggregate function such as MIN to arbitrarily choose the ID to retain (in this case only the NAME with the smallest value for ID is not deleted):

 1  delete from dupes 2   where id not in ( select min(id) 3                        from dupes 4                       group by name )

Discussion

The first thing to do when deleting duplicates is to define exactly what it means for two rows to be considered "duplicates" of each other. For my example in this recipe, the definition of "duplicate" is that two records contain the same value in their NAME column. Having that definition in place, you can look to some other column to discriminate among each set of duplicates, to identify those records to retain. It's best if this discriminating column (or columns) is a primary key. I used the ID column, which is a good choice because no two records have the same ID.

The key to the solution is that you group by the values that are duplicated (by NAME in this case), and then use an aggregate function to pick off just one key value to retain. The subquery in the "Solution" example will return the smallest ID for each NAME, which represents the row you will not delete:

  select min(id)   from dupes  group by name     MIN(ID) -----------           2           1           5           4

The DELETE then deletes any ID in the table that is not returned by the subquery (in this case IDs 3, 6, and 7). If you are having trouble seeing how this works, run the subquery first and include the NAME in the SELECT list:

  select name, min(id)   from dupes  group by name NAME          MIN(ID) ---------- ---------- DYNAMITE            2 NAPOLEON            1 SEA SHELLS          5 SHE SELLS           4

The rows returned by the subquery represent those to be retained. The NOT IN predicate in the DELETE statement causes all other rows to be deleted.