Section 7.14. Working with Duplicates

7.14 Working with Duplicates

7.14.1 Problem

You've just imported a new set of rows into your buffer tables. However, you forgot to delete the previous import. Now you have a table with some duplicate rows. You need a set of tools to eliminate these duplicates and to prevent them from occurring in the future.

Let's say that you uploaded some data from your sales agent to the Bookstore table; however, due to a mistake in the code, the upload doubled some rows:

 BookId      Name                     Quantity    Price   Type                  ----------- ------------------------ ----------- ------- ----------  1           Software Engineering     5           15.00   Manual               2           Modern Operating Systems 7           20.00   Reference            3           Learn SQL                15          18.00   Textbook             4           Learn Advanced SQL       1           8.00    Textbook             5           JavaScript Tutorial      5           10.00   Textbook             6           Modern Operating Systems 7           20.00   Reference            7           Learn SQL                15          18.00   Textbook

7.14.2 Solution

We'll demonstrate a technique for dealing with duplicates on the Bookstore table. Please note that the BookId column is always unique and that it is used for identifying otherwise duplicated rows. This will be useful when only one representative of a duplicated set needs to be singled out.

The queries in this recipe show different ways to count duplicate rows, report on duplicate rows, extract nonduplicated rows, and eliminate duplicate rows.

7.14.2.1 Reduction

The term reduction refers to the elimination of duplicate rows. The following query returns all nonduplicate rows from the Bookstore table:

 SELECT DISTINCT * FROM Bookstore

Because the DISTINCT keyword was used, any two rows containing the same set of values will be combined into one row in the query's result set. Because this query includes the primary-key column BookId, it's of no practical use in our particular case; some books still appear twice.

7.14.2.2 Selective reduction

Selective reduction is reduction that is performed only on a subset of the columns in a table. The following query retrieves each unique combination of name, price, and color from the Stock table:

 SELECT DISTINCT Name, Type, Price FROM Bookstore

The table stores five different books:

 Name                                     Type                 Price         ---------------------------------------- -------------------- ---------  JavaScript Tutorial                      Textbook             10.00 Learn Advanced SQL                       Textbook             8.00 Learn SQL                                Textbook             18.00 Modern Operating Systems                 Reference            20.00 Software Engineering                     Manual               15.00

This time, because we did not include the always unique primary key, each book is listed only once.

7.14.2.3 Selecting duplicates

The following query shows how you can find duplicate name, price, and color combinations:

 SELECT Name, Type, Price  FROM Bookstore GROUP BY Name, Type, Price HAVING count(*) > 1

The HAVING clause in this query ensures that if a given name, price, and color combination is only represented once, it will not be listed in the results from this query. Any such combinations that occur more than once will be listed:

 Name                                     Type                 Price         ---------------------------------------- -------------------- ---------  Learn SQL                                Textbook             18.00 Modern Operating Systems                 Reference            20.00

7.14.2.4 Counting occurrences

The following query produces a report showing the number of occurrences of each combination of values in a given set of columns. The results are sorted in order of highest occurrence.

 SELECT Name, Type, Price, count(*) Count  FROM Bookstore GROUP BY Name, Type, Price  ORDER BY count(*) DESC

The query lists all combinations of name, price, and type. It also lists the number of times each combination occurs. Combinations with the greatest number of occurrences will be listed first:

 Name                      Type                 Price        Count        ------------------------- -------------------- ------------ -----------  Learn SQL                 Textbook             18.00        2 Modern Operating Systems  Reference            20.00        2 JavaScript Tutorial       Textbook             10.00        1 Learn Advanced SQL        Textbook             8.00         1 Software Engineering      Manual               15.00        1

To see the same information, but only for cases where the occurrence count is greater than 1, use the following variation on the query:

 SELECT Name, Type, Price, count(*) Count  FROM Bookstore GROUP BY Name, Type, Price  HAVING count(*) > 1 ORDER BY count(*) DESC

This second version of the query uses a HAVING clause to restrict the results to only those combinations of name, price, and type that occur multiple times:

 Name                     Type                 Price        Count        ------------------------ -------------------- ------------ -----------  Learn SQL                Textbook             18.00        2 Modern Operating Systems Reference            20.00        2

7.14.2.5 Selecting by number of occurrences

The following query selects duplicates over a subset of columns where the occurrence count is at least 3:

 SELECT Name, Type, Price, count(*) Count  FROM Bookstore GROUP BY Name, Type, Price HAVING count(*) >= 3

You can replace the constant 3 in this query with any other occurrence count threshold of interest.

7.14.2.6 Selecting nonduplicates

The following query selects nonduplicated rows over a subset of columns, ignoring the ones that are duplicated:

 SELECT Name, Type, Price FROM Bookstore GROUP BY Name, Type, Price HAVING count(*)= 1

In this example, rows representing id, name, and qty combinations that occur more than once will not be returned in the result set:

 Name                     Type                 Price         ------------------------ -------------------- ------------  JavaScript Tutorial      Textbook             10.00 Learn Advanced SQL       Textbook             8.00 Software Engineering     Manual               15.00

7.14.2.7 Selecting duplicates with an odd occurrence count

If you're interested in duplicates where the occurrence count is an odd number other than 1, use a query such as this:

 SELECT Name, Type, Price, count(*) Count  FROM Bookstore GROUP BY Name, Type, Price HAVING count(*) % 2 = 1 AND count(*)>1

Since 1 is considered an odd number, but does not imply duplication, the WHERE clause specifically excludes rows with an occurrence count of 1.

7.14.2.8 Selecting duplicates with an even occurrence count

A similar query can be used to retrieve duplicates where the occurrence count is an even number:

 SELECT Name, Type, Price, count(*) Count  FROM Bookstore GROUP BY Name, Type, Price HAVING count(*) % 2 = 0

Note that there is no need to specifically exclude rows with an occurrence count of 1, because 1 is not an even number.

7.14.2.9 Deleting duplicate rows

You may find yourself faced with a situation where you want to arbitrarily delete duplicate data from a table. Data like this is often the result of running an import procedure twice. As long as you have at least one unique value, such as a primary key value, to identify each row, you can arbitrarily delete duplicates using a single DELETE statement. The statement in the following example deletes duplicate rows, arbitrarily retaining one record for each fruit and color combination:

 DELETE FROM Warehouse WHERE BookId NOT IN (    SELECT MAX(BookId)    FROM Bookstore    GROUP BY Name, Type    HAVING COUNT(*) > 1)

The subquery in this example identifies the highest ID number value for each name and color combination. This identifies the row that we are arbitrarily going to keep. All other rows with that name and color combination are deleted. The key here is that the ID is unique for all rows in a name and type combination.

7.14.2.10 Preventing duplicates

To ensure that no duplicates can be inserted into a table, create a unique index using the UNIQUE INDEX clause, or create a UNIQUE constraint while creating a table. For example, the following statement creates an index named BookstoreInd that does not allow any two rows in the Bookstore table to have the same combination of values for the Name, Type, and Price columns:

 CREATE UNIQUE INDEX BookstoreInd on Bookstore    (Name, Type, Price)

7.14.3 Discussion

SQL has built-in support for enforcing an occurrence count of 1 on a table. This support can be in the form of a UNIQUE constraint or a unique index on the table. To enforce an occurrence count other than 1, you'll need to use triggers.

When using the SELECT DISTINCT clause in a SELECT statement, you should be very careful about performance. At first glance, SELECT DISTINCT may seem like a good tool to use; in reality, such a statement can be very expensive to execute. Using the DISTINCT keyword in a SELECT statement causes a sorting operation to occur when such query is run. The only way that SQL Server can identify distinct values is to first sort all the results enough to group duplicates together. Then only one value, the distinct value, is returned from each set of duplicates. Because of the performance impact of the sort, you should avoid the use of DISTINCT unless it is absolutely necessary. The following query shows a typical example of the DISTINCT keyword being used unnecessarily. The query uses a subquery within an EXISTS predicate to check for rows in table B that correspond to those in table A:

 SELECT * FROM A  WHERE EXISTS(    SELECT DISTINCT *     FROM B     WHERE B.id=A.id)

This code returns the correct results, but the use of DISTINCT in the subquery hurts performance. This is because DISTINCT forces a sort to occur for each execution of the subquery. Since the subquery is a correlated subquery ” executed for each row ” the result is a lot of sort activity.

The purpose of the EXISTS predicate is to test for the existence of at least one row by executing a subquery. Since only one row is needed to cause a TRUE result to be returned, there's no point in using DISTINCT. It doesn't matter if more than one row satisfies the subquery: one is enough . By omitting DISTINCT, you allow the optimizer to stop when that one row is found.