7.14 Working with Duplicates7.14.1 ProblemYou've just imported a new set of rows into your buffer tables. However, you forgot to delete the previous import. Now you have a table with some duplicate rows. You need a set of tools to eliminate these duplicates and to prevent them from occurring in the future. Let's say that you uploaded some data from your sales agent to the Bookstore table; however, due to a mistake in the code, the upload doubled some rows: BookId Name Quantity Price Type ----------- ------------------------ ----------- ------- ---------- 1 Software Engineering 5 15.00 Manual 2 Modern Operating Systems 7 20.00 Reference 3 Learn SQL 15 18.00 Textbook 4 Learn Advanced SQL 1 8.00 Textbook 5 JavaScript Tutorial 5 10.00 Textbook 6 Modern Operating Systems 7 20.00 Reference 7 Learn SQL 15 18.00 Textbook 7.14.2 SolutionWe'll demonstrate a technique for dealing with duplicates on the Bookstore table. Please note that the BookId column is always unique and that it is used for identifying otherwise duplicated rows. This will be useful when only one representative of a duplicated set needs to be singled out. The queries in this recipe show different ways to count duplicate rows, report on duplicate rows, extract nonduplicated rows, and eliminate duplicate rows. 7.14.2.1 ReductionThe term reduction refers to the elimination of duplicate rows. The following query returns all nonduplicate rows from the Bookstore table: SELECT DISTINCT * FROM Bookstore Because the DISTINCT keyword was used, any two rows containing the same set of values will be combined into one row in the query's result set. Because this query includes the primary-key column BookId, it's of no practical use in our particular case; some books still appear twice. 7.14.2.2 Selective reductionSelective reduction is reduction that is performed only on a subset of the columns in a table. The following query retrieves each unique combination of name, price, and color from the Stock table: SELECT DISTINCT Name, Type, Price FROM Bookstore The table stores five different books: Name Type Price ---------------------------------------- -------------------- --------- JavaScript Tutorial Textbook 10.00 Learn Advanced SQL Textbook 8.00 Learn SQL Textbook 18.00 Modern Operating Systems Reference 20.00 Software Engineering Manual 15.00 This time, because we did not include the always unique primary key, each book is listed only once. 7.14.2.3 Selecting duplicatesThe following query shows how you can find duplicate name, price, and color combinations: SELECT Name, Type, Price FROM Bookstore GROUP BY Name, Type, Price HAVING count(*) > 1 The HAVING clause in this query ensures that if a given name, price, and color combination is only represented once, it will not be listed in the results from this query. Any such combinations that occur more than once will be listed: Name Type Price ---------------------------------------- -------------------- --------- Learn SQL Textbook 18.00 Modern Operating Systems Reference 20.00 7.14.2.4 Counting occurrencesThe following query produces a report showing the number of occurrences of each combination of values in a given set of columns. The results are sorted in order of highest occurrence. SELECT Name, Type, Price, count(*) Count FROM Bookstore GROUP BY Name, Type, Price ORDER BY count(*) DESC The query lists all combinations of name, price, and type. It also lists the number of times each combination occurs. Combinations with the greatest number of occurrences will be listed first: Name Type Price Count ------------------------- -------------------- ------------ ----------- Learn SQL Textbook 18.00 2 Modern Operating Systems Reference 20.00 2 JavaScript Tutorial Textbook 10.00 1 Learn Advanced SQL Textbook 8.00 1 Software Engineering Manual 15.00 1 To see the same information, but only for cases where the occurrence count is greater than 1, use the following variation on the query: SELECT Name, Type, Price, count(*) Count FROM Bookstore GROUP BY Name, Type, Price HAVING count(*) > 1 ORDER BY count(*) DESC This second version of the query uses a HAVING clause to restrict the results to only those combinations of name, price, and type that occur multiple times: Name Type Price Count ------------------------ -------------------- ------------ ----------- Learn SQL Textbook 18.00 2 Modern Operating Systems Reference 20.00 2 7.14.2.5 Selecting by number of occurrencesThe following query selects duplicates over a subset of columns where the occurrence count is at least 3: SELECT Name, Type, Price, count(*) Count FROM Bookstore GROUP BY Name, Type, Price HAVING count(*) >= 3 You can replace the constant 3 in this query with any other occurrence count threshold of interest. 7.14.2.6 Selecting nonduplicatesThe following query selects nonduplicated rows over a subset of columns, ignoring the ones that are duplicated: SELECT Name, Type, Price FROM Bookstore GROUP BY Name, Type, Price HAVING count(*)= 1 In this example, rows representing id, name, and qty combinations that occur more than once will not be returned in the result set: Name Type Price ------------------------ -------------------- ------------ JavaScript Tutorial Textbook 10.00 Learn Advanced SQL Textbook 8.00 Software Engineering Manual 15.00 7.14.2.7 Selecting duplicates with an odd occurrence countIf you're interested in duplicates where the occurrence count is an odd number other than 1, use a query such as this: SELECT Name, Type, Price, count(*) Count FROM Bookstore GROUP BY Name, Type, Price HAVING count(*) % 2 = 1 AND count(*)>1 Since 1 is considered an odd number, but does not imply duplication, the WHERE clause specifically excludes rows with an occurrence count of 1. 7.14.2.8 Selecting duplicates with an even occurrence countA similar query can be used to retrieve duplicates where the occurrence count is an even number: SELECT Name, Type, Price, count(*) Count FROM Bookstore GROUP BY Name, Type, Price HAVING count(*) % 2 = 0 Note that there is no need to specifically exclude rows with an occurrence count of 1, because 1 is not an even number. 7.14.2.9 Deleting duplicate rowsYou may find yourself faced with a situation where you want to arbitrarily delete duplicate data from a table. Data like this is often the result of running an import procedure twice. As long as you have at least one unique value, such as a primary key value, to identify each row, you can arbitrarily delete duplicates using a single DELETE statement. The statement in the following example deletes duplicate rows, arbitrarily retaining one record for each fruit and color combination: DELETE FROM Warehouse WHERE BookId NOT IN ( SELECT MAX(BookId) FROM Bookstore GROUP BY Name, Type HAVING COUNT(*) > 1) The subquery in this example identifies the highest ID number value for each name and color combination. This identifies the row that we are arbitrarily going to keep. All other rows with that name and color combination are deleted. The key here is that the ID is unique for all rows in a name and type combination. 7.14.2.10 Preventing duplicatesTo ensure that no duplicates can be inserted into a table, create a unique index using the UNIQUE INDEX clause, or create a UNIQUE constraint while creating a table. For example, the following statement creates an index named BookstoreInd that does not allow any two rows in the Bookstore table to have the same combination of values for the Name, Type, and Price columns: CREATE UNIQUE INDEX BookstoreInd on Bookstore (Name, Type, Price) 7.14.3 DiscussionSQL has built-in support for enforcing an occurrence count of 1 on a table. This support can be in the form of a UNIQUE constraint or a unique index on the table. To enforce an occurrence count other than 1, you'll need to use triggers. When using the SELECT DISTINCT clause in a SELECT statement, you should be very careful about performance. At first glance, SELECT DISTINCT may seem like a good tool to use; in reality, such a statement can be very expensive to execute. Using the DISTINCT keyword in a SELECT statement causes a sorting operation to occur when such query is run. The only way that SQL Server can identify distinct values is to first sort all the results enough to group duplicates together. Then only one value, the distinct value, is returned from each set of duplicates. Because of the performance impact of the sort, you should avoid the use of DISTINCT unless it is absolutely necessary. The following query shows a typical example of the DISTINCT keyword being used unnecessarily. The query uses a subquery within an EXISTS predicate to check for rows in table B that correspond to those in table A: SELECT * FROM A WHERE EXISTS( SELECT DISTINCT * FROM B WHERE B.id=A.id) This code returns the correct results, but the use of DISTINCT in the subquery hurts performance. This is because DISTINCT forces a sort to occur for each execution of the subquery. Since the subquery is a correlated subquery ” executed for each row ” the result is a lot of sort activity. The purpose of the EXISTS predicate is to test for the existence of at least one row by executing a subquery. Since only one row is needed to cause a TRUE result to be returned, there's no point in using DISTINCT. It doesn't matter if more than one row satisfies the subquery: one is enough . By omitting DISTINCT, you allow the optimizer to stop when that one row is found. |