With an anti-join, we retrieve all rows from one table for which there is no matching row in another table. There are a number of ways of expressing anti-joins in MySQL.
Perhaps the most natural way of writing an anti-join is to express it as a NOT IN subquery. For instance, Example 21-7 returns all of the customers who are not employees.
Example 21-7. Example of an anti-join using NOT IN
SELECT count(*) FROM customers WHERE (contact_surname,contact_firstname, date_of_birth) NOT IN (SELECT surname,firstname, date_of_birth FROM employees) Short Explain ------------- 1 PRIMARY select(ALL) on customers using no key Using where 2 DEPENDENT SUBQUERY select(ALL) on employees using no key Using where
Another way to express this query is to use a NOT EXISTS subquery. Just as MySQL will rewrite IN subqueries to use the EXISTS clause, so too will MySQL rewrite a NOT IN subquery as a NOT EXISTS. So, from MySQL's perspective, Example 21-7 and Example 21-8 are equivalent.
Example 21-8. Example of an anti-join using NOT EXISTS
SELECT count(*) FROM customers WHERE NOT EXISTS (SELECT * FROM employees WHERE surname=customers.contact_surname AND firstname=customers.contact_firstname AND date_of_birth=customers.date_of_birth) Short Explain ------------- 1 PRIMARY select(ALL) on customers using no key Using where 2 DEPENDENT SUBQUERY select(ALL) on employees using no key Using where
A third but somewhat less natural way to express this query is to use a LEFT JOIN. This is a join in which all rows from the first table are returned even if there is no matching row in the second table. NULLs are returned for columns from the second table that do not have a matching row.
In Example 21-9 we join customers to employees and return NULL values for all of the employees who are not also customers. We can use this characteristic to eliminate the customers who are not also employees by testing for a NULL in a normally NOT NULL customer column.
Example 21-9. Example of an anti-join using LEFT JOIN
SELECT count(*) FROM customers LEFT JOIN employees ON (customers.contact_surname=employees.surname and customers.contact_firstname=employees.firstname and customers.date_of_birth=employees.date_of_birth) WHERE employees.surname IS NULL Short Explain ------------- 1 SIMPLE select(ALL) on customers using no key 1 SIMPLE select(ALL) on employees using no key Using where; Not exists
21.2.1. Optimizing an Anti-Join
The guidelines for optimizing anti-joins using subqueries or left joins are identical to the guidelines for optimizing normal subqueries or joins. Scalability and good performance will be achieved only if we create an index to optimize the subquery or the join. For the previous examples, this would mean creating an index on customer names as follows:[*]
[*] It might occur to you that creating an index on customers would produce a better join than the index on employees. However, LEFT JOINs can only be performed with the table that will return all rows as the first table in the jointhis means that the join order can only be customers to employees, and therefore the index to support the join must be on employees.
CREATE INDEX i_customers_name ON employees(surname,firstname,date_of_birth);
Figure 21-6 shows the massive performance improvements that result when we create a supporting index for an anti-join.
Figure 21-6. Comparison of anti-join techniques
Figure 21-6 also shows a substantial performance advantage for the NOT IN subquery over NOT EXISTS or LEFT JOIN when there is no index to support the anti-join. We noted earlier that MySQL rewrites the NOT IN-based statement to a NOT EXISTS, so it is at first surprising that there should be a performance difference. However, examination of the NOT IN rewrite reveals a number of undocumented compiler directives within the rewritten SQL that appear to give NOT IN a substantial performance advantage in the absence of an index.
Not only is the LEFT JOIN technique slower than NOT IN or NOT EXISTS, but it degrades much faster as the quantity of data to be processed increases. Figure 21-7 shows that the LEFT JOIN version of the anti-join degrades much more rapidly as the size of the tables being joined increasesthis is the opposite of the effect shown for normal subqueries, where the join solution was found to be more scalable than the subquery solution (refer to Figure 21-3).