You can make several syntax choices when writing subqueries. The major subquery predicates are IN and EXISTS, but we'll also look at negation, duplicate elimination , double subqueries, and set operation alternatives.
In a slight plurality of cases, the plan for IN subqueries is in-to-outthat is, the DBMS starts on the inside and drives to the outer query. We will generalize and say that in-to-out plans occur for subqueries that (a) begin with <comparison operator> [ANY ALL] and (b) contain no correlations . Although far from reliable, this rule is easy to remember. Here's a simple example:
SELECT * from Table1 WHERE Table1.column1 IN (SELECT Table2.column1 FROM Table2 WHERE Table2.column1 = 5)
This example subquery has a restriction in it ( WHERE Table2.column1 = 5 ), which is a signal that the subquery has few rows. That's what you want. If a subquery is in-to-out, then the relationship should be few-rows-to-many.
Let's look at what one mature DBMS, IBM, does with this example. According to IBM's EXPLAIN facility, the general plan is as follows :
For the loop in the fifth step, it's vital that Table1.column1 be indexed, but it's irrelevant if Table2.column1 is indexed.
You can throw a wrench into the works by changing IN (subquery) to NOT IN (subquery) . Doing so makes the ratio many-to-few instead of few-to-many, and that means processing should be out-to-in. For an out-to-in query, IBM makes a temporary index on the work table and does the final step in reverse.
The IBM plan for IN includes steps for sorting and eliminating duplicates. That makes sense because the best optimizations are ones that reduce the row count in the driver. In addition, a slight speedup occurs if the rows in the outer query are in order by Table1.column1 , because the lookups will occur in ascending order by value. There is a way to force other DBMSs to follow a similar plan: It is legal to add the word DISTINCT in the inner query, like this:
SELECT * from Table1 WHERE Table1.column1 IN (SELECT DISTINCT Table2.column1 FROM Table2 WHERE Table2.column1 = 5) GAIN: 2/7
Logically, DISTINCT is superfluous here, but we're looking for some common side effects of DISTINCT: (a) it causes ORDER BY, and (b) it reduces the number of rows in the inner query. In our tests, Table2.column1 contained integers, most of them duplicates, and it had a high likelihood that smaller numbers would match values in Table1 . The explanation for the gain is that, by eliminating duplicates and by making it more probable that "matches" (i.e., small numbers) would appear early, DISTINCT sped things up with some DBMSs. Other DBMSs just ignored DISTINCT, so no harm was done.
Another side effect of DISTINCT, but more frequently seen with GROUP BY or HAVING, is materialization that is, to evaluate some expressions, the DBMS will create a temporary table, put the rows from the original table(s) into the temporary table, and select from the temporary copy. Scanning is faster if the inner loop happens on a materialized "view" of the table, which is smaller than the table itself. Materialization is also a factor that discourages the DBMS from flattening. As we've suggested before: If DISTINCT is a do-nothing operator because there's no duplication, then the relation is not one-to-many, and therefore a join is probably better than a subquery.
Though DISTINCT takes time, the rows are short if you're just retrieving one column in the subquery. But DISTINCT does involve a sort, so do consider the sort tips provided in Chapter 3, "ORDER BY."
The sort of subquery that [NOT] EXISTS introduces has a notable characteristic: It contains correlations. Again, we could say that technically it doesn't have to and we're generalizing shamelessly, but we really do want to talk about queries of the general form:
Query #1: SELECT * FROM Table1 WHERE EXISTS (SELECT * FROM Table2 WHERE Table1.column1 = Table2.column1)
Query #1 at least is better than the alternative expression:
Query #2: SELECT * FROM Table1 WHERE 0 < (SELECT COUNT(*) FROM Table2 WHERE Table1.column1 = Table2.column1) GAIN: -4/7
The reason the COUNT(*) alternative (Query #2) is slower than Query #1 is that a subquery's nested-loop breaks out as soon as there's a match. That's like stopping the count when you reach the number one. Not that anybody would really consider the COUNT(*) alternative, but it illustrates nicely the maximDon't ask for more precision than is needed to answer the question. The condition > 0 (or 0 < ) is often a hint of an underlying "Yes/No" question.
Some DBMSs will repeat a correlated subquery for each occurrence of the outer row. You can check if your DBMS is one of them by doing a simple test: Double the number of outer rows, and see if the statement takes about twice as long. We did this with the Big Eight (GAIN: 2/7). The result shows that two of the Big Eight do slow down if the outer query gets bigger. For such DBMSs, it's good if you can put a significant restriction on the outer query. Doing so won't harm performance overall.
Indexing both Table1.column1 and Table2.column1 is advantageous because either one might be the object of an index-lookup.
IN or EXISTS?
If IN and EXISTS have different plans, and if IN can be transformed to EXISTS, which should you use: IN or EXISTS? Consider these two equivalent queries:
SELECT * FROM Table1 WHERE Table1.column1 IN (SELECT Table2.column1 FROM Table2) SELECT * FROM Table1 WHERE EXISTS (SELECT * FROM Table2 WHERE Table1.column1 = Table2.column1)
Which query is better? It's impossible to say by just looking at them, but if the scenarios were slightly different, or if we knew a little more about the tables, we'd be able to make some tentative guesses:
For each scenario, the derivation follows from the reasoning we've gone through to this point.
Sometimes a subquery will add sets together. Here's an example:
SELECT * FROM Table1 WHERE EXISTS (SELECT * FROM Table2 WHERE Table1.column1 = Table2.column1 UNION SELECT * FROM Table3 WHERE Table1.column2 = Table3.column1)
There are two problems with this example. First, many DBMSs don't support UNION in a subquery. Second, it's slow! It's better to transform queries like this one to a SELECT with an outer query that contains two IN subqueries. Here's how:
SELECT * FROM Table1 WHERE Table1.column1 IN (SELECT Table2.column1 FROM Table2) OR Table1.column2 IN (SELECT Table3.column1 FROM Table3) GAIN: 2/3
Informix, Ingres, InterBase, and Sybase don't allow UNION in subqueries.The gain shown is for only three DBMSs.
Here's another example. Query #1 could be transformed to Query #2:
Query #1 SELECT * FROM Table1 WHERE Table1.column1 IN (SELECT Table2.column1 FROM Table2) AND Table1.column2 IN (SELECT Table2.column2 FROM Table2) Query #2 SELECT * FROM Table1 WHERE (Table1.column1, Table1.column2) IN (SELECT Table2.column1, Table2.column2 FROM Table2)
But this transform works only if the DBMS supports row subqueriesa rare feature. If your DBMS doesn't support row subqueries, you can still get the effect of the transform by using arithmetic or concatenation, like this:
SELECT * FROM Table1 WHERE Table1.column1 ',' Table1.column2 IN (SELECT Table2.column1 ',' Table2.column2 FROM Table2) GAIN: 4/6
Don't do this for Sybase; it shows a loss. The gain shown is for only six DBMSs.
Here's a final example. Query #1 transforms into Query #2:
Query #1: SELECT * FROM Table1 WHERE Table1.column1 IN (SELECT Table2.column1 FROM Table2 WHERE Table2.column1 IN (SELECT Table3.column1 FROM Table3)) Query #2: SELECT * FROM Table1 WHERE Table1.column1 IN (SELECT Table2.column1 FROM Table2) AND Table1.column1 IN (SELECT Table3.column1 FROM Table3) GAIN: 7/7
It's better to merge two subqueries into one.
Some DBMSs support a nonstandard SQL extension TOP <number> clause for the SELECT statement. TOP (or its equivalent) finds the first [number] rows that match the SELECT requirements. For example, Microsoft lets you do this:
SELECT * FROM Table1 X WHERE X.column1 IN (SELECT TOP 5 Y.column1 FROM Table2 Y WHERE Y.column2 = X.column2 ORDER BY Y.column1 DESC)
The example looks only for the first five rows that are true for the subquery, then evaluates the rest of the query with that data. (Microsoft also allows ORDER BY in a subquery if a TOP clause is in the select list.) Table 6-2 shows the TOP equivalent support provided by the Big Eight.
Table 6-2. DBMSs and TOP Equivalent
Conditions of the form > ALL (subquery) have a slight portability problem. If the subquery has zero rows, then most DBMSs correctly say the condition is true but some incorrectly say the condition is false . Fortunately, none of the Big Eight have this flaw.
Be that as it may, > ALL conditions should be replaced by > ANY conditions with a MAX in the subquery. For example, Query #1 should be replaced by Query #2:
Query #1: SELECT * FROM Table1 WHERE column1 > ALL (SELECT column1 FROM Table2) Query #2: SELECT * FROM Table1 WHERE column1 > ANY (SELECT MAX(column1) FROM Table2) GAIN: 5/7
The idea in this example is that if Table2.column1 is indexed, then the DBMS may have an optimal way to find MAX(column1) . (Note that these queries can return different results if Table2 is empty, or if Table2.column1 contains NULLs.)
A subquery is a type of set operation. Suppose you have a set of rows from an outer query ( OUTER ) and a set of rows from an inner query ( INNER ). For IN subqueries, ( OUTER ) values must contain ( INNER ) values. For NOT IN subqueries, ( OUTER ) values must not contain ( INNER ) values. Both of these operations could also be done with explicit set operators, namely INTERSECT and EXCEPT. Many DBMSs don't support INTERSECT or EXCEPT, although all DBMSs except MySQL support another set operator: UNION. Table 6-3 shows the SQL Standard requirements and the level of support the Big Eight have for the set operators. Notes on Table 6-3:
Table 6-3. ANSI/DBMS Support for Set Operators
It matters little whether a DBMS supports INTERSECT, but the EXCEPT operator is useful for transforming NOT IN. For example, Query #1 can be transformed to Query #2:
Query #1: SELECT column1 FROM Table1 WHERE Table1.column2 NOT IN (SELECT column2 FROM Table2) Query #2: SELECT column1 FROM Table1 EXCEPT SELECT column1 FROM Table1 WHERE Table1.column2 IN (SELECT column2 FROM Table2) GAIN: 2/3
Informix, Ingres, InterBase, and Microsoft don't support EXCEPT. The gain shown is for only three DBMSs.
The gain happens because IN conditions use indexes. NOT IN conditions don't use indexes.
The Bottom Line: Syntax Choices
In general, in-to-out plans occur for subqueries that (a) begin with <comparison operator> [ANY ALL] and (b) contain no correlations.
If a subquery is in-to-out, then the relationship should be few-rows-to-many.
The best optimizations are ones that reduce the row count in the driver.
Add DISTINCT to the inner query, even when it's logically superfluous. DISTINCT causes ORDER BY, and it reduces the number of rows in the inner query.
Scanning is faster if the inner loop happens on a materialized view of the table, which is smaller than the table itself. DISTINCT helps here as well.
Materialization is a factor that discourages the DBMS from flattening.
If DISTINCT is a do-nothing operator because there's no duplication, then the relation is not one-to-many. A join is probably better than a subquery in such cases.
[NOT] EXISTS subqueries generally contain correlations.
If you can put a significant restriction on the outer query, that's goodeven if the restriction has to be repeated in the inner query.
Because you can't break out of the outer loop when using [NOT] EXISTS, it's more effective to reduce the number of outer-loop iterations.
Use IN for subqueries when the outer table has many rows and the inner table has few rows.
Use EXISTS for subqueries when the outer query has a search condition in addition to the subquery condition.
Use NOT EXISTS for subqueries that have a WHERE NOT outer query.
Don't write queries that contain two subqueries. It's better to merge two subqueries into one.
Conditions of the form > ALL (subquery) have a portability problem. Replace > ALL conditions with > ANY conditions that have a MAX function in the subquery.
The EXCEPT operator is useful for transforming NOT IN subqueries. EXCEPT provides a gain because it changes NOT IN to IN. IN conditions use indexes. NOT IN conditions don't use indexes.