Join versus Subquery | SQL Performance Tuning

When you write a join, you're letting the DBMS optimizer decide among a variety of plans, such as merge- sort , nested-loop, or hash; see Chapter 5, "Joins." But when you write a subquery, you are strongly hinting that you want a nested-loop. Both out-to-in and in-to-out queries are nested-loop plansthey merely differ in direction or driver choice. That's one point in favor of writing joins instead of subqueries, because it's better not to tell the DBMS what to do. The best option is to give the DBMS choices.

Nested-loops for subqueries can be different from nested- loops for joins. Listing 6-1 shows a join nested-loop, while Listing 6-2 shows a subquery nested-loop.

Listing 6-1 Nested-Loop for Joins

 for (each row in driver) {   for (each row in driven) {     if "match" add to matchlist   } }

Listing 6-2 Nested-Loop for Subqueries

 for (each row in driver) {   for (each row in driven) {     if "match" EXIT LOOP   } }

The difference between nested-loops for joins and nested-loops for subqueries is that a subquery needs to find only a single match, so it might be able to break out early from the inner loop. In contrast, a join needs all matches, so it must chug through all rows. (Because the subquery loop can abort early the result is sometimes called a "semijoin," but that's a rare word.) And that's one point in favor of writing subqueries, because a subquery loop will have fewer iterations.

To test whether subqueries can beat joins, we made two one-column (INTEGER) tables. Table1 had 1,000 rows, and Table2 had 10,000 rows. Each table was populated with random numbers between 100 and 1,000. There was no clustering or indexing. Then we tested this join and a similar-looking subquery against Table1 and Table2 :

Use IN

How can you encourage early breakout ? Avoid the use of syntax like = (SELECT <query>) . Use = ANY (SELECT <query>) expressionsor the commoner shorthand IN (SELECT <query>) instead, because = (SELECT <query>) has to chug through the whole inner loop before it's certain that there is only one match. In other words, don't use forms like Query #1. Replace them with forms like Query #2 or Query #3:

 Query #1        /* avoid this */ SELECT column1 FROM Table1   WHERE column1 =      (SELECT column1 FROM Table2) Query #2        /* use this */ SELECT column1 FROM Table1   WHERE column1 = ANY      (SELECT column1 FROM Table2) Query #3        /* or use this */ SELECT column1 FROM Table1   WHERE column1 IN      (SELECT column1 FROM Table2)

 Join: SELECT MIN(Table1.column1)    FROM Table1, Table2    WHERE Table1.column1 = Table2.column1 Subquery: SELECT MIN(Table1.column1)    FROM Table1    WHERE Table1.column1 IN       (SELECT Table2.column1          FROM Table2) GAIN: 7/7

Portability

MySQL doesn't support subqueries. The gain on all tests shown in this chapter is for only seven DBMSs.

The result of the test was that the subquery was faster with every DBMS tested.

Here are some other arguments in the join-versus-subquery debate:

Pro join arguments

People who prefer joins over subqueries pose the following arguments:

If the outer query's WHERE clause contains multiple ANDed/ORed conditions, the optimizer can arrange them better if everything is on one levelthat is, if you flatten to a join. Optimizers tend to optimize each query separately.
Some DBMSs (e.g., Oracle) can parallelize joins better than they can parallelize subqueries.
When you do a join, data from both tables can be in the final select list. This isn't true for subqueries.
Simply because people think joins are better, they use them more, and therefore DBMS vendors work harder to make joins better. It's a self-fulfilling prophecy !

Pro subquery arguments

On the other side, people who prefer subqueries over joins argue these points:

Comparison rules might be more relaxed in subqueries. For example, SMALLINTs can be looked up in INTEGER subqueries without major penalties.
Subqueries can contain GROUP BY, HAVING, and set functions. It's more cumbersome to join when such syntax exists, especially if the syntax is hidden within a view definition.
Subqueries can appear in UPDATE statements. Most DBMSs can't update a join.
Subqueries that use a comparison operator without ANY, and subqueries that use NOT EXISTS within NOT EXISTS (the "relational divide" problem), are not expressible using joins with classic syntax.
In mature DBMSs, vendors utilize subquery optimizations similar to those that are used to optimize joins. For example, IBM was once unable to take advantage of indexes when doing in-to-out, but now IBM does use indexes in such situations.

Flattening

To flatten a query means to make everything one level. Based on our observations so far, there are clearly times when you should flatten a subquery. In this section, we'll discuss how the DBMS can flatten automatically, but first let's establish how you can do it yourself.

To flatten IN

Here's an example of a subquery using IN, and the flattened analogous query:

 Subquery with IN: SELECT * FROM Table1   WHERE Table1.column1 IN      (SELECT Table2.column1         FROM Table2         WHERE Table2.column1 = 5) Flattened: SELECT Table1.*   FROM Table1, Table2   WHERE Table1.column1 = Table2.column1     AND Table2.column1 = 5

This simple flattening seems like it should be equivalent to the IN query, but it has a flaw that can be seen if Table1 has these three rows: {1 , 5 , 5} and Table2 has these three rows: {2 , 5 , 5} . Notice that the subquery-based SELECT would correctly return two rows: {5 , 5} , but the flattened SELECT would return four rows: {5 , 5 , 5 , 5} . Notice further that if the flattened SELECT is "corrected" by using SELECT DISTINCT instead of just SELECT, the join would return only one row: {5} . In other words, flattening can cause too many duplicate rowsand if you try to get rid of the duplicates, you can end up with too few duplicate rows. Sometimes you'll see this flaw if the DBMS flattens automatically. (By the way, DISTINCT can be expensive so don't use it if Table2.column1 is uniquely indexed.)

To flatten NOT IN

Here's an example of a subquery using NOT IN, and the flattened equivalent query:

 Subquery with NOT IN: SELECT Table1.column1 FROM Table1   WHERE Table1.column1 NOT IN      (SELECT Table2.column1 FROM Table2) Flattened equivalent: SELECT Table1.column1   FROM Table1 LEFT JOIN Table2        ON Table1.column1 = Table2.column1   WHERE Table2.column1 IS NULL

This type of flattening is rare, and has two flaws: it assumes that ON happens before WHERE as in the SQL Standard, and it assumes that Table2.column1 has no NULLs (the WHERE Table2.column1 IS NULL clause is meant to find dummy rows produced by the outer join).

When is flattening automatic? The rules vary slightly between DBMSs and often seem to reflect some foible in the DBMS rather than some strategy. Amalgamating the rules followed by the various DBMSs, we can say that a flattening is almost certain to happen to subqueries in this scenario:

Flattening is legalthat is, the subquery isn't in an UPDATE statement, contains no implied groupings, and so on.
The outer query does not have any of {<>, NOT EXISTS, AND, OR, CASE} directly before the subquery.
The inner query does not have a join or another inner query.
There is not a one-to-many relation. (The DBMS might determine that by finding a UNIQUE index on the subquery column.)

If you think the DBMS is likely to flatten incorrectly, you can discourage it by changing the conditions. If you think the DBMS is likely to flatten correctly, then you face the questionWhy code with a subquery in the first place? If it's going to be a join anyway, it would be clearer to code it as a join. In other words, if flattening is automatic, that's a reason to flatten manually! We think that's taking things a bit too far. Better to let the DBMS decide, and worry yourself about the portability. Keep in mind that the original subquery code is there because somebody thought it was better.

The Bottom Line: Join versus Subquery

One major point in favor of writing joins instead of subqueries is that it's better not to tell the DBMS what to do. By writing joins, you're giving the DBMS choices instead of forcing it to follow a specific plan.

One major point in favor of writing subqueries is that a subquery loop may have fewer iterations, so subqueries may be faster than joins.

Encourage the DBMS to break out of subquery loops early by using syntax like IN (SELECT <query>) rather than = (SELECT <query>) .

Generally joins are better, but if, on average, several rows are in the driven query for each row in the driver (for example, a primary-to-foreign or other one-to-many relation), and especially if matching rows come first in the driven query, then subqueries are better.