Subqueries | Inside Microsoft SQL Server 7.0 (Mps)

SQL Server has an extremely powerful capability to nest queries, which provides a natural and efficient way to express WHERE clause criteria in terms of the results of other queries. You can express most joins as subqueries, although this method is often less efficient than performing the join operation. For example, using the pubs database to find all employees of the New Moon Books publishing company, we can write the query as either a join (using ANSI join syntax) or as a subquery.

Here's the query as a join ( equijoin , or inner join):

 SELECT emp_id, lname FROM employee JOIN publishers ON employee.pub_id=publishers.pub_id WHERE pub_name='New Moon Books'

This is the query as a subquery:

 SELECT emp_id, lname FROM employee  WHERE employee.pub_id IN      (SELECT publishers.pub_id       FROM publishers WHERE pub_name='New Moon Books')

You can write a join (equijoin) as a subquery (subselect), but the converse isn't necessarily true. The equijoin has the advantage in that the two sides of the equation equal each other and the order doesn't matter. Clearly, in a subquery, it does matter which query is the nested query. This difference is why a join can often be evaluated more quickly than an apparently equivalent subquery ( subselect ). A join gives the optimizer more options to choose from.

Performing relatively complex operations is simple when you use subqueries. For example, earlier we saw that the pubs sample database has four rows in the authors table that have no related row in the titleauthor table (which prompted our outer-join discussion). The following simple subquery returns those four author rows:

 SELECT 'Author ID'=A.au_id,     'Author'=CONVERT(varchar(20), RTRIM(au_lname) +', '     + RTRIM(au_fname)), state FROM authors A  WHERE A.au_id NOT IN      (SELECT B.au_id FROM titleauthor B)

Here's the output:

 Author ID          Author                     state ---------          -------------------        ----- 341-22-1782        Smith, Meander             KS 527-72-3246        Greene, Morningstar        TN 724-08-9931        Stringer, Dirk             CA 893-72-1158        McBadden, Heather          CA

The IN operation is commonly used for subqueries, either to find matching values (similar to a join) or to find nonmatching values by negating it (NOT IN), as shown above. Using the IN predicate is actually equivalent to saying = ANY. If we wanted to find every row in authors that had at least one entry in the titleauthor table, we could use either of these queries.

Here's the query using IN:

 SELECT 'Author ID'=A.au_id,     'Author'=CONVERT(varchar(20), RTRIM(au_lname) + ', '     + RTRIM(au_fname)), state FROM authors A  WHERE A.au_id IN      (SELECT B.au_id FROM titleauthor B)

This is the query using equivalent formulation with = ANY:

 SELECT 'Author ID'=A.au_id,     'Author'=CONVERT(varchar(20), RTRIM(au_lname) + , '     + RTRIM(au_fname)), state FROM authors A  WHERE A.au_id=ANY      (SELECT B.au_id FROM titleauthor B)

Here's the output:

 Author ID          Author                 state ---------          --------------------   ----- 172-32-1176        White, Johnson         CA 213-46-8915        Green, Marjorie        CA 238-95-7766        Carson, Cheryl         CA 267-41-2394        O'Leary, Michael       CA 274-80-9391        Straight, Dean         CA 409-56-7008        Bennet, Abraham        CA 427-17-2319        Dull, Ann              CA 472-27-2349        Gringlesby, Burt       CA 486-29-1786        Locksley, Charlene     CA 648-92-1872        Blotchet-Halls, Regi   OR 672-71-3249        Yokomoto, Akiko        CA 712-45-1867        del Castillo, Innes    MI 722-51-5454        DeFrance, Michel       IN 724-80-9391        MacFeather, Stearns    CA 756-30-7391        Karsen, Livia          CA 807-91-6654        Panteley, Sylvia       MD 846-92-7186        Hunter, Sheryl         CA 899-46-2035        Ringer, Anne           UT 998-72-3567        Ringer, Albert         UT

Each of these formulations is equivalent to testing the value of au_id in the titles table to the au_id value in the first row in the titleauthor table, and then OR'ing it to a test of the au_id value of the second row, and then OR'ing it to a test of the value of the third row, and so on. As soon as one row evaluates to TRUE, the expression is TRUE, and further checking can stop because the row in titles qualifies. However, it's an easy mistake to conclude that NOT IN must be equivalent to <> ANY, and some otherwise good discussions of the SQL language have made this exact mistake. And, more significantly, some products have also erroneously implemented it as such. Although IN is equivalent to = ANY, NOT IN is instead equivalent to <> ALL, not to <> ANY.

NOTE
Careful reading of the ANSI SQL-92 specifications also reveals that NOT IN is equivalent to <> ALL but is not equivalent to <> ANY. Section 8.4 of the specifications shows that R NOT IN T is equivalent to NOT (R = ANY T). Furthermore, careful study of section 8.7 < quantified comparison predicate> reveals that NOT (R = ANY T) is TRUE if and only if R <> ALL T is TRUE. In other words, NOT IN is equivalent to <> ALL.

By using NOT IN, you're stating that none of the corresponding values can match. In other words, all of the values must not match (<> ALL), and if even one does match, it's FALSE. With <> ANY, as soon as one value is found to be not equivalent, the expression is TRUE. This, of course, is also the case for every row of authors ”rows in titleauthor will always exist for other au_id values, and hence all authors rows will have at least one nonmatching row in titleauthor . That is, every row in authors will evaluate to TRUE for a test of <> ANY row in titleauthor .

The following query using <> ALL returns the same four rows as the earlier one that used NOT IN:

 SELECT 'Author ID'=A.au_id,     'Author'=CONVERT(varchar(20), RTRIM(au_lname) + ', '     + RTRIM(au_fname)), state  FROM authors A  WHERE A.au_id <> ALL      (SELECT B.au_id FROM titleauthor B)

The output follows :

 Author ID          Author                     state ---------          -------------------        ----- 341-22-1782        Smith, Meander             KS 527-72-3246        Greene, Morningstar        TN 724-08-9931        Stringer, Dirk             CA 893-72-1158        McBadden, Heather          CA

If we made the mistake of thinking that because IN is equivalent to = ANY, then NOT IN is equivalent to <> ANY, we would write the query as follows. This returns all 23 rows in the authors table!

 SELECT 'Author ID'=A.au_id,     'Author'=CONVERT(varchar(20), RTRIM(au_lname) + ', '     + RTRIM(au_fname)), state  FROM authors A  WHERE A.au_id <> ANY      (SELECT B.au_id FROM titleauthor B)

Here's the output:

 Author ID          Author                 state ---------          --------------------   ----- 172-32-1176        White, Johnson         CA 213-46-8915        Green, Marjorie        CA 238-95-7766        Carson, Cheryl         CA 267-41-2394        O'Leary, Michael       CA 274-80-9391        Straight, Dean         CA 341-22-1782        Smith, Meander         KS 409-56-7008        Bennet, Abraham        CA 427-17-2319        Dull, Ann              CA 472-27-2349        Gringlesby, Burt       CA 486-29-1786        Locksley, Charlene     CA 527-72-3246        Greene, Morningstar    TN 648-92-1872        Blotchet-Halls, Regi   OR 672-71-3249        Yokomoto, Akiko        CA 712-45-1867        del Castillo, Innes    MI 722-51-5454        DeFrance, Michel       IN 724-08-9931        Stringer, Dirk         CA 724-80-9391        MacFeather, Stearns    CA 756-30-7391        Karsen, Livia          CA 807-91-6654        Panteley, Sylvia       MD 846-92-7186        Hunter, Sheryl         CA 893-72-1158        McBadden, Heather      CA 899-46-2035        Ringer, Anne           UT 998-72-3567        Ringer, Albert         UT

The examples just shown use IN, NOT IN, ANY, and ALL to compare values to a set of values from a subquery. This is common. However, it's also common to use expressions and compare a set of values to a single, scalar value. For example, to find titles whose royalties exceed the average of all royalty values in the roysched table by 25 percent or more, you could use this simple query:

 SELECT titles.title_id, title, royalty FROM titles WHERE titles.royalty >=     (SELECT 1.25 * AVG(roysched.royalty) FROM roysched)

This query is perfectly good, because the aggregate function AVG ( expression ) stipulates that the subquery must return exactly one value and no more. Without using IN, ANY, or ALL (or their negations), a subquery that returned more than one row would result in an error. If we incorrectly rewrote the query as follows, without the AVG function, we'd get run-time error 512:

 SELECT titles.title_id, title, royalty FROM titles WHERE titles.royalty >=     (SELECT 1.25 * roysched.royalty FROM roysched)

This returns the following output:

 Msg 512, Level 16, State 1 Subquery returned more than 1 value. This is illegal when the  subquery follows =, !=, <, <=, >, >=, or when the subquery is used as  an expression.

It is significant that this error is a run-time error and not a syntax error, meaning that in the SQL Server implementation, if that subquery didn't produce more than one row, the query would be considered valid and would execute. For example, with the knowledge that the subquery here would return only one row, this query is valid and returns six rows:

 SELECT titles.title_id, title, royalty FROM titles WHERE titles.royalty >=     (SELECT 1.25*roysched.royalty FROM roysched      WHERE roysched.title_id='MC3021' AND lorange=0)

The output follows:

 title_id      title --------      ----------------------------------------- BU2075        You Can Combat Computer Stress! MC2222        Silicon Valley Gastronomic Treats MC3021        The Gourmet Microwave PC1035        But Is It User Friendly? PS2091        Is Anger the Enemy? TC4203        Fifty Years in Buckingham Palace Kitchens

However, this sort of query can be dangerous, and you should avoid it or use it only when you know a PRIMARY KEY or UNIQUE constraint will ensure that the subquery returns only one value. The query here appears to work, but it's a bug waiting to happen. As soon as another row is added to the roysched table, say with title_id of MC3021 and a lorange of 0, the query returns an error ”and no constraint exists to prevent such a row from being added.

You might argue that SQL Server should determine whether a query formation could conceivably return more than one row regardless of the data at the time and then disallow such a subquery formulation. The decision to allow this stemmed from the philosophy that such a query might be quite valid when the database relationships are properly understood , so the power shouldn't be limited to try to protect na ve users. Whether you agree with this philosophy or not, it's consistent with SQL in general ”and you should know by now that you can easily write a perfectly legal, syntactically correct query that answers a question entirely different from the one you thought you were asking!

Correlated Subqueries

You can use powerful correlated subqueries to compare specific rows of one table to a condition in a matching table. For each row otherwise qualifying in the main (or top) query, the subquery is evaluated. Conceptually, a correlated subquery is similar to a loop in programming, although it's entirely without procedural constructs such as do-while or for . The results of each execution of the subquery must be correlated to a row of the main query. In the next example, for every row in the titles table that has a price of $19.99 or less, the row is compared with sales in stores in California, for which the revenue ( price — qty ) is greater than $250. In other words, "Show me titles with prices of under $20 that have sales of more than $250 in California."

 SELECT T.title_id,title  FROM titles T WHERE price <= 19.99 AND T.title_id IN (     SELECT S.title_id FROM sales S, stores ST     WHERE S.stor_id=ST.stor_id     AND ST.state='CA' AND S.qty*T.price > 250     AND T.title_id=S.title_id)

Here's the result:

 title_id        title --------        ----------------------------- BU7832          Straight Talk About Computers PS2091          Is Anger the Enemy? TC7777          Sushi, Anyone?

Notice that this correlated subquery, like many subqueries, could have been written as a join (here using the old-style JOIN syntax):

 SELECT T.title_id, T.title  FROM sales S, stores ST, titles T WHERE S.stor_id=ST.stor_id AND T.title_id=S.title_id  AND ST.state='CA' AND T.price <= 19.99 AND S.qty*T.price > 250

It becomes nearly impossible to create alternative joins when the subquery isn't doing a simple IN or when it uses aggregate functions. For example, suppose that we want to find titles that lag in sales for each store. This could be defined as "Find any title for every store in which the title's sales in that store are below 80 percent of the average of sales for all stores that carry that title, and ignore titles that have no price established (that is, the price is NULL)." An intuitive way to do this is to first think of the main query that will give us the gross sales for each title and store, and then for each such result, do a subquery that finds the average gross sales for the title for all stores. Then we correlate the subquery and the main query, keeping only rows that fall below the 80 percent standard.

Such an example follows. For clarity, notice the two distinct queries, each of which answers a separate question. Then notice how they can be combined into a single correlated query to answer the specific question posed here. All three queries use the old-style JOIN syntax.

 -- This query computes gross revenues by -- title for every title and store SELECT T.title_id, S.stor_id, ST.stor_name, city, state,     T.price*S.qty FROM titles AS T, sales AS S, stores AS ST WHERE T.title_id=S.title_id AND S.stor_id=ST.stor_id -- This query computes 80% of the average gross revenue for each -- title for all stores carrying that title: SELECT T2.title_id, .80*AVG(price*qty) FROM titles AS T2, sales AS S2  WHERE T2.title_id=S2.title_id GROUP BY T2.title_id -- Correlated subquery that finds store-title combinations whose -- revenues are less than 80% of the average of revenues for that -- title for all stores selling that title SELECT T.title_id, S.stor_id, ST.stor_name, city, state,     Revenue=T.price*S.qty FROM titles AS T, sales AS S, stores AS ST WHERE T.title_id=S.title_id AND S.stor_id=ST.stor_id AND T.price*S.qty <      (SELECT 0.80*AVG(price*qty)     FROM titles T2, sales S2     WHERE T2.title_id=S2.title_id     AND T.title_id=T2.title_id )

And the answer is (from the third query):

 title_id  stor_id  stor_name            city       state  Revenue --------  -------  -------------------  ----       -----  ------- BU1032    6380     Eric the Read Books  Seattle    WA     99.95 MC3021    8042     Bookbeat             Portland   OR     44.85 PS2091    6380     Eric the Read Books  Seattle    WA     32.85 PS2091    7067     News & Brews         Los Gatos  CA     109.50 PS2091    7131     Doc-U-Mat: Quality   Remulade   WA     219.00                    Laundry and Books

When the newer ANSI JOIN syntax was first introduced, it wasn't obvious how to use it to write a correlated subquery. And it could be that the creators of the syntax forgot about the correlated subquery case, because using the syntax seems like a hybrid of the old and the new: the correlation is still done in the WHERE clause, rather than in the JOIN clause. For illustration, examine the two equivalent formulations of the above query using the ANSI JOIN syntax shown below.

 SELECT T.title_id, S.stor_id, ST.stor_name, city, state,     Revenue=T.price*S.qty FROM titles AS T JOIN sales AS S ON T.title_id=S.title_id     JOIN stores AS ST ON S.stor_id=ST.stor_id  WHERE T.price*S.qty <      (SELECT 0.80*AVG(price*qty)     FROM titles T2 JOIN sales S2 ON T2.title_id=S2.title_id     WHERE T.title_id=T2.title_id ) SELECT T.title_id, S.stor_id, ST.stor_name, city, state,     Revenue=T.price*S.qty FROM titles AS T JOIN sales AS S      ON T.title_id=S.title_id      AND T.price*S.qty <         (SELECT 0.80*AVG(T2.price*S2.qty)         FROM sales AS S2 JOIN titles AS T2             ON T2.title_id=S2.title_id         WHERE T.title_id=T2.title_id)     JOIN stores AS ST ON S.stor_id=ST.stor_id

Often, correlated subqueries use the EXISTS statement, which is the most convenient syntax to use when multiple fields of the main query are to be correlated to the subquery. (In practice, EXISTS is seldom used other than with correlated subqueries.) EXISTS simply checks for a nonempty set. It returns (internally) either TRUE or NOT TRUE, which we won't refer to as FALSE, given the issues of three-valued logic and NULL. Because no column value is returned, TRUE or NOT TRUE, convention dictates that a column list isn't specified and also dictates that the asterisk (*) character is used instead.

A common use for EXISTS is to answer a query such as "Show me the titles for which no stores have sales."

 SELECT T1.title_id, title FROM titles T1 WHERE NOT EXISTS (SELECT *  FROM titles T2 JOIN sales S ON (T2.title_id=S.title_id) WHERE T2.title_id=T1.title_id ) title_id    title --------    ---------------------------------- MC3026      The Psychology of Computer Cooking PC9999      Net Etiquette

Conceptually, this query is pretty straightforward. The subquery, a simple equijoin, finds all matches of titles and sales . Then NOT EXISTS correlates titles to those matches, looking for titles that don't have even a single row returned in the subquery.

Another common use of EXISTS determines whether a table is empty. The optimizer knows that as soon as it gets a single hit using EXISTS, the operation is TRUE, and further processing is unnecessary. For example, here's how you determine whether the authors table is empty:

 SELECT 'Not Empty' WHERE EXISTS (SELECT * FROM authors)

Earlier, when discussing outer join, we mentioned that it's now possible to use an outer-join formulation to address what was traditionally a problem in need of a correlated subquery solution. Here's an outer-join formulation for the problem described earlier, "Show me the titles for which no stores have sales."

 SELECT T1.title_id, title FROM titles T1      LEFT OUTER JOIN sales S ON T1.title_id=S.title_id  WHERE S.title_id IS NULL

TIP
Depending on your data and indexes, the outer-join formulation might be faster or slower than a correlated subquery. But before automatically deciding to write your query one way or the other, you might want to come up with a couple of alternative formulations, and choose the one that's fastest in your situation.
In this example, for which little data exists, both solutions run in subsecond elapsed time. But the outer-join query requires fewer than half the number of logical I/Os than does the correlated subquery. With more data, that difference would be significant.

This query works by joining the stores and title tables and by preserving the titles for which no store exists. Then, in the WHERE clause, it specifically chooses only the rows that it preserved in the outer join. Those rows are the ones for which a title had no matching store.

Other times, a correlated subquery might be preferable to a join, especially if it's a self-join back to the same table or some other exotic join. Here's an example. Given the following table (and assuming that the row_num column is guaranteed unique), suppose that we want to identify the rows for which col2 and col3 are duplicates of another row:

 row_num        col2        col3 -------        ----        ---- 1              C           D  2              A           A  3              A           D  4              C           B  5              C           C  6              B           C  7              C           A  8              C           B  9              C           D  10             D           D

We can do this in two standard ways. The first uses a self-join. In a self-join, the table (or view) is used multiple times in the FROM clause and aliased at least once. Then it can be treated as an entirely different table, comparing columns , and so on. A self-join to find the rows having duplicate values for col2 and col3 is easy to understand:

 SELECT DISTINCT A.row_num, A.col2, A.col3 FROM match_cols AS A, match_cols AS B WHERE A.col2=B.col2 AND A.col3=B.col3 AND A.row_num <> B.row_num ORDER BY A.col2, A.col3 row_num        col2        col3 -------        ----        ---- 4              C           B  8              C           B  1              C           D  9              C           D

But in this case, a correlated subquery using aggregate functions provides a considerably more efficient solution, especially if many duplicates exist:

 SELECT A.row_num, A.col2, A.col3 FROM match_cols AS A WHERE EXISTS (SELECT B.col2, B.col3 FROM match_cols AS B                   WHERE B.col2=A.col2                   AND B.col3=A.col3                   GROUP BY B.col2, B.col3 HAVING COUNT(*) > 1) ORDER BY A.col2, A.col3

This correlated subquery has another advantage over the self-join example ”the row_num column doesn't need to be unique to solve the problem at hand.

You can take a correlated subquery a step further to ask a seemingly simple question that's surprisingly tricky to answer in SQL: "Show me the stores that have sold every title." Ron Soukup uses a variation of this question when interviewing testing candidates who profess strong SQL knowledge. Even though it seems reasonable, relatively few candidates can answer this request, especially if we throw in the restrictions that they aren't allowed to use an aggregate function like COUNT(*) and that the solution must be a single SELECT statement (that is, creating temporary tables or the like isn't allowed).

The previous query already revealed two titles that no store has sold, so we know that with the existing dataset, no stores can have sales for all titles. For illustrative purposes, let's add sales records for a hypothetical store that does, in fact, have sales for every title. Following that, we'll see the query that finds all stores that have sold every title (which we know ahead of time is only the phony one we're entering here):

 -- The phony store INSERT stores (stor_id, stor_name, stor_address, city, state, zip) VALUES ('9999', 'WE SUPPLY IT ALL', 'One Main St', 'Poulsbo',     'WA','98370') -- By using a combination of hard-coded values and selecting every -- title, generate a sales row for every title INSERT sales (stor_id, title_id, ord_num, ord_date, qty, payterms) SELECT '9999', title_id, 'PHONY1', GETDATE(), 10, 'Net 60' FROM titles  -- Find stores that supply every title SELECT ST.stor_id, ST.stor_name, ST.city, ST.state FROM stores ST WHERE NOT EXISTS     (SELECT * FROM titles T1     WHERE NOT EXISTS         (SELECT *          FROM titles T2 JOIN sales S ON (T2.title_id=S.title_id)         WHERE T2.title_id=T1.title_id AND ST.stor_id=S.stor_id)     )

Here's the result:

 stor_id        stor_name           city       state -------        ---------           ----       ----- 9999           WE SUPPLY IT ALL    Poulsbo    WA

Although this query might be difficult to think of immediately, you can easily understand why it works. In English, it says "Show me the store(s) such that no titles exist that the store doesn't sell." This query consists of the two subqueries that are applied to each store. The bottommost subquery produces all the titles that the store has sold. The upper subquery is then correlated to that bottom one to look for any titles that are not in the list of those that the store has sold. The top query returns any stores that aren't in this list. This type of query is known as a relational division, and unfortunately , it isn't as easy to express as we'd like. Although the query shown above is quite understandable, once you have a solid foundation in SQL, it's hardly intuitive. As is almost always the case, you could probably use other formulations to write this query.

We've already alluded to doing this query without using an aggregate function like COUNT. There's nothing wrong with using an aggregate function ”Ron imposes the restriction only to make his test more of a brainteaser. If you think of the query in English as "Find the stores that have sold as many unique titles as there are total unique titles," you'll find the following formulation somewhat more intuitive. Of those who get this brainteaser right, this formulation (or a slight variation) is probably the most popular (if we don't have the restriction of not using the COUNT function). It actually runs slightly faster in SQL Server 7 than the NOT EXISTS formulation mentioned earlier.

 SELECT ST.stor_id, ST.stor_name  FROM stores ST, sales SA, titles T  WHERE SA.stor_id=ST.stor_id AND SA.title_id=T.title_id GROUP BY ST.stor_id,ST.stor_name  HAVING COUNT(DISTINCT SA.title_id)=(SELECT COUNT(*) FROM titles T1)

The following formulation runs much more efficiently than either of the previous two. The syntax is similar to the one above but still novel in its approach because it's just a standard subquery, not a join or a correlated subquery. You might think it's an illegal query, since it does a GROUP BY and a HAVING without an aggregate in the select list of the first subquery. But that's OK, both in terms of what SQL Server allows and in terms of the ANSI specification. What isn't allowed is having an item in the select list that isn't an aggregate function but then omitting it from the GROUP BY clause, if there is a GROUP BY clause.

 -- Find stores that have sold every title SELECT stor_id, stor_name FROM stores WHERE stores.stor_id IN  (SELECT stor_id FROM sales GROUP BY stor_id      HAVING COUNT(DISTINCT title_id)=(SELECT COUNT(*) FROM titles) )

And as a lead-in to the next topic, here's a formulation that uses a derived table, a feature that allows you to use a subquery in a FROM clause. This capability takes a SELECT statement and aliases it as though it were a table, and then it allows you to select from the select list. This query also runs efficiently.

 SELECT ST.stor_id, ST.stor_name FROM stores ST,          (SELECT stor_id, COUNT(DISTINCT title_id) AS title_count         FROM sales         GROUP BY stor_id     ) as SA WHERE ST.stor_id=SA.stor_id AND SA.title_count=     (SELECT COUNT(*) FROM titles)