Subqueries | Inside Microsoft SQL Server 2000

SQL Server has an extremely powerful capability for nesting queries that provides a natural and efficient way to express WHERE clause criteria in terms of the results of other queries. You can express most joins as subqueries, although this method is often less efficient than performing a join operation. For example, to use the pubs database to find all employees of the New Moon Books publishing company, you can write the query as either a join (using ANSI join syntax) or as a subquery.

Here's the query as a join (equijoin, or inner join):

 SELECT emp_id, lname FROM employee JOIN publishers ON employee.pub_id=publishers.pub_id WHERE pub_name='New Moon Books'

This is the query as a subquery:

 SELECT emp_id, lname FROM employee WHERE employee.pub_id IN (SELECT publishers.pub_id FROM publishers WHERE pub_name='New Moon Books')

You can write a join (equijoin) as a subquery (subselect), but the converse isn't necessarily true. The equijoin offers an advantage in that the two sides of the equation equal each other and the order doesn't matter. In certain types of subqueries, it does matter which query is the nested query. However, if the query with the subquery can be rephrased as a semantically equivalent JOIN query, the optimizer will do the conversion internally and the performance will be the same whether you write your queries as joins or with subqueries.

Relatively complex operations are simple to perform when you use subqueries. For example, earlier you saw that the pubs sample database has four rows in the authors table that have no related row in the titleauthor table (which prompted our outer-join discussion). The following simple subquery returns those four author rows:

 SELECT 'Author ID'=A.au_id, 'Author'=CONVERT(varchar(20), RTRIM(au_lname) + ', ' + RTRIM(au_fname)), state FROM authors A WHERE A.au_id NOT IN (SELECT B.au_id FROM titleauthor B)

Here's the output:

 Author ID Author state --------- ------------------- ----- 341-22-1782 Smith, Meander KS 527-72-3246 Greene, Morningstar TN 724-08-9931 Stringer, Dirk CA 893-72-1158 McBadden, Heather CA

The IN operation is commonly used for subqueries, either to find matching values (similar to a join) or to find nonmatching values by negating it (NOT IN), as shown above. Using the IN predicate is actually equivalent to = ANY. If you wanted to find every row in authors that had at least one entry in the titleauthor table, you could use either of these queries.

Here's the query using IN:

 SELECT 'Author ID'=A.au_id, 'Author'=CONVERT(varchar(20), RTRIM(au_lname) + ', ' + RTRIM(au_fname)), state FROM authors A WHERE A.au_id IN (SELECT B.au_id FROM titleauthor B)

This is the query using equivalent formulation with = ANY:

 SELECT 'Author ID'=A.au_id, 'Author'=CONVERT(varchar(20), RTRIM(au_lname) + ', ' + RTRIM(au_fname)), state FROM authors A WHERE A.au_id=ANY (SELECT B.au_id FROM titleauthor B)

Here's the output:

 Author ID Author state --------- -------------------- ----- 172-32-1176 White, Johnson CA 213-46-8915 Green, Marjorie CA 238-95-7766 Carson, Cheryl CA 267-41-2394 O'Leary, Michael CA 274-80-9391 Straight, Dean CA 409-56-7008 Bennet, Abraham CA 427-17-2319 Dull, Ann CA 472-27-2349 Gringlesby, Burt CA 486-29-1786 Locksley, Charlene CA 648-92-1872 Blotchet-Halls, Regi OR 672-71-3249 Yokomoto, Akiko CA 712-45-1867 del Castillo, Innes MI 722-51-5454 DeFrance, Michel IN 724-80-9391 MacFeather, Stearns CA 756-30-7391 Karsen, Livia CA 807-91-6654 Panteley, Sylvia MD 846-92-7186 Hunter, Sheryl CA 899-46-2035 Ringer, Anne UT 998-72-3567 Ringer, Albert UT

Each of these formulations is equivalent to testing the value of au_id in the authors table to the au_id value in the first row in the titleauthor table, and then OR'ing it to a test of the au_id value of the second row, and then OR'ing it to a test of the value of the third row, and so on. As soon as one row evaluates to TRUE, the expression is TRUE, and further checking can stop because the row in authors qualifies. However, it's an easy mistake to conclude that NOT IN must be equivalent to <> ANY, and some otherwise good discussions of the SQL language have made this exact mistake. More significantly, some products have also erroneously implemented it as such. Although IN is equivalent to = ANY, NOT IN is instead equivalent to <> ALL, not to <> ANY.

NOTE
Careful reading of the ANSI SQL-92 specifications also reveals that NOT IN is equivalent to <> ALL but is not equivalent to <> ANY. Section 8.4 of the specifications shows that R NOT IN T is equivalent to NOT (R = ANY T). Furthermore, careful study of section 8.7 <quantified comparison predicate> reveals that NOT (R = ANY T) is TRUE if and only if R <> ALL T is TRUE. In other words, NOT IN is equivalent to <> ALL.

By using NOT IN, you're stating that none of the corresponding values can match. In other words, all of the values must not match (<> ALL), and if even one does match, it's FALSE. With <> ANY, as soon as one value is found to be not equivalent, the expression is TRUE. This, of course, is also the case for every row of authors: rows in titleauthor will always exist for other au_id values, and hence all authors rows will have at least one nonmatching row in titleauthor. That is, every row in authors will evaluate to TRUE for a test of <> ANY row in titleauthor.

The following query using <> ALL returns the same four rows as the earlier one that used NOT IN:

 SELECT 'Author ID'=A.au_id, 'Author'=CONVERT(varchar(20), RTRIM(au_lname) + ', ' + RTRIM(au_fname)), state FROM authors A WHERE A.au_id <> ALL (SELECT B.au_id FROM titleauthor B)

Here is the output:

 Author ID Author state --------- ------------------- ----- 341-22-1782 Smith, Meander KS 527-72-3246 Greene, Morningstar TN 724-08-9931 Stringer, Dirk CA 893-72-1158 McBadden, Heather CA

If you had made the mistake of thinking that because IN is equivalent to = ANY, then NOT IN is equivalent to <> ANY, you would have written the query as follows. This returns all 23 rows in the authors table!

 SELECT 'Author ID'=A.au_id, 'Author'=CONVERT(varchar(20), RTRIM(au_lname) + ', ' + RTRIM(au_fname)), state FROM authors A WHERE A.au_id <> ANY (SELECT B.au_id FROM titleauthor B)

Here's the output:

 Author ID Author state --------- -------------------- ----- 172-32-1176 White, Johnson CA 213-46-8915 Green, Marjorie CA 238-95-7766 Carson, Cheryl CA 267-41-2394 O'Leary, Michael CA 274-80-9391 Straight, Dean CA 341-22-1782 Smith, Meander KS 409-56-7008 Bennet, Abraham CA 427-17-2319 Dull, Ann CA 472-27-2349 Gringlesby, Burt CA 486-29-1786 Locksley, Charlene CA 527-72-3246 Greene, Morningstar TN 648-92-1872 Blotchet-Halls, Regi OR 672-71-3249 Yokomoto, Akiko CA 712-45-1867 del Castillo, Innes MI 722-51-5454 DeFrance, Michel IN 724-08-9931 Stringer, Dirk CA 724-80-9391 MacFeather, Stearns CA 756-30-7391 Karsen, Livia CA 807-91-6654 Panteley, Sylvia MD 846-92-7186 Hunter, Sheryl CA 893-72-1158 McBadden, Heather CA 899-46-2035 Ringer, Anne UT 998-72-3567 Ringer, Albert UT

The examples just shown use IN, NOT IN, ANY, and ALL to compare values to a set of values from a subquery. This is common. However, it's also common to use expressions and compare a set of values to a single, scalar value. For example, to find titles whose royalties exceed the average of all royalty values in the roysched table by 25 percent or more, you can use this simple query:

 SELECT titles.title_id, title, royalty FROM titles WHERE titles.royalty >= (SELECT 1.25 * AVG(roysched.royalty) FROM roysched)

This query is perfectly good because the aggregate function AVG (expression) stipulates that the subquery must return exactly one value and no more. Without using IN, ANY, or ALL (or their negations), a subquery that returns more than one row will result in an error. If you incorrectly rewrote the query as follows, without the AVG function, you'd get run-time error 512:

 SELECT titles.title_id, title, royalty FROM titles WHERE titles.royalty >= (SELECT 1.25 * roysched.royalty FROM roysched)

Here is the output:

 Server: Msg 512, Level 16, State 1, Line 1 Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.

It is significant that this error is a run-time error and not a syntax error: in the SQL Server implementation, if that subquery didn't produce more than one row, the query would be considered valid and would execute. For example, the subquery in the following code returns only one row, so the query is valid and returns four rows:

 SELECT titles.title_id, royalty, title FROM titles WHERE titles.royalty >= (SELECT 1.25*roysched.royalty FROM roysched WHERE roysched.title_id='MC3021' AND lorange=0)

Here is the output:

 title_id royalty title -------- ----------- --------------------------------------------- BU2075 24 You Can Combat Computer Stress! MC3021 24 The Gourmet Microwave PC1035 16 But Is It User Friendly? TC4203 14 Fifty Years in Buckingham Palace Kitchens

However, this sort of query can be dangerous, and you should avoid it or use it only when you know that a PRIMARY KEY or UNIQUE constraint will ensure that the subquery returns only one value. The query here appears to work, but it's a bug waiting to happen. As soon as another row is added to the roysched table—say, with a title_id of MC3021 and a lorange of 0—the query returns an error. No constraint exists to prevent such a row from being added.

You might argue that SQL Server should determine whether a query formation could conceivably return more than one row regardless of the data at the time and then disallow such a subquery formulation. The reason it doesn't is that such a query might be quite valid when the database relationships are properly understood, so the power shouldn't be limited to try to protect naïve users. Whether you agree with this philosophy or not, it's consistent with SQL in general—and you should know by now that you can easily write a perfectly legal, syntactically correct query that answers a question in a way that's entirely different from what you thought you were asking!

Correlated Subqueries

You can use powerful correlated subqueries to compare specific rows of one table to a condition in a matching table. For each row otherwise qualifying in the main (or top) query, the subquery is evaluated. Conceptually, a correlated subquery is similar to a loop in programming, although it's entirely without procedural constructs such as do-while or for. The results of each execution of the subquery must be correlated to a row of the main query. In the next example, for every row in the titles table that has a price of $19.99 or less, the row is compared with each sales row for stores in California for which the revenue (price × qty) is greater than $250. In other words, "Show me titles with prices of under $20 for which any single sale in California was more than $250."

 SELECT T.title_id, title FROM titles T WHERE price <= 19.99 AND T.title_id IN ( SELECT S.title_id FROM sales S, stores ST WHERE S.stor_id=ST.stor_id AND ST.state='CA' AND S.qty*T.price > 250 AND T.title_id=S.title_id)

Here's the result:

 title_id title -------- ----------------------------- BU7832 Straight Talk About Computers PS2091 Is Anger the Enemy? TC7777 Sushi, Anyone?

Notice that this correlated subquery, like many subqueries, could have been written as a join (here using the old-style JOIN syntax):

 SELECT T.title_id, T.title FROM sales S, stores ST, titles T WHERE S.stor_id=ST.stor_id AND T.title_id=S.title_id AND ST.state='CA' AND T.price <= 19.99 AND S.qty*T.price > 250

It becomes nearly impossible to create alternative joins when the subquery isn't doing a simple IN or when it uses aggregate functions. For example, suppose we want to find titles that lag in sales for each store. This could be defined as "Find any title for every store in which the title's sales in that store are below 80 percent of the average of sales for all stores that carry that title and ignore titles that have no price established (that is, the price is NULL)." An intuitive way to do this is to first think of the main query that will give us the gross sales for each title and store, and then for each such result, do a subquery that finds the average gross sales for the title for all stores. Then we correlate the subquery and the main query, keeping only rows that fall below the 80 percent standard.

Such an example follows. For clarity, notice the two distinct queries, each of which answers a separate question. Then notice how they can be combined into a single correlated query to answer the specific question posed here. All three queries use the old-style JOIN syntax.

 -- This query computes gross revenues by -- title for every title and store SELECT T.title_id, S.stor_id, ST.stor_name, city, state, T.price*S.qty FROM titles AS T, sales AS S, stores AS ST WHERE T.title_id=S.title_id AND S.stor_id=ST.stor_id -- This query computes 80% of the average gross revenue for each -- title for all stores carrying that title: SELECT T2.title_id, .80*AVG(price*qty) FROM titles AS T2, sales AS S2 WHERE T2.title_id=S2.title_id GROUP BY T2.title_id -- Correlated subquery that finds store-title combinations whose -- revenues are less than 80% of the average of revenues for that -- title for all stores selling that title SELECT T.title_id, S.stor_id, ST.stor_name, city, state, Revenue=T.price*S.qty FROM titles AS T, sales AS S, stores AS ST WHERE T.title_id=S.title_id AND S.stor_id=ST.stor_id AND T.price*S.qty < (SELECT 0.80*AVG(price*qty) FROM titles T2, sales S2 WHERE T2.title_id=S2.title_id AND T.title_id=T2.title_id )

And the answer is (from the third query):

 title_id stor_id stor_name city state Revenue -------- ------- ------------------- ---- ----- ------- BU1032 6380 Eric the Read Books Seattle WA 99.95 MC3021 8042 Bookbeat Portland OR 44.85 PS2091 6380 Eric the Read Books Seattle WA 32.85 PS2091 7067 News & Brews Los Gatos CA 109.50 PS2091 7131 Doc-U-Mat: Quality Remulade WA 219.00 Laundry and Books

When the newer ANSI JOIN syntax was first introduced, it wasn't obvious how to use it to write a correlated subquery. It could be that the creators of the syntax forgot about the correlated subquery case, because using the syntax seems like a hybrid of the old and the new: the correlation is still done in the WHERE clause rather than in the JOIN clause. For illustration, examine the two equivalent formulations of the above query using the ANSI JOIN syntax:

 SELECT T.title_id, S.stor_id, ST.stor_name, city, state, Revenue=T.price*S.qty FROM titles AS T JOIN sales AS S ON T.title_id=S.title_id JOIN stores AS ST ON S.stor_id=ST.stor_id WHERE T.price*S.qty < (SELECT 0.80*AVG(price*qty) FROM titles T2 JOIN sales S2 ON T2.title_id=S2.title_id WHERE T.title_id=T2.title_id ) SELECT T.title_id, S.stor_id, ST.stor_name, city, state, Revenue=T.price*S.qty FROM titles AS T JOIN sales AS S ON T.title_id=S.title_id AND T.price*S.qty < (SELECT 0.80*AVG(T2.price*S2.qty) FROM sales AS S2 JOIN titles AS T2 ON T2.title_id=S2.title_id WHERE T.title_id=T2.title_id) JOIN stores AS ST ON S.stor_id=ST.stor_id

To completely avoid the old-style syntax with the join condition in the WHERE clause, we could write this using a subquery with a GROUP BY in the FROM clause (creating a derived table, which I'll describe in more detail later in this section). However, although this gets around having to use the old syntax, it might not be worth it. The query is much less intuitive than either of the preceding two formulations, and it takes twice as many logical reads to execute it.

 SELECT T.title_id, S.stor_id, ST.stor_name, city, state, Revenue = T.price * S.qty FROM titles AS T JOIN sales AS S ON T.title_id = S.title_id JOIN stores AS ST ON S.stor_id = ST.stor_id JOIN (SELECT T2.title_id, .80 * AVG(price * qty) AS avg_val FROM titles AS T2 JOIN sales AS S2 ON T2.title_id = S2.title_id GROUP BY T2.title_id) AS AV ON T.title_id = AV.title_id AND T.price * S.qty < avg_val

Often, correlated subqueries use the EXISTS statement, which is the most convenient syntax to use when multiple fields of the main query are to be correlated to the subquery. (In practice, EXISTS is seldom used other than with correlated subqueries.) EXISTS simply checks for a nonempty set. It returns (internally) either TRUE or NOT TRUE (which we won't refer to as FALSE, given the issues of three-valued logic and NULL). Because no column value is returned and the only thing that matters is whether any rows are returned, convention dictates that a column list isn't specified. You can either use the * to indicate all columns or a constant such as the number 1 to indicate that the column list is really ignored.

A common use for EXISTS is to answer a query such as "Show me the titles for which no stores have sales."

 SELECT T.title_id, title FROM titles T WHERE NOT EXISTS (SELECT 1 FROM sales S WHERE T.title_id=S.title_id )

Here is the output:

 title_id title -------- ---------------------------------- MC3026 The Psychology of Computer Cooking PC9999 Net Etiquette

Conceptually, this query is pretty straightforward. The subquery, a simple equijoin, finds all matches of titles and sales. Then NOT EXISTS correlates titles to those matches, looking for titles that don't have even a single row returned in the subquery.

Another common use of EXISTS is to determine whether a table is empty. The optimizer knows that as soon as it gets a single hit using EXISTS, the operation is TRUE and further processing is unnecessary. For example, here's how you determine whether the authors table is empty:

 SELECT 'Not Empty' WHERE EXISTS (SELECT * FROM authors)

Earlier, when discussing outer joins, I mentioned that you can now use an outer-join formulation to address what was traditionally a problem in need of a correlated subquery solution. Here's an outer-join formulation for the problem described earlier: "Show me the titles for which no stores have sales."

 SELECT T1.title_id, title FROM titles T1 LEFT OUTER JOIN sales S ON T1.title_id=S.title_id WHERE S.title_id IS NULL

TIP
Depending on your data and indexes, the outer-join formulation might be faster or slower than a correlated subquery. But before deciding to write your query one way or the other, you might want to come up with a couple of alternative formulations and then choose the one that's fastest in your situation.
In this example, for which little data exists, both solutions run in subsecond elapsed time. But the outer-join query requires fewer than half the number of logical I/Os than does the correlated subquery. With more data, that difference would be significant.

This query works by joining the stores and titles tables and by preserving the titles for which no store exists. Then, in the WHERE clause, it specifically chooses only the rows that it preserved in the outer join. Those rows are the ones for which a title had no matching store.

At other times, a correlated subquery might be preferable to a join, especially if it's a self-join back to the same table or some other exotic join. Here's an example. Given the following table (and assuming that the row_num column is guaranteed unique), suppose we want to identify the rows for which col2 and col3 are duplicates of another row:

 row_num col2 col3 ------- ---- ---- 1 C D 2 A A 4 C B 5 C C 6 B C 7 C A 8 C B 9 C D 10 D D

We can do this in two standard ways. The first way uses a self-join. In a self-join, the table (or view) is used multiple times in the FROM clause and is aliased at least once. Then it can be treated as an entirely different table and you can compare columns between two "instances" of the same table. A self-join to find the rows having duplicate values for col2 and col3 is easy to understand:

 SELECT DISTINCT A.row_num, A.col2, A.col3 FROM match_cols AS A, match_cols AS B WHERE A.col2=B.col2 AND A.col3=B.col3 AND A.row_num <> B.row_num ORDER BY A.col2, A.col3 row_num col2 col3 ------- ---- ---- 4 C B 8 C B 1 C D 9 C D

But in this case, a correlated subquery using aggregate functions provides a considerably more efficient solution, especially if many duplicates exist:

 SELECT A.row_num, A.col2, A.col3 FROM match_cols AS A WHERE EXISTS (SELECT B.col2, B.col3 FROM match_cols AS B WHERE B.col2=A.col2 AND B.col3=A.col3 GROUP BY B.col2, B.col3 HAVING COUNT(*) > 1) ORDER BY A.col2, A.col3

This correlated subquery has another advantage over the self-join example—the row_num column doesn't need to be unique to solve the problem at hand.

You can take a correlated subquery a step further to ask a seemingly simple question that's surprisingly tricky to answer in SQL: "Show me the stores that have sold every title." Even though it seems like a reasonable request, relatively few people can come up with the correct SQL query, especially if I throw in the restrictions that you aren't allowed to use an aggregate function like COUNT(*) and that the solution must be a single SELECT statement (that is, you're not allowed to create temporary tables or the like).

The previous query already revealed two titles that no store has sold, so we know that with the existing dataset, no stores can have sales for all titles. For illustrative purposes, let's add sales records for a hypothetical store that does, in fact, have sales for every title. Following that, we'll see the query that finds all stores that have sold every title (which we know ahead of time is only the phony one we're entering here):

 -- The phony store INSERT stores (stor_id, stor_name, stor_address, city, state, zip) VALUES ('9999', 'WE SUPPLY IT ALL', 'One Main St', 'Poulsbo', 'WA', '98370') -- By using a combination of hard-coded values and selecting every -- title, generate a sales row for every title INSERT sales (stor_id, title_id, ord_num, ord_date, qty, payterms) SELECT '9999', title_id, 'PHONY1', GETDATE(), 10, 'Net 60' FROM titles -- Find stores that supply every title SELECT ST.stor_id, ST.stor_name, ST.city, ST.state FROM stores ST WHERE NOT EXISTS (SELECT * FROM titles T1 WHERE NOT EXISTS (SELECT * FROM titles T1 WHERE NOT EXISTS (SELECT * FROM sales S WHERE S.title_id=T1.title_id AND ST.stor_id=S.stor_id) )

Here's the result:

 stor_id stor_name city state ------- --------- ---- ----- 9999 WE SUPPLY IT ALL Poulsbo WA

Although this query might be difficult to think of immediately, you can easily understand why it works. In English, it says, "Show me the store(s) such that no titles exist that the store doesn't sell." This query consists of the two subqueries that are applied to each store. The bottommost subquery produces all the titles that the store has sold. The upper subquery is then correlated to that bottom one to look for any titles that are not in the list of those that the store has sold. The top query returns any stores that aren't in this list. This type of query is known as a relational division, and unfortunately, it isn't as easy to express as we'd like. Although the query shown is quite understandable, once you have a solid foundation in SQL, it's hardly intuitive. As is almost always the case, you could probably use other formulations to write this query.

I've already alluded to writing this query without using an aggregate function like COUNT. There's nothing wrong with using an aggregate function—I imposed this restriction only to make writing the query more of a challenge. If you think of the query in English as "Find the stores that have sold as many unique titles as there are total unique titles," you'll find the following formulation somewhat more intuitive:

 SELECT ST.stor_id, ST.stor_name FROM stores ST, sales SA, titles T WHERE SA.stor_id=ST.stor_id AND SA.title_id=T.title_id GROUP BY ST.stor_id,ST.stor_name HAVING COUNT(DISTINCT SA.title_id)=(SELECT COUNT(*) FROM titles T1)

The following formulation runs much more efficiently than either of the previous two. The syntax is similar to the preceding one but its approach is novel because it's just a standard subquery, not a join or a correlated subquery. You might think it's an illegal query, since it does a GROUP BY and a HAVING without an aggregate in the select list of the first subquery. But that's OK, both in terms of what SQL Server allows and in terms of the ANSI specification. What isn't allowed is having an item in the select list that isn't an aggregate function but then omitting it from the GROUP BY clause, if there is a GROUP BY clause.

 -- Find stores that have sold every title SELECT stor_id, stor_name FROM stores WHERE stores.stor_id IN (SELECT stor_id FROM sales GROUP BY stor_id HAVING COUNT(DISTINCT title_id)=(SELECT COUNT(*) FROM titles) )

And as a lead-in to the next topic, here's a formulation that uses a derived table—a feature that allows you to use a subquery in a FROM clause. This capability lets you alias a virtual table returned as the result set of a SELECT statement, and then lets you use this result set as if it were a real table. This query also runs efficiently.

 SELECT ST.stor_id, ST.stor_name FROM stores ST, (SELECT stor_id, COUNT(DISTINCT title_id) AS title_count FROM sales GROUP BY stor_id ) as SA WHERE ST.stor_id=SA.stor_id AND SA.title_count= (SELECT COUNT(*) FROM titles)