Subqueries | Beginning Transact-SQL With SQL Server 2000 and 2005

A subquery is simply a SELECT query within a SELECT query. Several forms of subqueries exist, ranging from expressions that return a single, or scalar, value to a multi-row result set. I'm going to show you several examples of each type, beginning with scalar expressions.

Scalar Expressions

Within the column selection list of a SELECT statement, embedded SELECT statements can be used to return a single-column value. One of the most common examples of a scalar expression is to use an aggregate function to return a value based on the scope of multiple rows.

How It Works

Here's an example. I'd like to compare the price paid for a product with the average price of all product sales. This can be done by using a simple SELECT statement that uses the AVG() function to produce a column in the query's result set given the alias AvgPrice:

 ELECT   ProductID , UnitPrice , (SELECT AVG(UnitPrice) FROM SalesOrderDetail) AS AvgPrice FROM SalesOrderDetail

In the results shown in Figure 10-1, you can see that the values in this column are constant. This is because the subquery expression has no correlation or dependency on the outer query. It simply calculates the average price for all product sales again and again, and returns this value in every row of the result.

image from book
Figure 10-1:

Fortunately, SQL Server is smart enough to perform the calculation once and then simply return a cached value for each request. Can you imagine how wasteful it would be to recalculate the average of 121 thousand values, and then repeat the same calculation 121 thousand times? It's important to note that the database engine doesn't just do exactly what you ask it to. It has intelligence built in to find shortcuts like this one.

Speaking of intelligence, I'll do something a little more intelligent with this data. Rather than showing the average price of all sales on each row, I'll use this value to calculate the difference between this row's sale price and the average for all sales:

 ELECT   ProductID , UnitPrice , UnitPrice - (SELECT Avg(UnitPrice) FROM SalesOrderDetail) As AvgPriceDifference FROM SalesOrderDetail

Now the result shows the difference between the value returned by the subquery and the sale price in the UnitPrice column (see Figure 10-2).

image from book
Figure 10-2:

As you can see, subqueries can be used to add dimension to data in flat tables. I'm going to build on this query to show some variations.

Try It Out

Rather than getting a scalar value from an aggregate on the same table, you can also use another table. Try out the following query. Use the ProductID value to join the SalesOrderDetail and Product tables and get the ListPrice for the corresponding product:

 ELECT   ProductID , UnitPrice AS SalePrice , (SELECT ListPrice FROM Product WHERE Product.ProductID = SalesOrderDetail.ProductID) AS ProductListPrice FROM SalesOrderDetail

In Figure 10-3, the results show the list price in the third column.

image from book
Figure 10-3:

Now take it one step further. Use this value to calculate the difference between the product's list price and price charged. By adding the expression, UnitPrice – , before the subquery, you can tell whether the sale price is inflated or discounted from the list price:

 ELECT   ProductID , UnitPrice AS SalePrice , UnitPrice - (SELECT ListPrice FROM Product WHERE Product.ProductID = SalesOrderDetail.ProductID) AS PriceDifference FROM SalesOrderDetail

The results are shown in Figure 10-4.

image from book
Figure 10-4:

The negative values in the third column indicate that these products were sold for less than the list price in the product table. Apparently, Adventure Works Cycles makes it a point to sell all of their products below list price (too bad that this is a fictitious business).

Alternate Join Operations

Before I show you the following examples, I want to answer an obvious question: Why learn to use sub-queries that do the same thing as join operations when joins are preferable? All by themselves, it probably doesn't make sense to use subqueries in place of joins. So why take this route? As you will see, some business rules are best implemented using a subquery expression. Under the right conditions join-type subqueries used in concert with specialized business logic subqueries may be the right choice. Although the exception rather than the rule, sometimes you may need more flexibility than an ANSI join will offer.

Inner Join Subqueries

An inner join subquery is a simple expression. The main difference between this technique and the ANSI-SQL INNER JOIN statement is that this join is performed in the WHERE clause rather than the FROM clause. Note that two tables are referenced in the FROM clause as a comma-delimited list:

 ELECT SalesOrderHeader.OrderDate , SalesOrderDetail.ProductID , SalesOrderDetail.UnitPrice FROM SalesOrderHeader, SalesOrderDetail WHERE SalesOrderHeader.SalesOrderID = SalesOrderDetail.SalesOrderID ORDER BY SalesOrderHeader.OrderDate

There are no surprises in the result set shown in Figure 10-5. Just like an inner join, you will only see matching records between the SalesOrderHeader and SalesOrderDetail tables.

image from book
Figure 10-5:

Before adding anything to the script in this example, I'd like to compare this query to its ANSI-SQL equivalent:

 ELECT SalesOrderHeader.OrderDate , SalesOrderDetail.ProductID , SalesOrderDetail.UnitPrice FROM SalesOrderHeader INNER JOIN SalesOrderDetail ON SalesOrderHeader.SalesOrderID = SalesOrderDetail.SalesOrderID ORDER BY SalesOrderHeader.OrderDate

There is no reason to show the result set because it's going to look exactly as that for the previous example.

You can compound join operations for more than two tables very easily. I'll add the Product table so you can see the product name for each sales record:

 ELECT SalesOrderHeader.OrderDate , SalesOrderDetail.ProductID , SalesOrderDetail.UnitPrice , Product.Name FROM   SalesOrderHeader, SalesOrderDetail, Product WHERE  SalesOrderHeader.SalesOrderID = SalesOrderDetail.SalesOrderID AND SalesOrderDetail.ProductID = Product.ProductID ORDER BY SalesOrderHeader.OrderDate

Using the AND operator, you can extend the WHERE clause with the join between the SalesOrderDetail and Product tables. The result (shown in Figure 10-6) shows the OrderDate column from the SalesOrderHeader table, the ProductID and UnitPrice from SalesOrderDetail, and the Name column from the table.

image from book
Figure 10-6:

Outer Join Subqueries

An outer join can be performed in a few different ways. A simple outer join without complex business rules is performed as a simple variation of an inner join subquery. Note the LEFT OUTER JOIN operator (*=) used for matching key values between the two tables:

 ELECT SalesOrderHeader.OrderDate , SalesOrderDetail.ProductID , SalesOrderDetail.UnitPrice FROM SalesOrderHeader, SalesOrderDetail WHERE SalesOrderHeader.SalesOrderID *= SalesOrderDetail.SalesOrderID ORDER BY SalesOrderHeader.OrderDate

Like the LEFT OUTER JOIN statement, this means that all rows will be selected from the left side of the expression and then matching rows are returned from the right side. I'm not going to show the results because they will be the same as the inner join example. In the AdventureWorks database, there are no SalesOrderHeader records without matching SalesOrderDetail records, so the outer join doesn't produce any orphaned records.

The following example does expose unmatched records. Let's see if there are any products that haven't been purchased:

 SELECT Product.ProductID, SalesOrderDetail.SalesOrderID  FROM Product, SalesOrderDetail WHERE Product.ProductID *= SalesOrderDetail.ProductID

Again, the *= comparison operator instructs SQL Server to retrieve all of the products first and then the matched sales records. The results are shown in Figure 10-7.

image from book
Figure 10-7:

The null value in the SalesOrderID column indicates that there are no order detail records matching these products.

I have included the *= and =* syntax here because you may encounter it, but a very important note to remember about this syntax is that in the future it may become an illegal operation. In some very specific scenarios the *= and =* syntax may return ambiguous results. As a result, the preferred method for outer joins is using the ANSI-SQL syntax of LEFT OUTER JOIN and RIGHT OUTER JOIN.

Table Aliasing 101

The mechanics of subqueries are sometimes considerably different than standard join operations. The need to refer to columns from one table within an expression using a different table often requires that tables be aliased. In Chapter 5 you learned how to alias columns. This is the same concept but now you're just giving a table an alternate name. Here's a simple example to introduce the technique:

 SELECT ProductID, Name FROM Product AS P

The alias in this query doesn't really accomplish anything in its basic form, but it would be useful if this query were used as a nested subquery, within another SELECT statement.

Using Aggregate Functions in Subqueries

In Chapter 7 you saw how aggregate functions are used to return summary values over a group of rows rolled up using the GROUP BY clause. In subqueries, a similar result can be achieved by using aggregate function in a filtered SELECT expression. In a scalar expression, you use the AVG() function to return a single column value and then use that value to perform calculations with other column values.

In row set queries, aggregate functions can be used along with various filtering and matching techniques to achieve more flexible results. For example, I can simulate a top values query by using the COUNT() function. I'll introduce some other techniques first. Aggregate functions allow you to do some very powerful things in correlated subqueries, which are discussed later in the chapter.

The HAVING Clause

Chapter 7 also showed you how to use the HAVING clause to filter aggregated rows when using the GROUP BY statement. Similar results can be achieved using subquery techniques. Until we have a chance to discuss the mechanics of correlated subqueries, it may not seem to make sense to use the HAVING clause without a GROUP BY. For example, the following expression serves only to include or exclude all rows from this query:

 SELECT SUM(UnitPrice) FROM SalesOrderDetail  HAVING SUM(UnitPrice) > 100000

In this example, one value is returned if the UnitPrice sum for the entire range meets this criterion; otherwise, the query returns nothing. In effect, the HAVING statement does the same thing as a WHERE clause, but only after the aggregate value has been calculated. The WHERE clause is used to qualify records before they are fed to the aggregate SUM() function within the subquery. This query returns the name of products that have generated sales exceeding two million dollars:

 ELECT Name FROM Product WHERE EXISTS ( SELECT SUM(UnitPrice) FROM SalesOrderDetail WHERE SalesOrderDetail.ProductID = Product.ProductID HAVING SUM(UnitPrice) > 2000000 )

When you need to use an aggregated value within a subquery, this technique can be quite useful.

Creating a Derived Table

Sometimes you may need to build values into a query that are not obtained from a table. You've seen that the SELECT statement allows you to return literal values that are just part of an expression. For example, the following statement returns a single row result as if it were read from a table:

 SELECT 'Fred' As FirstName, 'Flintstone' As LastName

Because tables can be aliased, an expression-type result set can also be aliased and then treated as if it were a table:

 SELECT * FROM (SELECT 'Fred' As FirstName, 'Flintstone' As LastName)  AS CartoonCharacter

Derived tables can be used in joins and subqueries. As far as the query processor is concerned, the CartoonCharacter alias is a table, like any other table, and can be joined or used in expressions like a physical table. Extending the previous example, a UNION statement is used to create two records, returned as an aliased table. This table is then joined to the physical Department table:

 ELECT FirstName, LastName, Department.Name As DeptName FROM  (SELECT ‘Fred’ As FirstName,  UNION ALL SELECT ‘Barney’ As FirstName,  As CartoonCharacter Inner Join Department On CartoonCharacter.DepartmentID = Department.DepartmentID

The result, shown in Figure 10-8, returns the department name based on the join to the Department table and the DepartmentID provided for each of these records.

image from book
Figure 10-8:

Using the IN() Function

In Chapter 5 you saw how the IN() function can be used to match a column to any value in a comma-delimited list of values. This function has a dual purpose in that it also knows how to match a column to the results of a SELECT statement when this statement returns a single column of values. Here's a simple example that returns all products that have been used in orders:

 ELECT ProductID, Name FROM Product WHERE ProductID IN (  SELECT ProductID FROM SalesOrderDetail )

Subqueries can be nested. Extending the previous example, I can return products that were ordered in March 2003. The OrderDate is stored in the SalesOrderHeader table and the ProductID for the order is in the SalesOrderDetail table. This requires three expressions — a query within a query within a query, like this:

 SELECT ProductID, Name FROM Product WHERE ProductID IN ( SELECT ProductID FROM SalesOrderDetail WHERE SalesOrderID IN (  SELECT SalesOrderID FROM SalesOrderHeader WHERE OrderDate BETWEEN ‘3-1- )

The first product/order example returns a large number of rows. Reversing the logic using the NOT operator will return a list of all the products that haven't sold:

 SELECT ProductID, Name FROM Product WHERE ProductID NOT IN  (SELECT ProductID FROM SalesOrderDetail)

Perhaps my purpose is to clean up the product inventory and remove those product records that haven't generated revenue. To this end, I can delete products that haven't sold by applying the DELETE command:

 DELETE FROM Product WHERE ProductID NOT IN  (SELECT ProductID FROM SalesOrderDetail)

Using the EXISTS() Function

The EXISTS() function is used to return a row in the outer query when any records are returned by a subquery. The subquery can be any SELECT statement, directly related or not, to the main SELECT statement. The EXISTS() function is often used in correlated subqueries to either mimic the behavior of a join or to implement business rules that wouldn't be possible with a standard join.

It doesn't really matter what column or columns are returned in the subquery because you don't actually use these values. For this reason, it's common to use the asterisk rather than column names. The asterisk is typically used to return all column values from a query but, in this case, it simply allows the query engine to test for the presence of any rows without unnecessarily wasting system resources.

Here's a simple example of this type of subquery:

 SELECT EmployeeID, FirstName, LastName  FROM Employee  WHERE EXISTS  (SELECT * FROM Customer WHERE SalespersonID = Employee.EmployeeID)

Note the correlation between the inner and outer query using a reference to the Employee table from the subquery based on the Customer table. The results, shown in Figure 10-9, return only employee records where the employee is assigned as a salesperson for a customer.

image from book
Figure 10-9:

NOT EXISTS()

To reverse the logic for this query (in this, to return all of the unassigned employees), simply add the NOT operator before the EXISTS statement:

 SELECT EmployeeID, FirstName, LastName  FROM Employee  WHERE NOT EXISTS  (SELECT * FROM Customer WHERE SalespersonID = Employee.EmployeeID)

The result set shown in Figure 10-10 returns all of the employees that were excluded in the previous example.

image from book
Figure 10-10:

Correlated Subqueries

You've just seen some examples of subqueries where the outer query refers to, and conditionally filters rows based on, the inner query. A correlated subquery is a subquery where the selection criterion of the inner query refers to values in the outer query. Correlated subqueries can be a bit tricky. One of the restrictions is that the inner query cannot return more than one row matching the outer row. In the following example, I am looking for mountain bike products (where the SubCategoryID is 1) that have been purchased (where there is an existing SalesOrderDetail record). This query contains the proper logic but returns an error because the inner query returns multiple results for some of the products:

 SELECT ProductID , Name FROM Product AS P WHERE 1 =  (SELECT ProductSubCategoryID FROM SalesOrderDetail As SD WHERE P.ProductID = SD.ProductID)

The problem is easily corrected by adding a TOP 1 statement:

 - Correlated subquery using TOP 1: SELECT ProductID , Name FROM Product AS P WHERE 1 =  (SELECT TOP 1 ProductSubCategoryID FROM SalesOrderDetail As SD WHERE P.ProductID = SD.ProductID)

The first few rows of the result set are shown in Figure 10-11.

image from book
Figure 10-11:

The same result could also be achieved using SELECT DISTINCT or a GROUP BY statement in the inner query, as shown in the following two examples:

 - Correlated subquery using Distinct: SELECT ProductID , Name FROM Product AS P WHERE 1 =  (SELECT DISTINCT ProductSubCategoryID FROM SalesOrderDetail As SD WHERE P.ProductID = SD.ProductID) -- Correlated subquery using Group By: SELECT ProductID , Name FROM Product AS P WHERE 1 =  (SELECT ProductSubCategoryID FROM SalesOrderDetail As SD WHERE P.ProductID = SD.ProductID GROUP BY ProductSubCategoryID)

Note

Just an aside: I've always found this logic a little awkward because the value comparison (1 = ...) seems backward. But this is a good example of where it's sometimes necessary to think outside the box. This might not be the way you would typically make a value comparison, putting the expression on the right side of the result, but it's an accurate, logical expression that provides one more way to solve a problem.

One of the significant differences between the outer query value matching (using IN or EXISTS) and correlated techniques is in how the query processor builds the execution plan. In the case of a correlated subquery, a row is selected from the outer query and then a row is selected from the inner query. Value comparisons are made to determine if an outer row should be output and, if so, the row is typically sent to an output buffer. This process is repeated, processing each of the inner rows for every one of the outer rows until all combinations have been considered. This process is more rigorous than subqueries that use the IN or EXISTS functions and don't match outer query column values within the inner query.

Benchmarking and Best Practices

Several subquery techniques can be used to return the same kind of results you would get from an INNER JOIN or OUTER JOIN query. I don't want to beat the proverbial dead horse, but I'll say it one more time: Using ANSI standard join expressions gives the query processor the opportunity to make intelligent decisions about your stated intentions and then to build an optimal execution plan. Using explicit subquery expressions to achieve the same result doesn't.

Don't just take my word for it. Let's test it out. Using either the SQL Server 2005 Management Studio or SQL Server 2000 Query Analyzer, enable the feature to show the actual execution plan. In either tool, you will find this option on the Query menu. Execute the first correlated subquery example using the AdventureWorks database and then view the execution plan (see Figure 10-12). You can do this by selecting the Execution Plan tab in the query results pane.

image from book
Figure 10-12:

The first thing to note is that this query was executed in seven steps. From right to left, you see all of the activities performed by the query-processing engine to process this query. The join was processed using an operation called a Nested Loop Inner Join. This is exactly what you told the query processor to do: perform an inner join by matching records in two nested loops. Now, place the mouse pointer over the leftmost icon. A pop-up window displays statistical performance information about the execution. Take note of the Subtree Cost value. In my case, it was about 3.1. Also, note that the vast majority of time was spent processing an Index Seek operation on the SalesOrderDetails table's index. According to the execution statistics, 266 rows were processed 1004 times.

Now, type the following ANSI Join version of the same query, highlight the query text, and execute the query:

 - ANSI Join: SELECT DISTINCT Product.ProductID , Name FROM Product INNER JOIN SalesOrderDetail ON Product.ProductID = SalesOrderDetail.ProductID WHERE ProductSubCategoryID = 1

Again, view the execution plan shown in Figure 10-13.

image from book
Figure 10-13:

This time it only took four steps. Note the total Subtree Cost in the statistics for the left-most icon. In my case, this is only .67. It took almost one-fifth of the time to process this query as it did before. Note that instead of spending nearly all of this time reprocessing the same rows using the index scan, it only scanned the index once. The query optimizer decided that it made more sense to scan all 121,371

SalesOrderDetail rows using this index, matching 12,826 in the next operation. It also chose to use a Merge Join instead of the less-efficient Nested Loop Join. This time it spent a greater percentage of time processing the join but achieved a much more efficient result overall.

The real question is when do you really need to use a subquery instead of a join? So far you've seen that it usually makes sense to use a join wherever you can. However, business doesn't always follow simple rules. The next section looks at some examples of unique business rules that can be solved using various forms of subqueries.

Business Cases for Subqueries

Creating subqueries to solve unique business problems is a fine art, and a source of pride among true database professionals. Often an evolutionary process, queries may progress from crude data selections based on other views or queries, or multi-step inserts and updates to populate temporary tables, finally arriving at a short and elegant solution. This process can take days to years in the life of a database solution. I've learned to think twice before making statements such as "that can't be done in a single query," only to later discover otherwise. The following examples, with one exception, use the AdventureWorks database so you can follow along.

Top Sales by Territory

The Director of Sales at Adventure Works Cycles would like to see a report showing the top 5 sales in each sales territory and the salespeople responsible for those sales. The trick here is that I want to see five sales records in the result for every territory so a top values query won't help unless it is nested in a subquery. I can use a different technique to achieve a similar result. This is done by ordering the SalesOrderHeader rows by the SubTotal column in descending order. This places the largest sales records at the top of the list. The MIN() function is just a trick used to enable the HAVING() function to work. The HAVING expression used in this correlated subquery essentially returns the top 5 values from matching records within the inner query:

 SELECT TerritoryID, SubTotal, SalesPersonID, SalesOrderID FROM SalesOrderHeader AS SO1 WHERE SubTotal >=  ( SELECT Min(SubTotal) FROM SalesOrderHeader AS SO2 WHERE SO1.TerritoryID = SO2.TerritoryID AND SO1.SubTotal <= SO2.SubTotal HAVING Count(*) <= 5 ) ORDER BY TerritoryID, SubTotal Desc

Figure 10-14 shows the first few records in the result set.

image from book
Figure 10-14:

Unshipped Product Orders

In this scenario, I want a list of sales order records and products where part of the order has been shipped and part has not been shipped. In the outer query, I ask for records from the SalesOrderDetail table (sales order line items) that don't have a tracking number. This indicates that this item has not shipped. For these rows to be output, there must be records with the same SalesOrderID (related to the same SalesOrderHeader record) that do have a tracking number:

 SELECT SalesOrderID, ProductID  FROM SalesOrderDetail As SD1 WHERE CarrierTrackingNumber IS NULL  AND SalesOrderID IN  (SELECT SalesOrderID FROM SalesOrderDetail AS SD2 WHERE CarrierTrackingNumber IS NOT NULL AND SD1.SalesOrderID = SD2.SalesOrderID)

Figure 10-15 shows the results.

image from book
Figure 10-15:

Consulting Billing Time

One of the quandaries in the software development consulting business is just what we should bill our customers for. Inevitably, projects require ramp-up time and, sometimes, projects get cancelled before they really get started. With new clients, the process usually progresses from the sales stage to initiation meetings, specification forming, requirement gathering, and then finally to the development work. We certainly can't bill our customers for the sales calls and all the schmoozing that goes on when building a new business relationship, prior to getting started on a billable project. So, when does the billing clock actually begin to run? Customers want assurance that we can deliver results and we don't want to waste our time on dead-end projects that we won't get paid for.

For this example, I've created a table called ConsultingActivity with columns containing a description of consulting activities, the number of hours, and the activity type. In a production solution, this data may exist in multiple tables but in this simplified example, I've used only one table. The data looks like that shown in Figure 10-16.

image from book
Figure 10-16:

This is the only example in this chapter that doesn't use the AdventureWorks sample database. If you would like to see these results, you will need to create this table and populate it with these values. The ActivityType column may contain a number of different values. The business rule is that if our consultants have performed any development work, we will bill the customer for all activities. Otherwise, we don't bill on this project at all.

 SELECT CustProject, Hours FROM ConsultingActivity as ca1 WHERE CustProject IN (SELECT CustProject FROM ConsultingActivity as ca2 WHERE ActivityType = AND ca1.CustProject = ca2.CustProject) ORDER BY CustProject

Of the four projects in various stages of progress, only two have had any development consulting activity performed. The results of this query, shown in Figure 10-17, include all of the activities for these two projects.

image from book
Figure 10-17: