Subqueries | A Lightning SQL Tutorial

Table of contents:

Nulls in NOT IN Predicates

A subquery is a SELECT statement embedded within another SQL statement. You can use subqueries in place of column references, in place of table references, to generate values for comparison in a WHERE or HAVING condition, and to generate values for inserts and updates.

4.7.1 Treating Rowsets as Tables

So far in this chapter, you've selected data from tables. A SELECT statement is executed, and a set of rows comes back as the result. Imagine if you could further treat that set of rows as a table against which you issue another SELECT statement or perhaps an UPDATE or a DELETE. Using a subquery, you can do just what I've described, and that can lead to some interesting and elegant solutions to SQL problems.

One use for a subquery in the FROM clause is to aggregate data that has already been aggregated. For example, you might be faced with the following business problem:

Find all employees who have worked on projects 1001 and 1002 during the year 2004. Sum the number of hours each of those employees has worked on each project during that year. Report the ranges. Show the lowest number of hours that any employee worked on each project during 2004, as well as the highest number of hours.

This is an interesting problem to solve because you must sum the hours for employee/project combinations using the SUM aggregate function, and you must apply MIN and MAX to your sums. Example 4-32 shows one way to approach this problem using a subquery.

Example 4-32. Aggregating aggregated data


SELECT phs.project_id, MIN(phs.hours), MAX(phs.hours)


FROM (SELECT employee_id, project_id, SUM(hours_logged) hours


FROM project_hours


WHERE project_id IN (1001, 1002)


AND TRUNC(time_log_date,'Year') = TO_DATE('1-Jan-2004',

 'dd-Mon-yyyy')


GROUP BY employee_id, project_id) phs


GROUP BY project_id;

PROJECT_ID MIN(PHS.HOURS) MAX(PHS.HOURS)

---------- -------------- --------------

 1001 4 20

 1002 8 24

The subquery appears in the FROM clause enclosed in parentheses. The outer query treats the rows from the subquery in the same manner as it would treat rows from a table. Tables have names , and so should subqueries. Example 4-32 gives the name , or alias, phs to the subquery. The outer query uses that name to refer to the columns from the subquery. The alias hours is given to the column represented by SUM(hours_logged) , making it easy to refer to that column from the outer query.

The call to TRUNC(time_log_date,'Year') converts each date to January 1 st of its year. Any date in 2004 will be truncated to 1-Jan-2004. It's easy, then, to eliminate time log entries for other years .

Because subqueries in the FROM clause are treated the same as tables, it stands to reason they can take part in joins. Example 4-33 expands on Example 4-32 by adding another level of subquery and a join to the project table in order to include project names in the query output.

Example 4-33. Joining a subquery to a table

SELECT minmax_hours.project_id, p.project_name,

 minmax_hours.min_hours, minmax_hours.max_hours

FROM (

 SELECT phs.project_id, MIN(phs.hours) min_hours, MAX(phs.hours) 

 max_hours

 FROM (SELECT employee_id, project_id, SUM(hours_logged) hours

 FROM project_hours

 WHERE project_id IN (1001, 1002)

 AND TRUNC(time_log_date,'Year') = TO_DATE('1-Jan-2004',

 'dd-Mon-yyyy')

 GROUP BY employee_id, project_id) phs

 GROUP BY project_id) minmax_hours JOIN project p 

 ON minmax_hours.project_id = p.project_id;



PROJECT_ID PROJECT_NAME MIN_HOURS MAX_HOURS

---------- ---------------------------------------- ---------- ----------

 1001 Corporate Web Site 4 20

 1002 Enterprise Resource Planning System 8 24

There are different ways you can approach the join to project shown in Example 4-33. The join could occur in the innermost subquery or in the middle subquery, and either of those alternatives would eliminate the need for a third subquery. Sometimes it's a judgment call as to which approach is best. Sometimes you need to make that call based on readability or on a desire not to mess with a working query. By adding an outer SELECT statement to an already working query, I avoided the need to tamper with a SELECT statement, the middle one in Example 4-33, which I knew worked.

There's a performance issue, too, that isn't obvious in the amount of sample data used for the examples in this book. Joining to project in the innermost query would force the join to take place before any aggregation at all. Given a large enough number of rows in project_hours , a join that early would significantly increase the expenditure of I/O and CPU resources by the query, as all those detail rows would need to be joined. Having the join occur where it does in Example 4-33 means that only two rows, the two returned by the middle query, need to be joined to project .

Subqueries in the FROM clause are sometimes referred to as inline views , and such a subquery can be considered as a dynamically created view, for the purpose of the one query. Any subquery in the FROM clause can be replaced by an equivalent view, but then you have the problem of creating that view, which is a permanent database object.

You can see another interesting use of inline views, this time to generate rows of data that don't already exist in the database, in the article Creating Pivot Tables at http:// gennick .com/pivot.html.

4.7.2 Testing for Representation

You can use subqueries to see whether a row is representative of a set. The query in Example 4-32 contains the following WHERE condition:

WHERE project_id IN (1001, 1002)

This condition tests whether a row from the project_hours table is associated with project 1001 or 1002. Imagine a more complicated scenario. Imagine that you're interested in all projects having budgets of $1,000,000 or more. You don't know which projects have such large budgets. Furthermore, budgets change from time to time, and you don't want to have to modify the project ID numbers in your query each time your budgets change. Instead of hard-coding the project_id list for the IN predicate, you can generate that list using a subquery in the IN predicate, as shown in Example 4-34.

Example 4-34. A subquery generating values for an IN predicate

SELECT phs.project_id, MIN(phs.hours), MAX(phs.hours)

FROM (SELECT employee_id, project_id, SUM(hours_logged) hours

 FROM project_hours

 WHERE project_id IN (SELECT project_id

 FROM project

 WHERE project_budget >= 1000000)

 AND TRUNC(time_log_date,'Year') = TO_DATE('1-Jan-2004','dd-Mon-yyyy')

 GROUP BY employee_id, project_id) phs

GROUP BY project_id;

When you execute the SELECT statement shown in Example 4-34, the IN predicate's subquery is executed first to generate a list of project IDs encompassing all projects having budgets of $1,000,000 or more. Another approach to this problem is to issue a subquery for each project_hour row to see whether the associated project has the required budget. This approach is shown in Example 4-35.

Example 4-35. An EXISTS subquery checking for projects with large budgets

SELECT phs.project_id, MIN(phs.hours), MAX(phs.hours)

FROM (SELECT employee_id, project_id, SUM(hours_logged) hours

 FROM project_hours ph

 WHERE EXISTS (SELECT *

 FROM project pb

 WHERE pb.project_budget >= 1000000

 AND pb.project_id = ph.project_id)

 AND TRUNC(time_log_date,'Year') = TO_DATE('1-Jan-2004','dd-Mon-yyyy')

 GROUP BY employee_id, project_id) phs

GROUP BY project_id;

The subquery in Example 4-35 is known as a correlated subquery , meaning that the subquery references a value from the parent statement. Aliases are important when writing such subqueries. The parent query's project_hours table is aliased as ph . The subquery references the project ID from each candidate row from the main query via the reference to ph.project_id . The subquery is executed once for each row that can potentially be returned by the main query.

Subqueries prior to Example 4-35 have all been non-correlated , which means that you can execute them independently of their enclosing statement. Non-correlated subqueries are executed just once per execution of their containing query. Correlated subqueries are executed once per row. Either subquery may be written using IN or EXISTS, but typically you'll find that IN subqueries are non-correlated and EXISTS subqueries are correlated.

It's pretty much impossible to provide a general rule as to when to use IN and EXISTS predicates when either can be used to solve a particular problem. Given the low number of rows in the project table as compared to the project_hours table, you might think that the SELECT in Example 4-34 would be more efficient than the one in Example 4-35. After all, the IN predicate's subquery must execute only once, and return only two values. However, in my database, with the release of Oracle that I'm running and the sample data I'm using for this book, the SELECT in Example 4-35 requires far less input/output than Example 4-34. A good practice is to test both approaches, perhaps using SET AUTOTRACE ON as described in Chapter 12, and then use the one that performs best for your particular query.

Nulls in NOT IN Predicates

When writing NOT IN predicates, be sensitive to the possible presence of nulls in the results of your subqueries. The following query attempts to use NOT IN to return a list of all employees terminated on any date on which at least one employee hired in 2004 was terminated :

SELECT employee_id, employee_termination_date

FROM employee

WHERE employee_termination_date NOT IN (

 SELECT employee_termination_date

 FROM employee

 WHERE '2004' =

 TO_CHAR(

 TRUNC(employee_hire_date,'year'), 'yyyy')

 );

Executed against the sample data for this book, this query will return no rows because the subquery returns one or more nulls. A single null prevents the NOT IN condition from ever returning true because the null is treated as an unknown. Is a given termination date not in the set? The answer is unknown because there's at least one unknown value in the set. Be very careful of this scenario when writing NOT IN subqueries. Take care that such subqueries do not return nulls.

4.7.3 Generating Data for INSERTs and UPDATEs

Subqueries are useful in generating values for INSERT and UPDATE statements. Examples Example 4-36 and Example 4-37 show two ways of creating a reporting table of project billing data summarized by week. You might generate, and periodically refresh, such a table to make it easier for end users to check on project status using ad hoc query tools.

Example 4-36 uses a CREATE TABLE AS SELECT FROM statement, which lets you create and populate the reporting table in one step. Example 4-37 creates the reporting table separately, and then populates the table using an INSERT...SELECT FROM statement. In each case, a subquery generates the data to be inserted into the new table.

Example 4-36. Using CREATE TABLE AS SELECT FROM

DROP TABLE project_time;



CREATE TABLE project_time AS

 SELECT EXTRACT (YEAR FROM time_log_date) year,

 TO_NUMBER(TO_CHAR(time_log_date,'ww')) week_number,

 p.project_id, p.project_name, SUM(ph.hours_logged) hours_logged

 FROM project p JOIN project_hours ph 

 ON p.project_id = ph.project_id

 GROUP BY EXTRACT (YEAR FROM time_log_date), 

 TO_NUMBER(TO_CHAR(time_log_date,'ww')),

 p.project_id, p.project_name;

Example 4-37. Using INSERT...SELECT FROM

DROP TABLE project_time;



CREATE TABLE project_time (

 year NUMBER,

 week_number NUMBER,

 project_id NUMBER(4),

 project_name VARCHAR2(40),

 hours_logged NUMBER);



INSERT INTO project_time

 SELECT EXTRACT (YEAR FROM time_log_date),

 TO_NUMBER(TO_CHAR(time_log_date,'ww')),

 p.project_id, p.project_name, SUM(ph.hours_logged)

 FROM project p JOIN project_hours ph 

 ON p.project_id = ph.project_id

 GROUP BY EXTRACT (YEAR FROM time_log_date), 

 TO_NUMBER(TO_CHAR(time_log_date,'ww')),

 p.project_id, p.project_name;

Having created the project_time table with its redundant project_name column, it's a given that someone, someday, will come along and change a project name on you, leaving you to sort out the resulting mess by somehow propagating the new project name to all the summary rows in project_time . That kind of update is easily done using a subquery in the SET clause of an UPDATE statement, as in Example 4-38.

Example 4-38. A subquery generating a value for an UPDATE

UPDATE project_time pt

SET pt.project_name = (SELECT p.project_name

 FROM project p

 WHERE p.project_id = pt.project_id)

WHERE pt.project_name <> (SELECT p.project_name

 FROM project p

 WHERE p.project_id = pt.project_id)

The UPDATE in Example 4-38 reads each row in the project_time table and updates those project_time rows subject to name changes.

Introduction to SQL*Plus

Command-Line SQL*Plus