5.3 Correlated Subqueries
A subquery that references one or more
columns
from its containing SQL statement is called a
correlated subquery
. Unlike noncorrelated subqueries, which are executed exactly once prior to execution of the containing statement, a correlated subquery is executed once for each candidate row in the intermediate result set of the containing query. For example, consider the following query, which
locates
all
parts
supplied by Acme Industries that have been purchased 10 or more times since July 2001:
SELECT p.part_nbr, p.name
FROM supplier s INNER JOIN part p
ON s.supplier_id = p.supplier_id
WHERE s.name = 'Acme Industries'
AND 10 <=
(SELECT COUNT(*)
FROM cust_order co INNER JOIN line_item li
ON li.order_nbr = co.order_nbr
WHERE li.part_nbr =
p.part_nbr
AND co.order_dt >= TO_DATE('01-JUL-2001','DD-MON-YYYY'));
The reference to
p.part_nbr
is what makes the subquery correlated; values for
p.part_nbr
must be supplied by the containing query before the subquery can execute. If there are 10,000 parts in the part table, but only 100 are supplied by Acme Industries, the subquery will be executed once for each of the 100 rows in the intermediate result set created by joining the
part
and
supplier
tables.
|
It is possible to ask for the subquery to be evaluated earlier in the execution plan using the PUSH_SUBQ hint; once again, we suggest you pick up a good book on Oracle tuning if you are interested in learning more about how Oracle actually executes subqueries.
|
|
Correlated subqueries are often used to test whether relationships exist without regard to cardinality. We might, for example, want to find all parts that have shipped at least once since January 2002. The EXISTS operator is used for these types of queries, as
illustrated
by the following query:
SELECT p.part_nbr, p.name, p.unit_cost
FROM part p
WHERE EXISTS
(SELECT 1
FROM line_item li INNER JOIN cust_order co
ON li.order_nbr = co.order_nbr
WHERE li.part_nbr = p.part_nbr
AND co.ship_dt >= TO_DATE('01-JAN-2002','DD-MON-YYYY'));
As long as the subquery returns one or more rows, the EXISTS condition is satisfied without regard for how many rows were actually returned by the subquery. Since the EXISTS operator returns TRUE or FALSE depending on the number of rows returned by the subquery, the actual columns returned by the subquery are irrelevant. The SELECT clause requires at least one column, however, so it is common practice to use either the literal "1" or the wildcard "*".
Conversely, you can test whether a relationship does not exist:
UPDATE customer c
SET c.inactive_ind = 'Y', c.inactive_dt = TRUNC(SYSDATE)
WHERE c.inactive_dt IS NULL
AND
NOT EXISTS
(SELECT 1 FROM cust_order co
WHERE co.cust_nbr = c.cust_nbr
AND co.order_dt > TRUNC(SYSDATE) -- 365);
This statement makes all customer records inactive for those customers who haven't placed an order in the past year. Such queries are commonly found in maintenance routines. For example, foreign key constraints might prevent child records from referring to a nonexistent parent, but it is possible to have parent records without children. If business rules prohibit this situation, you might run a utility each week that
removes
these records, as in:
DELETE FROM cust_order co
WHERE co.order_dt > TRUNC(SYSDATE) -- 7
AND co.cancelled_dt IS NULL
AND NOT EXISTS
(SELECT 1 FROM line_item li
WHERE li.order_nbr = co.order_nbr);
A query that includes a correlated subquery using the EXISTS operator is referred to as a
semi-join
. A semi-join includes rows in table A for which corresponding data is found
one or more
times in table B. Thus, the
size
of the final result set is unaffected by the number of matches found in table B. Similar to the
anti-join
discussed earlier, the Oracle optimizer can
employ
multiple strategies for formulating execution plans for such queries, including a
merge semi-join
or a
hash semi-join
.
Although they are very often used together, the use of correlated subqueries does not require the EXISTS operator. If your database design includes denormalized columns, for example, you might run nightly routines to recalculate the
denormalized
data, as in:
UPDATE customer c
SET (c.tot_orders, c.last_order_dt) =
(SELECT COUNT(*), MAX(co.order_dt)
FROM cust_order co
WHERE co.cust_nbr = c.cust_nbr
AND co.cancelled_dt IS NULL);
Because a SET clause
assigns
values to columns in the table, the only operator allowed is
=
. The subquery returns exactly one row (thanks to the aggregation functions), so the results may be safely assigned to the target columns. Rather than recalculating the entire sum each day, a more efficient method might be to update only those customers who placed orders today:
UPDATE customer c SET (c.tot_orders, c.last_order_dt) =
(SELECT
c.tot_orders
+ COUNT(*), MAX(co.order_dt)
FROM cust_order co
WHERE co.cust_nbr =
c.cust_nbr
AND co.cancelled_dt IS NULL
AND co.order_dt >= TRUNC(SYSDATE))
WHERE c.cust_nbr IN
(SELECT co.cust_nbr
FROM cust_order co
WHERE co.order_dt >= TRUNC(SYSDATE)
AND co.cancelled_dt IS NULL);
As the previous statement shows, data from the containing query can be used for other purposes in the correlated subquery than just join conditions in the WHERE clause. In this example, the SELECT clause of the correlated subquery adds today's sales totals to the previous value of
tot_orders
in the
customer
table to
arrive
at the new value.
Along with the WHERE clause of SELECT, UPDATE, and DELETE statements, and the SET clause of UPDATE statements, another potent use of correlated subqueries is in the SELECT clause, as illustrated by the following:
SELECT d.dept_id, d.name,
(SELECT COUNT(*) FROM employee e
WHERE e.dept_id = d.dept_id) empl_cnt
FROM department d;
DEPT_ID NAME EMPL_CNT
---------- -------------------- ----------
10 ACCOUNTING 3
20 RESEARCH 5
30 SALES 6
40 OPERATIONS 0
The
empl_cnt
column returned from this query is derived from a correlated subquery that returns the number of employees assigned to each department. Note that the OPERATIONS department has no assigned
employees
, so the subquery returns 0.
To appreciate the value of subqueries in the SELECT clause, let's compare the previous query to a more traditional method using GROUP BY:
SELECT d.dept_id, d.name, COUNT(e.emp_id) empl_cnt
FROM department d LEFT OUTER JOIN employee e
ON d.dept_id = e.dept_id
GROUP BY d.dept_id, d.name;
DEPT_ID NAME EMPL_CNT
---------- -------------------- ----------
10 ACCOUNTING 3
20 RESEARCH 5
30 SALES 6
40 OPERATIONS 0
To include every department in the result set, and not just those with assigned employees, you must perform an outer join from
department
to
employee
. The results are sorted by department ID and name, and the number of employees are counted within each department. In our opinion, the previous query
employing
the scalar correlated subquery is easier to understand. It does not need an outer join (or any join at all), and does not necessitate a
sort
operation, making it an attractive alternative to the GROUP BY version.
|