| < Day Day Up > |
|
Tuning subqueries is a highly complex topic. Quite often subqueries can be used to partially replace subset parts of very large mutable joins, with possible enormous performance improvements.
A correlated subquery allows a correlation between a calling query and a subquery. A value for each row in the calling query is passed into the subquery to be used as a constraint by the subquery. A noncorrelated or regular subquery does not contain a correlation between calling query and subquery and thus the subquery is executed in its entirety, independently of the calling query, for each row in the calling query. Tuning correlated subqueries is easier because values in subqueries can be precisely searched for in relation to each row of the calling query.
A correlated subquery will access a specified row or set of rows for each row in the calling query. Depending on circumstances a correlated subquery is not always faster than a noncorrelated subquery. Use of indexes or small tables inside a subquery, even for noncorrelated subqueries, does not necessarily make a subquery perform poorly.
We have already seen substantial use of IN and EXISTS in the section on comparison conditions. We know already that IN is best used for small tables or lists of literal values. EXISTS is best used to code queries in a correlated fashion, establishing a link between a calling query and a subquery. To reiterate it is important to remember that using EXISTS is not always faster than using IN.
Subqueries can be nested where a subquery can call another subquery. The following example using the Employees schema shows a query calling a subquery, which in turn calls another subquery.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT * FROM division WHERE division_id IN (SELECT division_id FROM department WHERE department_id IN (SELECT department_id FROM project));
Notice in the query plan how the largest table is scanned using an INDEX FAST FULL SCAN. The Optimizer is intelligent enough to analyze this nested query and discern that the Project table is much larger than both of the other two tables. The other two tables are so small that the only viable option is a full table scan.
Query Cost Rows Bytes ----------------------------------- ---- ------ ------- SELECT STATEMENT on 14 10 590 HASH JOIN SEMI on 14 10 590 TABLE ACCESS FULL on DIVISION 1 10 460 VIEW on VW_NSO_1 8 10000 130000 HASH JOIN on 8 10000 60000 TABLE ACCESS FULL on DEPARTMENT 1 100 400 INDEX FAST FULL SCAN on XFKPROJECT_DEPT 4 10000 20000
Nested subqueries can be difficult to tune but can often be a viable and sometimes highly effective tool for the tuning of mutable complex joins, with three and sometimes many more tables in a single join. There is a point when there are so many tables in a join that the Optimizer can become less effective.
For very large complex mutable joins it is often possible to replace joins or parts of joins with subqueries. Very large joins can benefit the most because they are difficult to decipher and tune. Some very large joins are even beyond the intelligence of the Optimizer to assess in the best possible way. Two ways in which subqueries can replace joins in complex mutable joins are as follows:
A table in the join not returning a column in the primary calling query can be removed from the join and checked using a subquery.
FROM clauses can contain nested subqueries to break up joins much in the way that PL/SQL would use nested looping cursors.
Certain aspects of SQL coding placed in subqueries can cause problems:
An ORDER BY clause is always applied to a final result and should not be included in subqueries if possible.
DISTINCT will always cause a sort and is not always necessary. Perhaps a parent table could be used where a unique value is present.
When testing against subqueries retrieve, filter, and aggregate on indexes not tables. Indexes usually offer better performance.
Do not be too concerned about full table scans on very small static tables.
Tip | Instances where joins can be replaced with subqueries often involve databases with heavy outer join requirements. Excessive use of SQL outer joins is possibly indicative of an over-granular data model structure. However, it could also indicate orphaned child table rows or the opposite: redundant static data. Cleaning out redundant or orphaned rows can sometimes help performance immensely by negating the need for outer joins. |
Going back to the Accounts schema once again look at the following complex mutable join. We are joining four tables and selecting a column from only one of the tables. EXISTS comparisons can be placed into the WHERE clause to force index access, removing three tables from the join.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT c.name FROM customer c JOIN orders o USING(customer_id) JOIN ordersline ol USING(order_id) JOIN transactions t USING(customer_id) JOIN transactionsline tl USING(transaction_id) WHERE c.balance > 0;
Its query plan is a little scary. There are three full table scans and two full index scans. The objective is to remove full table scans and change as many index scans as possible into unique index scans.
Query Cost Rows Bytes --------------------------------- ------ ------ --------- SELECT STATEMENT on ###### ####### ######### MERGE JOIN on ###### ####### ######### SORT JOIN on ###### ####### ######### MERGE JOIN on 4988 ####### ######### SORT JOIN on 3209 100136 3805168 HASH JOIN on 762 100136 3805168 TABLE ACCESS FULL on CUSTOMER 9 2690 69940 MERGE JOIN on 316 100237 1202844 INDEX FULL SCAN on XFK_ORDERLINE_ORDER 26 540827 2704135 SORT JOIN on 290 31935 223545 TABLE ACCESS FULL on ORDERS 112 31935 223545 SORT JOIN on 1780 188185 1317295 TABLE ACCESS FULL on TRANSACTIONS 187 188185 1317295 SORT JOIN on 5033 570175 2850875 INDEX FAST FULL SCAN on XFK_TRANSLINE_TRANS 4 570175 2850875
Only the Customer.NAME column is selected. This query is an extreme case but we can actually remove every table from the join except the Customer table. Let's show this in two stages. Firstly, I will remove the transaction tables from the join.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT c.name FROM customer c JOIN orders o ON(c.customer_id = o.customer_id) JOIN ordersline ol USING(order_id) WHERE c.balance > 0 AND EXISTS( SELECT t.transaction_id FROM transactions t WHERE t.customer_id = c.customer_id AND EXISTS( SELECT transaction_id FROM transactionsline WHERE transaction_id = t.transaction_id ) );
We have now reduced the full table scans to two, have a single full index scan and most importantly index range scans on both of the transaction tables.
Query Cost Rows Bytes --------------------------------- ------ ------ --------- SELECT STATEMENT on 359 5007 190266 FILTER on HASH JOIN on 359 5007 190266 TABLE ACCESS FULL on CUSTOMER 9 135 3510 MERGE JOIN on 316 100237 1202844 INDEX FULL SCAN on XFK_ORDERLINE_ORDER 26 540827 2704135 SORT JOIN on 290 31935 223545 TABLE ACCESS FULL on ORDERS 112 31935 223545 NESTED LOOPS on 72 212 2544 TABLE ACCESS BY INDEX ROWID on TRANSACTIONS 2 70 490 INDEX RANGE SCAN on XFX_TRANS_CUSTOMER 1 70 INDEX RANGE SCAN on XFK_TRANSLINE_TRANS 1 570175 2850875
Now let's get completely ridiculous and remove every table from the join but the Customer table.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT c.name FROM customer c WHERE c.balance > 0 AND EXISTS( SELECT o.order_id FROM orders o WHERE o.customer_id = c.customer_id AND EXISTS( SELECT order_id FROM ordersline WHERE order_id = o.order_id ) ) AND EXISTS( SELECT t.transaction_id FROM transactions t WHERE t.customer_id = c.customer_id AND EXISTS( SELECT transaction_id FROM transactionsline WHERE transaction_id = t.transaction_id ) );
This is about the best that can be done with this query, now no longer a join. This final result has full table access on the Customer table only, along with four index range scans. We could possibly improve the query further by decreasing the number of Customer rows retrieved using filtering.
Query Cost Rows Bytes --------------------------------- ------ ------ --------- SELECT STATEMENT on 9 7 182 FILTER on TABLE ACCESS FULL on CUSTOMER 9 7 182 NESTED LOOPS on 66 201 2412 TABLE ACCESS BY INDEX ROWID on ORDERS 2 64 448 INDEX RANGE SCAN on XFK_ORDERS_CUSTOMER 1 64 INDEX RANGE SCAN on XFK_ORDERLINE_ORDER 1 540827 2704135 NESTED LOOPS on 72 212 2544 TABLE ACCESS BY INDEX ROWID on TRANSACTIONS 2 70 490 INDEX RANGE SCAN on XFX_TRANS_CUSTOMER 1 70 INDEX RANGE SCAN on XFK_TRANSLINE_TRANS 1 570175 2850875
Now what we want to do is to retrieve columns from different tables. Columns cannot be retrieved from an EXISTS comparison in the WHERE clause. We have to use another method. Nested subqueries in the FROM clause allow retrieval of columns.
In this example I am adding extra filtering to the TransactionsLine table; at over 500,000 rows it is the largest table in the query. Since the TransactionsLine table is larger than the Customer table it is filtered first.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT c.name, tl.amount FROM customer c JOIN orders o USING(customer_id) JOIN ordersline ol USING(order_id) JOIN transactions t USING(customer_id) JOIN transactionsline tl USING(transaction_id) WHERE tl.amount > 3170 AND c.balance > 0;
We start with three full table scans, one index range scan and a unique index hit.
Query Cost Rows Bytes --------------------------------- ------ ------ --------- SELECT STATEMENT on 1860 605 33880 NESTED LOOPS on 1860 605 33880 HASH JOIN on 1667 193 9843 TABLE ACCESS FULL on ORDERS 112 31935 223545 MERGE JOIN CARTESIAN on 436 43804 1927376 NESTED LOOPS on 292 16 288 TABLE ACCESS FULL on TRANSACTIONSLINE 276 16 176 TABLE ACCESS BY INDEX ROWID on TRANSACTIO 1 188185 1317295 INDEX UNIQUE SCAN on XPKTRANSACTIONS 188185 BUFFER SORT on 435 2690 69940 TABLE ACCESS FULL on CUSTOMER 9 2690 69940 INDEX RANGE SCAN on XFK_ORDERLINE_ORDER 1 540827 2704135
Firstly, some appropriate simple tuning can be done. Since the TransactionsLine table is the largest table with the smallest relative filtered result, it should be selected from first.
Tip | The first table to be processed should be the largest table with the largest relative row reduction filter. In other words, the biggest table with the lowest number of rows retrieved from it. This applies to both the FROM clause and the WHERE clause. Always reduce rows to be joined first. |
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT c.name, tl.amount FROM transactionsline tl JOIN transactions t USING(transaction_id) JOIN customer c USING(customer_id) JOIN orders o USING(customer_id) JOIN ordersline ol ON(ol.order_id = o.order_id) WHERE tl.amount > 3170 AND c.balance > 0;
Appropriate simple tuning yields one full table scan, two index range scans, and two unique index hits.
Query Cost Rows Bytes --------------------------------- ------ ------ --------- SELECT STATEMENT on 1381 3267 182952 NESTED LOOPS on 1381 3267 182952 NESTED LOOPS on 340 1041 53091 NESTED LOOPS on 308 16 704 NESTED LOOPS on 292 16 288 TABLE ACCESS FULL on TRANSACTIONSLINE 276 16 176 TABLE ACCESS BY INDEX ROWID on TRANSACTIO 1 33142 231994 INDEX UNIQUE SCAN on XPKTRANSACTIONS 33142 TABLE ACCESS BY INDEX ROWID on CUSTOMER 1 2690 69940 INDEX UNIQUE SCAN on XPKCUSTOMER 2690 TABLE ACCESS BY INDEX ROWID on ORDERS 2 172304 1206128 INDEX RANGE SCAN on XFK_ORDERS_CUSTOMER 1 172304 INDEX RANGE SCAN on XFK_ORDERLINE_ORDER 1 540827 2704135
Now let's use the FROM clause to create nested subqueries. The trick is to put the largest table with the most severe filter at the deepest nested level, forcing it to execute first. Thus we start with the TransactionsLine table.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT c.name, b.amount FROM customer c, ( SELECT t.customer_id, a.amount FROM transactions t,( SELECT transaction_id, amount FROM transactionsline WHERE amount > 3170 ) a WHERE t.transaction_id = a.transaction_id ) b, orders o, ordersline ol WHERE c.balance > 0 AND c.customer_id = b.customer_id AND o.customer_id = c.customer_id AND ol.order_id = o.order_id;
The cost is reduced further with the same combination of scans because fewer rows are being joined.
Query Cost Rows Bytes --------------------------------- ------ ------ --------- SELECT STATEMENT on 533 605 33880 NESTED LOOPS on 533 605 33880 NESTED LOOPS on 340 193 9843 NESTED LOOPS on 308 16 704 NESTED LOOPS on 292 16 288 TABLE ACCESS FULL on TRANSACTIONSLINE 276 16 176 TABLE ACCESS BY INDEX ROWID on TRANSACTIO 1 33142 231994 INDEX UNIQUE SCAN on XPKTRANSACTIONS 33142 TABLE ACCESS BY INDEX ROWID on CUSTOMER 1 2690 69940 INDEX UNIQUE SCAN on XPKCUSTOMER 2690 TABLE ACCESS BY INDEX ROWID on ORDERS 2 31935 223545 INDEX RANGE SCAN on XFK_ORDERS_CUSTOMER 1 31935 INDEX RANGE SCAN on XFK_ORDERLINE_ORDER 1 540827 2704135
Now let's combine WHERE clause comparison subqueries and FROM clause embedded subqueries.
EXPLAIN PLAN SET statement_id='TEST' FOR SELECT c.name, b.amount FROM customer c, ( SELECT t.customer_id, a.amount FROM transactions t,( SELECT transaction_id, amount FROM transactionsline WHERE amount > 3170 ) a WHERE t.transaction_id = a.transaction_id ) b WHERE c.balance > 0 AND EXISTS( SELECT o.order_id FROM orders o WHERE o.customer_id = c.customer_id AND EXISTS( SELECT order_id FROM ordersline WHERE order_id = o.order_id ) );
Using EXISTS makes the query just that little bit faster with lower cost since the number of tables joined is reduced.
Query Cost Rows Bytes --------------------------------- ------ ------ --------- SELECT STATEMENT on 420 2190 91980 FILTER on NESTED LOOPS on 420 2190 91980 MERGE JOIN CARTESIAN on 420 2190 81030 TABLE ACCESS FULL on TRANSACTIONSLINE 276 16 176 BUFFER SORT on 144 135 3510 TABLE ACCESS FULL on CUSTOMER 9 135 3510 INDEX UNIQUE SCAN on XPKTRANSACTIONS 188185 940925 NESTED LOOPS on 66 201 2412 TABLE ACCESS BY INDEX ROWID on ORDERS 2 64 448 INDEX RANGE SCAN on XFK_ORDERS_CUSTOMER 1 64 INDEX RANGE SCAN on XFK_ORDERLINE_ORDER 1 540827 2704135
| < Day Day Up > |
|