Almost all real-world queries with subqueries impose a special kind of condition on the rows in the outer, main query; they must match or not match corresponding rows in a related query. For example, if you need data about departments that have employees , you might query: SELECT ... FROM Departments D WHERE EXISTS (SELECT NULL FROM Employees E WHERE E.Department_ID=D.Department_ID) Alternatively, you might query for data about departments that do not have employees: SELECT ... FROM Departments D WHERE NOT EXISTS (SELECT NULL FROM Employees E WHERE E.Department_ID=D.Department_ID) The join E.Department_ID=D.Department_ID in each of these queries is the correlation join, which matches between tables in the outer query and the subquery. The EXISTS query has an alternate, equivalent form: SELECT ... FROM Departments D WHERE D.Department_ID IN (SELECT E.Department_ID FROM Employees E) Since these forms are functionally equivalent, and since the diagram should not convey a preconceived bias toward one form or another, both forms result in the same diagram. Only after evaluating the diagram to solve the tuning problem should you choose which form best solves the tuning problem and expresses the intended path to the data. 7.2.1 Diagramming Queries with SubqueriesIgnoring the join that links the outer query with the subquery, you can already produce separate, independent query diagrams for each of the two queries. The only open question is how you should represent the relationship between these two diagrams, combining them into a single diagram. As the EXISTS form of the earlier query makes clear, the outer query links to the subquery through a join: the correlation join. This join has a special property: for each outer-query row, the first time the database succeeds in finding a match across the join, it stops looking for more matches, considers the EXISTS condition satisfied, and passes the outer-query row to the next step in the execution plan. When it finds a match for a NOT EXISTS correlated subquery, it stops with the NOT EXISTS condition that failed and immediately discards the outer-query row, without doing further work on that row. All this behavior suggests that a query diagram should answer four special questions about a correlation join, questions that do not apply to ordinary joins:
When working with subqueries and considering these questions, remember that you still need to convey the properties you convey for any join: which end is the master table and how large are the join ratios. 7.2.1.1 Diagramming EXISTS subqueriesFigure 7-27 shows my convention for diagramming a query with an EXISTS -type subquery (which might be expressed by the equivalent IN subquery). The figure is based on the earlier EXISTS subquery: SELECT ... FROM Departments D WHERE EXISTS (SELECT NULL FROM Employees E WHERE E.Department_ID=D.Department_ID) Figure 7-27. A simple query with a subqueryFor the correlation join (known as a semi-join when applied to an EXISTS -type subquery) from Departments to Employees , the diagram begins with the same join statistics shown for Figure 5-1. Like any join, the semi-join that links the inner query to the outer query is a link with an arrowhead on any end of the join that points toward a primary key. Like any join link, it has join ratios on each end, which represent the same statistical properties the same join would have in an ordinary query. I use a midpoint arrow to point from the outer-query correlated node to the subquery correlated node. I place an E beside the midpoint arrow to show that this is a semi-join for an EXISTS or IN subquery. In this case, as in many cases with subqueries, the subquery part of the diagram is a single node, representing a subquery without joins of its own. Less frequently, also as in this case, the outer query is a single node, representing an outer query with no joins of its own. The syntax is essentially unlimited, with the potential for multiple subqueries linked to the outer query, for subqueries with complex join skeletons of their own, and even for subqueries that point to more deeply nested subqueries of their own. The semi-join link also requires up to two new numbers to convey properties of the subquery for purposes of choosing the best plan. Figure 7-27 shows both potential extra values that you sometimes need to choose the optimum plan for handling a subquery. The first value, next to the E (20 in Figure 7-27) is the correlation preference ratio . The correlation preference ratio is the ratio of I / E . E is the estimated or measured runtime of the best plan that drives from the outer query to the subquery (following EXISTS logic). I is the estimated or measured runtime of the best plan that drives from the inner query to the outer query (following IN logic). You can always measure this ratio directly by timing both forms, and doing so is usually not much trouble, unless you have many subqueries combined. Soon I will explain some rules of thumb to estimate I / E more or less accurately, and even a rough estimate is adequate to choose a plan when, as is common, the value is either much less than 1.0 or much greater than 1.0. When the correlation preference ratio is greater than 1.0, choose a correlated subquery with an EXISTS condition and a plan that drives from the outer query to the subquery. The other new value is the subquery adjusted filter ratio (0.8 in Figure 7-27), next to the detail join ratio. This is an estimated value that helps you choose the best point in the join order to test the subquery condition. This applies only to queries that should begin with the outer query, so exclude it from any semi-join link (with a correlation preference ratio less than 1.0) that you convert to the driving query in the plan.
Before I explain how to calculate the correlation preference ratio and the subquery adjusted filter ratios, let's consider when you even need them. Figure 7-28 shows a partial query diagram for an EXISTS -type subquery, with the root detail table of the subquery on the primary-key end of the semi-join. Figure 7-28. A semi-join to a primary keyHere, the semi-join is functionally no different than an ordinary join, because the query will never find more than a single match from table M for any given row from the outer query.
Since the semi-join is functionally no different than an ordinary join, you can actually buy greater degrees of freedom in the execution plan by explicitly eliminating the EXISTS condition and merging the subquery into the outer query. For example, consider this query: SELECT <Columns from outer query only> FROM Order_Details OD, Products P, Shipments S, Addresses A, Code_Translations ODT WHERE OD.Product_ID = P.Product_ID AND P.Unit_Cost > 100 AND OD.Shipment_ID = S.Shipment_ID AND S.Address_ID = A.Address_ID(+) AND OD.Status_Code = ODT.Code AND ODT.Code_Type = 'Order_Detail_Status' AND S.Shipment_Date > :now - 1 AND EXISTS (SELECT null FROM Orders O, Customers C, Code_Translations OT, Customer_Types CT WHERE C.Customer_Type_ID = CT.Customer_Type_ID AND CT.Text = 'Government' AND OD.Order_ID = O.Order_ID AND O.Customer_ID = C.Customer_ID AND O.Status_Code = OT.Code AND O.Completed_Flag = 'N' AND OT.Code_Type = 'ORDER_STATUS' AND OT.Text != 'Canceled') ORDER BY <Columns from outer query only> Using the new semi-join notation, you can diagram this as shown in Figure 7-29. Figure 7-29. A complex example of a semi-join with a correlation to the primary key of the subquery root detail tableIf you simply rewrite this query to move the table joins and conditions in the subquery into the outer query, you have a functionally identical query, since the semi-join is toward the primary key and the subquery is one-to-one with its root detail table: SELECT <Columns from the original outer query only> FROM Order_Details OD, Products P, Shipments S, Addresses A, Code_Translations ODT, Orders O, Customers C, Code_Translations OT, Customer_Types CT WHERE OD.Product_Id = P.Product_Id AND P.Unit_Cost > 100 AND OD.Shipment_Id = S.Shipment_Id AND S.Address_Id = A.Address_Id(+) AND OD.Status_Code = ODT.Code AND ODT.Code_Type = 'Order_Detail_Status' AND S.Shipment_Date > :now - 1 AND C.Customer_Type_Id = CT.Customer_Type_Id AND CT.Text = 'Government' AND OD.Order_Id = O.Order_Id AND O.Customer_Id = C.Customer_Id AND O.Status_Code = OT.Code AND O.Completed_Flag = 'N' AND OT.Code_Type = 'ORDER_STATUS' AND OT.Text != 'Canceled' ORDER BY <Columns from the original outer query only> I have indented this version to make obvious the simple transformation from one to the other in this case. The query diagram is also almost identical, as shown in Figure 7-30. Figure 7-30. The same query transformed to merge the subqueryThis new form has major additional degrees of freedom, allowing, for example, a plan joining to moderately filtered P after joining to highly filtered O but before joining to the almost unfiltered node OT . With the original form, the database would be forced to complete the whole subquery branch before it could consider joins to further nodes of the outer query. Since merging the subquery in cases like this can only help, and since this creates a query diagram you already know how to optimize, I will assume for the rest of this section that you will merge this type of subquery. I will only explain diagramming and optimizing other types. In theory, you can adapt the same trick for merging EXISTS -type subqueries with the semi-join arrow pointing to the detail end of the join too, but it is harder and less likely to help the query performance. Consider the earlier query against departments with the EXISTS condition on Employees : SELECT ... FROM Departments D WHERE EXISTS (SELECT NULL FROM Employees E WHERE E.Department_ID=D.Department_ID) These are the problems with the trick in this direction:
Therefore, you should rarely transform semi-joins to ordinary joins when the semi-join arrow points upward, except when the detail join ratio for the semi-join is near 1.0 or even less than 1.0. To complete a diagram for an EXISTS -type subquery, you just need rules to estimate the correlation preference ratio and the subquery adjusted filter ratio. Use the following procedure to estimate the correlation preference ratio:
Once you have found the correlation preference ratio, check whether you need the subquery adjusted filter ratio and determine the subquery adjusted filter ratio when you need it:
Try to see if you understand the rules to fill in the correlation preference ratio and the subquery adjusted filter ratio by completing Figure 7-31, which is missing these two numbers. Figure 7-31. A complex query with missing values for the semi-joinCheck your own work against this explanation. Calculate the correlation preference ratio:
To find the subquery adjusted filter ratio, follow these steps:
In the following section on optimizing EXISTS subqueries, I will illustrate optimizing the completed diagram, shown in Figure 7-32. 7.2.1.2 Diagramming NOT EXISTS subqueriesSubquery conditions that you can express with NOT EXISTS or NOT IN are simpler than EXISTS -type subqueries in one respect: you cannot drive from the subquery outward to the outer query. This eliminates the need for the correlation preference ratio. The E that indicates an EXISTS -type subquery condition is replaced by an N to indicate a NOT EXISTS -type subquery condition, and the correlation join is known as an anti-join rather than a semi-join, since it searches for the case when the join to rows from the subquery finds no match. It turns out that it is almost always best to express NOT EXISTS -type subquery conditions with NOT EXISTS , rather than with NOT IN . Consider the following template for a NOT IN subquery: SELECT ... FROM ... Outer_Anti_Joined_Table Outer WHERE... AND Outer.Some_Key NOT IN (SELECT Inner.Some_Key FROM ... Subquery_Anti_Joined_Table Inner WHERE <Conditions_And_Joins_On_Subquery_Tables_Only> ) ... You can and should rephrase this in the equivalent NOT EXISTS form: SELECT ... FROM ... Outer_Anti_Joined_Table Outer WHERE... AND Outer.Some_Key IS NOT NULL AND NOT EXISTS (SELECT null FROM ... Subquery_Anti_Joined_Table Inner WHERE <Conditions_And_Joins_On_Subquery_Tables_Only> AND Outer.Some_Key = Inner.Some_Key)
Both EXISTS -type and NOT EXISTS -type subquery conditions stop looking for matches after they find the first match, if one exists. NOT EXISTS subquery conditions are potentially more helpful early in the execution plan, because when they stop early with a found match, they discard the matching row, rather than retain it, making later steps in the plan faster. In contrast, to discard a row with an EXISTS condition, the database must examine every potentially matching row and rule them all out, a more expensive operation when there are many details per master across the semi-join. Remember the following rules, which compare EXISTS and NOT EXISTS conditions that point to detail tables from a master table in the outer query:
Since it is generally both more difficult and less rewarding to convert NOT EXISTS subquery conditions to equivalent simple queries without subqueries, you will often use NOT EXISTS subqueries at either end of the anti-join: the master-table end or the detail-table end. You rarely need to search for alternate ways to express a NOT EXISTS condition. Since selective NOT EXISTS conditions are inexpensive to check, it turns out to be fairly simple to estimate the subquery adjusted filter ratio:
7.2.2 Tuning Queries with SubqueriesAs for simple queries, optimizing complex queries with subqueries turns out to be straightforward, once you have a correct query diagram. Here are the steps for optimizing complex queries, including subqueries, given a complete query diagram:
As an example, consider Figure 7-32, which is Figure 7-31 with the correlation preference ratio and the subquery adjusted filter ratio filled in. Figure 7-32. Example problem optimizing a complex query with a subquerySince the correlated join is EXISTS -type, Step 1 does not apply. Since the midpoint arrow of the semi-join points upward, Step 2 does not apply. The lowest (the only) correlation preference ratio is 1.5 (next to the E ), so Step 3 does not apply. Applying Step 4, you find that the best driving node in the outer query is M . Applying Step 5, choose between downward joins to A1 and A2 with filter ratios of 0.2 and 0.7, respectively, or a virtual downward join to the virtual node representing the whole subquery, with a virtual filter ratio of 0.6. A1 is the best of these three candidates, with the best filter ratio, so join to it next. Finding nothing downstream of A1 , you next find the subquery as the next-best choice in the join order (again applying Step 5), so perform the semi-join to D . Applying Step 6, having begun the subquery, you must finish it, starting from D as the driving node. Following simple-query rules within that subquery, next join to S1 , S3 , S2 , and S4 , in that order. Returning to the outer query, applying the usual rules for simple queries, you find the remainder of the join order, A2 , B1 , B2 . The whole optimum join order, including the semi-join, is then (M, A1, D, S1, S3, S2, S4, A2 , B1, B2) .
|