When tuning queries, you can best picture nonjoin (single-table) conditions as filters. These filters pass through the table rows that the application needs (the rows satisfying the conditions), while discarding the rows that the application does not need (the rows failing to satisfy the query conditions). The application has functional reasons to exclude unneeded rows, but from a performance point of view the job of the filter is to save the database work and time. The trick of tuning is to avoid work, as much as possible, on rows destined for the trash heap. Selectivity is the power of a filter as a row-excluding tool, expressed as the fraction of table rows that the filter passes through. The database can apply filters at three points, with varying success at minimizing query cost:
2.6.1 Filter SelectivityIn this section, I explain how to calculate the selectivity of conditions on a table. Let's begin with some definitions:
When you tune a query and evaluate filter selectivity, start by asking if the query stands alone or if it represents a whole group of queries. If it represents a group of queries, ask about the distribution of values within that group . For example, consider the query: [5]
SELECT ... FROM Orders WHERE Unpaid_Flag='Y'; This query has (we would hope) high selectivity, being true for a small fraction of the complete history of orders. If you can count on finding a value of ' Y ' for Unpaid_Flag in the query, you might want an index on that column. But if the query is part of a group that just as often searches for the unselective condition Unpaid_Flag='N ', you should likely avoid the index. In this example, the meaning of the flag is likely special to the query, driving the very purpose of the query (to find bills to send out), so you can count on finding primarily queries against ' Y ', which is the rare value.
To calculate the selectivity of the condition Unpaid_Flag='Y ', begin by executing the following two queries: SELECT COUNT(*) FROM Orders WHERE Unpaid_Flag='Y'; SELECT COUNT(*) FROM Orders; The selectivity of the condition is the first count divided by the second. On the other hand, consider the query: SELECT ... FROM Order_Details WHERE Order_ID=:id; End users would query for details of orders roughly at random. You could assume this even if the application replaced the bind variable :id with an actual value, since an application would have no reason to always query the same order. The query against any particular ID does not stand alone, but represents a whole family of queries that you should tune as a unit. End users are as likely to query one order as another, so you could estimate the filter selectivity with: SELECT 1/COUNT(DISTINCT Order_ID) FROM Order_Details; A more subtle case arises when the end user might query on any value, but the end user is more likely to query on common values than uncommon values. For example, if operators bring up customers by finding all customers that have the last name of the calling customer, they will bring up common last names more often than uncommon ones with a query such as: SELECT ... FROM Customers WHERE Last_Name = 'SMITH'; Here, if you just counted distinct names, as you counted order IDs earlier, you would see an over-optimistic selectivity that assumed you were just as likely to search for Last_Name='KMETEC ' as for Last_Name='SMITH '. Each last name has a selectivity of n ( i )/ C , where n ( i ) is the count of rows with the i th nonnull last name and C is the count of all rows in the table. If choosing any last name were equally probable, you could just average n ( i )/ C over all the last names. That average would equal one over the number of distinct names. However, the probability of searching on a last name in this scenario is n ( i )/ C ', where C ' is the count of rows having nonnull last names. Therefore, you really need the sum of the selectivities times the probability of seeing each selectivity ”i.e., the sum of ( n ( i )/ C ') x ( n ( i )/ C ) ”over all last names. Since C ' is also the sum of the individual n ( i ) values, you can compute the filter selectivity in SQL as follows : SELECT SUM(COUNT(LastName)*COUNT(Last_Name))/ (SUM(COUNT(Last_Name))*SUM(COUNT(*))) FROM Customers GROUP BY Last_Name; 2.6.2 Index Range-Condition SelectivityIndex range-condition selectivity is the fraction of table rows the database examines in an index while scanning the index. Every index range scan must have two endpoints that define the beginning and end of the range. These endpoints can be equivalent to positive or negative infinity, meaning that the range can be unbounded on either end, though not generally on both ends. The range can exclude either or both of the range endpoints, depending on the nature of the bounding condition. For example, the range condition (Salary > 4000 AND Salary < 8000) excludes both endpoints with > and < bounding conditions. There are common problems that prevent a database from finding a concrete beginning and end for a range scan, even given a very selective condition on the indexed column, generally preventing effective use of the index. It is hard or even impossible (depending on the function) to translate a condition on SomeFunction(Some_Column) into a single range condition on a simple index leading with Some_Column . Generally, a database does not even try to do so.
Therefore, expressions that do more than simply name a column do not usually enable indexed access to ranges defined on that column, often making indexes useless for queries using those expressions. This is a double-edged sword: it forces you to rewrite some queries to enable the index that you want to use, but it is also a useful tool, allowing to you rewrite other queries to prevent the database from using indexes that you do not want to use. A subtle example of a function that disables use of an index is when you compare expressions of different types and the database applies a type-conversion function implicitly. For example, the type-inconsistent condition CharacterColumn=94303 implicitly becomes, in Oracle, TO_NUMBER(CharacterColumn)=94303 . To resolve this problem and enable use of an index on the character column, perform the conversion explicitly on the other side. For example: Replace
Even when comparing numbers to numbers, type conversion can lead to difficulties when a database vendor supports different number types, such as integer and decimal types. Type conversions also interfere with efficient, indexed join paths for which a foreign key of one type points to a primary key of another type.
Several rules apply when you figure out how large an index range scan will be:
If you have conditions that do not specify a restricted range on at least the first column of an index, the only way to use an index at all is to perform a full index scan , a scan of every index entry, across all leaf blocks. Databases will not usually choose full index scans, because to read an entire index is expensive and full index scans usually end up pointing to much of the table as well, with a net cost greater than the competing full table scan. You can force full index scans (often without meaning to) by adding a query hint (more on these later) that instructs the database to use an index even though it would not choose to use that index on its own. More often than not, adding such a hint is a mistake, a desperate attempt to get something better than a full table scan. Usually, this happens when a developer values an index too much, actually leading to a worse plan than the plan the database would choose without help. Even in cases for which a full index scan is better than the competing full table scan, it is almost surely worse than a third alternative: usually driving from a different, sometimes new index or changing a condition on a function to enable a narrow range scan. 2.6.3 Selectivity on Table Rows Reached from the IndexSince indexes are compact and better cached objects, compared to tables, selectivity of an index range scan matters less than efficient table access. Even when an index and table are perfectly cached, the selectivity on the table matters more than on the index, because you read about 300 index rows in a single logical I/O to a leaf block but you must do a separate logical I/O for every table row. This brings us to the second part of evaluating the usefulness of an index: how narrowly will the index restrict access to the table? Fortunately, database implementations are smart enough to check all conditions as early in the execution plan as possible, before touching a table, when the index allows. This reduces table reads whenever an index includes columns required to evaluate a condition, even if the database cannot use that condition to determine the range scan endpoints. For example, consider a Persons table with one index that includes the Area_Code , Phone_Number , Last_Name , and First_Name columns. Consider a query against this table with the conditions: WHERE Area_Code=916 AND UPPER(First_Name)='IVA' Only the first condition contributes to the endpoints of the index range scan. The second condition, on the fourth indexed column, fails to narrow that index range scan for three reasons:
Fortunately, none of these reasons prevents the database from checking the second condition before accessing the table. Because the index contains the First_Name column, the database can test conditions against that column using data from the index. In this case, the database uses the Area_Code condition to define an index range scan. Then, as it looks at each index entry in the range, the database discards entries with first names other than Iva . The likely result is just one or two table-row reads on this selective combination of conditions. Let's examine this situation more closely to decide whether it would be worthwhile to create a better index for this combination of conditions. Both Area_Code and First_Name will see skewed distributions. You will query the common area codes and common first names more often than the uncommon ones, so follow the approach for skewed distributions and find the individual selectivity factors: SELECT SUM(COUNT(Area_Code)*COUNT(Area_Code))/ (SUM(COUNT(Area_Code))*SUM(COUNT(*))) FROM Customers GROUP BY Area_Code; SELECT SUM(COUNT(First_Name)*COUNT(First_Name))/ (SUM(COUNT(First_Name))*SUM(COUNT(*))) FROM Customers GROUP BY First_Name; Let's say that the first calculation yields a selectivity of 0.0086, and the second yields a selectivity of 0.018. Let's further assume that you have a 1,000,000-row table and the index references 200 rows per leaf block (less than usual, because this is an unusually wide index key). The condition on Area_Code alone determines the number of index blocks scanned, so first find the number of rows that condition references. The calculation is 0.0086 x 1,000,000=8,600, indicating that the database scans 43 (8600/20) leaf blocks. (You can always neglect root and branch block reads in single-table queries.) These 43 leaf blocks are likely well cached and require about a millisecond to read.
Going to the table with 8,600 reads could be a problem, but, fortunately, you pick up the additional selectivity of the condition on First_Name before reaching the table, reducing the row count to about 155 (8,600 x 0.018) rows. You will see about 155 logical I/Os to the table, with a few physical I/Os, since you will see a less favorable hit ratio on the table, compared to the more compact index. The multicolumn filter selectivity, 0.0086 x 0.018 = 0.000155, is well into the range that favors indexed access, even though the individual selectivity of the first column alone is in the gray area. Notice that even if you modify the query or the data model to pick up the full selectivity in the index range conditions, the cost will go down only marginally, eliminating most of the 43 well-cached index leaf-block reads and speeding the query by about a millisecond. This will have no effect on the 155 more expensive table-block reads. (You could modify the application to do a case-sensitive match on the first name, and create a new index to cover just these two columns. Alternatively, you could modify the table to store the uppercase first names and index the new column together with area code.) Therefore, even though this query and index are not perfectly suited to one another, the index is good enough to use. It is not worthwhile to add a new index and change the query for the small potential gain. 2.6.4 Combining IndexesOccasionally, the database finds equality-condition filters that point to different single-column indexes. By combining index range scans on these indexes, the database can pick up the multicolumn filter selectivity before reaching the table. Let's reuse the earlier example of a query that specifies area code and customer first name, but replace the multicolumn index with single-column indexes on the two columns. Further, replace the condition on UPPER(First_Name) with a simple match on the column, assuming the column already stores uppercase values. Typical conditions are then: WHERE Area_Code=415 AND First_Name='BRUCE'; You can now assume that you have closer to the typical 300 rows or more per index leaf block, since these are single-column indexes on short columns. Using the same single-column filter selectivity factors and assuming as before that the table holds 1,000,000 rows, you can again predict that the Area_Code condition will yield 8,600 (0.0086 x 1,000,000) rows, requiring 29 (8,600/300) index leaf blocks. The second index range scan, for First_Name , will yield 18,000 (0.018 x 1,000,000) rows, requiring 60 (18,000/300) index leaf blocks. Since these are both equality conditions, the database finds the two rowid lists for these two conditions presorted, and it can easily merge these two sorted lists to find the same 155 table rowids found with the earlier multicolumn index. The operation just described, merging lists of rowids from two indexes, is called the index AND-EQUAL MERGE operation , and the database can do this between any number of equality conditions on single-column indexes. The cost of the table access is the same as finding the rows with one multicolumn index, but you must read more index leaf blocks ”in this case, 89 (29+60) instead of the earlier example of 43. As you can see, the AND-EQUAL MERGE is more expensive than a multicolumn index, but it might be better than using just one single-column index. But this case, in which the AND-EQUAL MERGE is reasonably attractive, is rare. Almost always, the most selective single-column condition is fine by itself and the cost of the less selective index range scan exceeds the savings in table access, or the added cost of the AND-EQUAL MERGE justifies creating a multicolumn index. When you see an AND-EQUAL MERGE in an execution plan, you should almost always either suppress use of the less selective index or create a multicolumn index to use instead. The new multicolumn index should start with the more selective column ”in this case, Area_Code ”and should usually take the place of any index on that column alone. Such an index sometimes enables you to drop the index on the other column as well. |