Section 6.1. Small Result Set, Direct Specific Criteria


6.1. Small Result Set, Direct Specific Criteria

The typical online transaction-processing query is a query returning a small result set from a few tables and with very specific criteria applied to those tables. When we are looking for a few rows that match a selective combination of conditions, our first priority is to pay attention to indexes.

The trivial case of a single table or even a join between two tables that returns few rows presents no more difficulty than ensuring that the query uses the proper index. However, when many tables are joined together, and we have input criteria referring to, for instance, two distinct tables TA and TB, then we can either work our way from TA to TB or from TB to TA. The choice depends on how fast we can get rid of the rows we do not want. If statistics reflect the contents of tables with enough accuracy, the optimizer should, hopefully, be able to make the proper decision as to the join order.

When writing a query to return few rows, and with direct, specific criteria, we must identify the criteria that are most efficient at filtering the rows; if some criteria are highly critical, before anything else, we must make sure that the columns corresponding to those criteria are indexed and that the indexes can be used by the query.

6.1.1. Index Usability

You've already seen in Chapter 3 that whenever a function is applied to an indexed column, a regular index cannot be used. Instead, you would have to create a functional index, which means that you index the result of the function applied to the column instead of indexing the column.

Remember too that you don't have to explicitly invoke a function to see a function applied; if you compare a column of a given type to a column or literal value of a different type, the DBMS may perform an implicit type conversion (an implicit call to a conversion function), with the performance hit that one can expect.

Once we are certain that there are indexes on our critical search criteria and that our query is written in such a way that it can take full advantage of them, we must distinguish between unique index fetches of a single row, and other fetchesnon-unique index or a range scan of a unique index.

6.1.2. Query Efficiency and Index Usage

Unique indexes are excellent when joining tables. However, when the input to a query is a primary key and the value of the primary key is not a primitive input to the program, then you may have a poorly designed program on your hands.

What I call primitive input is data that has been fed into the program, either typed in by a user or read from a file. If the primary key value has been derived from some primitive input and is itself the result of a query, the odds are very high that there is a massive design flaw in the program. Because this situation often means that the output of one query is used as the input to another one, you should check whether the two queries can be combined.

Excellent queries don't necessarily come from excellent programs.

6.1.3. Data Dispersion

When indexes are not unique, or when a condition on a unique index is expressed as a range, for instance:

     where customer_id between ... and ...

or:

     where supplier_name like 'SOMENAME%'

the DBMS must perform a range scan. Rows associated with a given key may be spread all over the table being queried, and this is something that a cost-based optimizer often understands. There are therefore cases when an index range scan would require the DBMS kernel to fetch, one by one, a large number of table data pages, each with very few rows of relevance to the query, and when the optimizer decides that the DBMS kernel is better off scanning the table and ignoring the index.

You saw in Chapter 5 that many database systems offer facilities such as table partitions or clustered indexes to direct the storage of data that we would like to retrieve together. But the mere nature of data insertion processes may well lead to clumping of data. When we associate a timestamp with each row and do mostly inserts into a table, the chances are that most rows will be inserted next to one another (unless we have taken special measures to limit contention, as I discuss in Chapter 9). The physical proximity of the inserted rows is not an absolute necessity and, in fact, the notion of order as such is totally foreign to relational algebra. But, in practice, it is what may happen. Therefore, when we perform a range scan on the index on the timestamp column to look for index entries close together in time, the chances are that the rows in question will be close together too. Of course, this will be even truer if we have tweaked the storage so as to get such a result.

Now, if the value of a key bears no relation to any peculiar circumstance of insertion nor to any hidden storage trick, the various rows associated with a key value or with a range of key values can be physically placed anywhere on disk. The keys in the index are always, by construction, held in sorted order. But the associated rows will be randomly located in the table. In practice, we shall have to visit many more blocks to answer a query involving such an index than would be the case were the table partitioned or the index clustered. We can have, therefore, two indexes on the same table, with strictly identical degrees of selectivity, one of which gives excellent results, and the other one, significantly worse results, a situation that was mentioned in Chapter 3 and that it is now time to prove.

To illustrate this case I have created a 1,000,000-row table with three columns c1, c2, and c3, c1 being filled with a sequence number (1 to 1,000,000), c2 with all different random numbers in the range 1 to 2,000,000, and c3 with random values that can be, and usually are, duplicated. On face value, and from a logical point of view, c1 and c2 are both unique and therefore have identical selectivity. In the case of the index on column c1, the order of the rows in the table matches the order in the index. In a real case, some activity against the table might lead to "holes" left by deletions and subsequently filled with out-of-order records due to new insertions. By contrast, the order of the rows in the table bears no relation to the ordering of the keys in the index on c2.

When we fetch c3, based on a range condition of the type:

     where column_name between some_value and some_value + 10

it makes a significant difference whether we use c1 and its associated index (the ordered index, where keys are ordered as the rows in the table) or c2 and its associated index (the random index), as you can see in Figure 6-1. Don't forget that we have such a difference because additional accesses to the table are required in order to fetch the value of c3; there would be no difference if we had two composite indexes, on (c1, c3) and (c2, c3), because then we could return everything from an index in which the keys are ordered.

The type of difference illustrated in Figure 6-1 also explains why sometimes performance can degrade over time, especially when a new system is put into production with a considerable amount of data coming from a legacy system. It may happen that the initial data loading imposes some physical ordering that favors particular queries. If a few months of regular activity subsequently destroys this order, we may suffer over this period a mysterious 3040% degradation of performance.

Figure 6-1. Difference of performance when the order in the index matches the order of the rows in the table


It should be clear by now that the solution "can't the DBAs reorganize the database from time to time?" is indeed a fudge, not a solution. Database reorganizations were once quite in vogue. Ever-increasing volumes, 99.9999% uptime requirements and the like have made them, for the most part, an administrative task of the past. If the physical implementation of rows really is crucial for a critical process, then consider one of the self-organizing structures discussed Chapter 5, such as clustered indexes or index-organized tables. But keep in mind that what favors one type of query sometimes disadvantages another type of query and that we cannot win on all fronts.

Performance variation between comparable indexes may be due to physical data dispersion.

6.1.4. Criterion Indexability

Understand that the proper indexing of specific criteria is an essential component of the "small set, direct specific criteria" situation. We can have cases when the result set is small and some criteria may indeed be quite selective, but are of a nature that isn't suitable for indexing: the following real-life example of a search for differences among different amounts in an accounting program is particularly illustrative of a very selective criterion, yet unfit for indexing.

In the example to follow, a table named glreport contains a column named amount_diff that ought to contain zeroes. The purpose of the query is to track accounting errors, and identify where amount_diff isn't zero. Directly mapping ledgers to tables and applying a logic that dates back to a time when these ledgers where inked with a quill is rather questionable when using a modern DBMS, but unfortunately one encounters questionable databases on a routine basis. Irrespective of the quality of the design, a column such as amount_diff is typical of a column that should not be indexed: ideally amount_diff should contain nothing but zeroes, and furthermore, it is obviously the result of a denormalization and the object of numerous computations. Maintaining an index on a column that is subjected to computations is even costlier than maintaining an index on a static column, since a modified key will "move" inside the index, causing the index to undergo far more updates than from the simple insertion or deletion of nodes.

All specific criteria are not equally suitable for indexing. In particular, columns that are frequently updated increase maintenance costs.

Returning to the example, a developer came to me one day saying that he had to optimize the following Oracle query, and he asked for some expert advice about the execution plan:

     select         total.deptnum,         total.accounting_period,         total.ledger,         total.cnt,         error.err_cnt,         cpt_error.bad_acct_count     from      -- First in-line view      (select           deptnum,           accounting_period,           ledger,           count(account) cnt      from           glreport      group by           deptnum,           ledger,           accounting_period) total,      -- Second in-line view      (select          deptnum,          accounting_period,          ledger,          count(account) err_cnt      from          glreport      where          amount_diff <> 0      group by          deptnum,          ledger,          accounting_period) error,      -- Third in-line view      (select          deptnum,          accounting_period,          ledger,          count(distinct account) bad_acct_count      from          glreport      where          amount_diff <> 0      group by          deptnum,          ledger,          accounting_period      ) cpt_error     where        total.deptnum = error.deptnum(+) and        total.accounting_period = error.accounting_period(+) and        total.ledger = error.ledger(+) and        total.deptnum = cpt_error.deptnum(+) and        total.accounting_period = cpt_error.accounting_period(+) and        total.ledger = cpt_error.ledger(+)     order by        total.deptnum,        total.accounting_period,    total.ledger

For readers unfamiliar with Oracle-specific syntax, the several occurrences of (+) in the outer query's where clause indicate outer joins. In other words:

     select whatever     from ta,          tb     where ta.id = tb.id (+) 

is equivalent to:

     select whatever     from ta          outer join tb                  on tb.id = ta.id 

The following SQL*Plus output shows the execution plan for the query:

     10:16:57 SQL> set autotrace traceonly     10:17:02 SQL> /     37 rows selected.     Elapsed: 00:30:00.06     Execution Plan     ----------------------------------------------------------        0      SELECT STATEMENT Optimizer=CHOOSE                         (Cost=1779554 Card=154 Bytes=16170)        1    0   MERGE JOIN (OUTER) (Cost=1779554 Card=154 Bytes=16170)        2    1     MERGE JOIN (OUTER) (Cost=1185645 Card=154 Bytes=10780)        3    2       VIEW (Cost=591736 Card=154 Bytes=5390)        4    3         SORT (GROUP BY) (Cost=591736 Card=154 Bytes=3388)        5    4           TABLE ACCESS (FULL) OF 'GLREPORT'                                 (Cost=582346 Card=4370894 Bytes=96159668)        6    2       SORT (JOIN) (Cost=593910 Card=154 Bytes=5390)        7    6         VIEW (Cost=593908 Card=154 Bytes=5390)        8    7           SORT (GROUP BY) (Cost=593908 Card=154 Bytes=4004)        9    8             TABLE ACCESS (FULL) OF 'GLREPORT'                                   (Cost=584519 Card=4370885 Bytes=113643010)       10    1     SORT (JOIN) (Cost=593910 Card=154 Bytes=5390)       11   10       VIEW (Cost=593908 Card=154 Bytes=5390)       12   11         SORT (GROUP BY) (Cost=593908 Card=154 Bytes=5698)       13   12           TABLE ACCESS (FULL) OF 'GLREPORT'                                 (Cost=584519 Card=4370885 Bytes=161722745)     Statistics     ----------------------------------------------------------             193  recursive calls               0  db block gets         3803355  consistent gets         3794172  physical reads            1620  redo size            2219  bytes sent via SQL*Net to client             677  bytes received via SQL*Net from client               4  SQL*Net roundtrips to/from client              17  sorts (memory)               0  sorts (disk)              37  rows processed 

I must confess that I didn't waste too much time on the execution plan, since its most striking feature was fairly apparent from the text of the query itself: it shows that the table glreport, a tiny 4 to 5 million-row table, is accessed three times, once per subquery, and each time through a full scan.

Nested queries are often useful when writing complex queries, especially when you mentally divide each step, and try to match a subquery to every step. But nested queries are not silver bullets, and the preceding example provides a striking illustration of how easily they may be abused.

The very first inline view in the query computes the number of accounts for each department, accounting period, and ledger, and represents a full table scan that we cannot avoid. We need to face realities; we have to fully scan the table, because we are including all rows when we check how many accounts we have. We need to scan the table once, but do we absolutely need to access it a second or third time?

If a full table scan is required, indexes on the table become irrelevant.

What matters is to be able to not only have a very analytic view of processing, but also to be able to stand back and consider what we are doing in its entirety. The second inline view counts exactly the same things as the first oneexcept that there is a condition on the value of amount_diff. Instead of counting with the count( ) function, we can, at the same time as we compute the total count, add 1 if amount_diff is not 0, and 0 otherwise. This is very easy to write with the Oracle-specific decode(u, v, w, x) function or using the more standard case when u = v then w else x end construct.

The third inline view filters the same rows as the second one; however, here we want to count distinct account numbers. This counting is a little trickier to merge into the first subquery; the idea is to replace the account numbers (which, by the way, are defined as varchar2[*] in the table) by a value which is totally unlikely to occur when amount_diff is 0; chr(1) (Oracle-speak to mean the character corresponding to the ASCII value 1) seems to be an excellent choice (I always feel a slight unease at using chr(0) with something written in C like Oracle, since C terminates all character strings with a chr(0)). We can then count how many distinct accounts we have and, of course, subtract one to avoid counting the dummy chr(1) account.

[*] To non-Oracle users, the varchar2 type is, for all practical purposes, the same as the varchar type.

So this is the suggestion that I returned to the developer:

      select  deptnum,             accounting_period,             ledger,             count(account) nb,             sum(decode(amount_diff, 0, 0, 1)) err_cnt,             count(distinct decode(amount_diff, 0, chr(1), account)) - 1                                          bad_acct_count      from           glreport      group by            deptnum,            ledger,            accounting_period 

My suggestion was reported to be four times as fast as the initial query, which came as no real surprise since the three full scans had been replaced by a single one.

Note that there is no longer any where clause in the query: we could say that the condition on amount_diff has "migrated" to both the logic performed by the decode( ) function inside the select list and the aggregation performed by the group by clause. The replacement of a filtering condition that looked specific with an aggregate demonstrates that we are here in another situation, namely a result set obtained on the basis of an aggregate function.

In-line queries can simplify a query, but can result in excessive and duplicated processing if used without care.




The Art of SQL
The Art of SQL
ISBN: 0596008945
EAN: 2147483647
Year: N/A
Pages: 143

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net