8.4 Rule-versus Cost-Based Optimization | Oracle High Performance Tuning for 9i and 10g

< Day Day Up >

Rule-based optimization uses a set of predetermined rules to find the fastest access path to data in the database. Cost-based optimization uses statistical measurements of the data in the database and calculates the fastest access path. Cost-based optimization uses a realistic picture of data and thus is potentially much more efficient than rule-based optimization, assuming statistics are maintained adequately.

Note

Oracle Database 10 Grid Rule-based Optimization is desupported.

8.4.1 Setting the Optimization Mode

Optimization can be set on a database-wide level in the Oracle Database configuration parameter called OPTIMIZER_MODE. The default value is CHOOSE. CHOOSE will allow use of cost-based optimization if statistics exist, otherwise rule-based optimization will be used.

The OPTIMIZER_MODE parameter can be changed online using the ALTER SESSION command for the current session only. If you wish to use rule-based optimization for a specific SQL statement perhaps using the RULE hint is the better option. All OPTIMIZER_MODE settings are covered by hints. A hint can be placed in an individual SQL statement. Hints will be covered later in this chapter. OPTIMIZER_MODE settings are as follows:

Tip

Setting OPTIMIZER_MODE = CHOOSE is by far the best all-round option. Oracle Database 10 Grid ALL_ROWS is the default value for the OPTIMIZER_MODE parameter in Oracle Database 10g.

OLTP Small Transactional Databases.

FIRST_ROWS_n. Cost-based optimizing even without statistics for fastest performance of the first 1, 10, 100, or 1,000 rows retrieved. In other words, full index and table scans could suffer immensely but short transactions retrieving small numbers of rows should perform better.
FIRST_ROWS. Fast delivery of the first few rows using a mixed rule and cost-based approach. FIRST_ROWS_n is more precise if your database needs this level of precision.

Tip

FIRST_ROWS_1000 might always be the best option of these. Any full index or table scanning over 1,000 rows could suffer using either of these options. There are hints applying behavior similar to FIRST_ROWS settings.

CHOOSE. Rule-based is selected when no statistics are present and cost-based when some statistics are available, regardless of the reliability of those statistics. CHOOSE is optimized for full index and table scans but is the most common choice for all types of databases.

Note
Oracle Database 10 Grid CHOOSE is desupported.
Data Warehouse or Reporting Databases.
- CHOOSE. Rule-based is selected when no statistics are present and cost-based when some statistics are available, regardless of the reliability of those statistics. CHOOSE is optimized for full index and table scans.
- ALL_ROWS. The opposite of FIRST_ROWS optimization using a cost-based approach even without statistics. Optimized for full index and table scans.
Note
Oracle Database 10 Grid ALL_ROWS is the default value for the OPTIMIZER_MODE parameter.

Preferably Never.

RULE. Rule-based optimization ignores any existing statistics presenting an unrealistic picture of data. Rule-based optimization can be used for backwards compatibility with previous Oracle Database version tuned outlines. Otherwise avoid it!

Note

Oracle Database 10 Grid CHOOSE and RULE are desupported since rule-based Optimization is desupported. Functionality still exists but will be deprecated in a future release of Oracle Database.

8.4.2 What is Rule-Based Optimization?

Rule-based optimization uses a set of rules allowing a best guess at how SQL code can be optimally executed. Rule-based optimization will eventually be removed from Oracle Database. Many advanced SQL code tuning features and useful bells and whistles introduced in recent Oracle Database versions are only available with cost-based optimization. Any SQL code statement can be persuaded to use a rule-based approach by inclusion of the RULE hint, regardless of the OPTIMIZER_MODE setting in the configuration parameter file.

Note

Oracle Database 10 Grid Rule-based Optimization and the RULE hint are desupported. Applications currently using rule-based optimization should be updated.

Outlines

Outlines can be used by a rule-based optimization approach in order to store previously tuned execution plans for SQL code. For cost-based optimization an execution plan is the sequence of steps written to the PLAN_TABLE when the EXPLAIN PLAN command is used. We have seen extensive use of the EXPLAIN PLAN command in recent chapters. Outlines are stored in a specific schema in tables containing what effectively are previously tuned hint additions. These hint additions are used to override SQL code statements to tune performance of SQL code statements during execution. Like rule-based optimization outlines will probably be removed in a future version of Oracle Database; possibly to be replaced with use of hints in SQL code.

Hints and Rule-Based Optimization

Hints can be used in a rule-based optimized database to produce similar results to that of cost-based optimization. Perhaps you are unable to regularly generate accurate statistics or statistics rapidly become out of date. Statistics can rapidly become out of date in large active databases and can be expensive in resources to constantly maintain.

Tip

If statistics cannot be regularly maintained improve SQL code performance using hints.

The OPTIMIZER_MODE in my database is set to CHOOSE and I have statistics. Thus in order to show hints functioning I can temporarily use rule-based optimization or remove statistics, or both. I do both.

ALTER SESSION SET OPTIMIZER_MODE=RULE;     ANALYZE TABLE generalledger DELETE STATISTICS; ANALYZE INDEX xpk_generalledger DELETE STATISTICS;

This first query plan uses a full table scan under rule-based optimization; note there are no cost, row or bytes values shown.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT gl.generalledger_id FROM generalledger gl;     Query                                    Cost     Rows     Bytes --------------------------------------   ----   ------   ------- 1. SELECT STATEMENT on 2.  TABLE ACCESS FULL on GENERALLEDGER

The INDEX_FFS hint is used to override the rule-based Optimizer and change full access on the GeneralLedger table to an index fast full scan on the primary key index of the GeneralLedger table.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT /*+ INDEX_FFS(gl XPK_GENERALLEDGER) */          gl.generalledger_id       FROM generalledger gl;     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- 1. SELECT STATEMENT on                8        2000     26000 2.  INDEX FAST FULL SCAN on       XPK_GENERALLEDGER               8        2000     26000

I wanted to show that under rule-based optimization hints can be used to override a query plan chosen by the rule-based Optimizer, regardless of the presence of statistics. Let's do some cleanup.

ALTER SESSION SET OPTIMIZER_MODE=CHOOSE;     ANALYZE TABLE generalledger COMPUTE STATISTICS; ANALYZE INDEX xpk_generalledger COMPUTE STATISTICS;

The problem with using hints to override the rule-based Optimizer is as follows. If relative sizes of objects in a database change over time any SQL code tuned with hints and executed under the rule-based Optimizer could eventually become ugly with respect to performance.

Note

Oracle Database 10 Grid Rule-based Optimization is desupported.

8.4.3 What is Cost-Based Optimization?

As already stated in the introduction to this section, cost-based optimization uses statistical measurements of the data as it is in the database. These statistics are used to calculate the fastest access path to data. Cost-based optimization uses a realistic picture of data and thus is potentially much more efficient than rule-based optimization. Statistics must exist on at least one object accessed in an SQL code statement for cost-based optimization to function. Keeping statistics up to date is absolutely critical.

There are numerous Oracle Database configuration parameters which can be used to influence the way the Optimizer behaves with the cost-based approach.

Configuration Parameters and Cost-Based Optimization

Some configuration parameters will affect cost-based optimization. Most of these parameters should simply not normally "be messed with" and should be altered as a last resort. If you have a small database and are changing these parameters frequently you may be avoiding essential SQL code tuning. SQL code tuning is a lot easier than altering configuration parameters. Altering configuration parameters can sometimes have some very unexpected results. These parameters are only listed in this chapter. More explicit explanations will be included in the Part III.

DB_FILE_MULTIBLOCK_READ_COUNT.
CURSOR_SHARING.
SORT_AREA_SIZE.
QUERY_REWRITE_ENABLED.
HASH_JOIN_ENABLED.
HASH_AREA_SIZE.
BITMAP_MERGE_AREA_SIZE.
OPTIMIZER_INDEX_CACHING.
OPTIMIZER_INDEX_COST_ADJ.
OPTIMIZER_MAX_PERMUTATIONS.
OPTIMIZER_DYNAMIC SAMPLING.

Note
Oracle Database 10 Grid Some parameters may have changed or been removed. Further discussion will be in Part III.

The Importance of Statistics and Realistic Statistics

The SQL code Optimizer utilizes statistics to compile the most efficient methods of executing SQL statements. Statistics are measurements of the data itself, such as how large a table is and how useful an index is. When an SQL statement accesses a table both the table and index states are important. States of database objects such as tables and indexes are contained within statistics. If statistics are out of date the Optimizer is not functioning realistically. Out-of-date statistics would have the same effect on all types of databases. It is very important to duplicate statistics from production to tuning environments, either by copying onto or executing statistics gathering on a tuning database, consistent with the production database.

Making a copy of a production database to a tuning database is not an issue when the production database is small. When the production database is large continuous copying to a tuning database could be very time consuming. Be aware that using the database import utility for even single schema imports on even a small database can take a lot longer than the production database export. The DBMS_STATS package can also be used to copy only statistics between databases, allowing for more realistic query plan testing.

Dynamic Sampling

For statistics to be realistic and effective they must be frequently generated. In an OLTP database data is changing all the time.

Statistics can rapidly become redundant. Cost-based optimization against redundant statistics can sometimes cause as big a performance problem as not having those statistics at all, depending of course on how relatively out of date statistics are. If statistics are out of date or not present then dynamic sampling may be used. Dynamic sampling reads a small number of blocks in a table to make a best guess at statistics.

Tip

The configuration parameter controlling dynamic sampling is OPTIMIZER_DYNAMIC_SAMPLING and is set on by default (1).

Note

Oracle Database 10 Grid OPTIMIZER_DYNAMIC_SAMPLING is defaulted to 2.

One of the most significant issues with respect to statistics is matching consistency of statistics between production, testing, and development databases. Most SQL code tuning is usually performed in development databases, but preferably in a test or dedicated tuning database. At least it should be! Test databases are more easily copied from production databases or parts of production databases. Test databases are often copied from production databases to get a realistic snapshot of a live production database. Development databases are often very messy and convoluted. Tuning on a production database, particularly tuning involving SQL code changes, is extremely risky for the stability and current level of performance for a 24 × 7 × 365 OLTP Oracle installation.

The OPTIMIZER_DYNAMIC_SAMPLING parameter sets dynamic sampling of data, substituting for cost-based statistics, from between a setting of 0 and 10. Set to 0 dynamic sampling is disabled and set to 10 the entire table is read. Settings in between simply change the number of blocks read for the sample.

Generating Statistics

Statistics can be generated using the ANALYZE command or the DBMS_STATS package. The ANALYZE command is potentially subject to deprecation in a future version of Oracle Database. In fact Oracle Corporation discourages use of the ANALYZE command for statistics generation.

What to Generate Statistics For

Statistics can be generated for tables and indexes plus general system statistics. System-level statistics will be covered in Part III. Schema-level statistics are generated for tables, indexes, and clusters for each schema. Schema statistics are accessible through the views USER_TABLES, USER_INDEXES, and USER_TAB_COLUMNS.

Tables

NUM_ROWS. Rows in table.
BLOCKS. Blocks used by a table.
EMPTY_BLOCKS. Blocks not used between highest appended block and high water mark. Reclaim these blocks by resizing the datafile.
AVG_SPACE. Average empty space for all blocks in the table. Depends on PCT_FREE updates reserved space and PCT_USED after deletions.
CHAIN_CNT. Rows spread across multiple blocks (row chaining) or having their data moved (migrated) retaining the ROWID pointer in the original block as well.
AVG_ROW_LEN. Average byte length of a row.
SAMPLE_SIZE. Sample size of most recent statistics generation.
LAST_ANALYZED. Date of last statistics generation.

Indexes

NUM_ROWS. Rows in index.
LEAF_BLOCKS. Total leaf blocks containing actual ROWID and indexed column values.
BLEVEL. Depth of a BTree index from root to leaf blocks, usually 0 for small static tables and ranging from 1 to 2 (3 levels) for very large tables.
DISTINCT_KEYS. Unique values for a key.
AVG_LEAF_BLOCKS_PER_KEY. Ratio of DISTINCT_KEYS to leaf blocks. How many rows are pointed to by each key value.
AVG_DATA_BLOCKS_PER_KEY. Ratio of DISTINCT_KEYS to table blocks. How many rows in a table are pointed to by each unique index value.
CLUSTERING_FACTOR. Randomness between index and table blocks. A reverse key index will be more random.
SAMPLE_SIZE. Sample size of most recent statistics generation.
LAST_ANALYZED. Date of last statistics generation.

Columns

NUM_DISTINCT. Unique column values.
LOW_VALUE. Lowest column value.
HIGH_VALUE. Highest column value.
DENSITY. Column density.
NUM_NULLS. Null values in column.
NUM_BUCKETS. Histogram buckets.
SAMPLE_SIZE. Sample size of most recent statistics generation.
LAST_ANALYZED. Date of last statistics generation.

The ANALYZE Command

Let's look at syntax first. The highlighted parts in the following syntax diagram are the interesting parts because they deal with Optimizer statistics.

ANALYZE { TABLE | INDEX | CLUSTER }             [PARTITION ( partition ) | SUBPARTITION                ( subpartition )]       COMPUTE [ SYSTEM ] STATISTICS [ FOR ¼ object           specifics ¼ ]       ESTIMATE [ SYSTEM ] STATISTICS [ FOR ¼ object           specifics ¼ ]             [ SAMPLE n { ROWS | PERCENT } ]       [ ¼ referential integrity | structural ¼ validation ]       LIST CHAINED ROWS INTO table DELETE [ SYSTEM ] STATISTICS;

COMPUTE. Calculates statistics for all rows and columns in an object (table, index, or cluster). Precise results are produced.
ESTIMATE. Calculates statistics on a sample number or percentage of rows, defaulted to 1,064 rows.
DELETE. Clears statistics.
SYSTEM. Collects only system statistics, not user schema statistics.

COMPUTE can be time consuming and must be executed in times of low activity. Executing full statistics computations during highly active concurrent activity cycles could cause performance issues, perhaps even unpopular response from end users. ESTIMATE is far less intrusive but less accurate and is also preferably executed at times of low activity.

A statistics generating script is included in Appendix B.

The DBMS_STATS Package

The DBMS_STATS package is more versatile and potentially faster than using the ANALYZE command to generate statistics. It is a little more complex to use. There are some non-Optimizer statistics that can only be gathered with the ANALYZE command. However, DBMS_STATS has parallel execution and Oracle Partitioning benefits plus performance tuning capabilities that the ANALYZE command does not have.

The DBMS_STATS package can even be used to copy statistics between databases thus allowing accurate simulated tuning between testing and production databases.

Generating Statistics Automatically

Some Optimizer statistics generation can be performed automatically using the MONITORING clause of the CREATE TABLE or ALTER TABLE commands and otherwise using the DBMS_STATS package.

The MONITORING clause may be a performance risk for very busy databases. Automated statistics gathering monitors table DML activity and automatically invokes statistics generation when a small percentage of table rows have been changed. Automatic statistics generation can cause performance problems for highly active concurrent databases. Automating generation of statistics is thus not recommended. Generation of statistics should be scheduled using scheduling software or the DBMS_JOBS package at times of known low activity.

Timed Statistics

Timed statistics gathering is controlled by the TIMED_STATISTICS and TIMED_OS_STATISTICS configuration parameters. Setting this parameter has negligible effect on performance and will be covered in Part III.

Histograms

A histogram is a mechanism used to provide an accurate mathematical picture to the Optimizer of the distribution of values across a range of values. A traditional mathematical statistics histogram looks something like that shown in Figure 8.1.

click to expand
Figure 8.1: An Example Histogram

Oracle Database uses histograms to allow faster access to skewed distributions. The histogram in Figure 8.1 is a skewed or unbalanced distribution because most of the values are to the right of the diagram. A balanced distribution would have all the columns being of exactly the same height. In that case we would not need a histogram. It follows that histograms are completely pointless on unique indexes, since all the columns would be the same height and there would be a single value in each column. Additionally use of bind variables in SQL statements will ignore histograms. Histograms are sometimes used where SQL code cannot or should not be altered to lower shared pool SQL code reparsing.

The Accounts schema GeneralLedger table has a non-unique foreign key index on the COA# column. Looking at Figure 7.3 in the previous chapter we can see that the GeneralLedger.COA# foreign key index is potentially skewed, or at least heavier in certain parts. Figure 7.3 from the previous chapter is included in this chapter for convenience as Figure 8.2.

click to expand
Figure 8.2: A Skewed BTree Index

Following is the count of the rows in the GeneralLedger table. Notice how COA# = '30001' has almost 400,000 rows and COA# = '60001' has only 75,000 rows. As a result of these varying row counts it is likely that there are a lot more leaf index blocks referenced by the branch containing '30001' than that of '60001'.

This is a skewed or unbalanced distribution, leading to a potentially skewed BTree index.

Tip

These counts are somewhat different to those in the previous chapter since the database is highly active and has grown.

SELECT coa#, COUNT(coa#) FROM generalledger GROUP BY coa#;     COA#     COUNT(COA#) ------   ----------- 30001         383357 40003         151107 41000         252552 50001         206357 50028              1 60001          75555

Let's compare the costs of three queries reading rows of three different COA# column values. Before doing this I will refresh statistics for the GeneralLedger table, making sure we have no histograms to begin with.

ANALYZE TABLE generalledger DELETE STATISTICS; ANALYZE INDEX xfk_coa# DELETE STATISTICS; ANALYZE INDEX xpk_generalledger DELETE STATISTICS; ANALYZE TABLE generalledger COMPUTE STATISTICS; ANALYZE INDEX xfk_coa# COMPUTE STATISTICS; ANALYZE INDEX xpk_generalledger COMPUTE STATISTICS;

Now let's compare the costs. Note that all cost, row, and byte values are identical.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa# FROM generalledger WHERE coa# = '30001';     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- 1. SELECT STATEMENT on                 425   178155    890775 2. INDEX RANGE SCAN on XFK_COA#        425   178155    890775     EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa# FROM generalledger WHERE coa# = '50028';     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- 1. SELECT STATEMENT on                 425   178155    890775 2. INDEX RANGE SCAN on XFK_COA#        425   178155    890775     EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa# FROM generalledger WHERE coa# = '60001';     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- 1. SELECT STATEMENT on                 425   178155    890775 2. INDEX RANGE SCAN on XFK_COA#        425   178155    890775

Let's create a histogram on the GeneralLedger.COA# column. Histograms can be generated using the ANALYZE command or the DBMS_STATS package. I will stick with the ANALYZE command for now.

ANALYZE TABLE generalledger COMPUTE STATISTICS FOR COLUMNS coa#;

Tip

Histograms require regeneration just like statistics do and can become stale and useless.

Created histograms can be examined using the USER_HISTOGRAMS view. Let's look at the query plans for each of the statements separately. Here we can see a dramatic difference in cost, rows, and bytes values as a result of creating a histogram. These query plan figures are now consistent, relative to each other, with the ratios of the numbers of rows for each COA# column value.

EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa# FROM generalledger WHERE coa# = '30001';     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- 1. SELECT STATEMENT on                 609   383357   1916785 2. INDEX FAST FULL SCAN on XFK_COA#    609   383357   1916785     EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa# FROM generalledger WHERE coa# = '50028';     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- 1. SELECT STATEMENT on                   3        1         5 2. INDEX RANGE SCAN on XFK_COA#          3        1         5     EXPLAIN PLAN SET statement_id='TEST' FOR       SELECT coa# FROM generalledger WHERE coa# = '60001';     Query                                 Cost     Rows     Bytes -----------------------------------   ----   ------   ------- 1. SELECT STATEMENT on                 182    75555    377775 2. INDEX RANGE SCAN on XFK_COA#        182    75555    377775

Histograms can be very useful.

< Day Day Up >