There are several different ways to retrieve any given row in the database. An index could be used to locate the row, or, if the table is small, it might be faster to perform a full table scan, reading each row sequentially. If there are multiple indexes, the optimizer chooses the best one to use. If there is a materialized view, the query may be rewritten to use it. The job of the optimizer is to choose a strategy to execute the query in the fastest possible time.
There are two approaches to query optimization:
The rule-based optimizer uses a set of rules to rank each of the possible different access paths available according to the speed of execution of each path. It chooses the access path with the best rank. The rule-based optimizer does not take into account the size of the table or the data distribution. Thus, it may always choose to use an index even though a full table scan may be more efficient.
The cost-based optimizer uses statistics, such as the cardinality of the table, number of distinct values of a column, and the data distribution, to determine the cost of an access path. It then picks the access path with the least cost to execute the query. The cost is a measure of how much I/O, CPU time, and memory will be required to execute the query. To use the cost-based optimizer effectively, statistics describing the cardinality and data distribution must be collected for each table, index, and materialized view. Chapter 7 will describe the use of the DBMS_STATS package to collect statistics.
Note that the rule-based optimizer is no longer being enhanced by Oracle. Many new features, such as partitioning, bitmap indexes, star query optimization, and query rewrite, are only available using the cost-based optimizer. Therefore, you should plan on using the cost-based optimizer for your data warehouse so you can take advantage of these features.
We will now look at various query optimizations used by the cost-based optimizer. Before we do so, let us take a brief look at the EXPLAIN PLAN facility used in several examples in this book.
Oracle provides several tools to display the execution plan. You can use the autotrace option of SQL*Plus to display the plan when you issue any query. Alternatively, to just get the query plan without executing the query, you can use EXPLAIN PLAN. The output of EXPLAIN PLAN is placed in a table known as PLAN_TABLE in your schema. You must create the PLAN_TABLE using the script $ORACLE_HOME/rdbms/admin/utlxplan.sql in your schema. You can also ask EXPLAIN PLAN to place the output in some other table, but it must have the same columns as the PLAN_TABLE. You can display the plan using scripts $ORACLE_ HOME/rdbms/admin/utlxpls.sql (for regular query plans) and utlxplp.sql (for parallel query plans).
Let us look at the output of an EXPLAIN PLAN statement. The output consists of the access path for each table, the cost of the access path, and the number of rows retrieved. The plan output is indented so that tables that are being joined are shown at the same level of indentation. Operations that are performed earlier are indented further.
EXPLAIN PLAN FOR SELECT t.month, t.year, p.product_id, SUM (purchase_price) as sum_of_sales, COUNT (purchase_price) as total_sales, COUNT(*) as cstar FROM time t, product p, purchases f WHERE t.time_key = f.time_key AND f.product_id = p.product_id GROUP BY t.month, t.year, p.product_id; @utlxpls; PLAN_TABLE_OUTPUT ------------------------------------------------------------------ Id|Operation | Name |Rows |Bytes|Cost|Pstart|Pstop ------------------------------------------------------------------ 0|SELECT STATEMENT | | 567|23814| 415| | | 1| SORT GROUP BY | | 567|23814| 415| | | 2| NESTED LOOPS | |70585|2895K| 43| | | *3| HASH JOIN | |70585|2412K| 43| | | 4| TABLE ACCESS FULL |TIME | 15| 225| 2| | | 5| PARTITION RANGE ALL| | | | | 1 | 7 | 6| TABLE ACCESS FULL |PURCHASES|94113|1838K| 40| 1 | 7 | *7| INDEX RANGE SCAN |PRODUCT_ | 1| 7| | | | | | PK_INDEX| | | | | | ------------------------------------------------------------------ Predicate Information (identified by operationid): --------------------------------------------------- 3 - access("T"."TIME_KEY"="F"."TIME_KEY") 7 - access("F"."PRODUCT_ID"="P"."PRODUCT_ID") Note: cpu costing is off 21 rows selected.
In this example, the innermost operation is a hash join between the tables, PURCHASES, and TIME (rows 3-6). The output of this is joined with the PRODUCT table using a nested loops join (rows 2-7). Note that an index, PRODUCT_PK_INDEX (row 7), is used to access the PRODUCT table. The final operation is a sort in order to perform the GROUP BY. In Oracle 9i Release 2, the utlxpls.sql script has been enhanced to show the predicates being applied during each operation. In the previous example, the predicates being applied are the join predicates. The PARTITION RANGE clause (row 5) indicates the first and last partitions being scanned. The keyword ALL means that no partition pruning was done, causing all seven partitions to be scanned. Therefore, pstart = 1 and pstop = 7.
Now let us look at various techniques that the optimizer might use to return the data for our queries.
One of the most common operations in a query is a join between two tables. A join operation combines data from two or more tables, based on a condition (known as the join predicate) involving columns from the tables. There are three basic join methods used by the query optimizer in Oracle:
Nested Loops join
In a sort-merge join, the data from each table is sorted by the values of the columns in the join condition. The sorted tables are then merged, so that each pair of rows with matching columns is joined. The SORT_AREA_ SIZE initialization parameter specifies the amount of memory available for sorting. Once this is exceeded, intermediate data is written to temporary segments in the TEMPORARY TABLESPACE on disk. If there is an index on the join columns of either table, the optimizer may use it to directly retrieve the data in sorted order, thereby avoiding the sort for that table.
In a nested loops join, one table is chosen as the outer table; the other is chosen as the inner table. For each row in the outer table, all matching rows that satisfy the join condition in the inner table are found. A nested loops join can be extremely efficient if the inner table has an index on the join column and there are few rows in the outer table.
In a hash join, the smaller table is loaded into memory and a hashing function applied to the join columns, to build a hash table. Then, the second table is scanned; its join columns are hashed and compared with the hash table in memory to look for matching rows. If memory is not sufficient to hold the smaller table, the tables will be broken into smaller partitions. Hash joins can only be used for equi-joins (i.e., joins of the type table1.column1 = table2.column2). The HASH_AREA_SIZE specifies the amount of memory available for building the hash table before needing to write to disk.
Hash joins are especially useful when there are no indexes on the columns being joined. Given the ad hoc nature of decision-support queries, it is not always possible for the DBA to index all the columns the users may use to join tables together. When using Oracle as the staging area for data transformations, data may be moved from one temporary table into another. It may be more efficient to use a hash join than to take the time to build indexes on the tables. To enable a hash join, the HASH_JOIN_ ENABLED initialization parameter must be set to TRUE. If you use the Database Configuration Assistant tool to create a database for data warehousing, this parameter will automatically be enabled for you.
We will now look at some join optimizations that are of great value in a data warehouse.
When the tables being joined are partitioned, the optimizer may choose to perform a partition-wise join. Rather than performing a large join between the tables, the join operation is broken up into a series of smaller joins between the partitions or subpartitions. Note that a partition-wise join can use any of the methods discussed previously to perform the join: sortmerge, hash, or nested loops joins.
A full partition-wise join can be done when the tables being joined are equi-partitioned on the join key in the query. Equi-partitioning means that there is a correspondence between the partitions (or subpartitions) of one table with the partitions (or subpartitions) of the other. So every partition of the first table must only be joined to its corresponding partition in the other table. When executing the query in parallel, each join can be executed on a separate processor.
If only one of the tables is partitioned on the join key, a partial partitionwise join can be done when executing the query in parallel. The table that is not partitioned by the join key is dynamically partitioned according to the partitioning criteria of the partitioned table, and each query server joins a pair of partitions from the two tables as in a full partition-wise join.
The decision to use partition-wise join or any of its variants is taken by the optimizer based on the cost of the execution plan.
Note that if the query only requires some of the partitions of any table, then only those partitions will participate in the partition-wise join.
A star query is a typical query executed on a star schema. Each of the dimension tables is joined to the fact table using the primary-key/foreign-key relationship. If the fact table has bitmap indexes on the foreign-key columns, Oracle can optimize the performance of such star queries using an algorithm known as the star transformation. Star transformation is based on combining bitmap indexes on the fact table columns that correspond to the dimension tables in the query. First, bitmap indexes are used to retrieve the necessary rows from the fact table. The result is then joined to the dimension tables to get any columns required in the final answer. A bitmap join index may also be used instead of a combination of bitmap indexes. Star transformation is also possible if only some of the foreign-key columns have bitmap indexes.
If your data warehouse has a star schema, you should enable star transformation. This is done by setting the initialization parameter, STAR_ TRANSFORMATION_ENABLED, to TRUE. Alternatively, you can use the STAR_TRANSFORMATION hint on queries that join the fact and dimension tables. Note that star transformation is only available with the cost-based optimizer. The optimizer will weigh the execution plans, with and without star transformation, and pick the most efficient plan.
Suppose there were bitmap indexes on each of the foreign-key columns of the purchases table (i.e., purchases.customer_id, purchases.time_key, and purchases.product_id). The following example shows a query that can benefit from star transformation. This query joins the fact (purchases) table and dimension tables (product, time, customer) in a star schema and has selections on each of the dimension table as (month = 1 and county = 'Hants' and category = 'HDRW') representing a small fraction of the entire data.
SELECT c.town, t.quarter, p.product_name, SUM(f.purchase_price) sales FROM purchases f, time t, customer c, product p WHERE f.time_key = t.time_key and f.customer_id = c.customer_id and f.product_id = p.product_id and t.month = 1 and c.county = 'Hants' and p.category = 'HDRW' GROUP BY c.town, t.quarter, p.product_name;
The execution plan is shown below.
------------------------------------------------------------------ |ID | Operation | Name |Rows| Bytes|Cost| ------------------------------------------------------------------ | 0| SELECT STATEMENT | | 54| 3834| 715| | 1| SORT GROUP BY | | 54| 3834| 715| |* 2| HASH JOIN | | 328| 23288| 712| |* 3| TABLE ACCESS FULL |PRODUCT | 54| 1188| 2| |* 4| HASH JOIN | | 984| 48216| 709| | 5| MERGE JOIN CARTESIAN | | 3| 78| 4| |* 6| TABLE ACCESS FULL |CUSTOMER | 1| 12| 2| | 7| BUFFER SORT | | 1| 14| 2| |* 8| TABLE ACCESS FULL |TIME | 1| 14| 2| | 9| TABLE ACCESS BY INDEX ROWID |PURCHASES|7030| 157K| 703| | 10| BITMAP CONVERSION TO ROWIDS| | | | | | 11| BITMAP AND | | | | | | 12| BITMAP MERGE | | | | | | 13| BITMAP KEY ITERATION | | | | | |*14| TABLE ACCESS FULL |CUSTOMER | 1| 12| 2| |*15| BITMAP INDEX RANGE SCAN|CUST_IDX | | | | | 16| BITMAP MERGE | | | | | | 17| BITMAP KEY ITERATION | | | | | |*18| TABLE ACCESS FULL |TIME | 2| 28| 2| |*19| BITMAP INDEX RANGE SCAN|TIME_IDX | | | | | 20| BITMAP MERGE | | | | | | 21| BITMAP KEY ITERATION | | | | | |*22| TABLE ACCESS FULL |PRODUCT | 54| 1188| 2| |*23| BITMAP INDEX RANGE SCAN|PROD_IDX | | | | ------------------------------------------------------------------
Let us try to understand how star transformation works. Notice that the rows corresponding to category = 'HDRW' can be retrieved using the subquery (this is also known as a semijoin):
SELECT * FROM purchases WHERE product_id IN (SELECT product_id FROM product WHERE category = 'HDRW');
This query can be executed using the prod_idx bitmap index, by retrieving and merging the bitmaps corresponding to all product_ids for the HDRW category. Similar subqueries are generated for each of the other dimension tables and, finally, the resulting bitmaps are combined using a bitmap AND operation into a single bitmap. This is shown in the execution plan in rows 10 through 23. This bitmap is used to retrieve the relevant rows from the purchases tables. Finally, a join is done to the dimension tables (shown in plan rows 0 through 9) to obtain the other column values (town, quarter, and product_name). This is efficient because only a small subset of the purchases table is now involved in the join.
Creating a single column bitmap index for each foreign-key column in the fact table will allow the optimizer to consider star transformation for your star queries.