Data partitioning | Oracle9iR2 Data Warehousing

3.3 Data partitioning

3.3.1 Benefits of partitioning

Whenever any task seems daunting, breaking it up into smaller tasks often makes it easier to accomplish. Imagine packing your belongings, and getting ready to move. Dividing it up room by room makes it easier. If each member of the family packs up a room at the same time, you can get all your belongings packed faster. This is the idea behind partitioning and parallelism.

Very large tables and indexes can be divided into smaller, more manageable partitions. Partitioning the data makes it possible for data management operations to be performed at the partition level. Many operations, such as loading data, building local indexes, enforcing constraints, gathering optimizer statistics, purging data, and backup and recovery, can be done at the granularity of a partition.

Partitioning of data also allows queries on the tables to take advantage of the parallel processing features of Oracle. In addition, DML operations on the table can be done in parallel, providing significant improvements in performance.

If a table is partitioned, the query optimizer can determine if a certain query can be answered by reading only specific partitions. Thus, an expensive table scan can be avoided. This feature is known as partition elimination or dynamic partition pruning. If the warehouse was partitioned by month and a query asked for sales from December 1998, the optimizer would know which partition the data was stored in, and would eliminate partitions containing other months from its search.

3.3.2 How to partition data

A table can be partitioned using any column of the table. This column is known as the partition key. It is important to partition the data by a column that does not change, such as the purchase date. If partitioning is done by product_id and the business changed the encoding scheme for its products, then updating the partitioning column would require moving data to a different partition, which can be a time-consuming operation.

In data warehouses, it is common to partition by time. This allows us to perform "rolling window operations." If our warehouse contained one year's worth of data, at the end of April 2002 we could add a new partition with that month's data and delete the data for April 2001. In Chapter 5 we will look at loading data into a new partition.

Partitioning improves the availability of the data warehouse. By placing each partition on its own disk, if one disk fails and is no longer accessible, only the data in that partition is unavailable-not the entire table. Maintenance operations can be performed on one partition, while users continue to access data from the other partitions in the table.

For maximum performance, each partition should be stored in a separate tablespace and each tablespace should be stored on one or more separate storage devices, as appropriate. On larger systems, the I/O should also be spread across different controllers and channels. By spreading the data across several physical devices, you balance the I/O and improve query performance, availability, and manageability.

There are several ways to partition your data, including the following:

By ranges of data values (range partitioning)
By a hash function (hash partitioning)
Using discrete values (list partitioning)
Using a combination of methods (composite partitioning)

Range partitioning

In range partitioning, data is partitioned into non-overlapping ranges of values. As each row gets inserted into the table, it is placed into the appropriate partition based on the value of the partition-key column. Range partitioning is especially useful when the partition key is time. It allows the optimizer to perform partition pruning for queries asking for a specific value or a range of partition-key values.

In the EASYDW schema, each month's purchases are stored in their own partition. The purchases fact table is partitioned by range, using the time_ key as the partition key. Each partition is stored in its own tablespace, spread across multiple disks, as illustrated in Figure 3.2.

click to expand
Figure 3.2: Range partitioning.

The SQL that would create this partitioned table is shown in the following code segment. The VALUES LESS THAN clause specifies the upper bound on the partition-key values in that partition. The lower bound is specified by the VALUES LESS THAN clause of the previous partition, if any. In our example, the purchases_feb2002 partition has data values with time_key ≥01-02-2002 and <01-03-2002.

 CREATE TABLE easydw.purchases (product_id varchar2(8),  time_key date,  customer_id varchar2(10),  purchase_date date,  purchase_time number(4,0),  purchase_price number(6,2),  shipping_charge number(5,2),  today_special_offer varchar2(1)) PARTITION by RANGE (time_key)  (partition purchases_jan2002     values less than (TO_DATE('01-02-2002', 'DD-MM-YYYY'))     pctfree 0 pctused 99     storage (initial 64k next 64k pctincrease 0)     tablespace purchases_jan2002,   partition purchases_feb2002     values less than (TO_DATE('01-03-2002', 'DD-MM-YYYY'))     pctfree 0 pctused 99     storage (initial 64k next 64k pctincrease 0)     tablespace purchases_feb2002,   partition purchases_mar2002     values less than (TO_DATE('01-04-2002', 'DD-MM-YYYY'))     pctfree 0 pctused 99     storage (initial 64k next 64k pctincrease 0)     tablespace purchases_mar2002,   partition purchase_catchall     values less than (MAXVALUE)     pctfree 0 pctused 99     storage (initial 64k next 64k pctincrease     0)     tablespace purchases_maxvalue);

Note that the last partition in the purchases table has a special bound called MAXVALUE. This is an optional catchall partition, which collects all rows that do not correspond to any defined partition ranges.

Hash partitioning

With range partitioning, it is possible to end up with a situation where the data is not evenly divided among the partitions. Some partitions may be very large and others small. If the data was partitioned by month, and some months had more sales than others, this would result in partitions that are very different in size. When the data is skewed in this way, hot spots form where there is contention for resources in one area.

Hash partitioning reduces this type of data skew by applying a hashing function to the partitioning key. The resulting output value is used to determine which partition to store the row in. So instead of partitioning by month, if we hash partitioned the purchases table by product_id, all rows for a month would get scattered across several partitions, as shown in Figure 3.3. Notice that all rows for the same product fall into the same partition; however, there is no way to specify which partition will have which products.

click to expand
Figure 3.3: Hash partitioning.

The following example illustrates the SQL required to create a hash partitioned table.

 CREATE TABLE easydw.purchases  (product_id          varchar2(8),  time_key             date,  customer_id          varchar2(10),  purchase_date        date,  purchase_time        number(4,0),  purchase_price       number(6,2),  shipping_charge      number(5,2),  today_special_offer  varchar2(1)) PARTITION BY HASH(product_id) PARTITIONS 3;

Hash partitioning is good for finding rows corresponding to a single value but cannot be used by the optimizer to perform partition elimination for queries requesting a range of values. If we had a query based on a specific product, such as: How many tents did we sell each month? the optimizer could determine which partition to examine. However, if we had a query: What products did we sell during the month of January? we would need to search all partitions.

In hash partitioning, the user has no control or knowledge of which products go into which partitions. You can only specify how many partitions you would like. This makes it difficult to do maintenance operations that require knowledge of the data, such as archiving old data from the table. Hence, hash partitioning is not commonly used by itself but is used in conjunction with range partitioning.

List partitioning

In some cases, it may not be convenient to organize data into ranges of values. The data may not have a natural partitioning key such as time. For business reasons, values that are far apart may need to be grouped together. For instance, if we have sales data for various states of the United States, it is not very easy to put data for all states in New England into the same partition. In Oracle 9i, a new type of partitioning was introduced to solve this problem. This partitioning method, known as list partitioning, allows data to be distributed according to discrete column values. Figure 3.4 shows an example of a list-partitioned table.

click to expand
Figure 3.4: List partitioning.

The following SQL statement uses list partitioning to organize the sales data by region. The last partition is a catchall partition, which captures the rows that do not map to any other partition.

 CREATE TABLE easydw.regional_sales (state            varchar2(2),  store_number     number,  dept_number      number,  dept_name        varchar2(10),  sales_amount     number (6,2) ) PARTITION BY LIST(state) (  PARTITION northeast VALUES ('NH', 'VT', 'MA', 'RI', 'CT'),  PARTITION southeast VALUES ('NC', 'GA', 'FL'),  PARTITION northwest VALUES ('WA', 'OR'),  PARTITION midwest  VALUES ('IL', 'WI', 'OH'),  PARTITION west   VALUES ('CA', 'NV', 'AZ'),  PARTITION otherstates VALUES (DEFAULT));

List partitioning allows the query optimizer to perform partition pruning on queries that ask for specific values of the partitioning key. For instance, a query requesting data for Massachusetts or New Hampshire only needs to access the northeast partition.

Composite partitioning

Oracle provides a two-level partitioning scheme known as composite partitioning to combine the benefits of two partitioning methods. In composite partitioning, data is divided into partitions using one partitioning scheme and further subdivided into subpartitions using another scheme. Currently, Oracle supports two types of composite partitioning schemes:

Range-Hash
Range-List

We mentioned earlier that hash partitioning does not allow a user control over the distribution of data. On the other hand, range partitioning suffers from the potential problem of data skew. Range-hash composite partitioning combines the benefits of range and hash partitioning. The data is first partitioned by range and then further subdivided into subpartitions by using a hash function. When partitioning by date, the last partition is often a hot spot. By further subpartitioning using a hash function, I/O contention is reduced because access is divided among the subpartitions.

All partitions of a table or index have the same logical attributes but can have different physical attributes. For example, all partitions in a table have the same column and constraint definitions, but they can have different storage attributes and physical placement. Subpartitions of a single partition can be placed in different tablespaces. This allows data to be distributed across many physical devices to reduce I/O contention.

In Figure 3.5, the data is first partitioned by month and then further partitioned by product. Each partition has three subpartitions. Each subpartition has been stored in its own tablespace. The SQL to create a composite partitioned table is shown in the following example. The STORE IN clause allows you to name the tablespace where each subpartition will reside.

click to expand
Figure 3.5: Range-hash composite partitioning.

 CREATE TABLE easydw.purchases (product_id varchar2(8),  time_key date,  customer_id varchar2(10),  purchase_date date,  purchase_time number(4,0),  purchase_price number(6,2),  shipping_charge number(5,2),  today_special_offer varchar2(1)) PARTITION by RANGE (time_key)   SUBPARTITION BY HASH(product_id)   SUBPARTITIONS 3  (partition purchases_jan2002    values less than (TO_DATE('01-02-2002', 'DD-MM-YYYY'))    pctfree 0 pctused 99    storage (initial 64k next 16k pctincrease 0)    STORE IN (purchases_jan2002,              purchases_jan2002_2,              purchases_jan2002_3),   partition purchases_feb2002    values less than (TO_DATE('01-03-2002', 'DD-MM-YYYY'))    pctfree 0 pctused 99    storage (initial 64k next 16k pctincrease 0)    STORE IN (purchases_feb2002,              purchases_feb2002_2,              purchases_feb2002_3),   partition purchases_mar2002    values less than (TO_DATE('01-04-2002', 'DD-MM-YYYY'))    pctfree 0 pctused 99    storage (initial 64k next 16k pctincrease 0)    STORE IN (purchases_mar2002,              purchases_mar2002_2,              purchases_mar2002_3));

The range-list composite partitioning method first partitions a table by a continuous key, such as time_key, and then subpartitions each partition with discrete values. In the following example, we have a table range partitioned by month and further list partitioned by state. A query that asks for sales for the dates in the month of January for the New England states can be evaluated efficiently if the sales table is partitioned using range-list partitioning.

 CREATE TABLE sales (state varchar2(2),  store_number number,  dept_number number,  dept_name varchar2(10),  sales_amount number (6,2),  sale_date date,  item_number number (10) ) PARTITION BY RANGE (sale_date)   SUBPARTITION BY LIST(state)   SUBPARTITION TEMPLATE   (SUBPARTITION "NorthEast" VALUES ('NH','VT','MA','RI','CT')                               TABLESPACE sales_ne,    SUBPARTITION "SouthEast" VALUES ('NC','GA','FL')                               TABLESPACE sales_se,    SUBPARTITION "NorthWest" VALUES ('WA','OR')                               TABLESPACE sales_nw,    SUBPARTITION "MidWest"   VALUES ('IL','WI','OH')                               TABLESPACE sales_mw,    SUBPARTITION "West"      VALUES ('CA','NV','AZ')                               TABLESPACE sales_w) (   PARTITION sales_jan_2002    VALUES LESS THAN (TO_DATE('01-Feb-2002', 'DD-Mon-YYYY')),   PARTITION sales_feb_2002    VALUES LESS THAN (TO_DATE('01-Mar-2002', 'DD-Mon-YYYY')),   PARTITION sales_mar_2002    VALUES LESS THAN (TO_DATE('01-Apr-2002', 'DD-Mon-YYYY')) );

In this example, we show the use of the SUBPARTITION TEMPLATE clause. The specification of range-list partitioning can get quite verbose, since you have to specify the detailed subpartition clause for each range partition. Often, you want the same list subpartitioning within each of your range partitions. The SUBPARTITION TEMPLATE makes it convenient to specify the same subpartition information for all range partitions in the table. Oracle will generate the subpartition name using a combination of the partition name and the name specified in the template. It then generates the subpartitions according to the definition in the template. The partition and subpartition information for a table can be obtained from the user_tab_partitions and user_tab_subpartitions dictionary views. For our example, the partition and subpartition names are as follows:

 SELECT partition_name, subpartition_name FROM user_tab_subpartitions WHERE table_name = 'SALES'; PARTITION_NAME         SUBPARTITION_NAME ---------------------- ------------------------------ SALES_JAN_2002         SALES_JAN_2002_NorthEast SALES_JAN_2002         SALES_JAN_2002_SouthEast SALES_JAN_2002         SALES_JAN_2002_NorthWest SALES_JAN_2002         SALES_JAN_2002_MidWest SALES_JAN_2002         SALES_JAN_2002_West SALES_FEB_2002         SALES_FEB_2002_NorthEast SALES_FEB_2002         SALES_FEB_2002_SouthEast SALES_FEB_2002         SALES_FEB_2002_NorthWest SALES_FEB_2002         SALES_FEB_2002_MidWest SALES_FEB_2002         SALES_FEB_2002_West ...

You can choose not to use the SUBPARTITION TEMPLATE but instead specify the list sub-partition values and names for each range partition.

The following query asks for total sales for NH, MA and CT states for a range of dates. The optimizer will determine that this query can be answered quickly by reading the first sub-partition of the partitions for February and March. The Pstart and Pstop values in the output of EXPLAIN LAN below shows the range of partitions used to answer the query, in this case partitions 2 to 3.

 EXPLAIN PLAN FOR SELECT store_number, dept_number, SUM(sales_amount) as q1_sales FROM sales WHERE sale_date between TO_DATE('15-Feb-2002', 'DD-Mon-YYYY')   and TO_DATE('15-Mar-2002', 'DD-Mon-YYYY')   and state in ('NH', 'MA', 'CT') GROUP BY store_number, dept_number; ----------------------------------------------------------------- |Id| Operation                 |Name |Rows|Cost| Pstart | Pstop | ----------------------------------------------------------------- |0 | SELECT STATEMENT          |     | 1  | 16 |        |       | |1 |  SORT GROUP BY            |     | 1  | 16 |        |       | |2 |   PARTITION RANGE ITERATOR|     |    |    |     2  |     3 | |3 |    PARTITION LIST INLIST  |     |    |    |KEY(I)  |KEY(I) | |4 |     TABLE ACCESS FULL     |SALES| 1  | 14 |   KEY  |   KEY | -----------------------------------------------------------------

Hint:

Range-list partitioning is particularly suited as a partitioning scheme for materialized views, which typically contain summarized data across multiple dimensions, one of which is often time.

Multicolumn partition keys

A partition key used for range partitioning can have multiple columns. The following SQL code segment creates a table with range partitioning using a multicolumn partitioning key.

 CREATE TABLE easydw.purchases (product_id varchar2(8),  time_key date,  customer_id varchar2(10),  purchase_date date,  purchase_time number(4,0),  purchase_price number(6,2),  shipping_charge number(5,2),  today_special_offer varchar2(1)) PARTITION by RANGE (time_key, product_id)   (partition purchases_jan2002_100      values less than (TO_DATE('31-01-2002','DD-MM-YYYY'), 100)      pctfree 0 pctused 99      storage (initial 64k next 64k pctincrease 0)      tablespace purchases_jan2002_100,    partition purchases_jan2002_200      values less than (TO_DATE('31-01-2002','DD-MM-YYYY'), 200)      pctfree 0 pctused 99      storage (initial 64k next 64k pctincrease 0)      tablespace purchases_jan2002_200 ,    partition purchases_feb2002_all      values less than (TO_DATE('28-02-2002','DD-MM-YYYY'), 100)      pctfree 0 pctused 99      storage (initial 64k next 64k pctincrease 0)      tablespace purchases_feb2002,    partition purchases_mar2002_all      values less than (TO_DATE('31-03-2002','DD-MM-YYYY'), 100)      pctfree 0 pctused 99      storage (initial 64k next 64k pctincrease 0)      tablespace purchases_mar2002 );

To understand how data get mapped to partitions, let us now insert some data values into this table and see which partitions they go into.

 insert into purchases (product_id, time_key)     values (1, TO_DATE('15-01-2002', 'DD-MM-YYYY')); insert into purchases (product_id, time_key)     values (150, TO_DATE('15-01-2002', 'DD-MM-YYYY')); insert into purchases (product_id, time_key)     values (101, TO_DATE('31-01-2002', 'DD-MM-YYYY')); insert into purchases (product_id, time_key)     values (170, TO_DATE('31-01-2002', 'DD-MM-YYYY')); insert into purchases (product_id, time_key)     values (200, TO_DATE('31-01-2002', 'DD-MM-YYYY')); insert into purchases (product_id, time_key)     values (1, TO_DATE('28-02-2002', 'DD-MM-YYYY')); SELECT product_id, time_key FROM purchases partition(purchases_jan2002_100); PRODUCT_ID TIME_KEY ---------- ----------- 1          15-JAN-2002 150        15-JAN-2002 SELECT product_id, time_key FROM purchases partition(purchases_jan2002_200); PRODUCT_ID TIME_KEY ---------- ----------- 101        31-JAN-2002 170        31-JAN-2002 SELECT product_id, time_key FROM purchases partition(purchases_feb2002_all); PRODUCT_ID TIME_KEY ---------- ----------- 200        31-JAN-2002 1          28-FEB-2002

Notice that the row with 15-Jan-2002 for product_id = 150 goes into the first partition rather than the second. This may not be what you expect. In this case, the condition checked for the first partition is (time_key < '31-01-2002') or (time_key = '31-01-2002' and product_id < 100). If the (time_key, product_id) values do not map to the first partition, the second partition is checked using (time_key < '31-01-2002') or (time_key = '31-01-2002' and product_id < 200). Multikey partitioning can be useful if you have one time_key value with lots of product_id values (e.g., several purchases may be made on Christmas Eve). In this case, you can partition on that value to distribute the data into several partitions.

A range-partitioning scheme with a multicolumn partition key must not be confused with range-range composite partitioning. As of the time of writing, Oracle does not support range-range composite partitioning. If it were supported, the row with time_key = 15-Jan-2002 and product_id = 150 would indeed map to the second partition in the previous example.

To summarize, range partitioning should be used when your table has a continuous key, such as time. List partitioning is ideal for tables where you would like to place specific discrete values in one partition. Hash partitioning distributes data uniformly among all partitions and can be used in combination with range partitioning to avoid hot spots in the table. Finally, range-list partitioning may be used when the table stores data along multiple dimensions: one continuous, such as time, and the other discrete, such as product or geography. Use of partitioning provides great benefits in a warehouse by improving query performance, manageability, and availability of data.

In the next section, we will see how partitioning can be applied to an index.

3.3.3 Partitioned indexes

Both b*tree and bitmap indexes can be partitioned using any of the available partitioning methods. They may be partitioned the same way that the table is partitioned or by different criteria. A b*tree index may be partitioned even if the underlying table is not. Partitioned indexes can be of two types:

Global
Local

Global indexes

In a global index, the keys in an index partition need not correspond to any single table partition or subpartition. A global index may not be partitioned at all or can have a partition key completely different from the table. However, a partitioned global index must have the partitioning key as the leading column of the index key. Bitmap indexes on partitioned tables cannot be global.

Figure 3.6 shows a global index purchases_product_index on the purchases table. Here, the purchases table is partitioned by time_key, but the index is partitioned by product_id. The leading column of the index is also product_id.

click to expand
Figure 3.6: Global index.

The following code segment shows the SQL used to create this index:

 CREATE INDEX easydw.global_product_index on purchases  (product_id)  global  partition by range (product_id)  (partition sp1000 values less than ('SP1000') ,   partition sp2000 values less than ('SP2000') ,   partition other values less than (maxvalue));

Figure 3.6 shows why global indexes are not very efficient for query processing. Since there is no correlation between the table and index partitions, accessing values in a table partition can involve access to several or all of the index partitions.

In addition, when the data in an underlying table partition is moved or removed using a partition maintenance operation, all partitions of a global index are affected and the index is marked UNUSABLE. An ALTER INDEX REBUILD statement must be issued to rebuild the index. (Note that in Oracle 9i release2, there is an option to efficiently rebuild the affected global indexes by specifying the UPDATE GLOBAL INDEXES clause on the partition operation.)

These shortcomings of a global index are addressed by local indexes.

Local indexes

A local index inherits its partitioning criteria from the underlying table. It has the same number of partitions, subpartitions, and partition bounds as the table. When partitions are added, dropped, split, or merged in the underlying table, the index partitions are automatically modified correspondingly.

There are two types of local partitioned indexes:

Prefixed
Non-prefixed

If the partitioning key of the index appears as a leading prefix of the index, it is called a prefixed index. Otherwise, the index is said to be non-prefixed.

Prefixed local indexes

In our example, the purchases table is partitioned by the time key. Figure 3.7 shows an example of a prefixed local index, purchase_time_index on this table. The partitioning key is time_key and the index columns are (time_key, customer_id). The partitioning key, time_key, is a prefix of the index key.

click to expand
Figure 3.7: Prefixed local index.

Following is the SQL required to create this index. The local index follows the same partitioning scheme as the table, but you can name the partitions and also the tablespace where each partition resides.

 CREATE INDEX easydw.purchase_time_index  ON purchases (time_key, customer_id) local  (partition indexJan2002 tablespace purchases_jan2002_idx,   partition indexFeb2002 tablespace purchases_feb2002_idx,   partition indexMar2002 tablespace purchases_mar2002_idx);

With this index, if we wanted to know the purchases made by a certain customer in January, we need only to search the partition indexJan2002.

Nonprefixed local indexes

Local indexes that do not include the partitioning key as the leading column of the index are known as nonprefixed local indexes. Such indexes are useful when we would like to partition by one column for ease of maintenance but index other columns for data retrieval. For instance, we may want our indexes and tables to be partitioned by time_key, so it's easy to add a new month's data and build data for that month. However, to get good performance for queries for sales by product_id, we need an index on product_id.

Figure 3.8 shows a local nonprefixed index on the product_id column of the purchases table. The partitioning scheme is the same as that of the purchases table (i.e., on the time_key column). The following code segment shows how to create this index.

click to expand
Figure 3.8: Nonprefixed local index.

 CREATE BITMAP INDEX easydw.purchase_product_index ON purchases  (product_id) local  (partition indexJan2002 tablespace purchases_jan2002_idx,   partition indexFeb2002 tablespace purchases_feb2002_idx,   partition indexMar2002 tablespace purchases_mar2002_idx);

With this index, when searching for sales of some product, say "Tents" the optimizer is not able to perform partition elimination and must search for the data for all months. However, when searching for "Tents" sold in January, it can eliminate the February and March partitions of the index.

To summarize, local prefixed indexes are the most efficient type of indexes for query performance, since the optimizer can use partition elimination to avoid looking at unnecessary partitions. Local indexes also facilitate rolling window operations. Avoid using global indexes if possible, since additional maintenance is required and bitmap indexes cannot be global.

3.3.4 RAID technology

An alternative technique for spreading the I/O requests across multiple devices is by using RAID storage technology-redundant arrays of inexpensive disks. By using multiple disks together, performance and reliability can be improved. Rather than storing all data in one file on a single disk, the data is striped across multiple disks, reducing contention for disk arms. RAID technologies can be used with all the data warehousing features discussed in this book.