Indexing has always been a very important technique for efficient query processing in database systems. Unlike OLTP systems, which have mostly update activity, data warehouses tend to read large amounts of data to answer queries. Hence, it is important to understand the indexing needs of a warehouse.
Deciding which indexes to create is an important part of the physical design of a database. Indexes should be built on columns that are often part of the selection criteria of a query. Columns that are frequently referenced in the SQL WHERE clause are good candidates for indexing. Most decision-support queries require specific rows from a dimension table, and so it is important to have good indexing on the dimension tables. For example, if a user is looking for the location of all stores in the Northeast, as in the following query, an index built on the column "region" could be used to locate just those rows in the Northeast rather than reading every row in the table.
SELECT store, location FROM stores WHERE region = 'Northeast';
Typical warehouses have low update activity, other than for loading new data. Hence, a warehouse can have many more indexes than an OLTP system. Space requirements for indexes in a warehouse are often significantly larger than the space needed to store the data, especially for the fact table. Therefore, you may want to keep indexing on the fact table to a minimum. Typically, you may have one or two concatenated indexes on the fact table. You may also want to consider bitmap indexes on foreign-key columns to speed up star joins. On the other hand, dimension tables are much smaller compared with the fact table and could be indexed much more extensively. Any column of the dimension table that is frequently used in selections or is a level in a dimension object, described in Chapter 4, is a good candidate for indexing.
An index is built on one or more columns of a table. These columns are known as the index keys. For each key value, the index contains a pointer to the location of the rows with that key value. Whenever data in the table changes, the index is automatically updated to reflect the changed data.
Oracle offers several kinds of indexes. We will discuss three of these in this book:
B*tree index
Bitmap index
Bitmap join index
B*tree indexes are hierarchical structures that allow rapid searching in order to obtain the address of a row in a table with a particular value. B*tree indexes store pointers to rows of the tables using rowids. There are two varieties of b*tree indexes:
Unique
Non-unique
A unique index ensures that each row has a distinct value for its key. No duplicates are allowed. A unique index is automatically created when a PRIMARY KEY or UNIQUE constraint is enabled on a table. B*tree indexes can also be nonunique. Nonunique indexes improve query performance when a small number of rows are selected.
When multiple columns are used together in the WHERE clause, you can build an index on that group of columns. Indexes made up of multiple columns are called composite, multikey, or concatenated indexes. For example, city and state are both needed to differentiate Portland, Maine, from Portland, Oregon. The column that is used most frequently by itself should be specified as the first column in the index. In the example, state should be the leading column, since we can anticipate queries on state alone or city and state used together.
Whenever a row is inserted, deleted, or when the key columns are updated, the index is automatically updated to reflect the change. A b*tree index is designed such that the time required to search any particular value in the index is nearly constant. This design is called a balanced index. As new index nodes are created, the b*tree is automatically rebalanced. Index maintenance, therefore, adds overhead to the DML statement. When loading a large number of rows, it is sometimes faster to mark the index unusable for the duration of the insert and rebuild the index, as shown in the following code segment:
ALTER INDEX customer_gender_index UNUSABLE; INSERT /*+ APPEND */ INTO customers ... COMMIT; ALTER INDEX customer_gender_index REBUILD;
A b*tree index is useful when the index key has many distinct values, each leading to a few rows in the table. For instance, a primary key implies each value of the key corresponds to one row in a table. However, if the index-key column has only a few distinct values, then each value would lead to retrieval of large numbers of rows in the table. This provides little, if any, performance benefit. For instance, consider the query: How many women who live in California buy tents? There are only two possible values for the column "sex"; hence, finding all the rows corresponding to women may result in reading all the data blocks in the table, since multiple records are typically stored in one block. A full table scan may be more efficient. Also, for large tables, space requirements for b*tree indexes could become prohibitive.
Bitmap indexes are designed to solve these problems and, hence, are more commonly used in a warehouse.
Bitmap indexes are designed to answer queries involving columns with few distinct values but potentially large numbers of rows for each value. The number of distinct values of a column is known as its cardinality. Unlike b*tree indexes, which store pointers to rows of the tables using rowids, a bitmap index stores a bitmap for each distinct value of a column.
The bitmap for each distinct value has as many bits as the number of rows in the table. The bit is set if the corresponding row has that value. Table 3.1 shows a bitmap index for the "sex" column (cardinality = 2, distinct values - M and F). Two bitmaps are created: one for the value M and one for the value F. In the bitmap for M, all rows for male customers would have their bits set to 1 and all rows for female customers would have their bits set to 0.
Row | Sex | Male Bitmap | Female Bitmap |
---|---|---|---|
1 | M | 1 | 0 |
2 | F | 0 | 1 |
3 | M | 1 | 0 |
4 | M | 1 | 0 |
5 | M | 1 | 0 |
6 | F | 0 | 1 |
7 | M | 1 | 0 |
8 | M | 1 | 0 |
9 | F | 0 | 1 |
10 | M | 1 | 0 |
Bitmap indexes on two columns of a table can be combined efficiently using the AND and OR Boolean operators. This allows them to be used for queries involving multiple conditions on bitmapped columns of a table. This gives the benefit that you can answer a wide variety of queries with just a few single-column bitmap indexes. For instance, suppose we had a customer table that contained the customer name, sex, and occupation (teacher, engineer, self-employed, housewife, doctor, etc.). We are looking at a new promotion targeted toward self-employed women and would like to find out the towns to target. This can be expressed as a query:
SELECT customer_name, town FROM customer WHERE sex = 'F' AND occupation = 'Self-Employed';
Table 3.2 shows the bitmap for the occupation column.
Row | Occupation | Teacher | Engineer | Self-Employed | Housewife |
---|---|---|---|---|---|
1 | Teacher | 1 | 0 | 0 | 0 |
2 | Engineer | 0 | 1 | 0 | 0 |
3 | Teacher | 1 | 0 | 0 | 0 |
4 | Teacher | 1 | 0 | 0 | 0 |
5 | Engineer | 0 | 1 | 0 | 0 |
6 | Self-Employed | 0 | 0 | 1 | 0 |
7 | Self-Employed | 0 | 0 | 1 | 0 |
8 | Housewife | 0 | 0 | 0 | 1 |
9 | Engineer | 0 | 1 | 0 | 0 |
10 | Housewife | 0 | 0 | 0 | 1 |
The sex bitmap defined in Table 3.1 and the occupation bitmap defined in Table 3.2 can be combined together, as shown in Table 3.3, to quickly determine the rows in the customer table that satisfy this query.
Row | Female Bitmap | Self-Employed | Result |
---|---|---|---|
1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 0 | 0 |
4 | 0 | 0 | 0 |
5 | 0 | 0 | 0 |
6 | 1 | 1 | 1 (self-employed women) |
7 | 0 | 1 | 0 |
8 | 0 | 0 | 0 |
9 | 1 | 0 | 0 |
10 | 0 | 0 | 0 |
The SQL statement used to create the bitmap index on the customer.sex column is as follows:
CREATE BITMAP INDEX easydw.customer_gender_index ON customer(sex) pctfree 5 tablespace indx storage (initial 64k next 64k pctincrease 0);
Alternatively, you can create a bitmap index in Oracle Enterprise Manager, as shown in Figure 3.1. In order to create a bitmap index rather than a b*tree index, check the bitmap box. You can specify storage parameters and see the SQL generated for the index creation.
Figure 3.1: Creating a bitmap index in Oracle Enterprise Manager.
Bitmap indexes can be compressed and therefore require much less space than a corresponding b*tree index. The space savings can be significant if the number of distinct values of the bitmap is small compared with the number of rows in the table.
The disadvantage of bitmap indexes is that they are expensive to maintain when the data in the table changes. This is because the bitmaps must be uncompressed, recompressed, and possibly rebuilt. Fortunately, in a warehouse individual rows are updated very rarely. Bitmap index maintenance is optimized for the common case of loading new data into a warehouse via bulk insert. In this case, existing bitmaps do not need change and are simply extended to include the new rows. Also, in the case of a b*tree index, when updating a row, only that row is locked; however, with bitmap indexes, a large part of the bitmap may need to be locked. Thus, bitmap indexes reduce the concurrency in the system. Hence, bitmap indexes are not suited for systems with a lot of concurrent update activity.
Oracle 9i introduced a variation on bitmap indexes called the bitmap join index. A bitmap join index can be used to index the result of a join between a fact table and one or more dimension tables. To illustrate the concept of bitmap join indexes, consider the following query: How much did women spend in our store?:
SELECT sum(p.purchase_price) FROM purchases p, customer c WHERE p.customer_id = c.customer_id AND c.sex = 'F'
We could create a bitmap index on the customer.sex column, as discussed in the previous section to find those customers who are women quickly. However, we would still need to compute the join between purchases and customers to find the purchases made by women.
Instead, if we take the answer to the join and then create a bitmap to identify rows corresponding to purchases made by men and women, we get a bitmap join index. Table 3.4 shows the answer to the previous join between purchases and customer table (only the first ten rows are shown).
Row | Customer_id | Sex | Purchase_price |
---|---|---|---|
1 | AB123456 | F | 28.01 |
2 | AB123457 | F | 28.01 |
3 | AB123457 | F | 28.01 |
4 | AB123457 | F | 28.01 |
5 | AB123456 | F | 67.23 |
6 | AB123457 | F | 67.23 |
7 | AB123458 | M | 67.23 |
8 | AB123459 | M | 67.23 |
9 | AB123460 | F | 67.23 |
10 | AB123461 | M | 50.71 |
... |
The bitmap join index has two bitmaps one for the value male and another for the value female, as shown in Table 3.5. Each row in the bitmap corresponds to a single row in the purchases table. Thus, we can immediately identify the rows for purchases made by women from the bitmap for value female.
Row | Sex | Male Bitmap | Female Bitmap |
---|---|---|---|
1 | F | 0 | 1 |
2 | F | 0 | 1 |
3 | F | 0 | 1 |
4 | F | 0 | 1 |
5 | F | 0 | 1 |
6 | F | 0 | 1 |
7 | M | 1 | 0 |
8 | M | 1 | 0 |
9 | F | 0 | 1 |
10 | M | 1 | 0 |
... |
The bitmap join index on purchases, joined with customer, is created using the following SQL statement. The joining tables and their join conditions are specified using the FROM and WHERE clauses, similar to those in a SELECT statement. The table on which the index is built is specified by the ON clause, as with any other index.
CREATE BITMAP INDEX easydw.purchase_cust_index ON purchases (customer.sex) FROM purchases, customer WHERE purchases.customer_id = customer.customer_id;
We can immediately note some differences between a bitmap index and a bitmap join index. While a bitmap index is created on columns of a single table, a bitmap join index is built on the fact table (purchases), but the index columns are from dimension tables (customer).
As with a bitmap index, columns included in a bitmap join index must be low-cardinality columns. For a bitmap join index to make sense, the result of the join should have the same number of rows as the fact table on which it is created (in our example the purchases table). To ensure this, a unique constraint must be present on the dimension table column that joins it to the fact table (in our example the customer.customer_id column).
Bitmap join indexes put some restrictions on concurrent DML activity on the tables involved in the index, but this is not a major issue for most warehouse applications.
An alternative to a bitmap join index is a materialized view with joins. We will briefly compare the two in Chapter 4.