Physical Database Design | Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications

Because BI applications usually require operational detailed data as well as summarized data, and because they often need to store some or all of that data redundantly, the size of some BI target databases can be enormous . Databases approaching or exceeding one terabyte of data are called very large databases (VLDBs). Designing VLDBs is a big challenge, and the day-to-day chores of maintaining these VLDBs are demanding. Many difficult physical design decisions need to be made, and some highly effective performance enhancements need to be used. The following sections present some suggested guidelines.

Implementation Options

Almost every DBMS lets the database administrator choose from a number of implementation options. Give considerable attention to selecting the right options when implementing a BI target database. It takes experience to know which combination of options will meet the desired performance level. Implementation decisions include the following:

How much free space to choose
How much buffer space to declare
How large to set the blocksize
Whether to use any compaction technique

Physical Dataset Placement

Another basic issue that affects performance is the placement of the datasets. Methods for achieving fast response include combinations of:

Storing frequently referenced data on fast devices.
Storing different aggregation levels on different platforms. For performance reasons, it may be necessary to store aggregate data on distributed midrange servers while keeping detail data on the mainframe.
Striping disks in an interleaved fashion to optimize input/output (I/O) controller usage. Using lots of small disks instead of a few large disks, separating those disks onto separate controllers, and writing the data across devices increases I/O throughput.
Placing datasets in a way that lengthy seeks are avoided when possible.
Selecting address and search schemes that require few seeks, preferably only one per retrieval.
Running multiple operations in parallel.

Also consider whether to separate indices from data and put them on separate disks.

Partitioning

Ensure that tables are partitioned effectively across multiple disks. This is particularly important for VLDBs where fact tables can reach several hundred gigabytes. Partitioning allows the data of one "logical" table to be spread across multiple physical datasets. The physical data distribution is based on a partitioning column, which is most commonly date. Since a partitioning column must be part of the table's primary key, the partitioning column cannot be a derived column, and it cannot contain NULL values. Partitioning enables you to back up and restore a portion of a table without impacting the availability of other portions of the same table that are not being backed up or restored.

Clustering

Define cluster table requirements, and physically co-locate related tables on the disk drive. Clustering is a very useful technique for sequential access of large amounts of data. Clustering is accomplished through clustering indices that determine in which sequential order the rows in the tables are physically stored in the datasets. Ideally, you want to cluster the primary key of each table to avoid page splits , that is, to make sure that new rows inserted into the tables will be stored sequentially on the disk according to the columns in their clustering index. Using this technique can dramatically improve performance because sequential access of data is the norm in BI applications. When the rows of a table are no longer stored in the same order as its clustering index (data fragmentation), performance will suffer and the table has to be reorganized.

Indexing

There are two extreme indexing strategies, neither of which is advisable: one strategy is to index everything, and the other is to index nothing. Instead of veering to these extremes, index those columns that are frequently searched and that have a high distribution in values, such as Account Open Date. Do not index columns that have a low distribution in values, such as Gender Code.

Once you have decided which columns to index, determine the index strategy to use. Most DBMSs provide several access methods to choose from, either sequential access or direct access using any of the following well-known indexing algorithms:

B-tree
Hash
Inverted file
Sparse
Binary

Consult with your DBMS vendor to choose the most optimum access method (indexing algorithm) for the DBMS product you are using.

Reorganizations

Occasionally you will need to reorganize the databases because incremental loads will fragment the datasets over time, and inserted rows will no longer be stored in a logical sequence. This fragmentation may result in long data retrieval chains, and performance can drop off significantly. Most DBMSs provide reorganization routines to rearrange the fragmented database in order to reclaim space occupied by deleted data or to move records from overflow areas into free space in prime data areas.

The basic activities involved in reorganizing a database are to copy the old database onto another device, reblock the rows, and reload them. This is not a trivial effort for BI target databases. The good news is that all DBMSs can perform a partial reorganization routine on database partitions, which is another reason for the database administrator to partition the BI target databases.

Backup and Recovery

Since software and hardware may fail, it is necessary to establish backup and recovery procedures. DBMSs provide utilities to take full backups as well as incremental backups . Many organizations are under the misguided impression that the BI target databases can always be recreated from the original source data. They neglect to realize that it may take a very long time to recreate the BI target databases if they have to rerun all the initial and historical extract/transform/load (ETL) programs ” assuming the original source files are still available.

Disaster recovery is also an issue for BI applications. If the backup tapes or cartridges are destroyed during a disaster, it could be difficult to recreate your BI target databases, and it could take a very long time (if recovery is possible at all). For this reason, many companies choose to store their database backups in remote locations.

Parallel Query Execution

To improve the performance of a query, break down a single query into components to be run concurrently. Some DBMS products offer transparent parallel execution, which means you do not need to know how to break down a query into components because the DBMS does it for you. Performance is greatly increased when multiple portions of one query run in parallel on multiple processors. Other applications of parallel query execution are loading tablespace partitions, building indices, and scanning or sorting tables. Parallel processing is a very important concept for BI applications and should be considered whenever possible.