When you are constructing a data warehouse, it is easy to become focused on ensuring that queries are processed quickly. However, blindly following this approach could easily result in a database that is difficult to manage or use.
It's no good building a warehouse that answers all questions in under a minute if the data inside it is at risk because the database cannot easily be backed up. Therefore, always identify the crucial management tasks and determine if they can be performed easily using this design when designing a database. In Chapter 7 we will discuss in more detail the management tasks for a data warehouse, but let us briefly review some of those tasks and see how they affect the design. Some of the important management tasks include:
Loading new data
Aggregating new data
Data maintenance activities such as indexing and archiving
All databases should be backed up regularly, and Oracle 9i has the RMAN utility, which allows on-line backups and incremental backups of the data that have changed. On the surface, backup may seem a trivial task, but backing up a terabyte warehouse takes time, even if it's an on-line backup. Therefore, the designer should carefully consider the tablespaces (explained in Chapter 3), where the data is stored to make tablespace backups of read-only tablespaces, or use partitioning (explained in Chapter 3) so that a full backup can be taken by running parallel backup tasks; you could even consider using Data Guard to back up the data warehouse.
Full database backups are likely to be a luxury in very large warehouses; therefore, you should design the warehouse to allow incremental backups to be taken that contain only the changes to the warehouse data. Due to the huge volumes of data in the warehouse, any operation that can be performed in parallel will significantly reduce the time required to complete the task, especially if there are many parallel processes running concurrently. Most of the new data will have to be stored in the fact table; there-fore, during the design phase, the designer should ascertain when and how much data is going to be loaded. Then calculate the anticipated load time, and if it cannot all be loaded in the available time, techniques such as partitioning the fact table so that the data could be loaded in parallel should be considered.
When the data warehouse is being tested by the design team, they should not concentrate only on performance testing, which will enable them to advise how long loads will take and discuss with the operations department how backups will be performed and how much time they'll need.
In Chapter 4, we will see how summary management, which was introduced in Oracle 8i, can be used to maintain aggregated data. One of the performance techniques widely used in the warehouse is to create summary tables of preaggregated data, known in Oracle 9i as a materialized view. Then a query transparently reads the materialized view instead of having to read all of the detail data. Hence, the performance improvements can be enormous, depending on the reduction in rows between the detail and the materialized view.
Unfortunately, we get nothing in this world for free and materialized views have to be maintained. This can involve considerable I/O, depending on how many new records are added and whether the materialized view is created completely from the beginning or if only new data is added. If many materialized views are defined and they are all refreshed at the same time, then consideration should be given to placing the materialized views in different tablespaces on different disks. Failure to do this will result in all I/O occurring on the same disk, thus slowing the refresh process considerably. This may be an even more important consideration if the refresh operations are to be performed in parallel.
As we will see in Chapters 3 and 4, there are various techniques that can be employed by the designer to improve query performance. Some will involve how queries are constructed, but many are actually in the database design. For example, all databases benefit from indexes, and a data warehouse is no exception. Therefore, do not forget to decide which type, where, and how much space is to be allocated for indexes. In a data warehouse, the designer does not have to worry about many users inserting new entries into the index and the associated problems that can result. Instead, the designer is now concerned with the time that is required to maintain or build an index. For instance, recreating an index on a fact table with 100 million rows will take more than a few minutes to complete!
Oracle 9i offers different types of indexes, and the designer should select the one that is most appropriate, which could mean that the design will contain many bitmapped indexes. Different types of queries in a data ware-house will use different access methods to the data that benefit from the different index types (e.g., a star transformation uses a bitmapped index).
Physical placement of the data is another important consideration, especially if it is used in conjunction with partitioning and parallel operations. If data is physically located on different disks, then queries or tasks can be performed that do not saturate the I/O limits on a specific disk drive.
One of the significant performance gains is when materialized views, which are described in Chapter 4, are used. Since they have to be created and maintained, once again space must be reserved for this data, and the improvement in query response time must be balanced against the time required to maintain this data.
New data for a warehouse often arrive in batches. Hopefully, they will be loaded into the database when it is not in use, but this cannot be guaranteed. Therefore, if, during your investigations of the proposed system, you discover that data will be loaded into the warehouse while it is in use, review techniques that allow you to insert data into an area different from the one being used by the users (e.g., partitioning).
Another consideration is whether the fact table is likely to be updated. With so many records in the fact table, this could have a significant impact on performance; therefore, procedures may have to be put into place to stop unauthorized updates to the fact table.