1338-1339 | Oracle Development Unleashed (3rd Edition)

Previous Table of Contents Next

Page 1338

to deal with multihundred-megabyte databases. This chapter discusses two key techniques that can speed up the queries you issue against your data warehouse. The first method is aggregation. Here, you combine individual records of information over some interval of time, such as days, weeks, and months. This level of detail typically is sufficient for most reporting needs and greatly reduces the number of rows through which your queries need to sort . The second process is summarization. Here, you look at typical queries run by your users. You look for the most common and complicated summation and other calculations (total sales by store, for example). You then consider the possibility of building tables that contain these calculated results. This will cost you some disk space, but it can save you CPU and response time, because the reports fetch a handful of records from the summation table instead of processing thousands (or millions) of rows in the raw tables every time a user issues a query.
A data warehouse needs to be implemented to provide performance using a reasonable amount of resources. One of the first things most businesspeople will tell you is that you have to weigh the costs against the benefits ”until, of course, it is time for the IS department to buy disk drives to store information that accounting might want to analyze some time in the distant future. Your challenge is to strike that delicate balance between meeting user needs and wasting money on disk drives containing information that no one will access, as well as processing capacity and memory to sort through that mountain of data. The key to this balance is knowing what to purge and when to purge it. This topic is discussed in "Purge Processes," later in this chapter.
A data warehouse typically is coupled with a standard series of reports that answer the bulk of users' questions without those users having to develop their own reports. This series of standard reports is the basis for the analyses by the users. Earlier, I mentioned how it was important to provide the capability to answer ad-hoc questions. However, what if business users coded their own reports and aggregated the numbers in a slightly different manner? It would take a lot of time to sort out whose analysis is correct. Therefore, for common analyses, it is easier to have standard reports designed to be flexible in areas such as on which departments or products to perform the standard analysis.

With these simple concepts in mind, you should be ready to start building the data warehouse. Well, not quite. As with all too many disciplines, the difference between a cost-effective system that works well and a disaster is a little bit of experience and effort. Entire books are devoted to the subject of data warehousing; this chapter covers a few of the basic concepts in a little more detail.

A typical data warehouse is a system that has a series of data stores (tables, files, and so on) connected by a series of data-transfer processes.

Page 1339

Data Stores

The first topic to discuss is the various data stores that make up a data warehouse system. The following are typical components in this architecture:

Source data: This component of a data warehouse can have the widest variety of formats. This data can take the form of mainframe flat files produced by a COBOL-based order-entry system. Or, the data can be a delimited ASCII text file produced from some machinery that automatically records readings from monitoring instruments. The key is that this is raw data with a format optimized for data-entry and processing purposes; it is not optimized for long- term storage and analysis. Also you might well have more than one data source for your data warehouse. These data sources may have different formats and be located on different computers (perhaps even in different countries ).
Extracted source data: In many cases, source data is not in a format or on a computer you can connect to directly from your Oracle data warehouse. Therefore, an intermediate file format is used to transfer the information. This file is produced on the host system using whatever tools make sense in that environment. The file format is determined by a combination of what is possible on the source host system and what makes sense for import into the new data warehouse system. This file then can be transferred in a number of ways, which include using magnetic tapes or the File Transfer Protocol (FTP) to the computer that houses the data warehouse.
Imported source data: Another common intermediate step in the data-extraction process is a series of Oracle tables that contain the extracted source data in a format that mirrors how it was written to the transfer files. This method is used because using high-speed data-loading tools, such as Oracle's Loader product, often is not the best way to perform complicated calculations or reformatting operations. These operations are needed to transform the data from the format produced on the host system to the format that makes the most sense for the data warehouse. The extracted data files typically match the rows of information stored in the source data system. Generally, the format used is a format that makes the most sense for OLTP. These data-storage formats are not the best formats for data warehouses. Some of the data warehouse storage formats, such as the star schema, are discussed in "Star Join Schema," later in this chapter.
Base tables: After the basic data transferred from the source system is reformatted for data warehousing, it is placed into the base tables of the system. The term base tables refers to the fact that these tables are the basis for producing aggregate and summary tables later. You can think of these tables as the ones that contain the key business information in which you are interested.
Reference tables: A lot of reports and analyses performed in a data warehouse need to know what the list of legal values are for a certain field. Suppose that you want to analyze sales by region. You need to know all the regions in your company. Of course,

Previous Table of Contents Next