1340-1341

Previous Table of Contents Next

Page 1340

  • you could build that list dynamically by issuing a query that selects a unique list of regions from the base tables (for example, select distinct(region) from...). If you have base tables that contain millions of rows, however, this query will take quite some time and really slow down your reports . Also, you might miss certain values that do not have data in the current base tables (for example, the mid-Atlantic region where you are just starting up operations and therefore have no current sales data). These reference tables can be populated by data-extraction routines from the appropriate source databases, or you can develop forms to manually update this information in the data warehouse, depending on what makes the most sense for your environment.
  • Aggregate tables: As I mentioned earlier, you might have so much data being sent to your data warehouse that it is not practical to keep every scrap of information online and available for queries. Suppose that you bring down one million transactions per day and each takes 100 bytes of information. This would consume more than 36GB of disk storage space per year. Disk space has become relatively cheap these days; however, the processing capacity to wade through that level of detail is not inexpensive. Also, some data warehouses receive much more than one million transactions per day or track more detailed information about each transaction. Finally, the fine level of detail you receive from your source database may not be of much use to your end users. Very few analysts care how many hairbrushes were sold on even minutes during the lunch hour . Instead, they care about how many are sold per store on a weekly or even monthly basis. Therefore, aggregating the raw data into larger buckets where you lose some of the finer detail makes sense in many environments.
  • Summary tables: Whereas using aggregation saves disk storage space and gets rid of unnecessary details, summary tables make queries faster. With summary tables, you intentionally duplicate some of the information you store online to achieve speed in your queries. You look at the queries your users most typically perform to determine whether a number of common calculations would consume a lot of CPU resources (the total sales by region mentioned earlier, for example). You then run a job that calculates these common results in advance and stores them in a separate set of summary tables. When your users write their queries, they simply can pull out a single row from these summary tables (a fast operation) instead of grinding through thousands or millions of rows of raw data.
  • Data archives: Organizations spend a lot of time and money storing, formatting, and working with information in computer systems. However, you eventually will run into limitations on how much data you can store online and how much processing capacity you have to process that data. Therefore, in almost all cases, you have to develop some form of purge routines to remove information from your databases. You do not have to permanently destroy this information. Many systems store the data that is about to be purged in an offline format (similar to the data extracts you made from your source systems discussed earlier). Users can request that this information be restored online temporarily in reasonably sized chunks (one month of data at a time,

Page 1341

for example). The advent of CD technology makes it inexpensive even to build an archive of data that can be accessed much more quickly than would be possible with magnetic tape.

Programs and Processes

The other part of a data warehouse consists of the programs and processes used to move and analyze the data. In my conceptual architecture, I have the following software components :

  • Source-extraction routines: These processes take the information from the source database (or files) and transfer it to a neutral file format, which then is imported into a data warehouse. These routines usually are written in whatever programming tool is most common on the system that runs the source database. If COBOL is the common programming tool on the OLTP system, for example, it is a likely candidate for the extraction routines. The output of the routines is determined by the information needed in the data warehouse, the format in which the information is stored on the OLTP system, and the file or data-transfer format that is compatible between the two systems (for example, column-delimited ASCII text files).
TIP
You should be as generous as possible when it comes to sending information from the OLTP system to the data warehouse. It is easy to ignore information that is being sent down, but it can be a massive effort (and may not even be possible) to run back processing jobs to add new historical data to your warehouse to meet a new user requirement.
  • Data-import routines: If your OLTP and data warehouse systems are not connected via some electronic data exchange technology such as SQL*Net, you usually have to use a flat file to transfer the extracted information from the OLTP system to the database where the data warehouse resides. One of the most common tools for this task that I have run across is Oracle's Loader product. This tool is optimized to read files into Oracle tables. Many bulk-loading options are available for you to speed up the data-loading process.
  • Base table population software: If you import your data into a series of import tables, you need to develop software to convert this data into the format of the base tables. These conversion routines usually involve a form of data copy during which data is split between tables and, occasionally, some other basic math operations. Many data warehouses also take this opportunity to perform some basic summarization operations. Suppose that your import data involves every transaction that comes across registers in your stores. You might choose to summarize the number of a particular product sold in a particular store in the base table. This eliminates the lower level of detail that is not of interest to an analyst.
Previous Table of Contents Next


Oracle Unleashed
Oracle Development Unleashed (3rd Edition)
ISBN: 0672315750
EAN: 2147483647
Year: 1997
Pages: 391

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net