Page 1348
The result of the data-extraction process is usually a set of flat files that is transferred to the data warehouse system. I have found that OLTP and legacy systems usually deal best with a column-delimited flat file format, where columns in the file correspond to various pieces of information. The alternative is some character-delimited format in which a semicolon separates various pieces of information, for example. Note that on some computer systems, you might have to convert the data storage format of the flat file to transfer it to your data warehouse system. One of the more common conversions is to translate the IBM mainframe EBCDIC format to the ASCII format used on UNIX systems. The following is an example of an output file that is column delimited:
PURCHASING 123.4534246.90 OPERATIONS 33.2056 66.40 . . .
This example simulates a download of expenditures by various departments. The departments occupy the first column of data (PURCHASING and OPERATIONS). The next item is some form of per-unit cost. The third item is some form of cost code. Notice that this column runs together with the numbers on either side of it. This is common in column-delimited flat files; therefore, you have to map out in writing the column assignments with the people performing the data extraction and the people importing this data. The final column is some form of burdened cost. You also should notice that column-delimited formats result in a number of blank spaces. This can be significant in some cases (you have a large text field that is typically only 20 percent full). These blank spaces are valid characters and take up transmission time and data storage space during the transfer to the data warehouse. You might want to consider some form of character-delimited format if these spaces are a problem on your system.
The aggregation and summarization processes are similar, because they move data from one set of tables to another in your Oracle database. This is a nice, controlled environment over
Page 1349
which you typically have complete design authority. As always, though, you'll need to deal with a few design considerations, including the following:
The following statement shows the basic format of the SQL scripts used in an aggregation script. This example takes weekly expense information for a department and uses a cross-reference table to aggregate the expenses into fiscal months:
Select week_month.month_number, weekly_expenses.department, sum(weekly_expenses.expense_amount) into monthly_expenses.month_number, monthly_expenses.department, monthly_expenses.expense_amount where weekly_expenses.week_number = week_month.week_number and monthly_expenses.month_number = week_month.month_number group by weekly_expenses.department,monthly_expenses.month_number;
I already covered some of the basics on the purge processes. You will need to consider the possibility of archiving before executing the purge processes (tapes or even writeable CDs are relatively cheap these days). The purge process is another example of a process that is well-suited to the use of stored procedures. When using these processes, you typically are executing a series of SQL statements that are deleting records based on some values in the primary key. A few words of wisdom on these purge processes follow:
Page 1350
The star join schema is very popular in the data warehousing world. An example of this schema is a basic data table (referred to as a fact table) and its links to some of the basic information items that users are likely to form their queries against (dimension tables for store and time). This schema makes strong use of primary keys and foreign keys to ensure that you do not have records of information in your fact tables that cannot be referenced by the associated dimension tables.
STAR JOIN SCHEMA TIPS |
The key to making the star join schema work is ensuring that you always run your queries against the dimension table first and then the fact table. This makes sense, because a dimension table may have only a few dozen rows, whereas the fact table may contain millions. Also, because the dimension tables make up part of the primary key or alternate indexes of the fact table, you will be performing an indexed search against the fact table if you have processed the dimension table records first. I have seen this method make the difference between queries that take hours to complete and those that take less than 10 seconds. You might have to use the hint option on your SQL statements to get the query to behave as you want it to, but it is worth the effort. |