Preparing for the ETL Process | Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications

The ETL process begins with preparations for reformatting, reconciling , and cleansing the source data.

Reformatting: The source data residing in various different source files and source databases, each with its own format, will have to be unified into a common format during the ETL process.
Reconciling: The tremendous amount of data in organizations points to staggering redundancy, which invariably results in staggering inconsistencies. These have to be found and reconciled during the ETL process.
Cleansing: Dirty data found during data analysis and prototyping will have to be cleansed during this process.

Before designing the ETL process, it is necessary to review the following:

Record layouts of the current as well as the historical source files
Data description blocks for the current as well as the historical source databases
Data-cleansing specifications for the source data elements

Most source data for the ETL process is current operational data from the operational systems, but some of the source data may be archived historical data.

Table 9.1. Sets of ETL Programs

Initial Load	Historical Load	Incremental Load
Initial population of BI target databases with current operational data	Initial population of BI target databases with archived historical data	Ongoing population of BI target databases with current operational data
1	2	3

If the data requirements include a few years of history to be backfilled from the start, three sets of ETL programs must be designed and developed, as listed in Table 9.1.

If the decision is made to write the ETL programs in a procedural language (e.g., C++ or COBOL), the transformation specifications for the three sets of programs must be prepared and given to the ETL developers. If an ETL tool will be used, ETL instructions (technical meta data) must be created for the three sets of load processes. The ETL technical meta data will reflect the same logic that would have been written in custom programs if no ETL tool had been available. The technical meta data should be stored in a meta data repository.

The Initial Load

The process of preparing the initial load programs is very similar to a system conversion process, such as the one many organizations perform when they move their old operational systems to an enterprise resource planning (ERP) product. In general, the first task of a system conversion process is to map selected data elements from the source files or source databases to the most appropriate data elements in the target files or target databases. A "most appropriate data element" in a target file or target database is one that is the most similar in name , definition, size, length, and functionality as the source data element. The second task of a system conversion process is to write the conversion (transformation) programs to transform the source data. These conversion programs must also resolve duplicate records, match the primary keys, and truncate or enlarge the size of the data elements.

Usually missing from conversion programs, and unfortunately also missing from most ETL processes, are data cleansing and reconciliation. Organizations repeatedly miss prime opportunities to bring order to their data chaos when they continue to "suck and plunk" the data from source to target as is. Their only concern is that the receiving database structure does not reject the source data for technical reasons, such as duplicate keys, or data type and length violations. That is not good enough for BI applications because business people expect data quality and data consistency for business reasons. Thus, when designing the load processes, data cleansing and reconciliation must become part of the ETL process flow.

The Historical Load

The historical load process could be viewed as an extension of the initial load process, but this type of conversion is slightly different because historical data is static data. In contrast to live operational data, static data has served its operational purpose and has been archived to offline storage devices. The implication is that, as some old data expires and some new data is added over the years, the record layouts of archived files are usually not in synch with the record layouts of the current operational files. Therefore, the conversion programs written for the current operational files usually cannot be applied to archived historical files without some changes. For example, in a frequently changing operational system, it is not unusual for five years of archived historical files to have five (or more) slightly different record layouts. Even though the differences in the record layouts may not be drastic, they still have to be reconciled. In addition, the cleanliness of the data may not be the same across all archived files. What was once valid in a historical file may no longer be valid. The data transformation specifications have to address these differences and reconcile them. All these factors contribute to the reasons why the ETL process can get very lengthy and very complicated.

The Incremental Load

Once the processes for populating the BI target databases with initial and historical data have been devised, another process must be designed for the ongoing incremental load (monthly, weekly, or daily). Incremental loads can be accomplished in two ways, extract all records or deltas only, as shown in Table 9.2. The design of the ETL extract process will differ depending on which option is selected.

Table 9.2. Incremental Load Options

Extract All Records	Extract Deltas Only
Extract source data from all operational records, regardless of whether any data values have changed since the last ETL load or not.	Extract source data only from those operational records in which some data values have changed since the last ETL load ("net change").

Extracting all records is often not a viable option because of the huge data volumes involved. Therefore, many organizations opt for delta extracts (extracting only records that changed). Designing ETL programs for delta extraction is much easier when the source data resides on relational databases and the timestamp can be used for determining the deltas. But when the data is stored in flat files without a timestamp, the extract process can be significantly more complex. You may have to resort to reading the operational audit trails to determine which records have changed.

An alternative may be to extract a complete copy of the source file for every load, then compare the new extract to the previous extract to find the records that changed and create your own delta file. Another alternative is to ask the operational systems staff to add a system timestamp to their operational files. Occasionally they may agree to do that if the change to their operational systems is trivial and does not affect many programs. However, in most cases operations managers will not agree to that because any changes to their file structures would also require changes to their data entry and update programs. Additional code would have to be written for those programs to capture the system timestamp. It would not be cost-effective for them to change their mission-critical operational systems and spend a lot of time on regression testing ”just for the benefit of a BI application.

Processing Deleted Records

Another aspect that needs to be carefully considered for incremental loads is that of deleted operational source records. When certain records are logically deleted from the source files and source databases (flagged as deleted but not physically removed), the corresponding rows cannot automatically be deleted from the BI target databases. After all, one of the main requirements of BI target databases is to store historical data.

The ETL process must follow a set of business rules, which should define when an operational deletion should propagate into the BI target databases and when it should not. For example, perhaps an operational record is being deleted because it was previously created in error, or because the record is being archived, or because the operational system stores only " open " transactions and deletes the "closed" ones. Most likely, the business rules would state that you should delete the related row from the BI target database only in the case where the record was created in error. Since your BI target database stores historical data, the business rules would probably not allow you to delete the related row in the other two instances.

When records are physically deleted from the source files or source databases, you would never know it if you are extracting only deltas. Delta extract programs are designed to extract only those existing records in which one of the data values changed; they cannot extract records that do not exist. One way to find the physically deleted records is to read the operational audit trails. Another option is to extract a complete copy of the source file, compare the new extract to the previous extract to find the records that were deleted, and then create your own delta files. In either case, once the deleted records are identified, the ETL process has to follow a set of business rules to decide whether or not to physically remove the related rows from the BI target databases.