Designing the Extract Programs


From an operational systems perspective, the most favored way to create extracts might be to just duplicate the entire contents of the operational source files and source databases and to give the duplicates to the BI project team. However, the ETL developers would have the burden of working with huge files when they only need a subset of the source data.

From the BI project perspective, the most favored way to create extracts might be to sort , filter, cleanse , and aggregate all the required data in one step if possible and to do it right at the source. However, in some organizations that would impact the operational systems to such a degree that operational business functions would have to be suspended for several hours.

The solution is usually a compromise: the extract programs are designed for the most efficient ETL processing, but always with a focus on getting the required source data as quickly as possible. The goal is to get out of the way of operational systems so that the daily business functions are not affected. This is easier said than done, for a number of reasons.

Selecting and merging data from source files and source databases can be challenging because of the high data redundancy in operational systems. The extract programs must know which of the redundant source files or source databases are the systems of record. For example, the same source data element (e.g., Customer Name ) can exist in dozens of source files and source databases. These redundant occurrences have to be sorted out and consolidated, which involves a number of sort and merge steps, driven by a number of lookup tables cross-referencing specific keys and data values.

Another way to produce small and relatively clean extract files is to extract only those source data elements that are needed for the BI application and to resolve only those source data quality problems that pertain to the business data domain rules, without attempting to sort out and consolidate redundant occurrences of data. However, even that compromise will not work in many large organizations because the data-cleansing process would slow down the extract process, which in turn would tie up the operational systems longer than is acceptable.

In many large organizations, the BI project team is lucky to get three to four hours of processing time against the operational systems before those operational systems have to "go live" for the operational functions of the next business day. This is the main reason why populating the BI target databases is split into three separate processes: extract, transform, and load (Figure 9.3).

Figure 9.3. ETL Processes

graphics/09fig03.gif



Business Intelligence Roadmap
Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications
ISBN: 0201784203
EAN: 2147483647
Year: 2003
Pages: 202

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net