Source Data Transformation


The technical rules and the business rules for the required source data transformations were accumulated and defined throughout the steps of project planning, project requirements definition, data analysis, application prototyping, and meta data repository analysis. During those steps, the rules were probably extracted from old manuals, old memos, e- mails , programs (operational and decision support), and computer-aided software engineering (CASE) tools and provided by people who remember when and why a business rule was created. These rules are now reflected as data transformation activities in the ETL process.

Data Transformation Activities

BI projects present the best opportunity to eliminate dead and useless data because it allows the business people to see their information requirements in a different light. When properly implemented, the data transformation activities of cleansing, summarization, derivation, aggregation, and integration will produce data that is clean, condensed, new, complete, and standardized, respectively (Figure 11.1).

Figure 11.1. Data Transformation Activities

graphics/11fig01.gif

  • Cleansing: By definition, cleansing is a BI transformation process in which source data that violates the business rules is changed to conform to those rules. Cleansing is usually accomplished through edits in the ETL programs, which use table lookups and program logic to determine or derive the correct data values and then write those data values into the load files used to populate the BI target databases.

  • Summarization: Numeric values are summarized to obtain total figures (amounts or counts), which can then be stored as business facts in multidimensional fact tables. Summary totals can be calculated and stored at multiple levels (e.g., departmental summary of sales, regional summary of sales, and total sales by country).

  • Derivation: During this process, new data is created from existing atomic (detailed) source data. Derivation is typically accomplished by calculations, table lookups, or program logic. Examples include the following:

    - Generating a new code for classifying customers based on a certain combination of existing data values

    - Calculating profit from income and expense items

    - Appending the last four digits of a ZIP code based on the address in a postal lookup table

    - Calculating a customer's age based on his or her date of birth and the current year

  • Aggregation: All the data about a business object is brought together. For example, data elements for a customer may be aggregated from multiple source files and source databases, such as a Customer Master file, a Prospect file, a Sales file, and demographic data purchased from a vendor. (In multidimensional database design jargon, the term aggregation also refers to the roll-up of data values.)

  • Integration: Data integration based on normalization rules forces the need to reconcile different data names and different data values for the same data element. The desired result is to have each unique data element known by one standard name , with one standard definition and an approved domain. Each data element should also be associated with its sources files and source databases as well as its BI target databases. Standardizing the data should be a business objective.

Underestimating Data Transformation Efforts

Source data transformation is similar to opening a Russian doll ”you open one and there is another inside. It could be an endless process. That is why the time required for the ETL process is chronically underestimated. The original estimates are usually based on the amount of technical data conversions required to transform data types and lengths, and they often do not take into account the overwhelming amount of transformations required to enforce business data domain rules and business data integrity rules.

The transformation specifications given to the ETL developer should never be limited to just technical data conversion rules. For some large organizations with many old file structures, the ratio of a particular data transformation effort could be as high as 80 percent effort toward enforcing business data domain rules and business data integrity rules and only 20 percent effort toward enforcing technical data conversion rules. Therefore, expect to multiply your original estimates for your ETL data transformation effort by four. Even if you think you have a very realistic timetable for the ETL process, do not be surprised if you still miss deadlines due to dirty data. If you do not miss deadlines, do not be surprised to discover you have not cleansed enough of the data sufficiently.

Insist on full-time involvement from the business representative, and insist on getting the right business representative ”someone who is knowledgeable about the business and who has authority to make decisions about the business rules. These stipulations are essential for speeding up the ETL process. Furthermore, urge the business sponsor and the business representative to launch a data quality initiative in the organization, or at least in the departments under their control or influence. When business people drive a data quality initiative, they are more likely to assist with the ETL transformation process. Remind them that while IT technicians may know the process semantics, the business people know the data contents and business semantics. They understand what the data really means.



Business Intelligence Roadmap
Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications
ISBN: 0201784203
EAN: 2147483647
Year: 2003
Pages: 202

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net