EXTRACTION, TRANSFORMATION, AND LOADING

only for RuBoard - do not distribute or recompile

EXTRACTION, TRANSFORMATION, AND LOADING

The term extraction and transformation is used to describe the processes of, first, the extraction of data from source systems and, second, the modification or transformation of that same data to put it into a form that is more acceptable to the data warehouse than its native form. There are many extraction and transformation tools, and most also provide a mechanism for loading the transformed data into the warehouse. Hence, they are usually referred to as extraction, transformation, and loading (ETL) tools. In order to facilitate the loading, the ETL tool must have an understanding of the structure of the data warehouse, and so one of the components of the ETL application is a source to target mapping of data.

The diagram in Figure 10.1 shows the major components of most ETL products.

Figure 10.1. Extraction, transformation, and load processing.
graphics/10fig01.gif

The diagram shows that the extraction, transformation, and load can be depicted as three distinct processes. Each of the three processes is supported by its own, internally held, documentation. This can form the basis of a metadata repository. We will be discussing metadata a little later on.

Extraction

It is the extraction process that reaches into the source systems and plucks the data that we need for the warehouse. The extraction routines can vary from being quite trivial, in the case of a file being delivered by the source system, to being impossibly complex, in the case of attempting to identify changes to customer circumstances. The extraction process needs to understand the layout of the source system data to a sufficient extent so that it can select out just the data that it needs. Any source system will contain much data that is of little or no interest to the data warehouse. Individual records will contain many data fields that we need in the data warehouse as well as many fields that we do not need. There is little point in extracting data that we do not need, especially in the case of behavioral data where we may be reading and extracting from as many as hundreds of millions of records per day. Each extraneous byte of data extracted translates into 10 megabytes of wasted space in such a large system. It is important, therefore, that the tool enables us to work at the field level when defining what it is we wish to extract.

In terms of the actual processing involved, it is quite common for the extraction process to store an intermediate version of the extracted data into what is known as a staging area. The benefit of this is that it enables us to easily rerun the loading process from raw data in the event of an upstream problem in the data warehouse load, without having to reextract it. Also, it enables the raw data to be backed up and archived easily, should we wish to do so. The extraction part of the tool must, therefore, allow for the writing out of intermediate flat files.

Transformation

In an ETL product, the transformation processor is a very busy process and is responsible for doing quite a lot of stuff. The VIM processing, which was described in Chapter 7, is the responsibility of the transformation process. The validation part that determines whether a record is OK, or can be made to be OK, is done here and so, therefore, is the rejection of records that cannot be made to be OK. The integration part is also done here. The transformation process has access to lots of functions, some of which are shown in the diagram in Figure 10.1. There will be simple processes that convert numbers to characters , and vice versa, as well as date conversion routines and, sometimes, currency conversion.

Also, lookup processes will be needed where, for instance, the customer identifiers are inconsistent between systems and we need to use a kind of surrogate key in the data warehouse to bring them all together. So we need to be able to specify valid values for each of the fields on our input data streams as well as what changes need to be applied. Some fields will require many transformations such as a format conversion from numeric to character, followed by a date conversion and then a lookup to determine whether or not the date occurs, say, within the customer's period of validity. So it is important that the tool enables complex processes to be built up. The best way to achieve this, as our example showed, is by stringing simpler processes together. Obviously, we need the function library to be extensible so that we can add our own functions. Some tools provide for the construction of functions within the tool itself. These are usually quite restrictive , and so it is important that we are able to call external functions that we have written for ourselves .

Loading

The loading of data into the data warehouse is one of the most problematic issues with ETL products. Some products operate by physically inserting each record as a row into the target table in the warehouse. This involves an SQL insert statement usually embedded in the tool itself. Some products take this further and operate on a kind of sausage machine basis. They link the extraction, transformation, and loading processes together for each source system record. This means that a record is read from the source system, transformed, and written into the data warehouse; then the next record is read from the source system, transformed, and written, and so on. This type of approach is fine when we do not have large volumes of data to be transferred. When we do, this kind of product will usually become a bottleneck. Vendors of such products will point to multiprocessor hardware architectures as the salvation, but be very careful before selecting one of these types of products if you do have large volumes of data.

With large volumes we need to be thinking about using the loading tool provided by the RDBMS vendor, preferably bypassing SQL and with the recovery processing switched off. Also external sorting using a high-performance sort (see Sorts at the end of this chapter) can greatly assist us in the quest for performance.

In operational terms, there are two kinds of ETL product. The first is as we discussed in this section. They consist of a kind of run-time environment that has to be started. Once up and running, they often come under the control of their own internal scheduler. They look for files in source systems; initiate extract, transformation, and load processes; handle errors; and issue alerts. These are highly functional systems, but they do come with some drawbacks, notably performance. The second type, much less common, enables us to design our extraction, transformation, and load in much the same way as before, but this time the ETL product generates program source code. The source code can be in many languages (e.g., C, Cobol, etc.). Once this is done, we have to compile the generated programs into executable code using our standard compiler. These systems are usually much more efficient to run than the first type, but they too have their problems. Whereas the first type of product makes it relatively easy to alter the processing dynamically, such as changing input or output mappings, the second type means regeneration of source code and recompiling. In a live environment, this can be problematic especially if there are strict rules about checking programs in and out of the version control system. Another problem is that a support programmer, when called out overnight to fix a problem, might be tempted to alter the program source code and recompile it. This is potentially disastrous, as the code has been generated from the ETL product. The correct way to make changes is within the confines of the product. Then the new source code can be generated, compiled, and executed. If the support staff do not know the ins and outs of the ETL product, they may not feel competent to do this and take the shortcut approach.

only for RuBoard - do not distribute or recompile


Designing a Data Warehouse . Supporting Customer Relationship Management
Designing A Data Warehouse: Supporting Customer Relationship Management
ISBN: 0130897124
EAN: 2147483647
Year: 2000
Pages: 96
Authors: Chris Todman

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net