ETL System Physical Design

The last information to document before pulling together the detailed ETL system specification is the details about the ETL system physical design. Consider where Integration Services will run in test and production; where and how big the staging area will be; and where package definitions will be stored and how they will be named.

It may surprise you to learn that there arent any global configuration parameters for Integration Services. Beyond installing Integration Services and the BI Studio tools, theres really nothing else to set up.

System Architecture and Integration Services

Integration Services has two major components : a design environment thats part of the BI Studio, and a runtime environment. The design environment is where you create, edit, and debug packages. The only way to remotely execute an Integration Services package in development/debugging mode is to use a remote desktop connection to the remote machine.

On test or production systems, use the DTExecUI or DTExec utility to execute packages. You can run on one server a package that is stored on a different server. In production, Integration Services packages can be located anywhere , and can be run on any server that has the Integration Services runtime executables.

You need a SQL Server 2005 license on any server that has the Integration Services runtime executables installed on it. Chapter 4 includes a discussion of how you may configure your SQL Server 2005 system, and concludes that most often the ETL system will run on the same physical server as the data warehouse relational database.

Design your ETL system with no assumptions about where Integration Services will be running. In other words, build in the concept of (and variables for) an ETL server that may or may not be co-located with another component of the DW/BI system. As much as we try to provide you with practical advice during the design phase, the only way to know for certain whether your ETL system needs a dedicated server is to build it and test it.

You can even design your ETL system to run multiple Integration Services packages on multiple servers. For example, if your enterprise DW/BI system consists of several distributed but conformed business process dimensional models, you would have one central ETL server manage the dimensions, which are then distributed to the conformed dimensional models. Each distributed database could perform its own fact processing on its own ETL server, reporting back to the central server the details of processing success or failure.

Staging Area

Earlier books in the Toolkit series, particularly The Data Warehouse Lifecycle Toolkit and, to a lesser extent, The Data Warehouse ETL Toolkit , talk extensively about a staging area. The staging area is a non- user -queryable place to rest data after its extracted from the source systems and before its loaded into the DW/BI system.

image from book
STAGE FACT DATA FOR RECOVERABILITY

We recommend that you stage fact data, if only to provide a buffer between the source systems and the ETL system. If a load fails (What? Impossible!), you can always go back to the recovery area rather than have to touch the source systems again. Its a good idea to store a copy of dimension extracts as well, especially if they get overwritten in the source system. Set up your staging process to keep copies as far back as you might need them. Weve kept anywhere from seven days for some tables to all of history for others. It depends on how easy it is to go back and get the exact same data set.

In systems with high-performance requirements, you may store the data in the recovery area in parallel with transforming the data, rather than to perform these steps in series. The Integration Services multicast transform makes it easy to develop such parallel flows.

image from book
 

Some Integration Services ETL systems will not use a staging area at all. They will kick off a package at some predetermined time often in the middle of the night. The package pulls data from the source systems, performs all necessary transformations, populates the relational data warehouse database, and launches Analysis Services processing.

Other systems may execute a process on the source system to extract data to files, and then use Integration Services to move those files to the ETL server for processing. In this case the staging area would consist of some file space on both the source and ETL servers.

Another common architecture will be to design a set of Integration Services packages to perform extracts, and a second set of packages to perform transformations and loads. You might choose this architecture if you want to extract data more frequently than you want to transform and load it. In this case, the extract packages write the data as flat files or raw files (an Integration Services-specific format), and the staging area would consist of file space on the ETL server.

Fewer new ETL systems will use a relational staging area than previously. The flat file and raw file storage is efficient and relatively easy to manage. However, a staging relational database is a concept thats comfortable and familiar to many ETL architects . Theres nothing wrong with using a database rather than files, if your load window can handle the overhead.

Package Storage

An Integration Services package is an XML file, and has the file extension .dtsx. If you open an Integration Services package file in Notepad or any other text editor, you can verify this is so. And yes, if youre curious , you can theoretically develop a package simply by writing a correctly structured XML file. You would be crazy to do it this way. If you need to build or modify packages programmatically, use the object model.

During development, when youre working with packages in BI Studio, those packages are stored in the file system, ideally under your source control system. After youve developed your ETL system, you will deploy it to your test and then production systems. When you install a set of packages on the target system, you can store the packages in the file system or in the msdb database in an instance of SQL Server 2005. We prefer to store production packages in the file system Package Store. The Package Store is a special file system folder in which Integration Services can find packages.

Note 

Readers who are experienced with DTS 2000 are usually biased against storing packages in SQL Server. In DTS 2000, there was no way to categorize packages stored in the database. Integration Services solves that problem.

There are two reasons we continue to prefer file system storage:

  • Integration with source control

  • Flexible backup and restore, compared to objects inside the msdb database

Package Naming Conventions

Develop a set of naming conventions for the packages. Our convention is to start with the table name , and append an acronym for the major sets of processing that the package handles. Some examples follow:

  • DimCustomer_ETLPi performs incremental extraction, transformation, load, and Analysis Services dimension processing (P) for the customer dimension (DimCustomer table).

  • DimCustomer_Eh performs historical extraction only for the customer dimension.

  • DimCustomer_TLh performs historical transformation and loading for the customer dimension.

  • FactOrderLineItem_ETLPi performs incremental extraction, transformation, loading, and Analysis Services partition processing for the Orders fact table.

  • Master_Orders_i runs subpackages for the incremental processing of any dimension tables used by the Orders business process, and then runs subpackages for the incremental processing of the one or more related fact tables in that dimensional model.

We start the file name with the name of the table, so that all the packages that have to do with any one table are sorted together.

Alternatively, group all the packages that do only historical loads into a separate project, as we illustrated earlier in this chapter.



Microsoft Data Warehouse Toolkit. With SQL Server 2005 and the Microsoft Business Intelligence Toolset
The MicrosoftВ Data Warehouse Toolkit: With SQL ServerВ 2005 and the MicrosoftВ Business Intelligence Toolset
ISBN: B000YIVXC2
EAN: N/A
Year: 2006
Pages: 125

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net