| ||
In this chapter weve talked about the features of Integration Services, and walked through some general strategies for high-level planning and the physical design of your ETL system. Now its time to pull everything together, and develop a detailed specification for the entire ETL system.
All the documents you have developed so farthe source to target mappings, data profiling reports , physical design decisionsshould be rolled into the first sections of the ETL specification.
Document all the decisions we have discussed in this chapter, including:
The default strategy for a package error handler
The default strategy for extracting from each major source system
The default approach for handling partitioning
The design of the dimension manager
Locations of staging areas
Data model of the ETL process metadata
Requirements for system availability, and the basic approach to meeting those requirements
The next section of the ETL specification describes the historical and incremental load strategies for each table. A good specification will include between two and ten pages of detail for each table, and document the following information and decisions:
Table design (column names , data types, keys, and constraints).
How much historical data to load (for example, 37 months).
Historical data volumes (row count).
Incremental data volumes, measured as new and updated rows/load cycle.
How to handle late-arriving data for facts and dimensions.
Load frequency, like daily.
How changes in each dimension attribute will be handled (like Type 1, 2, or 3).
Fact table and associated Analysis Services partitioning strategy. Discuss how to implement partitioning if todays load can include fact data for aged partitions.
Overview of data sources, including a discussion of any unusual characteristics of the sources, like an unusually short access window.
Detailed source-to-target mapping.
Source data profiling, including at least the minimum and maximum values for each numeric column, the count of distinct values in each column, and the incidence of NULLs.
Extract strategy for the source data (source system APIs, direct query from database, dump to flat files).
Dependencies: Which other tables need to be loaded before this table is processed ?
Document the transformation logic. Its easiest to write this section as pseudo-code, rather than trying to craft complete sentences. The more familiar you are with Integration Services, the more you can use shorthand like Conditional split on ColA>1000 .
Preconditions to avoid error conditions. For example, should the package check for file or database space before proceeding.
Cleanup steps, for example deleting working files.
Package error handler, if it differs from the default described previously.
The packages you will create to handle the tables processing, and the scope of work for each package.
An estimate of whether this portion of the ETL system will be easy, medium-difficulty, or difficult to implement.
The final section of the ETL specification describes any master packages, and provides a first cut at job sequencing. Create a dependency tree that specifies which tables must be processed before others. Whether or not you choose to parallelize your processing, its important to know the logical dependencies that cannot be broken.
| ||