Extraction, transformation, and loading (ETL) is a process that is common to data warehousing. It includes the process of extracting data from various data sources throughout an organization and manipulating it to conform to the standards of the data warehouse prior to populating the relational data warehouse. The manipulation falls into two basic categories:
This chapter focuses on the T (transformation) step in ETL because it is traditionally the most challenging and low-performing part of the ETL process. In fact, in SQL Server 2000's DTS, ETL was often referred to as ELT (or more appropriately, ELTL). It was far more efficient to bulk-load the data extracted from the source into a staging area, perform the transformation, and then load it into the destination star schema format. Fact table transformations, in their purest form, consisted of dimension surrogate key lookups based on business keys. The most useful means of doing this was by using T-SQL to perform an INSERT from the fact table joined to the various dimension tables using the business keys. It was possible to use DTS's Data Transformation task for this purpose and use the Lookup feature within that task. This is the equivalent of using many SELECT statements to perform the table lookup for each dimension of each fact table row! SSIS ETL ConceptsSSIS includes the data adapters and pipelines for dealing with ETL. A source data adapter provides access to an external data source and converts the native data types to SSIS data types. The pipeline is an in-memory series of tasks that operate on the data from the source to transform it to the desired values and format. The destination data adapter then converts data from the SSIS data types to the destination data types and sends the data to the connection string identified therein. The key to transforming data in SSIS is the fact that it occurs in memory. This results in increased performance, and tasks such as Lookup Table Reference, which really slowed down the ETL process, now occur in a few seconds. This is because the lookup tables are each referenced once, causing all the data to be pulled into memory, where subsequent lookups are performed. The resulting performance improvement leads back to the pure ETL process and potentially eliminates the need for an intermediate staging area. What impact does this have on memory on the SSIS server? This is an important question to consider. Memory-intensive tasks have advanced settings that can moderate these effects. Memory availability is something to consider with this type of processing, though. Data Source ViewsData source views (DSVs) are another new component of SQL Server 2005. They provide a means of defining particular tables of interest in a potentially complex underlying data model. In the ETL process, this can be used, for example, to focus on one subject area. Table drop-down lists in the SSIS designers can be simplified in this way, displaying only the tables defined in the DSV. This also has a positive performance impact on the time it takes to populate these dialog boxes. DSVs provide a layer of abstraction between the SSIS packages and the actual underlying data sources. Therefore, they can be used to provide disconnected package development capabilities. DSVs are defined within a BIDS solution and can be shared between SSIS and Analysis Services projects in the same solution. In addition, you can use the Import and Export Wizard as explained in the "World of Wizards" section to transfer data from one data source to another. |