Package Connections and Data Flow Sources | Expert SQL Server 2005 Integration Services (Programmer to Programmer)

The data flow sources in SSIS provide the ability to connect to most standard data repositories, such as delimited files, Excel files, Oracle databases, and so on. With these data sources, which rely on package connections, some of the connections included out-of-the-box with the Windows and SSIS install include SQL Server, flat files (of various code pages), Analysis Services, Oracle, MS Excel, MS Access, XML, and so on.

Important

It’s important to note that many of the sources that SSIS can use come from the OS-level inclusion of Microsoft Data Access Components (MDAC). Go to www.microsoft.com/data for more information.

Both data flow and control flow components use separate objects at the package level to facilitate connectivity package connection, whether to connect to an FTP site, to a relational database, or to a file system folder. All data flow sources (except the Raw adapter) leverage package connections to extract data. As a review, the Connection Manager window appears at the bottom of the control flow and data flow, and right-clicking in the window allows a new connection to be created, as Figure 3-2 highlights.

image from book
Figure 3-2: Creating a new connection

The choices include creating a connection from an existing project Data Source, plus the ability to create a data source from many other types of connection types, including flat file connections, ADO.NET connections, Analysis Services connections, and OLE DB connections. Creating a connection is the first step in extracting data, and the second step is typically to create a new Data Flow Task (in the same package) that will contain the source adapter (which references the connection). The next section examines all the different package adapters, but also be aware that when you create a new source adapter in the data flow, new connections can typically be created automatically right inside the specific adapter’s editor.

Important

A Data Source created within an SSIS project (and viewable in the Solution Explorer window in BIDS) can be linked to and shared among packages. When using a Data Source in this manner, be aware that these shared Data Source connections are only updated in the designer, meaning that when you modify the project Data Source, you must open up all the packages that share that Data Source, then the linked connection can be updated with the new connection string. When you deploy packages from your development environment and run the package outside BIDS, the connection will no longer be shared among other packages or updated by the Data Source. Instead, you should use a package configuration to update connections and to share connection strings.

Source Adapters

Although connections in the package point to source systems or files, the adapters are the objects that define what is extracted. For example, a connection may reference an Oracle database or SQL Server database, and a source adapter would define which specific table (or view or query) to extract data from. For Excel connections, the adapter references which worksheet and data range to pull, and for flat files, the adapter simply references the flat file connection, because the connection already defines the columns and file format metadata. Figure 3-3 shows the data flow source adapters within the data flow toolbox.

image from book
Figure 3-3: Data flow source adapters

It’s only the raw file that doesn’t have an associated connection, and that is because the file itself already contains all the details about the columns and data types defined within. Professional SQL Server 2005 Integration Services (Wiley Publications, 2006) summarizes the source adapters in Chapter 8. Therefore, this section simply summarizes this information, looks at some advanced features of these components, and discusses some best practices around the adapters.

Flat File Source

Flat files are a very common extraction source. This is because many source systems will push data to flat files (for various reasons) that can then be consumed by another process (such as SSIS) in a loosely coupled manner. The flat file source adapter is a powerful mechanism to bring data into the data flow. Flat files come in many different structures and formats. SSIS supports delimited, fixed-width, and fixed with ragged right, where the last column may have an undetermined number of characters. Even delimited flat files may have different column delimiters and different text qualifiers, and these can be set in the advanced properties of the connection.

Two important features of the flat file source give the adapter extra flexibility and great performance. First of all, the flat file adapter supports various code page formats. Figure 3-4 shows the General tab of the Flat File Connection Manager Editor, with the Code page drop-down selected.

image from book
Figure 3-4: General tab of the flat file editor

The most common code page, of course, is ANSI 1252 – Latin 1. But beyond that, you may have files generated from legacy systems or applications, such as sources with double-byte characters, IBM EBCDIC files, or even other language-specific code pages.

Second, a little-known property of the flat file adapter is the FastParse property available for date, time, and integer source columns, as Figure 3-5 shows.

image from book
Figure 3-5: FastParse property

The FastParse property of output columns (found in the Advanced Editor, Input and Output Properties tab) reduces the conversion overhead of each column as it is imported into the data flow buffers, and overall gives generous performance benefits. By using the Fast Parse option, you are giving up the ability to translate the value to the local specific data, and you are forced to have date columns formatted in YYYYMMDD or YYYY-MM-DD format. But if you are dealing with large flat files, you will want to work through these limitations to take advantage of the setting. In essence, you are telling the connection that the data can be trusted in the source column to be in a standard format.

DataReader Source

The DataReader Source is the adapter used to connect to an ADO.NET source. There’s nothing fancy about the user interface. Because the ADO.NET providers don’t expose the database objects and metadata, the source query must be manually entered. Figure 3-6 shows the main editor of the DataReader adapter.

image from book
Figure 3-6: Main editor of the Data Reader adapter

Many have found that the ADO.NET provider is more often than not slower in SSIS as compared to an OLE DB provider for the same source. The reason is not that the ADO.NET providers are slow, but rather the overhead required to marshal data from a native source, through a managed provider, and into the native SSIS pipeline requires more steps.

OLE DB Source

The OLE DB source adapter provides the most flexibility, because it allows several mechanisms to customize the source query, including parameterization and variable bound queries, which are both covered later in this chapter in the section “Incremental Data Extraction.”

When using the OLE DB source adapter to connect to SQL Server, leverage the SQL Native Client (SQLNCI), because it is the best choice for SQL Server sources, offering very fast extraction.

In addition, you can use OS-based Microsoft Data Access Component (MDAC) providers. Download updates for the generic providers at www.microsoft.com/data. These providers have been detached from SQL Server and are included with the operating system install.

Excel Source

With the release of Service Pack 2, there are now two providers to access Excel (or Access). The first version (now legacy) supports Excel 97-2005 and is merely a modified version of the JET provider (the OLE DB provider for connectivity to Access and Excel). New to SP2 is the latest provider for Office 2007 Excel and Access. This provider, called Microsoft ACE OLEDB 12.0, overcomes the 255-character limit when extracting cells. Both have a limitation in that they are only supported in a 32-bit execution mode (no native 64-bit). This doesn’t mean that you cannot extract data from Excel on a 64-bit system; it just means that you must run the package in 32-bit mode, which will be a little more inefficient. (On 64-bit servers, both 64-bit and 32-bit versions of DTExec are installed to handle 64-bit compatibility issues.)

You can create an Excel connection in the data flow simply by using the Excel Source in the Data Flow Source adapter list, as shown earlier in Figure 3-3. The Excel source adapter interface will allow a new Excel connection to be generated, which includes the Excel file path property, shown in Figure 3-7.

image from book
Figure 3-7: New Excel connection

The Excel connection references the Excel file, and the data flow adapter for the Excel connection then allows a specific worksheet to be selected.

One thing that may cause you issues when extracting from Excel sources is that all the text columns are pulled back as Unicode (DT_WSTR) data types in SSIS. Even if you try to modify the advanced input and output properties of these columns, the adapter will not accept a change to these types; therefore, another approach is required to maintain any native string (DT_STR) types. To change the data type of your columns, you must use the Data Convert transformation and add new columns for each Unicode column that needs conversion back to native strings. The initial data flow will look like Figure 3-8.

image from book
Figure 3-8: Initial data flow

When looking at the details of the conversion, Figure 3-9 shows that the Data Convert transformation is changing several of the data types from DT_WSTR to DT_STR, since, in many cases, your destinations may be non-Unicode. When converting to non-Unicode, you will need to specify the code page. Doing this is required to prevent destination errors.

image from book
Figure 3-9: Data Convert transformation changing several of the data types

Any time you need to change the data type of a pipeline column, SSIS requires a new column to be added to the buffers, since the data type of an existing column cannot be changed in place. Therefore, new columns are added to the pipeline with adjusted data types. This makes the rows wider and the buffers less efficient with the extra columns.

Raw File Source

The raw file is a highly efficient file-based storage mechanism unique to SSIS in that it contains data created in the native format of the data flow. Therefore, the raw file can only be created by a Data Flow Destination. As mentioned earlier, the raw file does not include a reference to a package connection. The file internally contains all the metadata information it needs for columns and data types, but not having a connection means that the file location property cannot be added to a package configuration through a connection. Instead, the raw file allows the AccessMode to be set to File name from variable, which then allows the file path to be dynamic, based on the value of a variable. (Using this approach will allow the file path to be put in a package configuration.) When AccessMode is set to File name, then the file path is hard-coded in the adapter.

Uses of the raw file include the ability to temporarily stage data between data flows, which can be useful for restartability, and also the ability to share data between multiple packages. Chapter 7 reviews the Raw file in detail for these purposes. Additionally, the raw file can provide a data archiving mechanism for later processing, data lineage, or data validation requirements.

XML Source

Pulling XML data into the data flow may be valuable if you have large XML files that need to be imported into a database. By nature, XML files have the overhead of tags, which makes their file sizes larger, so XML files are not often used as a bulk data transfer means. Instead, delimited files are more common because they can be created, transferred, and read faster. Smaller data sets or B to B integration is more commonly XML-based. However, you may have to deal with XML as a source if the data comes from another department or from a vendor that supplies the data in this way.

The XML Source adapter enables you to import XML data into the data flow. The Data access mode property enables you to point to XML data in a few different ways:

XML file location, when a hard-coded path to the XML file can be entered
XML file from variable, if the path and file name pointing to the XML file is located in a variable
XML data from variable, if the XML you are importing is embedded in a text variable

If you have an XSD schema, then you can enter it right in the XSD location property, or you can generate one right from the editor. The XML Source will present multiple outputs, depending on the schema. In some ways, the output acts like a Conditional Split, where multiple outputs are possible and the output needs to be selected when connecting it with the next transformation or destination. One final note is that you should go to the Advanced Editor and modify the input columns to be the right data type and length, because the XML source assigns every output column as a Unicode string (DT_WSTR) of length 255.

Advanced Features and Concepts

As mentioned in the discussion of the OLE DB source adapter, when you use this adapter, you have the ability to dynamically filter data from a source system. This dynamic query capability is reviewed in detail in the section “Incremental Data Extraction” later in this chapter. This section examines delaying the validation of the source and updating connections by using property expressions.

Delaying Adapter and Connection Validation

An important property of source adapters is the ValidateExternalMedata because it allows a source to be identified that may not exist at design time, or may change before execution. For example, if you are creating a database view in one data flow, and then connecting to the newly created view in the second data flow, the object will not initially exist. This property would need to be set to False to ensure that the package will run without failing validation. Connections in the Connection Managers window have a similar property called DelayValidation. This is similar except that it is applied to the entire connection.

Updating Connections with Property Expressions

It is often the case that the database used at design-time is not the same database that a production package will use. In such cases, it is useful to be able to dynamically tell the package at runtime which database it should point to instead.

Three general methods exist to update a connection at runtime. The most common way to update a connection is by leveraging Package Configurations (discussed in Chapter 16 of Professional SQL Server 2005 Integration Services), which can update a connection when the package loads. This provides the capabilities to change the connection strings without opening the package, such as for moving packages from a development server to a test server, and onto a production server.

The drawback to configurations when updating connections is that the connection can only be updated when the package loads. This approach will not work if you are looping through identical files in a folder and then using data flow to import the data. In this case, the connection must be updated every time the package loops. As an example, Figure 3-10 shows a control flow with a ForEach Loop Container that is looping over Excel files in a folder.

image from book
Figure 3-10: Control flow with a ForEach Loop Container

The Excel connection must be updated for every file. So, for this situation, a configuration will not work, since they are only applied when the package loads. Instead, a Property Expression is utilized. The ForEach Loop is updating a package variable called FileName every time the container loops, which can easily be used to update the connection. Property Expressions can be found in the sliding property grid window under the property called Expressions. Figure 3-11 highlights the Expressions property when selected on the Excel Connection Manager connection.

image from book
Figure 3-11: The Expressions property when selected on the Excel Connection Manager connection

To open the Property Expressions Editor, click the plus symbol to the left of the Expressions property name, and then the ellipse on the right-hand side of the property when it appears. This opens a window, allowing several properties of the connection to be updated by an expression. For the purpose of this example, the ExcelFilePath property is selected from the drop-down list, and the expression @[User::FileName] is selected, as Figure 3-12 shows.

image from book
Figure 3-12: Selecting the expression @[User::FileName]

In this case, the expression itself is only referencing a package variable, but property expressions can leverage the full SSIS Expression language. To create a more complicated expression, the ellipse next to @[User::FileName] will open the full SSIS Expression Builder window.

When the package shown in Figure 3-11 executes, the ForEach Loop iterates through every file in the designated folder, and the full name and path are passed into the FileName variable. When the data flow runs and the adapter calls the connection, the connection is then automatically updated by the property expression and the right file is extracted in the data flow, making the extraction flexible to handle multiple sources while using the same connection.

Optimal Data Staging Method

Data staging provides a valuable mechanism for data archiving and data validation. In some cases, a data warehouse may only be using a portion of the source data provided. However, at some later date, this information may become valuable to extend the solution. Having raw staged data can allow re-loading and data validation.

When dealing with flat files generated or pushed from a source, the flat file itself can be the archive mechanism. At times, however, it may be valuable to have your raw source data persisted into a database for tracking or validating with queries. In these cases, when you need to stage the data with SSIS, one of two approaches can be taken.

The data can be landed to the staging/archive table in one data flow, and a second data flow can then extract the data from the staged data back into the data flow for further processing. This approach would look like Figure 3-13, with the control flow and data flows, shown side by side.

image from book
Figure 3-13: Data flow to stage data

The second approach is to handle both the staging and the transformation logic within the same data flow by using a Multicast transformation. Figure 3-14 shows the same process handled in this approach.

image from book
Figure 3-14: Staging and the transformation logic within the same data flow

Both have their advantages and disadvantages. The second approach scales a lot better when dealing with large amounts of data. The reason is twofold. First, the staging process is handled in parallel with the transformation process, but also, the disk I/O is reduced in half because the staging table is only written to, as opposed to written to and then extracted from again.

However, if the transformation process fails, then the staging process will also stop. When taking the first approach, if the transformation process failed, the staged data would already be in the archive table, and then the data flow containing the transformation logic could be re-started after the data issue resolved.