Tracking Data Lineage Identifiers


Data lineage is a part of the metadata tracking in an ETL process associated with mapping data back to the source data points. But before considering an approach to data lineage tracking, to better understand data lineage requires understanding metadata in general.

Metadata is the data about the data. For ETL, it’s the integrated tracking mechanism for the process and the core data. Metadata is very important for ETL and data warehousing-not just for administrators to understand what data got loaded, but also for system developers to validate data sources, for data administrators to find out what happened when failures occur, and for users to be confident in the data they are looking at. In fact, the latter may be the most important, because if the users do not trust the data, then their adoption of the solution will be slow going. Tracking data lineage is directly related to knowing where each data point came from, not just what system or table or file, but knowing which source row(s) they came from.

As with other requirements, data lineage can be handled several ways in SSIS. The most common way is to add an auto-incrementing number to the data flow that maps data points back to their sources.

If you are staging raw data, this can be leveraged by sending this lineage number to the staged data and also to the destination tables. With raw staging tables, instead of having to track back to a source system, which may be volatile, you are able to go back to the raw staging database and see what the data was at the point in time it was extracted.

To do this, add a lineage sequence identifier to the data flow right after the extraction so that it could then flow to both the stage table as well as the primary transformation processes. A simple Script component can handle the incrementing identifier. Figure 3-35 shows an extraction data flow that was presented earlier. This time, a Script component has been added after the extraction.

image from book
Figure 3-35: Extraction data flow with Script component added after the extraction

To configure the Script component to output a sequence lineage number, a new output column must be added to the Script component output properties. Figure 3-36 shows the Inputs and Outputs properties page, where a new integer column called CustomerLineageID has been added as an output column.

image from book
Figure 3-36: Inputs and Outputs properties page

The script itself is very straightforward. For every row, the lineage is output and then the value is increased by 1 for the next row. The following code handles this lineage:

  Public Class ScriptMain     Inherits UserComponent     Private CustomerLineageID As Integer = 1     Public Overrides Sub Input_ProcessInputRow(ByVal Row As InputBuffer)         Row.CustomerLineageID = Me.CustomerLineageID         Me.CustomerLineageID = Me.CustomerLineageID + 1     End Sub End Class 

With this sequence lineage approach, the added Customer Lineage ID would flow to both the raw staging table and the destination tables. Alternately, if the entire source data does not need to be staged, then the lineage ID could be landed to a staging table with only the business keys from the source, rather than every column. This would provide the necessary link between the source keys and the destination lineage ID. The SSIS data flow process would look very similar.

It’s worth mentioning that in data warehousing, oftentimes a fact table will contain a degenerate dimension-or a granular-level number that is an attribute of the fact. But, instead of being broken out into a separate dimension table, it is stored in the fact table itself. Oftentimes this value is seen as a business key of the fact records. This is also a type of data lineage because the discrete values can relate back directly to the source system.

When the lineage ID is stored in the destination tables, the next step is to put a trigger or action in the destination database that can then reference the source data that is either stored in the raw staging environment, or links directly to the source. Tracking data lineage allows better data validation, but when you can enable the users to see the link between source and destination themselves by building that functionality into the application, you are ahead of the game. To get there requires adding it to your SSIS-based ETL.



Expert SQL Server 2005 Integration Services
Expert SQL Server 2005 Integration Services (Programmer to Programmer)
ISBN: 0470134119
EAN: 2147483647
Year: 2004
Pages: 111

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net