Things to Consider Tools Have we selected an ETL tool, or are we writing the ETL programs from scratch? Will the ETL tool run on the platform where the source files are? On a separate server? Do we have a separate data-cleansing tool? Will we run it before or during the ETL process? Do we have an efficient sort utility? ETL Staging How big is our ETL staging window? How many hours per night? Per week? Do we have a smaller window at month-end because of other month-end processes? How much smaller? Can we fit our ETL process into those windows or will we have to run over several days or nights? How many source data elements do we have to extract? And how many source files and source databases do we have to access? ETL Process Flow How many programs can we run in parallel to shorten the ETL runtime? How long will the initial load take? Have we prototyped it? How long will it take to load the historical data? How many years of history do we need to load? Do we know how long the incremental loads will run? Should we insert rows or use the database management system (DBMS) load utility? Should we use a third-party load utility to speed up the process? Do we already have a third-party load utility, or do we need to buy one? When and how will the data be archived? On disk? On tape? Do we have to write archival programs at this time or can we postpone that additional programming effort to a future release? Performance Considerations How would ETL load performance be affected if we left referential integrity (RI) turned on? How high would the risk of data corruption be if we turned RI off? How much RI checking do we want to perform in the ETL programs? Reconciliation At how many points in the ETL process do we need to count input and output records? Are the record layouts and database structures different on the old historical files and databases than they are on the current files and databases? How do we reconcile them? Do we need to reconcile changed codes? Reused and redefined fields? How many data elements do we have to reconcile? How many codes? How many amounts? Will dirty data be rejected? How will that be reflected in the reconciliation totals? Will the load counts and reconciliation totals be stored as meta data? Quality Metrics How will data quality errors be counted? What data quality metrics do we need to compile in the programs? Will we store those metrics as meta data in the meta data repository or print them in a report? |