ETL Testing | Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications

Unfortunately, testing, like reconciliation, is often done very poorly on BI projects, if at all. That is not acceptable ”and neither is the excuse that "it can be fixed in the next release." The next release will be even larger and more complicated, and it will require more time to test. In other words, if it takes too long to test now, it will take even longer to test later, which usually means that adequate testing is never done. Rushing into deployment at the expense of testing is inexcusable, especially if it is done to meet some artificial deadline.

The same types of testing that apply to operational systems also apply to BI applications (Figure 11.5). The following sections briefly describe each type.

Figure 11.5. Types of Testing

graphics/11fig05.gif

Unit Testing

Unit testing refers to the testing of discrete program modules and scripts, not to testing the flow or the links between the programs. Every developer must test his or her program modules and scripts individually (if XP techniques are utilized, this is done in pairs). There are three components to unit testing: compilation, functionality, and edits.

Compilation: Obviously, each program module must compile successfully or it cannot be implemented. There are many testing tools on the market for every conceivable programming language. These tools allow the developer to trace every step of the code as it is being executed, displaying the before and after images of the data for each line of code.
Functionality: Each program module must perform the functions for which it was designed and must produce the expected test results. Therefore, each developer must create a small test file with valid as well as invalid data to test every function of his or her program modules. He or she must know in advance what the program modules should do with the valid and the invalid test data (expected test results). Unit testing is not completed until all program modules produce all of the expected results.
Edits: Each program module must catch errors and, depending on the severity of the error, either produce an error message or gracefully end the program. No program should ever stop abruptly ("crash" or abend) with a cryptic system error message. With the high degree of poor-quality data in the source files, it is quite common that more lines of code are written for edits than for functionality.

Consider allowing someone other than the developer to test any given piece of code. This along with peer reviews should snare most errors that may not be caught because of pride of ownership and being too close to one's own code.

If an ETL tool is used, unit testing still applies to the individual ETL tool modules. In that case, you are testing the validity of your ETL instructions, that is, the ETL technical meta data. It is important to note that if an ETL tool does not meet the standards of your organization, you will have to supplement the tool with your own custom-written code. In that case, unit testing will be a combination of testing the ETL technical meta data, testing the custom-written code, and testing the "handshake" between them.

Integration Testing

Integration testing, also known as system testing, is the first complete ETL process run. This includes all three sets of ETL processes: the initial load, the historical load, and the incremental load. Just because all program modules pass their individual unit tests, it cannot be assumed that the entire ETL process will run smoothly. The interactions and flow of all programs, as specified in the ETL process flow diagram, must be observed and validated .

Interactions: Program modules receive data, manipulate it, and hand it off to other program modules. This interaction between the modules must be tested . The test data used for integration testing is different from that used for unit testing. A copy of a relatively large subset of representative operational source data is used for integration testing.
Flow: The ETL process flow diagram should indicate which programs must run in which sequence, which programs can run in parallel, and where sort and merge utilities are interjected. This flow must be tested for functionality and efficiency. Testing for functionality ensures that the right process is performed on the right data at the right time, that is, that the programs run in the correct sequence. Testing for efficiency ensures that the entire ETL process can complete within the expected time frame. If it cannot, the flow must be redesigned, and the entire ETL process must be retested.

If an ETL tool is used, the entire ETL process must still be tested from beginning to end, except that you are running the ETL tool processes instead of custom-written programs.

The business representative and the subject matter expert should be involved in integration testing. They are the first to know whether a particular run was successful or not. This is also an excellent opportunity for some advanced training for them, and it will go a long way toward allaying any future suspicions about the accuracy and quality of the data in the BI target databases.

Integration testing, like unit testing, requires many test runs to remove all the defects and to tune the flow. Every time the actual test results do not equal the expected test results, the program producing the error must be corrected, and all programs must be rerun. But unlike unit testing, integration testing is far too complicated to be performed without a formal test plan, which should include a description of the test cases and the sequence in which the programs should be executed.

Regression Testing

The most complicated and most time-consuming of all types of testing is regression testing. It is similar to integration testing, but this time the programs that are being tested are not new. Since the BI decision-support environment is evolving and new BI applications are added to the ETL process continuously, the project team will have to perform extensive regression testing on all releases except the first one. With every new BI release the ETL process has to be modified (enhanced) to extract more data from the operational systems for the new BI application. The new BI application may have a separate set of data access and analysis programs, but the ETL process is shared.

The main goal of regression testing is to make sure that the modifications to existing ETL programs did not inadvertently produce some errors that did not exist before. If new programs were added to the ETL process flow, the new interactions between programs must be tested, and any affected subsequent old programs must be retested. If the ETL process flow had to be enhanced or redesigned to increase efficiency, the entire ETL process has to be retested.

To develop a new test plan for every regression test would be much too time-consuming. Thus, it is advisable to save the original test plan and the original test data created for integration testing of the first release. Then enhance the original test plan for subsequent regression tests with new programs, new data, and new test cases.

Performance Testing

Performance testing, also known as stress testing, is performed to predict system behavior and performance. This is very similar to traditional performance testing on operational systems, but BI performance testing is more complicated because of the enormous volumes of data in the BI target databases. Since most organizations do not have more than three to four hours left in their nightly batch windows , the goal is to find out how much data can be processed during that time and how many nights it will take to complete the entire ETL process.

Unlike integration testing or regression testing, performance testing does not have to be performed on every program module. Performance testing could be limited to only the most critical program modules with the highest volumes of data and the longest runtimes . In addition to running physical tests, you can use stress test simulation tools. These simulation tools allow you to describe the production platform, including other programs running on the same server and sharing the same space. Based on your input, the tools calculate and project estimated performance numbers . It is highly recommended to run a simulation prior to actual performance testing with real data.

Quality Assurance Testing

Most large organizations have strict procedures for moving an application into production. These procedures usually include QA testing, and in most cases a separate QA environment is established for such testing. Operations staff direct the developers in moving databases and programs into the QA environment. Then all operating instructions and scheduled jobs have to be turned over to the operations staff for testing. They will go through a simulated production run before allowing the application components to transfer to the production environment.

Acceptance Testing

Acceptance testing can be performed in one of two ways, depending on how testing as a whole is set up. Ideally, there should be a separate acceptance test environment, which could also be used for regression testing of future releases. With a separate acceptance test environment, QA testing and acceptance testing could be done at the same time. However, it may not be feasible or justifiable to maintain a separate acceptance test environment. A simpler alternative is to perform acceptance testing after QA testing in the same QA environment.

If the business representative actively participated during integration testing or regression testing, there should be very few surprises during acceptance testing. In fact, if the business representative is comfortable with the integration or regression test results, and barring any unforeseen problems detected during QA testing, separate acceptance testing may not be necessary at all. However, if a traditional approach was followed in which the business representative was not involved in any design or testing activities except for occasional reviews, acceptance testing is the most important test of all.

Some project teams limit acceptance testing to the access and analysis portion of the BI application and exclude the business representative from ETL testing. That is a big mistake. When business analysts and business managers complain about incorrect data in the BI target databases, the reason may not be that the report programs do not work properly but that the ETL process is faulty. Therefore, testing how to get the data into the BI target databases correctly is more important than testing how to get it out correctly because an error in a report program is a lot easier to find and correct than an error in the ETL process. Moreover, since the business representative is involved in source data analysis and in providing the business rules for data cleansing, it is only logical that he or she should test the ETL process that implements those rules. The business representative should ask some of the following questions.

Is the appropriate data being extracted?
If the source data element is split into multiple columns , is it done correctly during the transformation process?
If some data elements are merged together, did any integrity problems result from this transformation process?
Is the data loaded correctly into the appropriate BI target databases and the appropriate BI tables?
Can the data in the BI target databases be reconciled with the source files and source databases? Where are the reconciliation totals stored?
Are data values correctly transformed and cleansed? Is bad data slipping through without notice?
Is the load performance adequate? And is the BI data available to the business people when they expect it?