Error Outputs | Wireless Java : Developing with Java 2, Micro Edition

Error outputs can obviously be used to improve reliability, but they also have an important part to play for scalability as well. From a reliability perspective, they are a critical feature for coping with bad data. An appropriately configured component will direct failing rows down the error output as opposed to the main output. These rows are now removed from the main Data Flow path and may then receive additional treatment and cleansing to enable them to be recovered and merged back into the main flow. They can be explicitly merged, such as with a Union transform, or implicitly through a second adapter directed at the final destination. Alternatively they could be discarded. Rows are rarely discarded totally; more often they will be logged and dealt with at a later point in time.

The capability to recover rows is perhaps the most useful course of action. If a data item is missing in the source extract but required in the final destination, the error flow path can be used to fix this. If the data item is available from a secondary system, then a lookup could be used. If the data item is not available elsewhere, then perhaps a default value can be used instead.

In other situations, the data may be out of range for the process or destination. If the data causes an integrity violation, then the failed data could be used to populate the constraining reference with new values and then the data itself could be successfully processed. If a data type conflict occurs, then maybe a simple truncation would suffice, or an additional set of logic could be applied to try and detect the real value, such as with data time values held in strings. The data could then be converted into the required format.

When assumptions or fixes have been made to data in this way, it is best practice to always mark rows as having been manipulated so that if additional information becomes available later, they can be targeted directly. In addition, whenever an assumption is made, it should be clearly identified as such to the end user.

All of the scenarios described above revolve around trying to recover from poor data, within the pipeline and the current session, allowing processing to continue, and ideally fixing the problem such that data is recovered. This is a new concept when compared with DTS and several other products, but the ability to fix errors in real time is a very valuable option that you should always consider when building solutions.

The obvious question that then occurs is this: Why not include the additional transformations used to correct and cleanse the data in the main Data Flow path, so that any problems are dealt with before they cause an error? This would mean that all data flowed down a single path and the overall Data Flow design may appear simpler, with no branching and merging flows. This is where the scalability factor should come into your solution design. Ideally, you would always build the simplest Data Flow possible, using as few transformations as possible. The less work you perform, the greater the performance and therefore scalability.

Figure 9-32 illustrates a simple Data Flow used to load some data. In this contrived scenario, some of the rows will be missing values for SpecialtyCode and ConsultantCode. The source data contains text descriptions as well, so these are being used to perform a lookup to retrieve the missing values. The initial design logic goes that you evaluate the column for NULL values in a Conditional Split transform. Bad rows are directed to an alternate output that connects to the Lookup transform. Once the lookup has populated the missing value, the rows are then fed back into the main pipeline through the Union All transform. The same pattern is followed for the SpecialtyCode and ConsultantCode columns, ensuring that the final insert through the OLE DB Destination has all good data. This is the base design for solving your problem, and it follows the procedural logic quite closely.

image from book
Figure 9-32

Figure 9-33 shows two alternative Data Flow designs, presented side by side for easy comparison. In the first design, you disregard any existing data in the SpecialtyCode and ConsultantCode columns and populate them entirely through the lookup. Although this may seem like wasted effort, the overall design is simpler and in testing it was 2% faster compared to the more complicated design in Figure 9-32. This was with a test data set that had a bad row ratio of 1 in 3, that is, one row in three had missing values. If the ratio dropped to 1 in 6 for the bad rows, then the two methods performed the same.

image from book
Figure 9-33

The second design assumes that all data is good until proven otherwise, so you insert directly into the destination. Rows that fail due to the missing values pass down the error output, "OLE DB Destination Error Output," and are then processed through the two lookups. The choice between the two designs is whether you fix all rows or only those that fail. Using the 1 in 3 bad rows test data, fixing only the failed rows was 20% faster than fixing all rows. When the bad row ratio dropped to 1 in 6, the performance gain also dropped, to only 10%.

As demonstrated by the previous examples, the decision on where to include the corrective transformations is based on the ratio of good rows to bad rows, when compared with how much work is required to validate the quality of the data. The cost of fixing the data should be excluded if possible, as that will be required regardless of the design, but often the two are inseparable.

The performance characteristics of the corrective transforms should also be considered. In the examples above, you used lookups, which are inherently expensive transforms. The test data and lookup reference data included only six distinct values to minimize the impact on the overall testing. Lookups with more distinct values, and higher cardinality, will be more expensive, as the caching becomes less effective and itself will consume more resources.

In summary, the more expensive the verification, the more bad rows you require to justify adding the validation and fix to the main flow. For fewer bad rows, or a more expensive validation procedure, you have increased justification for keeping the main flow simple and for using the error flow to perform the fix-up work.

The overall number of rows should also influence your design, since any advantages or disadvantages will be amplified with a greater number of rows, regardless of the ratio. For a smaller number of rows, the fixed costs may outweigh the implied benefits, as any component has a cost to manage at runtime, so a more complicated Data Flow may not be worthwhile with fewer overall rows.

This concept of using error flows versus the main flow to correct data quality issues and related errors is not confined to those outputs that implement the error output explicitly. You can apply the same logic manually, primarily through the use of the conditional split transformation, as shown in the first example, Figure 9-32. You can perform a simple test to detect any potential issues and direct rows of differing quality down different outputs. Where expensive operations are required, your goal is to ensure that as few rows as possible follow this path and that the majority of the rows follow cheaper and usually simpler routes to their destination.

Finally, do not be put off by the name of an error output; they are not things to be avoided at all costs. Component developers often take advantage of the rich underlying pipeline architecture, using error outputs as a simple way of indicating the result of a transformation for a given row. They do not affect the overall success or failure state of a package, so don't be put off from using them.

You should be aware that the performance figures quoted here are for indicative purposes only. They illustrate the differences in the methods described but should not be taken as literal values that you can expect to reproduce, unless you're using exactly the same design, data, and environment. The key point is that testing such scenarios should be a routine part of your development practice.