Flylib.com

Books Software

 
 
 

7.5 When Should Data Profiling Be Done?

7.5 When Should Data Profiling Be Done?

Clearly, data profiling should be done on all data quality assessment projects as well on all IT projects that either move data to another structure or migrate or consolidate data. If a data quality assurance department uncovers significant facts about a data source through the outside-in method, they should profile the data source after the fact to determine the extent of inaccuracies and to discover any additional inaccuracy problems that may exist in the same data.

It is important to do all steps of the data profiling process whenever used. Analysts that think they can bypass a step because they understand the structure or believe that rules are enforced by the application programs will often be surprised by discoveries that go beyond what they think. The biggest task in data profiling is generally getting started, gathering known metadata, and getting the data extracted. Once you have done all of this and gone through the first step of data profiling, the other steps do not add that much more time to the process. Ending early has risk or missing inaccuracies, but costs little to finish.

Important databases should be reprofiled periodically. The rationale is that changes to applications are occurring all of the time. Industry experts have consistently estimated that production applications incur a change of 7% every year. Many of these changes have the potential to introduce new opportunities for generating inaccurate data. Other changes, such as business process changes or personnel changes, can introduce the possibility that data accuracy will deteriorate.

Once data profiling has been done on a source one time, much of the initial work has already been done, making a reprofiling exercise go much faster. Data profiling should be done on data sources after remedies have been implemented and a period of time passes for them to have an impact. This is a good way to measure the effectiveness of the remedies as well as to ensure that new problems have not been introduced.

Data profiling of secondary, derivative data stores is also helpful. For example, data profiling the data warehouse can reveal problems that are unique only at the data warehouse level. Aggregating and integrating data from multiple data sources can generate conditions that are illogical and discoverable only in the aggregation. For example, two data sources that maintain the same information at different levels of granularity will populate a data warehouse column with unusable data. Each data source would pass data profiling just fine.

7.6 Closing Remarks

Data profiling is described in this book as a generic technology. Any specific implementation of software and process to support it will be more or less complete for each step. For example, in value analysis you could invent new analytical techniques endlessly to micro-define what is acceptable. Similarly you can invent rules for business objects seemingly endlessly.

start sidebar

Data profiling is emerging as a unique and independent technology. However, analysts have performed data profiling throughout the years on all projects. You always have a phase of collecting information about data sources, mapping to targets, identifying issues, and crafting remedies.

The difference is that the analyst lacked a set of analytical tools designed specifically for the task of data profiling. They used ad hoc queries to test data. As a result, they generally did not have the time or resources to perform rigorous data profiling. They shortchanged the part about looking at the data. This meant that they tended to accept the gathered descriptions of data even though they cannot be trusted. The result has been a high rate of project failure or significant overruns.

One very experienced analyst once told me that he had been doing projects for 20 years and at the beginning of each project he promised himself that he would do it right that time. Doing it right meant thoroughly looking at the data to verify all gathered facts about the data, to uncover undocumented issues, and to discover the true state of data quality. In every case, he ended up stopping his investigation early in the project because of the enormous time and resources required to complete a comprehensive examination. In all case the project suffered later because of information about the data that was missed.

The emergence of a discrete methodology backed by software explicitly crafted for these tasks has greatly reduced the time and effort required to perform thorough data profiling. This is enabling data quality staff to use this approach effectively.

end sidebar

It is easy to get into "analysis paralysis" in performing data profiling by trying to micro-define correctness to the ultimate level and then burn up machines for days trying to validate them. At some point the process yields too little for the effort to be worthwhile. Practitioners need to find the right balance to get the most value from the work being performed.

Although overanalyzing data is a risk, you rarely see this as the case. The most common failing is not to perform enough analysis. Too often the desire to get results quickly ends up driving through the process with too few rules defined and too little thinking about the data.

Used effectively, data profiling can be a core competency technology that will significantly improve data quality assessment findings, shorten the implementation cycles of major projects by months, and improve the understanding of data for end users. It is not the only technology that can be used. However, it is probably the single most effective one for improving the accuracy of data in our corporate databases.