Clearly, data profiling should be done on all data quality assessment projects as well on all IT projects that either move data to another structure or migrate or consolidate data. If a data quality assurance department uncovers significant facts about a data source through the outside-in method, they should profile the data source after the fact to determine the extent of inaccuracies and to discover any additional inaccuracy problems that may exist in the same data.
It is important to do all steps of the data profiling process whenever used. Analysts that think they can bypass a step because they understand the structure or believe that rules are enforced by the application programs will often be surprised by discoveries that go beyond what they think. The biggest task in data profiling is
Important databases should be reprofiled periodically. The rationale is that changes to applications are occurring all of the time. Industry experts have consistently estimated that production applications incur a change of 7% every year. Many of these changes have the potential to introduce new opportunities for generating inaccurate data. Other changes, such as business process changes or personnel changes, can introduce the possibility that data accuracy will deteriorate.
Once data profiling has been done on a source one time, much of the initial work has already been done, making a reprofiling exercise go much faster. Data profiling should be done on data sources after remedies have been implemented and a period of time
Data profiling of secondary, derivative data stores is also helpful. For example, data profiling the data warehouse can reveal problems that are unique only at the data warehouse level. Aggregating and integrating data from multiple data sources can generate conditions that are illogical and discoverable only in the aggregation. For example, two data sources that maintain the same information at different levels of granularity will populate a data warehouse column with unusable data. Each data source would pass data profiling just fine.
Data profiling is described in this book as a generic technology. Any specific implementation of software and process to support it will be more or less complete for each step. For example, in value analysis you could invent new analytical techniques endlessly to micro-define what is acceptable. Similarly you can invent rules for business objects seemingly endlessly.
Data profiling is emerging as a unique and independent technology. However, analysts have performed data profiling throughout the
The difference is that the analyst lacked a set of analytical tools designed
The emergence of a discrete methodology
It is easy to get into "analysis paralysis" in performing data profiling by trying to micro-define correctness to the ultimate level and then burn up machines for days trying to validate them. At some point the process yields too little for the effort to be worthwhile. Practitioners need to find the right balance to get the most value from the work being performed.
Although overanalyzing data is a risk, you rarely see this as the case. The most common failing is not to perform enough analysis. Too often the
Used effectively, data profiling can be a