DATA ACCURACY AND STANDARDIZATION | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter IX - The Pitfalls of Knowledge Discovery in Databases and Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

Another important aspect of DM is the accuracy of the data itself. It follows that poor data are a leading contributor to the failure of DM. This factor is a major business challenge. The emergence of electronic data processing and collection methods has lead some to call recent times as the "information age." However, it may be more accurately termed as "analysis paralysis." Most businesses either posses a large DB or have access to one. These DBs contain so much data that it may be quite difficult to understand what that data are telling us. Just about all transactions in the market will generate a computer record somewhere. All those data have meaning with respect to making better business decisions or understanding customer needs and preferences. But discovering those needs and preferences in a DB that contains terabits of seemingly incomprehensible numbers and facts is a big challenge (Abbot, 2001).

Accuracy and Cohesiveness

A model is only as good as the variables and data used to create it. Many dimensions of this issue apply to DM, the first being the quality and the sources of the data. We repeatedly can find that data accuracy is imperative to very crucial functions. Administrative data are not without problems, however. Of primary concern is that, unlike a purposeful data collection effort, the coding of data is often not carefully quality controlled. Likewise, data objects may not necessarily be defined commonly across DBs or in the way a researcher would want. One of the most serious concerns is matching records across different DBs in order to build a more detailed individual record. In addition, administrative records may not accurately represent the population of interest, leading to issues similar to sampling and non-response bias. Transaction records and other administrative data are volatile, that is, rapidly changing over time, so that a snapshot of information taken at one time may not indicate the same relationships that an equivalent snapshot taken at a later time (or using a different set of tables) would reveal. Finally, programs and regulations themselves may introduce bias into administrative records, thus making them unreliable over time (Judson, 2000).

No matter how huge a company's datasets may be, freshness and accuracy of the data are imperative for successful DM. Many companies have stores of outdated and duplicate data, as anyone who has ever received multiple copies of the same catalog can attest. Before a company can even consider what DM can do for business, the data must be clean and fresh, up-to-date, and error- and duplication-free. Unfortunately, this is easier said than done. Computer systems at many large companies have grown rather complicated over the years by encompassing a wide variety of incompatible DBs and systems as each division or department bought and installed what they thought was best for their needs without any thought of the overall enterprise. This may have been further complicated through mergers and acquisitions that may have brought together even more varied systems. "Bad data, duplicate data, and inaccurate and incomplete data are a stumbling block for many companies" (E-commag, 2001).

The first step to achieve proper DM results is to start with the correct raw data. For companies to mine their customer transaction data (which sometimes has additional demographic information), they can figure out an "ad infinitum" value of revenue for each customer. It is clear here that it is very important for companies to get proper data accuracy for forecasting functions. Poor data accuracy can lead to the risk of poor pro forma financial statements.

With accurate data, one should be able to achieve a single customer view. This will eliminate multiple counts of the same individual or household within and across an enterprise or within a marketing target. Additionally, this will normalize business information and data. With an information quality solution, one will be able to build and analyze relationships, manage a universal source of information, and make more informed business decisions. By implementing an information quality solution across an organization, one can maximize the ROI from CRM, business intelligence, and enterprise applications.

If dirty, incomplete, and poorly structured data are employed in the mining, the task of finding significant patterns in the data will be much harder. The elimination of errors, removal of redundancies, and filling of gaps in data (although tedious and time-consuming tasks), are integral to successful DM (Business Lounge, 2001).

Standardization and Verification

Data in a DB or data store are typically inconsistent and lacking conformity. In some cases, there are probably small variations in the way that even the subscriber's name or address appear in a DB. This will lead to the allocation of organizational resources based on inaccurate information. This can be a very costly problem for any organization that routinely mails against a large customer DB. An added difficulty of getting these correct data sources for data accuracy is the large amount of the sources themselves. The number of enterprise data sources is growing rapidly, with new types of sources emerging every year. The newest source is, of course, enterprise e-business operations. Enterprises want to integrate clickstream data from their Web sites with other internal data in order to get a complete picture of their customers and integrate internal processes. Other sources of valuable data include Enterprise Resource Planning (ERP) programs, operational data stores, packaged and home-grown analytic applications, and existing data marts. The process of integrating these sources into one dataset can be complicated and is made even more difficult when an enterprise merges with or acquires another enterprise.

Enterprises also look to a growing number of external sources to supplement their internal data. These might include prospect lists, demographic and psychographic data, and business profiles purchased from third-party providers (Hoss, 2001). Enterprises might also want to use an external provider for help with address verification, where internal company sources are compared with a master list to ensure data accuracy. Additionally, some industries have their own specific sources of external data. For example, the retail industry uses data from store scanners, and the pharmaceutical industry uses prescription data that are totaled by an outsourced company.

Although data quality issues vary greatly from organization to organization, we can discuss these issues more effectively by referring to four basic types of data quality categories that affect IT professionals and business analysts on a daily basis. These categories standardization, matching, verification, and enhancement make up the general data quality landscape (Moss, 2000).

Verification is the process of verifying any other type of data against a known correct source. For example, if a company decided to import some data from an outside vendor, the U.S. Postal Service DB might be used to make sure that the ZIP codes match the addresses and that the addresses were deliverable. If not, the organization could potentially wind up with a great deal of undeliverable mail. The same principle would apply to verifying any other type of data. Using verification techniques, a company ensures that data are correct based on an internal or external data source that has been verified and validated as correct.

Enhancement of data involves the addition of data to an existing data set or actually changing the data in some way to make it more useful in the long run. Enhancement technology allows a company to get more from its data by enriching the data with additional information.

Data Sampling and Variable Selection

Data accuracy can also fall victim to the potential of corruption by means of data sampling. This is not to say that the information is necessarily wrong per se, but in fact the sample of information that is taken may not be wholly representative of what is being mined.

Apart from sampling, summarization may be used to reduce data sizes (Parsaye, 1995). But summarization can cause problems too. In fact, the summarization of the same dataset with two sampling or summarization methods may yield the same result, and the summarization of the same dataset with two methods may produce two different results.

Outliers and Noise

Outliers are variables that have a value that is far away from the rest of the values for that variable. One needs to investigate if there is a mistake due to some external or internal factor, such as a failing sensor or a human error. If it can be established that the outlier is due to a mistake, Pyle (1999) suggests that it should be corrected by a filtering process or by treating it like a missing value.

Noise is simply a distortion to the signal and is something integral to the nature of the world, not the result of a bad recording of values. The problem with noise is that it does not follow any pattern that is easily detected. As a consequence of this, there are no easy ways to eliminate noise but there are ways to minimize the impact of noise.

Missing Values or Null Values

Companies rely on data for a variety of different reasons from identifying opportunities with customers to ensuring a smooth manufacturing process and it is impossible to make effective business decisions if the data are not good. In order to be able to fully assess these solutions, it is helpful to have a thorough understanding of the basic types of data quality problems. As companies look to address the issue of data quality, there are five basic requirements to consider when looking for a total data quality management solution: ease of use, flexibility, performance, platform independence, and affordability (Atkins, 2000).

Because DBs often are constructed for other tasks than DM, attributes important for a DM task could be missing. Data that could have facilitated the DM process might not even be stored in the DB. Missing values can cause big problems in series data and series modeling techniques. There are many different methods for "repairing" series data with missing values such as multiple regression and autocorrelation. Chapter 7 addresses this problem in detail.

Therefore, data quality control is a serious problem. Data inconsistency and data redundancy might mire any data-mining effort. Data quality is a company-wide problem that requires company-wide support. A total data quality solution must be able to address the needs of every individual within an organization, from the business analysts to the IT personnel. To meet the needs of these diverse groups, a solution must integrate with current IT tools and still be easy to use. For companies to truly thrive and achieve successful business intelligence, data quality must be an issue that is elevated to top priority.


	Brought to you by Team-Fly