TECHNICAL ISSUES | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter IX - The Pitfalls of Knowledge Discovery in Databases and Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

There are many technical challenges regarding DM. These technical issues can cover a whole spectrum of possibilities. However, in this discussion, technical issues will be limited to the pitfalls regarding certain DM methods and other general technical issues affecting a standard IT department. The selected technical methods reviewed for pitfalls that are used by DM are neural networks, decision trees, genetic algorithms, fuzzy logic, and data visualization. General technical issues covered relate to requirements of disaster planning.

When reviewing DM for technical issues, an obvious point that might be initially missed by new users is that DM does not automatically happen by just loading software onto a computer. Many issues must be painstakingly thought out prior to moving forward with DM. There is no single superior DM method that exists among the more than 25 different types of methods currently available. If there were one specific method that was best and without any disadvantages, then there would be no need for all of these different types of methods. One company even noted, with regards to DM, that there "is no predictive method that is best for all applications" (International Knowledge Discovery Institute, 1999). Therefore, when a company begins dealings with prospective clients, after having already gained a thorough understanding of the client's problem, it will use 8 to 10 different DM methods in an effort to find the best application for the client. Listed below are a few examples of some DM methods generally used.

The Problems with Neural Networks (NNs)

Neural networks (or neural nets) are computer functions that are programmed to imitate the decision-making process of the human brain. NNs are one of the oldest and most frequently used techniques in DM. This program is able to choose the best solution for a problem that contains many different outcomes. Even though the knowledge of how human memory works is not known, the way people learn through constant repetition is not a mystery. One of the most important abilities that humans possess is the capability to infer knowledge. NNs are very useful, not because the computer is coming up with a solution that a human would not come up with, but rather that they are able to arrive at that solution much faster. "In terms of application, NNs are found in systems performing image and signal processing, pattern recognition, robotics, automatic navigation, prediction and forecasting, and a variety of simulations" (Jones, 2001).

There are several possible pitfalls regarding NNs. One issue is related to learning. A drawback of this process is that learning or training can take a large amount of time and resources to complete. Since results or the data being mined are time critical, this can pose a large problem for the end-user. Therefore, most articles point out that NNs are better suited to learning on small to medium-sized datasets as it becomes too time inefficient on large-sized datasets (Hand, Mannila, & Smyth, 2001).

Also, NNs lack explicitness. As several researchers have pointed out, the process it goes through is considered by most to be hidden and therefore left unexplained. One article summarized, "It is almost as if the pattern-discovery process is handled within a black box procedure" (Electronic Textbook Statsoft.com, 2001). This lack of explicitness may lead to less confidence in the results and a lack of willingness to apply those results from DM, since there is no understanding of how the results came about. It becomes obvious as the datasets variables increase in size, that it will become more difficult to understand how the NN came to its conclusions.

Another possible weakness is that the program's capability to solve problems could never exceed that of the user's. This means that the computer will never able to produce a solution that a human could not produce if given enough time. As a result, the user has to program problems and solutions into the computer so that it can decide what are the best solutions. If the user has no answer(s), chances are neither will the computer.

A final drawback of NNs noted in one paper is that "there is a very limited tradition of experience on which to draw when choosing between the nets that are on offer" (Australian National University, 2001). This lack of knowledge or general review on which type of NNs is best might result in the purchase of an application that will not make the best predictions or in general will just work poorly.

The Problems with Decision Trees (DTs)

The Decision Tree (DT) is a method that uses a hierarchy of simple if-then statements in classifying data. This tends to be one of the most inexpensive methods to use. However, it is faster to use, easier to generate understandable rules, and simpler to explain since any decision that is made can be understood by viewing the path of decisions. DTs provide an effective structure in which alternative decisions and the implications of taking those decisions can be laid down and evaluated. They also help to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. However, for all its advantages, there are also some disadvantages or possible pitfalls of which users should be aware.

The problems with DTs can be divided into two categories: algorithmic problems that complicate the algorithm's goal of finding a small tree, and inherent problems with the representation. The conventional DT is useful for small problems but quickly becomes cumbersome and hard to read for intermediate-sized problems. In addition, special software is required to draw the tree. Representing all the information on the tree requires writing probability numbers and payoff numbers under the branches or at the ends of the nodes of the tree (KDNuggets, 2001).

In DTs learning, the goal is to arrive at a classification while minimizing the depth of the final tree. Choosing attributes that divide the positive examples from the negative examples does this. Thus, if there is noise in the training set, DTs will fail to find a valid tree.

A big headache is that the data used must be interval or categorical. Therefore, any data not received in this format will have to be recoded to this format in order to be used. This recoding could possibly hide relationships that other DM methods would find. Also, overfitting can occur whenever there is a large set of possible hypotheses. A technique to deal with this problem is to use DT pruning. Pruning prevents recursive splitting if the relevance of the attribute is low.

Another possible drawback is that the DTs generally represent a finite number of classes or possibilities. It becomes difficult for decision makers to quantify a finite amount of variables. Due to this limitation, the accuracy of the output will be limited to the number of classes selected that may result in a misleading answer. Even worse, if the user does try to cover an acceptable number of variables, as the list of variables increase, the if-then statements created can become more complex.

Finally, there are two other separate drawbacks due to variable systematic risk. First, DTs are not appropriate for estimation. This is based on the fact that complicated, but understandable, rules must be adhered to. Therefore, estimation would not be appropriate. Second, this method is not useful for all types of DM. It is "hard to use DTs for problems involving time series data unless a lot of effort is put into presenting the data in such a way that trends are made visible" (Knowledge Acquisition Discovery Management, 1997).

The Problems with Genetic Algorithms (GAs)

Genetic Algorithms (GAs) relate to evolutionary computing that solves problems through application of natural selection and evolution. Evolutionary computing is a component of machine learning, "a sub-discipline of artificial intelligence" (Sethi, 2001, p. 33). It is a "general-purpose search algorithm" (Chen, 2001, p. 257) that utilizes rules mimicking a population's natural genetic evolution to solving problems. Fundamentally, they are free of derivatives (or outgrowths) and are best utilized in optimization problems with allegories founded in evolutionary operations such as selection, crossover, and mutation.

A chromosome in computer language is a string of binary bits in which the possible solution is encoded. The term "fitness function" is designated to mean the quality of the solution. Beginning with a random accumulation of potential solutions, the search creates what is termed the "gene pool or population" (Sethi, 2001, p. 37). Upon each completion of a search, a new pool is constructed by the GA, and the pool has evolved with improved "fitness values." The most desirable solution is chosen from resulting generations of GAs.

Despite these advantages, these algorithms have not yet been applied to very large-scale problems. One possible reason is that GAs require a significant computational effort with respect to other methods, when parallel processing is not employed (Sethi, 2001, p. 39).

The Problems With Fuzzy Logic (FL)

Fuzzy Logic (FL), first developed by Lotfi Zadeh, uses fuzzy set theory. FL is a multi-valued (as opposed to binary) logic developed to deal with imprecise or vague data. Although there are some limitations to using FL, the main reason that it is not used very widely is that many people are misinformed about it. The difficulties of using FL are different than those of using conventional logic. In addition to technical difficulties, lack of knowledge about this rather new field also causes many problems.

How to define membership functions is a common problem. The rules will be based on trial and error, rather than the designer's intuition and knowledge. Critics argue that this process in contrast to the rigid mathematical design process of conventional devices could lead to undesired outcomes under abnormal circumstances (Pagallo & Haussler, 1990).

Lack of knowledge and/or misinformation about FL are the main reasons it is still a relatively small field. Many critics argue that FL is a fad. They feel it will not be useful since there is no known way to guarantee that it will work under all circumstances. Also, many argue that conventional methods work just or nearly as well, so it is not worthwhile to develop new methods (Chen, 2001).

Another concern seems to be the language barrier. Japanese engineers have pursued FL even though the University of California was its birthplace. It seems as if many American researchers have shunned it. Although technical terms often have a neutral connotation, the Japanese term for FL clearly has a positive tone (clever). Unlike the trendy Japanese term, Americans assign a negative connotation to fuzzy. Despite Japanese and worldwide successes, this technology is still somewhat underdeveloped in the United States.

The Problems with Data Visualization

This is a method that allows the user to gain a more intuitive understanding of the data. Generally, graphics tools used better illustrate relationships among data. Data visualization further allows the user to more easily focus and see the patterns and trends amongst the data. However, the major pitfall in this case is that as the volume of data increases, it can become increasingly difficult to discern accurate patterns from the data sets.

Disaster Planning

Finally, disaster recovery plans must be analyzed. This may result in the need to redesign data backup management techniques to accommodate business growth. Once a coherent plan to accommodate the current and future business needs of the corporation has been developed, it must be implemented quickly. Administrators must know that the backup systems that are in place will work each time and that the data are properly protected. These systems must be reviewed once or twice per year to ensure that they are delivering the protection that the business needs in the most cost-effective manner.

In conclusion, there are high administrative and maintenance costs to the firm for DM in terms of both time and money. Failure to recognize the investment necessary to maintain the system can lead to data corruption and system failure.


	Brought to you by Team-Fly