PROCESS OF KNOWLEDGE DISCOVERY
The continuous explosion of data has prompted the development of the process of data mining (DM), or knowledge discovery in databases (KDD), which derives concrete and concise information from data. DM is defined as an interactive, iterative, nontrivial process of deriving valid, interesting, accurate, potentially useful, and ultimately comprehensible structures from data (Fayyad et al., 1996; Freitas, 2001). The data mining process is usually divided into many subtasks , as illustrated in Figure 1. Following is the description of each step involved in the DM process applied to m-business data.
Figure 1: The data mining process
Ascertain Business Objective
Prior to commencing any data mining process, it is important that the businesses clearly identify and define their goals, objectives, limitations, as well as challenges with regards to their operations and economic and financial situation. It is essential to choose the important business areas that require DM to be applied to measure and predict possible future outcomes . Performing data mining without a clear objective will most likely result in additional cost incurred that reaps no benefits (Cabena, Hadjinian, Stadler, Verhees, & Zanasi, 1997).
This is the initial phase where data from multiple sources are collated according to the business objectives. Since each source of data will be sending data in different formats, data is merged together into a common set of data formats. This is an important task in m-business data since massive amounts of data are gathered from various user transactions and browsing.
A prerequisite for successful data mining is having clean and well- understood data. Due to the fact that data is initially derived from multiple sources, the possibility of having incomplete, noisy , and inconsistent data from the initial data merger is relatively high. In order to perform the data mining process effectively and efficiently , it is essential to apply a set of preprocessing techniques to improve the data quality. This phase includes the following three main steps.
Data cleaning : This step is to rectify the issues of missing values, incorrect attribute values (noisy data), and inconsistent data within the datasets. The clean data is ascertained to be accurate and complete.
Data transformation : In order to apply data mining effectively, the cleaned dataset is transformed to a standard format. The operations of transforming the cleaned data include normalization, aggregation, soothing generalization, and attribute construction.
Data reduction : In order to perform data mining efficiently, the goal is to obtain a reduced volume of the dataset that is representative of the overall cleaned data. This dataset should yield better DM results with faster processing performance. This "ideal" dataset is derived from applying data reduction techniques. This includes removing irrelevant attributes that "confuse" the data mining procedure. For example, attributes having a unique value for each tuple in the database record should be removed as they have no predictive power and will not be able to perform any generalization.
This phase is concerned with the analysis of data by utilising mining techniques to derive hidden and unexpected patterns and relationships from the set of cleaned data. The task is to select a model that fits the end users' needs. There are four main operations associated with data mining techniques.
Predictive modelling : If the goal is to predict future needs, the best model is predictive modelling. This model predicts future events based on previous data by recognising the distinct characteristics of the dataset. A decision tree is developed or a neural network is derived through the analysis of data characteristics. These models are developed over training and testing phases. There are two basic specializations within predictive modelling ‚ classification and value prediction ‚ according to the type of variable(s) it is inferring to. Classification establishes a specific class to each record in the dataset. There is a finite set of classes that all the data is classified to. Value prediction, or regression, develops a model that is able to estimate a continuous numeric value associated with a particular record.
Clustering : The purpose of this operation is to allow the partitioning of data into segments, or clusters. Data within a segment have high similarity but between segments will have low similarity. This will allow differentiation of homogeneous records (that have close proximity) and heterogeneous records (that are not similar to each other). This operation can be applied using either demographic or neural clustering methods. These two methods are distinguished by (1) the types of input data allowed, (2) the methods of calculating the similarity between records, and (3) the way in which the resulting signals are organised for analysis.
Link analysis : This operation is concerned with establishing links between individual or sets of records. There are three specializations of link analysis: association discovery, sequential pattern discovery, and similar time sequence discovery. The underlying principle of link analysis is to determine the confidence level of the association. The association rule A ƒ B with the confidence level of 60% means that if a customer has an item A, there is 60% chance that the customer wants to have the item B. This provides the opportunity for businesses to attempt to try and understand users' preferences and needs.
Deviation detection : This operation is concerned with detecting any anomalies, or unusual activities, within a dataset using summarization and graphical representation. The result of this operation brings attention to the business when there are changes deemed material to the well-being of the business.
The data mining phase includes the selection of data mining operations and then appropriate solving techniques.
This final phase involves the analysis of the mined results. When the mined results are determined insufficient, an iterative process of performing preprocessing and data mining begins until adequate and useful information has been obtained. Once useful patterns and information have been mined, the postprocessing phase ensures the assimilation of knowledge. This involves two main challenges: (1) presenting the new findings in an understandable and convincing way to businesses and (2) formulating ways to thoroughly exploit the new information to benefit the business.
Data Mining Versus Traditional Querying and Reporting Tools
Traditionally, querying and reporting tools of relational database management systems are used to identify specific trends and patterns within the huge amounts of daily transaction data created by m-business. The users of these tools know what kind of information is to be accessed and analysed. Data mining, on the other hand, allows the user to source out unknown facts, i.e., information that is hidden behind the data. This type of data extraction allows business users to seek out new business opportunities and previously unknown data patterns. Another disadvantage of using traditional database queries and reporting tools is the limitations of the output. It is possible to typify questions such as "which mobile service is the most used for users between 20 and 25 years of age". Data mining enables users to pose more complex queries. For example, DM can predict the estimated sales for the years of interest according to the previous year's data. Performing a traditional SQL query that provides the same output as data mining is very computationally expensive. Also, time dimension management is not well supported in a relational model.