DATA MINING CHALLENGES AND THEIR SOLUTIONS
With the increasing use of mobile services, it is very likely that mobile devices will use the concept of data mining in the future. In order to apply data mining efficiently in m-business, certain requirements have to be met. Ideally the methods used for mining mobile data should be able to (1) mine different kinds of knowledge in databases, (2) deal with diverse types of data types such as relational, temporal, and spatial types of data, (3) mine information from heterogeneous databases and global information systems, (4) handle noisy and incomplete data, which is mostly the case in the m-business domain, (5) perform the mining tasks efficiently regardless of the size and complexity of the dataset, (6) support interactive mining of knowledge at multiple levels of abstraction, (7) support integration of the discovered knowledge with existing knowledge, and (8) deal with the issues related to applications of discovered knowledge and social impacts such as protection of data security, integrity, and privacy.
The process of obtaining useful information from voluminous records of actual mobile sessions data calls for using powerful, parallel, distributed, scalable, integrated, and incremental data mining tools. The data mining software can be developed as a collection of components that may be based on object technology. By developing data mining modules as a collection of components , one can develop generic tools and then customize them for specialised applications. This section attempts to summarise the requirements and the future issues that need to be addressed when data mining is applied in the mobile sector.
In the m-business environment, data can reside in many different geographical locations. Most data mining systems are currently based on centrally located data; data is stored in a single database and the mining techniques are focused on this dataset. As a result of convergence between computation and communication, the new data mining approaches have to be concerned with distributed aspects of computation and information storage. This means that organisations will have to implement decentralised approaches for data storage and decision support. Distributed data mining typically involves local data compression and analysis for minimisation of network traffic as well as the generation of global data models and analysis by combining local data and models (Park & Kargupta, 2002).
A development of integrating data mining applications, data mining systems, and business processes effectively together will guarantee and support the environments of e-business and m-business. In order to conduct data mining in a distributed environment where data is collected from multiple sources, XML is proving to be the most ideal solution to realize such a potential. Every mobile device is able to transmit XML documents that can be read and processed easily, regardless of which platform the mobile device is running on. XML has provided the facilities to perform data exchange on the Web as well as wirelessly between applications or between users and applications in a flexible and extensible representation (Graves, 2002).
However, there are always the common situations of incomplete, noisy, and inconsistent data. Even though the imperfect dataset is cleaned during the preprocessing step, nevertheless, the data are never cleaned 100% perfectly . The appropriate procedures should be developed to cleanse the received and the integrated imperfect data.
At present, the Web environment has catered to electronic commerce (Nayak, 2002). With millions of people accessing the Web every day, it has been possible to gather a large amount of "clickstream data" (Kohavi, 2001) to determine and predict the possible interest of users. But at the current level of technology, mobile devices have a number of limitations. One of these limitations is having a small display screen (Madria, Mohania, Bhowmick, & Bhargava, 2002). As a result, on a WAP phone, the average number of links it has to other Web sites is an average of 5 links, while a standard Web page has an average of 25 links. If a user is to have three clicks on the Web via a WAP phone, there are only 5 3 (= 125) pages that are accessible to the user, compared with the standard Web page having 25 3 (= 15,625) accessible pages (Barnes, 2002). Therefore, users of mobile devices are highly restricted on the Web pages that they can visit. It is quite unlikely that the user will be going to the site that he really wants from the links available. As a result, the usage of data mining to analyse clickstream data collected from users of mobile devices is not going to be accurate. Predicting the user's interest will be difficult. Data mining is therefore limited due to the current mobile Internet service, which is not able to reflect the user's interest sufficiently.
Security and Privacy
With the technology of sending messages to mobile users, it has become possible for users to specify the types of information that they prefer and require. In a mobile network, this is known as creating "dynamic bookmarks" (Duri et al., 2001). As a result, a business is able to provide users with the information and services that they prefer. For example, if the user indicates that his preference is a particular brand of product above a particular price, then it can be analysed that the user may also be interested in another similar brand of the same standard as well. This opens data mining possibilities such as classifying the users based on their reported needs using predictive mining and clustering and finding correlations between various needs by performing a link analysis .
Unfortunately there are also situations when users are not comfortable about declaring any information about them. As a result, preferences indicated by users might not necessarily be correct. Thus, data mining results may have classified the user as a potential person to send information to, but in reality, it could all end up as an added expense to incur cost in conducting the data mining techniques and including the irrelevant people into the mobile service.
There is also a certain level of fear within users with regards to mobile data security. Security is a vital concern in m-business due to the type of communication medium. There is always a potential risk of compromising the integrity, security, and availability of information with the portability of mobile devices (Madria et al., 2002). This exacerbates the possibility of users who do not believe in the security of mobile data to inaccurately declare their personal information and personal preference. Also the vocal type of data transfer mode is not appropriate for applications with confidential data where one could be overheard.
With the issue that the data mining application is usually computationally expensive, there is always a concern whether the benefits of data mining justify the cost incurred for the process. Firstly, before data mining commences, preprocessing of the data is required to ensure the data is cleaned and all inconsistent or missing values are adequately rectified. A substantial amount of computational power is required to perform this process. Secondly, the cleaned data needs to be transformed to an appropriate format to facilitate the mining process. With the concern of security and privacy of XML documents being transferred wirelessly, XML allows its documents to be complex and tagged with unmeaningful names . So the document is not useful to an unauthorized person without the knowledge of how to transform the document appropriately. However, the more complex the XML document is, the more computational power is needed to process these documents. Thus, there is a difficulty to strike a balance between the security and privacy of data transferred versus the computational cost required to process the "encrypted" documents and also between the cost involved in the data mining process versus its benefit.
Although there are a vast number of mobile technologies available, there exist several limitations and constraints of the technologies adversely affecting the performance of data mining in the m-business domain. Some of the limitations with regards to mobile computing include low bandwidth, limited battery power, and unreliable communications that result in frequent disconnections and higher error rates. These factors have resulted in the increase in communication latency, additional cost to retransmit data, time-out delays, error control protocol processing, and short disconnections (Madria et al., 2002). There are also many possibilities of lost connections like when a user moves to areas of high concentrations (e.g., events, concerts, etc.) or interference. These limitations pose significant problems in collecting the data for mining purposes.
Furthermore, technologies like GPS have allowed the location of users to be identified outdoors. The idea of knowing a user's location appears promising in data mining; for example, providing tourists with a map of their current location on their mobile devices (Brown, Chalmers, & MacColl, 2002) or possibly even sending messages to consumers when they bypass a shopping sale. However, present technologies are not mature enough to determine the location of a user indoors (Duri et al., 2001). Although the potential of gathering knowledge from a user's location is very useful in terms of m-business and data mining, this potential is not yet realized until present technologies improve to provide adequate and up-to-standard location-aware applications to function on mobile devices.
At present, only a minority of consumers have mobile devices that allow for wireless Internet connections. This is due to the current technology limitations such as devices with small screens and only allowing the viewing of Internet contents via a scaled-down version of HTML (Hypertext Markup Language) ‚ WML (Wireless Markup Language) ‚ which does not support Internet viewing as good as on a personal computer. Limitations in wireless network bandwidth ( ranging from 9.6 to 19.2 Kbps) do not also completely live up to the high expectation of consumers (Kalakota & Robinson, 2002). Thus, technologies in mobile devices are perceived as not stable and reliable as yet.
Data mining over distributed sources would definitely be limited due to the technological limitations of the devices. For example, the present low bandwidth means that the data transfer is slow. This implies that data mining processes have to be delayed until most data transfers have been completed and received. Web serving is already very popular in e-business. In future there will be additional resources for Web access. One of them is the wireless access of Web sites via mobile phones. Unfortunately the bandwidth is a big issue and therefore it is essential to minimize the data volume that is sent through the channels. We have shown earlier in the paper that with the possible DM opportunities in m-business it is possible to overcome some of these limitations.
Data Mining Process and Intelligent Agents
The several steps of the knowledge discovery process can be partly automated using intelligent agents (Russell & Norvig, 1995). Intelligent agents use domain knowledge with embedded simple rules. The use of training data helps to reduce the need for domain experts. A scanning agent goes through the rules and facts and displays the items that have valuable information.
Intelligent agents can help to automate the data selection steps by determining learning parameters, by applying triggers for database updates, and by managing invalid data. The agents can perform automatic sensitivity analysis to detect helpful parameters. As a result the number of domain experts will be reduced, so there is no more need of experts whenever the environment is changing. Data cleansing can be automated using intelligent agents with a rule base. Whenever a record is added or updated in a relational database, the trigger of the agent examines the transaction data. Missing or invalid data can also be cleansed by using the rules in its rule base.
Agents can be used in implementing classification, clustering, summarization, and generalization models that have a learning nature and rules generation. The search for patterns of interest, by using learning and intelligence in classification, clustering, summarization, and generalization, is supported by intelligent agents. Agents can generate newly discovered related information from data by learning the preferences from a profile or from examples and feedback from a user profile. This information can be used to provide confidence in what the agent is predicting. The agents are implemented based on machine learning techniques and data mining techniques such as case-based reasoning, neural networks, association, and induction (Seydim, 1999).
One major advantage of intelligent agents is in their support for online transaction data mining. This automated decision support is called "active data mining" (Moon, Kim, & Kim, 2001). In conclusion intelligent agents are very important in the process of knowledge discovery, especially in distributed environments such as m-businesses, by supporting the discovery process in many stages.