Data Processing | Agility and Discipline Made Easy: Practices from OpenUP and RUP

The step following after the collection of data from all sources and the building of users' profiles is data processing. Initially, some preparation activities take place in order to clean the data and facilitate their manipulation. For instance, entries that do not reveal actual usage information are removed and missing data are completed. Then follows the application of statistical and data-mining techniques in order to detect interesting patterns in the pre- processed data. The most well known techniques that are used for data analysis include clustering, classification, association rules mining, sequential pattern discovery and prediction. A more detailed description of each technique follows.

Clustering

Clustering algorithms are used mainly for segmentation purposes. Their aim is to detect 'natural' groups in data collections (e.g., customer profiles, product databases, transaction databases, etc.). They compute a measure of similarity in the collection in order to group together items that have similar characteristics. The items may either be users that demonstrate similar online behavior or pages that are similarly utilized by users. The produced groups (database segmentation into clusters of similar people, e.g., customers, prospects, most valuable or profitable customers, most active customers, lapsed customers, etc.) can be based on many different customer attributes (e.g., navigation behavior, buying behavior or demographics ). There are several clustering algorithms available: Hierarchical Agglomerative Clustering or HAC (Rasmussen, 1992; Willett, 1988), kmeans clustering (MacQueen, 1967), Self-Organizing Maps (SOMs) or Kohonen (1997).

Figure 3-2: Clustering

Classification

The main objective of classification algorithms is to assign items to a set of predefined classes. These classes usually represent different user profiles and classification is performed using selected features with high discriminative ability as refers to the set of classes describing each profile. For example the profile of an active buyer can be:

sex = male

30 < = age < = 40

marital-status = single

number-of-children = 0

education = higher

Figure 3-3: Classification

This information can be used to attract potential customers. Unlike clustering, which involves unsupervised learning, in classification a training set of data with pre-assigned class labels is required (classification is categorized as a supervised machine learning technique). Then the classifier (by observing the class assignment in the training set) learns to assign new data items in one of the classes. It is often that clustering is applied before classification to determine the set of classes. Some widely used classification algorithms are: K-Nearest Neighbor (KNN), Decision Trees, Na ve Bayes, Neural Networks (Chakrabarti, 2003).

Association Rules

Association rules connect one or more events. The aim is to find out associations and correlations between different types of information without obvious semantic dependence. In the Web personalization domain, this method may indicate correlations between pages not directly connected and reveal previously unknown associations between groups of users with specific interests (Agrawal et al., 1993; Agrawal & Srikant, 1994; Chen et al., 1996, 1998). Such information may prove valuable for e-commerce and e-business Web sites since it can be used to improve Customer Relationship Management (CRM). Some examples of association rules are the following:

20% of the users that buy the book 'Windows 2000' also select 'Word 2000' next ,
50% of the users who visited the 'Help' pages belong to the 25-30 age group,
30% of the users who accessed the Web page 'Special Offers' placed an online order for product 'DVD - Lord of the Rings,'
60% of the users who ordered the book 'Harry Potter' were in the 18-25 age group and lived in Athens,
or 80% of users who accessed the Web site started from page 'Products.'

Figure 3-4: Associations rules

Sequential Pattern Discovery

Sequential pattern discovery is an extension to the association rule mining technique, and it is used for revealing patterns of co-occurrence, thus incorporating the notion of time sequence. A pattern in this case may be a Web page or a set of pages accessed immediately after another set of pages. Examples of sequential patterns can be:

'45% of new customers who order a mobile phone will spend more than 50 Euros using it within 30 days.'
'Given the transactions of a customer who has not bought any products during the last three months, find all customers with a similar behavior.'

Prediction

Predictive modeling algorithms calculate scores for each customer. A score is a number that expresses the likelihood of the customer behaving in a certain way in the future. For example it answers questions such as:

What is the possibility of a user to click on a certain banner?
How likely is a user to re-visit the Web site in the next month?
How many orders will be placed by customers from abroad?

These scores are calculated based on rules that are derived from examples of past customer behavior. Predictive Modeling methods include Decision Trees, Regression Models and Neural Networks.

The prediction can be:

Off-line: the decisions are pre-calculated and picked-up at the start of a repeated visit (the click-stream of the current visit is not used).
Online: the current visit's click-stream is used for decisions.

Moreover, predictive models can be used in order to decide the content to be displayed to a user (which constitutes a crucial part of CRM).