Data Collection | Agility and Discipline Made Easy: Practices from OpenUP and RUP

By understanding consumers' behavior, more efficient e-marketing strategies will become available to drive Internet use and e-commerce applications. Marketing efforts intended to enhance Web site use are expected to follow a two-fold strategy:

Turn non-users into users.
Expand usage of current users.

Consumers supported by a personalized system will be more likely to either turn from nonshoppers into active shoppers or increase their previous shopping volume. Web sites need to encourage users to discuss problems, and use this feedback to improve both products and services. Web sites should try to collect customer information and use that information to develop a relationship with customers. Customer satisfaction is the key for customer retention. Like traditional stores, online stores also need to build strong relationships with their customers. Technology provides many advantages over traditional ways of business with commercial Web sites using techniques such as online user groups, input from previous customers (rankings, comments, opinions , product assessments, etc.), order tracking and more.

So, the first step in the personalization process is the acquisition of data about the users (a task that is in continuous execution in most of the cases). User data must be transformed into an internal representation (modeling) that will allow for further processing and easier update. Indeed, personalization in order to produce satisfactory results needs different kinds of data. Some data can be observed by the system while other have to be provided by the user. The collection of information that describes a particular user is called a user profile , and consequently, a good such model comprises the basis for personalization activities. These profiles may be static or dynamic based on whether - and how often- they are updated. More specifically , the information incorporated in a use model may include: the user's identification profile, the preference profile, the socioeconomic profile, user's ratings, reviews and opinions, the transaction profile, the interaction profile, the history profile, etc.

There are two general methodologies for acquiring user data depending on whether the user is required to be actively engaged in the process or not:

Reactive approach: the user is asked explicitly to provide the data using questionnaire forms, fill-in preference dialogs, or even via machine readable data- carriers , such as smart cards.
Non-reactive approach: the system implicitly derives such information without initiating any interaction with the user using acquisition rules, plan recognition, and stereotype reasoning.

Static profiles are usually acquired explicitly while dynamic ones are acquired implicitly by recording and analyzing user navigational behavior. In both approaches, we have to deal with different but equally serious problems. In the case of explicit profiling, users are often negative about filling in questionnaires and revealing personal information online, they comply only when required and even then the data submitted may be false. On the other hand, in implicit profiling, even though our source of information is not biased by the users' negative attitude, the problems encountered derive once again from the invaded privacy concern and the loss of anonymity. Personalization is striving to identify users, record their online behavior in as much detail as possible and extract needs and preferences in a way they do not notice, understand or control. The problem of loss of control is observed in situations where the user is not in control, a problem known as loss of control (Kramer et al., 2000; Mesquita et al., 2002; Nielsen, 1998).

Moreover, to maximize data gathering opportunities the Web site should collect data from every customer touch point, online and off-line.

Online customer touch points include:

Registration: the Web site asks some basic information about the customer (e.g., name , address, phone number, fax, interests, preferences, etc.), including the e-mail address and the password. Being a registered user makes future purchases faster, easier and friendlier.
Transactions: purchase data or information requests .
Sign-ups: newsletters, e-mail notifications, samples, coupons , partner offers, etc.
Customer profiles or user preferences .
Customer surveys: research- related and entertaining content surveys.
Customer service .
Web log files: pages viewed , categories searched, links clicked, etc.
Incoming and outgoing URLs (URL linking to the store, and links leading outside the store).
Advertising banners .
Sweepstakes and other promotions requiring customer data.

Off-line customer touch points on the other hand may comprise:

Customer service by phone, stored in the customer profile database.
In-store transactions (meaning physical store purchases).
Various surveys.
Paper submissions (e.g., sweepstake or promotion entries).

Perhaps, the most important data source is the initial registration. In most cases this registration process is more important than the first transaction, in that the act of registering indicates that a customer wants to start a 'conversation' or a relationship and gives the store permission to begin this process. When adequate data is collected, subsequent interactions with the store may well exceed the visitor's expectations.

Ensuring that the store allows customers to update and modify their own profile data not only will keep the customer information up-to-date, but it will also engender more trust because customers know what information is maintained about them by the e-store.

Another equally effective way to gather data about the customer is when the system does not explicitly ask for any information at all. Many successful Web sites use cookies and unique identifiers to make customer-specific data collection invisible to the customer.

Different kinds of data are used in personalization process:

data about the user.
data about the Web site usage.
data about the software and hardware available on the user's side.

User Data

This category denotes information about personal characteristics of the user. Several such types of data have been used in personalization applications. One source of information affecting customers' decision-making precess and attitudes is their demo- graphic traits . These traits include name, address, zip code, phone number, other geographic information, gender, age, marital status, education, income, etc. All customers are not equal. Different customers and customer segments value different things, so for some it is important for a Web site to provide lower prices and faster delivery, while for others the priority focuses on quality, number of choices and convenience. An example found in Liebermann and Stashevsky (2002) reveals differences in attitudes based on sex and according to it males worry more than females for the vast volumes of Internet advertising.

Another source of information relates to user's knowledge of concepts and relationships between concepts in the application-specific domain (input that has been of extensive use in natural language processing systems) or domain-specific expertise.

Moreover, valuable types of data may be user skills and capabilities in the sense that apart from 'what' the user knows, in many cases it is of equal importance to know what the user knows 'how' to do, or even further, distinguish between what the user is familiar with and what he/she can actually accomplish.

Finally, interests and preferences, goals and plans are used by plan recognition techniques where identified goals allow the Web site to predict interests and needs and adjust its contents' structure and presentation for easier and faster goal achievement.

Usage Data

Usage data may be directly observed and recorded, or acquired by analyzing observable data (whose amount and detail varies depending on the technologies used during Web site implementation, i.e., Java applets, etc.), a process known as Web usage mining (Markellou et al., 2004, see also section 'Personalization and Web Mining'). Usage data may either be:

Observable data comprising selective actions like clicking on an link, data regarding the temporal viewing behavior of users, ratings (using a binary or a limited, discrete scale) and other confirmatory or disconfirmatory actions (making purchases, e-mailing /saving/printing a document, bookmarking a Web page and more), or
Data that derive from observed data by further processing (measurements of frequency of selecting an option/link/service, production of suggestions/recommendations based on situation-action correlations , or variations of this approach, for instance recording action sequences).

Environment Data

On the client side, the range of different hardware and software used is large and keeps growing with the widespread use of mobile phones and personal digital assistants (PDAs) for accessing the Web. Thus in many cases the adaptations to be produced should also take into account such information. Environment data addresses information about the available software and hardware at the client side (browser version and platform, availability of plug-ins, firewalls preventing applets from executing, available bandwidth, processing speed, display and input devices, etc.), as well as locale (geographical information that can be used, for instance, to automatically adjust the language, or other locale specific content, such as the local time or the shipping costs).