CASE 1: PERSONALIZED MEDIA DISTRIBUTION

data mining: opportunities and challenges

Chapter I - A Survey of Bayesian Data Mining
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

The application concerns personalized presentation of news items. A related area is recommendation systems (Kumar, Raghavan, Rajagopalan, & Tomkins 2001). The data used are historical records of individual subscribers to an electronic news service. The purpose of the investigation is to design a presentation strategy where an individual is first treated as the "average customer," then as his record increases, he can be included in one of a set of "customer types," and finally he can also get a profile of his own. Only two basic mechanisms are available for evaluating an individual's interest in an item: to which degree has he been interested in similar items before, and to which degree have similar individuals been interested in this item? This suggests two applications of the material of this chapter: segmentation or classification of customers into types, and evaluating a customer record against a number of types to find out whether or not he can confidently be said to differ. We will address these problems here, and we will leave out many practical issues in the implementation.

The data base consists of a set of news items with coarse classification; a set of customers, each with a type, a "profile" and a list of rejected and accepted items; and a set of customer types, each with a list of members. Initially, we have no types or profiles, but only classified news items and the different individuals' access records. The production of customer types is a fairly manual procedure; even if many automatic classification programs can make a reasonable initial guess, it is inevitable that the type list will be scrutinized and modified by media professionals the types are normally also used for direct marketing.

The assignment of new individuals to types cannot be done manually because of the large volumes. Our task is thus to say, for a new individual with a given access list, to which type he belongs. The input to this problem is a set of tables, containing for each type as well as for the new individual, the number of rejected and accepted offers of items from each class. The modeling assumption required is that for each news category, there is a probability of accepting the item for the new individual or for an average member of a type. Our question is now do these data support the conclusion that the individual has the same probability table as one of the types, or is he different from every type (and thus should get a profile of his own)? We can formulate the model choice problem by a transformation of the access tables to a dependency problem for data tables that we have already treated in depth. For a type t with a_i accepts and r_i rejects for a news category i, we imagine a table with three columns and Σ (a_i + r_i) rows: a t in Column 1 to indicate an access of the type, the category number i in Column 2 of a_i + r_i rows, a_i of which contain 1 (for accept) and r_i a 0 (for reject) in Column 3. We add a similar set of rows for the access list of the individual, marked with 0 in Column 1. If the probability of a 0 (or 1) in Column 3 depends on the category (Column 2) but not on Column 1, then the user cannot be distinguished from the type. But Columns 1 and 2 may be dependent if the user has seen a different mixture of news categories compared to the type. In graphical modeling terms, we could use the model choice algorithm. The probability of the customer belonging to type t is thus equal to the probability of model M4 against M3, where variable C in Figure 2 corresponds to the category variable (Column 2). In a prototype implementation we have the following customer types described by their accept probabilities:

Table 2: Customer types in a recommender system
Category	Typ1	Typ2	Typ3	Typ4
News-int	0.9	0.06	0.82	0.23
News-loc	0.88	0.81	0.34	0.11
Sports-int	0.16	0	0.28	0.23
Sports-loc	0.09	0.06	0.17	0.21
Cult-int	0.67	0.24	0.47	0.27
Cult-loc	0.26	0.7	0.12	0.26
Tourism-int	0.08	0.2	0.11	0.11
Tourism-loc	0.08	0.14	0.2	0.13
Entertainment	0.2	0.25	0.74	0.28

Three new customers have arrived, with the following access records of presented (accepted) offers:

Table 3: Individual's access records
Category	Ind1	Ind2	Ind3
News-int	3(3)	32(25)	17(8)
News-loc	1(1)	18(9)	25(14)
Sports-int	1(1)	7(2)	7(3)
Sports-loc	0(0)	5(5)	6(1)
Cult-int	2(2)	11(4)	14(6)
Cult-loc	1(1)	6(2)	10(3)
Tourism-int	0(0)	4(4)	8(8)
Tourism-loc	1(1)	5(1)	8(3)
Entertainment	1(1)	17(13)	15(6)

The results in our example, if the types are defined by a sample of 100 items of each category, are:

Table 4: Probabilities of assignments of types to customers
	Typ1	Typ2	Typ3	Typ4
Ind1	0.2500	0.2500	0.2500	0.2500
Ind2	0.0000	0.0000	1.0000	0.0000
Ind3	0.0000	0.0001	0.9854	.0145

It is now clear that the access record for Individual 1 is inadequate, and that the third individual is not quite compatible with any type. It should be noted that throughout this example we have worked with uniform priors. These priors have no canonic justification but should be regarded as conventional. If specific information justifying other priors is available, they can easy be used, but this is seldom the case. The choice of prior will affect the assignment of individual to type in rare cases, but only when the access records are very short and when the individual does not really fit to any type.


	Brought to you by Team-Fly