USE OF DATA MINING IN DESIGNING THE LEARNING PROCESS

data mining: opportunities and challenges
Chapter XIX - Data Mining in Designing an Agent-Based DSS
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

USE OF DATA MINING IN DESIGNING THE LEARNING PROCESS

The learning process conceptually consists of analyzing the data of the stored cases to infer information that allows updating the discriminating function of the system domains. As it was previously indicated, this discriminating function defines the system knowledge base. This updating is necessary to avoid possible classification errors as described in the previous section. To carry out this task, data mining is the suitable technology (tool).

The first step in designing the learning process is to design a cases base in which data of queries and their respective results are stored. Conventional data-mining processes are carried out on a database whose structure was designed without taking into account such a process. For this reason, at their initial stages, these processes involve tasks such as selection, preprocessing, transformation, etc., that are needed to generate a convenient data structure to be analyzed. A way of simplifying the mining process is to take into account these tasks while designing the data structure to store cases in the cases base.

Structure of Data Associated with a Query

As it has been described, whenever a user performs a query, the system classifies domains according to its knowledge base. Then, the query is sent to domain dj DP having the greatest discriminating score RSV(dj, qi). If the query is not positively answered, it is sent to domain djDP having the following lower discriminating score RSV(dj, qi) and so forth, till one of the domains positively answers the query. In general, if the classification error does not affect the system efficacy, some of the djDP should answer positively.

The data associated with each information requirement qi, which are necessary for the learning process, will be stored as a case. Taking into account the possible classification errors previously discussed, it will be necessary to store the following data: the query qi, the RSV(dj,qi) (discriminating function value) of domains classified into the DP criterion set (Potential domain set), the taxonomy Kdj of each domain djDP, and the valuation of the answer va (dj, qi) emitted by the user of domain dj answering to query qi. It should be stressed that from the learning point of view, the information in the answer to the query does not matter, but what does matter is whether the required information was provided by a domain or not. Therefore, the variable called valuation of the answer va (dj, qi), which was described in previous sections, was introduced.

The logical structure of data to be stored for a specific query is summarized in Table 2.

Table 2: Logical structure of data for a query qi

Domain

ValuationAnswer

RSV

wp1

wpn

dj

         

The first field, which is called Domain, stores the name that identifies each domain djDP. The second field, called ValuationAnswer (VA), stores the value of the qualitative value va (dj, qi). In the third field, the RSV(dj, qi) is stored. This is the value of the discriminating function that helps to determine the position of the domain djDP in the ranking.

In the remaining fields, the values of the discriminating function weights are stored, wpk pk Kqi Kdj / djDP. These are the weights of keywords that appear in the query and belong to the domains classified into the DP criterion set. A field value equal to 0 indicates that the keyword associated to this field is not stated in the taxonomy of the respective domain.

Let us consider the following example:

Query: How many items of product X were for sale in promotion?

Filtered Query: Kq1 = {items, product, X, sale, promotion}

This example shows that four domains of the system were classified as potential, Dp={Marketing, Production, Forecast, Sales}.

The Marketing domain presents the smallest RSV value, and therefore is the fourth one in the ranking. Moreover, this domain answers the consult in a positive way, va(Marketing, q1) = Positive.

Forecast and Sales domains, which are second and first in the ranking, respectively, answer the consult in a negative way, va(Forecast, q1) = va(Sales, q1) = Negative. The Production domain presents a va(Forecast, q1) = Null, since it exceeded the time period it had to answer the consult.

Finally, the intersection set between the set of keywords of the query and the set of keywords that define the taxonomy of Marketing domain is integrated by the following keywords: product and promotion. Thus, Kq1 K(marketing) = {product, promotion}. For that reason, the respective weights of the discriminating function are higher than zero. As the keyword Sale is not include into Marketing taxonomy, the weight factor associated to the discriminating variable Sale is wsale = 0.

Table 3: Logical structure of data for the example

Domain

VA

RSV

Wproduct

Wsale

Wpromotion

Marketing

P

1.0

0.4

0

0.6

Production

Null

1.1

0.9

0.2

0

Forecast

N

1.2

0.5

0.7

0

Sales

N

1.6

0.3

1.0

0.3

The logical structure of data represented in Table 2, which is defined to store a case, allows easily visualization of the results associated to a query qi. In other words, it is a visualization of data in the queries or cases dimension. From the point of view of the need for learning that has been posed, this is not a convenient structure, since in order to evaluate possible domain classification errors, it is necessary to visualize data from the domains dimension.

Visualization of Data of the Cases Base in the Domains Dimension

Table 4 presents a logical structure of data in the cases base that allows observing a domain's behavior before the various queries. This structure will store, for a given domain dj, all queries qi, for which RSV(dj, qi) > cutoff score. Essentially, this data structure uses binary fields to represent the relationship between a query qi, which is defined by the set of keywords Kqi, and the domain taxonomy Kdj.

Table 4: Logical structure of data for the domain dj

Query

VA

OR

P1

Pk

P'1

P'm

qi

               

In Table 4, the first field, Query, stores the name that identifies each query qi. The second field, called VA, stores the qualitative variable value va(dj, qi). According to what has been described in previous sections, in order to determine possible errors in the classification efficiency, the RSV(dj, qi) value itself does not matter, but the relative position of each domain djDP does. Therefore, the third field (Order) stores the order that domain obtained in the ranking of domains classified as potential to answer to the query qi. The remaining fields are divided into two groups. In the first group, each field represents a keyword of the domain taxonomy. Each field is designated with a keyword pk Kdj, and represents a binary variable pk that takes 1 as its value if the keyword pk of the domain taxonomy is in the query qi, and takes 0 as its value if pk is not stated in that query.

In the second group, each field represents a keyword stated in query qi that does not belong to the domain taxonomy. Each of these fields is designated with a keyword p'k and represents a binary variable p'k that takes 1 as its value if the keyword p'k is stated in the query qi but does not belong to the domain taxonomy dj, and takes 0 as its value if p'k is not in that query.

Let us consider the previous example. Let us suppose that it is the first case for which the domain is classified as having potential for answering to the query:

Query: How many items of product X were for sale in promotion?

Filtered query: Kq1 ={items, product, X, sale, promotion}

Marketing Domain

Query

VA

OR

marketing

product

promotion

items

x

sale

q1

P

4

0

1

1

1

1

1

Production Domain

Query

VA

OR

production

product

machine

items

x

sale

promotion

q1

Null

3

0

1

0

1

1

1

1

Forecast Domain

Query

VA

OR

forecast

age

sale

product

items

x

promotion

q1

N

2

0

0

1

1

1

1

1

Sales Domain

Query

VA

OR

sale

product

customer

promotion

items

x

q1

N

1

1

1

0

1

1

1

Thus, we have a structure for each system domain. Each structure stores the way in which the domain behaved for the different queries for which it has been classified as potential.

In the following section, we will see how this logical data structure meets the posed learning needs.

Application of Data Mining to the Learning Process Design

Once the logical data structure is designed and the cases are stored, the latter must be analyzed using data mining. The object is to analyze the data kept in the cases base so as to identify relationships among the data from which possible behavior patterns of cases can be defined. Such patterns are used to define rules for updating the discriminating function of each domain, which is stored in the knowledge base. This requires working with the cases data associated to the involved domain. For this purpose, the logical data structure presented in Table 4 of the previous section will be used. According to what has been discussed, the updating can be performed in two ways: either updating the weights of the keywords (predicting variables) that define the domain taxonomy or modifying the domain taxonomy by adding new keywords.

Patterns Obtainment

Once a significant number of cases qi are stored, we can perform a mining of these data to look for patterns. For that purpose, we define Qdj as the set of stored cases qi associated to dj. That is, Qdj = {qi / RSV(dj, qi) > cutoff score}.

To start with, cases qi Qdj are classified into four groups: a first group formed by cases in which no classification error occurred; a second group of cases in which the domain provided a positive answer, but it was not the first one in the ranking of potential domains (these are cases in which the classification error affected the system efficiency); a third group of cases, in which the domain provided a negative answer; and, finally, a fourth group of cases in which the query was not answered.

To carry out this classification, va(dj, qi) and or(dj, qi) are defined as predicting variables.

Group of efficient cases Q+dj: integrated by those cases that present va(dj,qi) = positive and or(dj,qi) =1

Group of non-efficient cases Q*dj: integrated by those cases that present va(dj,qi) = positive but or(dj,qi) >1

Group of negative cases Qdj: integrated by those cases that were answered in a negative way.

Group of Null cases Q0dj: integrated by those cases in which an answer was not provided.

Once qi Qdj cases are classified into one of the four defined groups, the purpose is to infer rules to update the discriminating function of each domain that is stored in the knowledge base. The action of these rules will be to:

  • Modify the cases belonging to Q+dj to the lowest extent.

  • Determine the weights wpk of the keywords (predicting variable) of the domain taxonomy. These weights must be increased in order to correct the classification error produced in the cases of group Q*dj. These rules will operate on keywords pk Kdj that are frequently present in queries qi Q*dj, since it can be inferred that these predicting variables are more important to classify domains than what their associated weights really reflect. In other words, the current weight factors wpk are low.

  • Encourage the incoming of new keywords into the domain taxonomy. This means including new predicting variables in the discriminating function of domain dj. These rules will operate on keywords p'k Kdj that are frequently present in queries qi Q*dj. In other words, it is inferred that these predicting variables are important to classify domains. However, if those words are also frequently present in queries answered by most of the remaining domains, these keywords would not be useful to distinguish among domains and thus they should not be incorporated.

  • Another possibility is that a domain presents many cases in which it answered in a negative form although appearing as better positioned in the ranking than the domain that actually provided a positive answer. This means that this domain taxonomy has words whose weights are too high when compared to their importance in the domain. Therefore, there should be a rule that diminishes the weights of these words.

With the aim of interpreting the relationships among variables, we present three main rules obtained by the mining process that will be used to develop the system learning process:

click to expand

In the condition of this rule, we are saying that a word belonging to the domain taxonomy is a candidate for increasing its weight if:

  1. more than α1 cases are stored in Qdj and

  2. the number of times in which p is stated in queries of Q*dj is greater than α2 and

  3. np* is much more higher than np

Now, we present the rule for the incoming of new words into the domain taxonomy.

click to expand

In the first condition of this rule, we say that a word is a candidate for entering a Kdj if:

  1. more than α3 cases are stored in Qdj and

  2. the proportion between the number of times in which p is stated in queries of Q*dj in respect to the number of cases stored in Qdj is greater than α4.

  3. np* is much higher than np.

A word can enter Kdj if the amount of domains in the system is much higher than the quantity of domains in which p is stated or is a candidate.

click to expand

In the condition of this rule, we are saying that a word p is a candidate for diminishing its weight if:

  1. more than α5 cases are stored in Qdj and

  2. the proportion of the number of times in which p is stored in queries of Qdj in respect to the number of stored cases Qdj is greater than α6 and

  3. np is much greater than the number of times in which p is stated in queries with positive answer (np+ + np*)

Brought to you by Team-Fly


Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net