USE OF DATA MINING IN DESIGNING THE LEARNING PROCESS

data mining: opportunities and challenges

Chapter XIX - Data Mining in Designing an Agent-Based DSS
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

The learning process conceptually consists of analyzing the data of the stored cases to infer information that allows updating the discriminating function of the system domains. As it was previously indicated, this discriminating function defines the system knowledge base. This updating is necessary to avoid possible classification errors as described in the previous section. To carry out this task, data mining is the suitable technology (tool).

The first step in designing the learning process is to design a cases base in which data of queries and their respective results are stored. Conventional data-mining processes are carried out on a database whose structure was designed without taking into account such a process. For this reason, at their initial stages, these processes involve tasks such as selection, preprocessing, transformation, etc., that are needed to generate a convenient data structure to be analyzed. A way of simplifying the mining process is to take into account these tasks while designing the data structure to store cases in the cases base.

Structure of Data Associated with a Query

As it has been described, whenever a user performs a query, the system classifies domains according to its knowledge base. Then, the query is sent to domain d_j ∊ D_P having the greatest discriminating score RSV(d_j, q_i). If the query is not positively answered, it is sent to domain d_j∊D_P having the following lower discriminating score RSV(d_j, q_i) and so forth, till one of the domains positively answers the query. In general, if the classification error does not affect the system efficacy, some of the d_j∊D_P should answer positively.

The data associated with each information requirement q_i, which are necessary for the learning process, will be stored as a case. Taking into account the possible classification errors previously discussed, it will be necessary to store the following data: the query q_i, the RSV(d_j,q_i) (discriminating function value) of domains classified into the D_P criterion set (Potential domain set), the taxonomy Kd_j of each domain d_j∊D_P, and the valuation of the answer va (d_j, q_i) emitted by the user of domain d_j answering to query q_i. It should be stressed that from the learning point of view, the information in the answer to the query does not matter, but what does matter is whether the required information was provided by a domain or not. Therefore, the variable called valuation of the answer va (d_j, q_i), which was described in previous sections, was introduced.

The logical structure of data to be stored for a specific query is summarized in Table 2.

Table 2: Logical structure of data for a query q_i
Domain	ValuationAnswer	RSV	wp₁		wp_n
d_j

The first field, which is called Domain, stores the name that identifies each domain d_j∊D_P. The second field, called ValuationAnswer (VA), stores the value of the qualitative value va (d_j, q_i). In the third field, the RSV(d_j, q_i) is stored. This is the value of the discriminating function that helps to determine the position of the domain d_j∊D_P in the ranking.

In the remaining fields, the values of the discriminating function weights are stored, w_pk ∀ p_k ∊ Kq_i ∩ Kd_j / d_j∊D_P. These are the weights of keywords that appear in the query and belong to the domains classified into the D_P criterion set. A field value equal to 0 indicates that the keyword associated to this field is not stated in the taxonomy of the respective domain.

Let us consider the following example:

Query: How many items of product X were for sale in promotion?

Filtered Query: Kq₁ = {items, product, X, sale, promotion}

This example shows that four domains of the system were classified as potential, D_p={Marketing, Production, Forecast, Sales}.

The Marketing domain presents the smallest RSV value, and therefore is the fourth one in the ranking. Moreover, this domain answers the consult in a positive way, va(Marketing, q₁) = Positive.

Forecast and Sales domains, which are second and first in the ranking, respectively, answer the consult in a negative way, va(Forecast, q₁) = va(Sales, q₁) = Negative. The Production domain presents a va(Forecast, q₁) = Null, since it exceeded the time period it had to answer the consult.

Finally, the intersection set between the set of keywords of the query and the set of keywords that define the taxonomy of Marketing domain is integrated by the following keywords: product and promotion. Thus, Kq₁ ∩ K(marketing) = {product, promotion}. For that reason, the respective weights of the discriminating function are higher than zero. As the keyword Sale is not include into Marketing taxonomy, the weight factor associated to the discriminating variable Sale is w_sale = 0.

Table 3: Logical structure of data for the example
Domain	VA	RSV	W_product	W_sale	W_promotion
Marketing	P	1.0	0.4	0	0.6
Production	Null	1.1	0.9	0.2	0
Forecast	N	1.2	0.5	0.7	0
Sales	N	1.6	0.3	1.0	0.3

The logical structure of data represented in Table 2, which is defined to store a case, allows easily visualization of the results associated to a query q_i. In other words, it is a visualization of data in the queries or cases dimension. From the point of view of the need for learning that has been posed, this is not a convenient structure, since in order to evaluate possible domain classification errors, it is necessary to visualize data from the domains dimension.

Visualization of Data of the Cases Base in the Domains Dimension

Table 4 presents a logical structure of data in the cases base that allows observing a domain's behavior before the various queries. This structure will store, for a given domain d_j, all queries q_i, for which RSV(d_j, q_i) > cutoff score. Essentially, this data structure uses binary fields to represent the relationship between a query q_i, which is defined by the set of keywords Kq_i, and the domain taxonomy Kd_j.

Table 4: Logical structure of data for the domain d_j
Query	VA	OR	P₁		P_k	P'₁		P'_m
q_i

In Table 4, the first field, Query, stores the name that identifies each query q_i. The second field, called VA, stores the qualitative variable value va(d_j, q_i). According to what has been described in previous sections, in order to determine possible errors in the classification efficiency, the RSV(d_j, q_i) value itself does not matter, but the relative position of each domain d_j∊D_P does. Therefore, the third field (Order) stores the order that domain obtained in the ranking of domains classified as potential to answer to the query q_i. The remaining fields are divided into two groups. In the first group, each field represents a keyword of the domain taxonomy. Each field is designated with a keyword p_k∊ Kd_j, and represents a binary variable p_k that takes 1 as its value if the keyword p_k of the domain taxonomy is in the query q_i, and takes 0 as its value if p_k is not stated in that query.

In the second group, each field represents a keyword stated in query q_i that does not belong to the domain taxonomy. Each of these fields is designated with a keyword p'_k and represents a binary variable p'_k that takes 1 as its value if the keyword p'_k is stated in the query q_i but does not belong to the domain taxonomy d_j, and takes 0 as its value if p'_k is not in that query.

Let us consider the previous example. Let us suppose that it is the first case for which the domain is classified as having potential for answering to the query:

Query: How many items of product X were for sale in promotion?

Filtered query: Kq₁ ={items, product, X, sale, promotion}

Marketing Domain

Query	VA	OR	marketing	product	promotion	items	x	sale
q₁	P	4	0	1	1	1	1	1

Production Domain

Query	VA	OR	production	product	machine	items	x	sale	promotion
q₁	Null	3	0	1	0	1	1	1	1

Forecast Domain

Query	VA	OR	forecast	age	sale	product	items	x	promotion
q₁	N	2	0	0	1	1	1	1	1

Sales Domain

Query	VA	OR	sale	product	customer	promotion	items	x
q₁	N	1	1	1	0	1	1	1

Thus, we have a structure for each system domain. Each structure stores the way in which the domain behaved for the different queries for which it has been classified as potential.

In the following section, we will see how this logical data structure meets the posed learning needs.

Application of Data Mining to the Learning Process Design

Once the logical data structure is designed and the cases are stored, the latter must be analyzed using data mining. The object is to analyze the data kept in the cases base so as to identify relationships among the data from which possible behavior patterns of cases can be defined. Such patterns are used to define rules for updating the discriminating function of each domain, which is stored in the knowledge base. This requires working with the cases data associated to the involved domain. For this purpose, the logical data structure presented in Table 4 of the previous section will be used. According to what has been discussed, the updating can be performed in two ways: either updating the weights of the keywords (predicting variables) that define the domain taxonomy or modifying the domain taxonomy by adding new keywords.

Patterns Obtainment

Once a significant number of cases q_i are stored, we can perform a mining of these data to look for patterns. For that purpose, we define Q_dj as the set of stored cases q_i associated to d_j. That is, Q_dj = {q_i / RSV(d_j, q_i) > cutoff score}.

To start with, cases q_i ∊ Q_dj are classified into four groups: a first group formed by cases in which no classification error occurred; a second group of cases in which the domain provided a positive answer, but it was not the first one in the ranking of potential domains (these are cases in which the classification error affected the system efficiency); a third group of cases, in which the domain provided a negative answer; and, finally, a fourth group of cases in which the query was not answered.

To carry out this classification, va(d_j, q_i) and or(d_j, q_i) are defined as predicting variables.

Group of efficient cases Q⁺_dj: integrated by those cases that present va(d_j,q_i) = positive and or(d_j,q_i) =1

Group of non-efficient cases Q^*_dj: integrated by those cases that present va(d_j,q_i) = positive but or(d_j,q_i) >1

Group of negative cases Q⁻_dj: integrated by those cases that were answered in a negative way.

Group of Null cases Q⁰_dj: integrated by those cases in which an answer was not provided.

Once q_i ∊ Q_dj cases are classified into one of the four defined groups, the purpose is to infer rules to update the discriminating function of each domain that is stored in the knowledge base. The action of these rules will be to:

Modify the cases belonging to Q⁺_dj to the lowest extent.
Determine the weights w_pk of the keywords (predicting variable) of the domain taxonomy. These weights must be increased in order to correct the classification error produced in the cases of group Q^*_dj. These rules will operate on keywords p_k ∊ Kd_j that are frequently present in queries q_i ∊ Q^*_dj, since it can be inferred that these predicting variables are more important to classify domains than what their associated weights really reflect. In other words, the current weight factors w_pk are low.
Encourage the incoming of new keywords into the domain taxonomy. This means including new predicting variables in the discriminating function of domain d_j. These rules will operate on keywords p'_k ∉ Kd_j that are frequently present in queries q_i ∊ Q^*_dj. In other words, it is inferred that these predicting variables are important to classify domains. However, if those words are also frequently present in queries answered by most of the remaining domains, these keywords would not be useful to distinguish among domains and thus they should not be incorporated.
Another possibility is that a domain presents many cases in which it answered in a negative form although appearing as better positioned in the ranking than the domain that actually provided a positive answer. This means that this domain taxonomy has words whose weights are too high when compared to their importance in the domain. Therefore, there should be a rule that diminishes the weights of these words.

With the aim of interpreting the relationships among variables, we present three main rules obtained by the mining process that will be used to develop the system learning process:

click to expand

In the condition of this rule, we are saying that a word belonging to the domain taxonomy is a candidate for increasing its weight if:

more than α₁ cases are stored in Q_dj and
the number of times in which p is stated in queries of Q^*_dj is greater than α₂ and
n_p^* is much more higher than n_p⁻

Now, we present the rule for the incoming of new words into the domain taxonomy.

click to expand

In the first condition of this rule, we say that a word is a candidate for entering a Kd_j if:

more than α₃ cases are stored in Q_dj and
the proportion between the number of times in which p is stated in queries of Q^*_dj in respect to the number of cases stored in Q_dj is greater than α₄.
n_p^* is much higher than n_p⁻.

A word can enter Kd_j if the amount of domains in the system is much higher than the quantity of domains in which p is stated or is a candidate.

click to expand

In the condition of this rule, we are saying that a word p is a candidate for diminishing its weight if:

more than α₅ cases are stored in Q_dj and
the proportion of the number of times in which p is stored in queries of Q^∧_dj in respect to the number of stored cases Q_dj is greater than α₆ and
n_p^∧ is much greater than the number of times in which p is stated in queries with positive answer (n_p⁺ + n_p^*)


	Brought to you by Team-Fly