PATTERN MATCH ANALYSIS


While exploiting and researching the Market Basket Data Analysis System Based on Mining Associations Rules, we analyzed the pattern match included in mining association rules. We discovered some characteristics of the transaction data set and the set of candidates:

  1. Recording the item number and index can reduce the comparison time and enhance the efficiency of pattern match . Sort all the items of each transaction in the transaction data set alphabetically . When preprocessing the transaction data set, do the sort, record the item number of each transaction and set it as an index. Then, choose a suitable algorithm to generate the k-itemsets . We get alphabetized k-itemsets which match.

  2. The itemsets in the set of candidates can be alphabetized, not only in row, but also in a column . The set of candidate k-itemsets generated by the join and prune steps, based on the set of candidate (k ˆ’ 1)-itemsets and C 1, is alphabetical. Then, through the SQL language as follows , we can get the alphabetic set of candidate k-itemsets in a row and in a column:

     INSERT INTO C  k  SELECT TOP 100 PERCENT dbo.p.col  1  , dbo.p.col  2  ,..., dbo.p.col  k   1  , q. col  k   1  AS col  k  FROM dbo.l  k   1  p INNER JOIN dbo.l  k   1  q ON dbo.p.col  1  =q.col  1  AND dbo.p.col  2  =q.col  2  AND ... AND dbo.p.col  k   1  <q.col  k   1  ORDER BY P.col  1  , P.col  2  ,..., P.col  k   1  , Q.col  k   1  

    We can deal with the pattern match by the alphabetic characteristic. For two alphabetic linear objects, we can find the most efficient lookup algorithm to realize match easily. Then, we should look for a suitable algorithm to transform the alphabetic pre-match object into an alphabetic linear object. The simplest transformation method is to set the item number as weight and add up k-itemsets with weight as the object of match.

  3. The item number of the longest itemset in a transaction data set is far less than the total item number I, and the transactions having longer itemsets take up only a small part of the total transactions . When researching the market data set, which has 15,169 pieces of merchandise divided into 228 kinds of ware, we processed the mining based on 228 items, that is I=228. We discovered that the item number of the longest transaction in the transaction data set is 33; 0.098 percent of the transactions have an item number larger than 20; 0.62 percent have an item number larger than 15; and 3.9 percent have an item number larger than 10.

  4. The matching data object is larger, and the matching success ratio is lower . When the k-itemsets number of each transaction in the transaction data set is larger, or when the transaction item number is larger, the itemset number of the set of candidates is larger. Then, the matching success ratio is lower. Owing to match being a process of lookup in nature, we can choose an algorithm which fits the match characteristic. In this chapter, that algorithm is bisearch, which has an average comparison time of log 2 n ( n is the length of the object) and, therefore, can improve the efficiency.

  5. The match proceeded between two alphabetical objefound that the k-itemsets number of each transaction in the transaction data set, when k went from 1 to some scale, was always less than, or far less than, the itemsets number of the set of candidates. But above the scale, the k-itemsets number of each transaction in the transaction data set was more than the itemsets number of candidates. The results show determine data object should be applied to match the other one between the transaction data set and the set of candidates. But what's the determining principlecy.

Let the average transaction number of transaction itemsets in transaction data set DB be h . Then, the average k-itemsets number in each transaction is C k h . And, let there be m k-itemsets in candidate k-itemsets set C k . Using the bisearch method, if we take each candidate in candidate k-itemsets set C k to match the C k h -itemsets of each transaction in the transaction data set, then the total comparison time is mlog 2 C k h ; conversely, if we take the C k h k-itemsets of each transaction in the transaction data set to match each candidate in candidate k-itemsets set C k , the total compare time is C k h log 2 m. Which one measure we should take is decided by the big and small of m log 2 C k h and C k h log 2 m.

Then compare the big and small of m log 2 C k h and C k h log 2 m.

When x>e(const), then ln>1, h ² (x)<0, function h(x) is a monotone decrease function, and then is tenable. That is, when e<x<y, x log 2 y<y log 2 x.

To the pattern match of mining for association rules, there is C k h >h>e and m is a positive integer.

When m>e, if C k h <m, then C k h log 2 m<m log 2 C, and we choose to take the C k h k-itemsets of each transaction in the transaction data set to match each candidate in candidate k-itemsets set C k , which has a higher efficiency. On the contrary, if C k h >m, then C k h log 2 m>m log 2 C k h , and we choose to take each candidate in candidate k-itemsets set C k to match the C k h k-itemsets of each transaction in the transaction data set, which has the higher efficiency.

When m=2, then C k h >2 log 2 C k h ; and when m=1, the question is changed into the match between one single data and an array of data. Therefore, when m<e, take each itemset in candidate k-itemsets set C k to match the C k h k-itemsets of each transaction in the transaction data set.

From the above, when C k h <m, take k-itemsets of each transaction in the transaction data set to match the candidates set. Contrarily, when C k h >m, take the candidate k-itemsets set to match k-itemsets of each transaction in the transaction data set.

We can summarize the flow of the pattern match being fit for association rules as follows, through the preview analysis:

If k-itemsets of each transaction in the transaction data set match the candidate k-itemsets set, we should dispose of each k-itemsets of the candidate k-itemsets set by adding up their weights and putting them into a linear array to wait for match. Then, deal with current k-itemsets of the current transaction in the transaction data set by adding up the weight for each match, which is a match between a single data and an array, then choose, and then dispose and match the next k-itemsets of transaction. That method can be an effective use of the alphabetic characteristic of the transaction data set to make it more efficient. We can use a similar method when taking the candidate k-itemsets set to match k-itemsets of each transaction in the transaction data set .




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net