QUERY-BY-STRUCTURE SYSTEMS | Data Mining: Opportunities and Challenges

data mining: opportunities and challenges

Chapter XIII - Query-By-Structure Approach for the Web
Data Mining: Opportunities and Challenges
by John Wang (ed)
Idea Group Publishing 2003


	Brought to you by Team-Fly

This section presents the CLaP (Fotouhi, Grosky, & Johnson, 1999) and the Neural Network Net Query-by-Structure systems (Johnson, Fotouhi, & Draghici, 2001). These systems have some features similar to those presented in the review in the previous section. In particular, all three systems are capable of dynamically querying the Web. In addition, each system can identify Web documents based on content. However, what sets these systems apart is there extensive use of document structure in the search process, hence the name query-by-structure.

CLaP

The CLaP system, which stands for Content, Link and Presentation, was designed in 1999 in an attempt to enhance Web searches by allowing the user to indicate that the desired documents contain not only specific keywords, but that those keywords be structured and presented in a specific way. This was achieved by allowing the user to request that the content information appear within particular presentation (HTML) tags. The CLaP system allowed the user to issue queries on a locally maintained database of Web documents. However, the system also allowed the user to directly query the Web. Hence, not only was the user able to search for Web documents already cataloged by the CLaP system, but he or she could also initiate a dynamic search of the Web by denoting a specific starting URL. Initial testing of the CLaP system showed promising results. Comparisons between straight keyword searches and searches that also utilize presentation information led to significant reductions in the number of documents selected.

CLaP System Overview

The query process began with the user selecting a starting URL and various characteristics of interest, such as "all documents containing images, forms, and/or tables." In addition, the user could select the content for which he or she was searching, and specifically request that it appear between a particular set of tags. The request was then transformed into a database query that was processed by the system. The user was able to specify that the request was to be processed locally, in which case the results (URLs of selected Web documents) were returned to the user. Additionally, the user could provide a starting URL and request that a dynamic query of the Web be performed to find the documents matching the user-specified criteria.

A dynamic query started by scanning the document specified by the starting URL. If that document contained the criteria requested by the user, it was included with the results. In addition, during the scanning process, any links to other HTML pages within the document were identified. These links were then traversed, and the documents they pointed to were scanned as well. In a sense, the system was capable of mining the Web to find other documents. This scanning process could potentially continue until no further links were found. Clearly, though, this type of a scan could take a considerable amount of time. Hence, the user was given the ability to limit the scanning by specifying a depth at which to stop the search. In addition, to allow the user to further refine his or her searches without having to query the Web, all relevant information obtained from the selected Web documents was then stored in the system database for any future query.

CLaP Architecture

The CLaP system architecture was divided into three layers: User, Database, and WWW, as shown in Figure 1. In addition, each layer was composed of specific modules that performed various functions within the system. The Visual Interface module is the user interface for the CLaP system that allowed the user to enter his or her query via an HTML form (Figure 2). This form enabled the user to query based on several different HTML presentation tags including title, anchor, image, table, etc. The user could also search on various attributes for these tags, as well as for content within the tags. When the form was submitted, a program within the database layer processed the information and converted the user selection into a SQL query.

click to expand
Figure 1: CLaP system architecture.

click to expand
Figure 2: Visual Interface module.

The Interpreter module is the program described above whose function is to convert the user query into a standard SQL query. The Interpreter module directly interfaced with the SQL module and the URL extractor module. When the SQL module received a SQL query from the Interpreter module, it initiated the query on the Database module, which will return the results to the user. In addition, the SQL module also interfaced with the URL Extractor module when a query needed to be initiated on the Web. Results from a query of the Web were loaded into the database, and the SQL module was initiated to issue the query on the Database module. Figure 3 shows the SQL query used by the system to find all pages mentioning "Princess Diana" that contain an image and contain the text "Diana" inside the image tags.

click to expand
Figure 3: Example of an SQL query.

The URL Extractor module initiated the search of the World Wide Web. Initially, the module received the user-specified starting URL from the SQL module. The search was initiated by invoking the HTML Parser module that scanned the Web document at the starting URL. The URL Extractor module then captured all the links from the Web document scanned by the parser. These new links were then passed, one-by-one, to the HTML parser, and the scan/extract process repeated itself. Each time the URL extraction process was started, the URL extractor module was able to identify duplicate URLs. All duplicates were simply discarded by the Extractor module, thus eliminating redundant scans as well as cycles. In addition, URLs containing complex query strings were also discarded. However, the Extractor module was able to recognize relative, absolute, and local links, and properly structure them for the Parser module.

The HTML parser was responsible for retrieving documents on the Web. In addition, the HTML parser also scanned each retrieved document and determined whether the document satisfied the user's query. If a document did satisfy the user query, it was passed to the loader module. If it did not, then the document was discarded after its links had been extracted by the URL Extractor module. The Parser module utilizes the built-in parser provided with Microsoft Internet Explorer 4.0. The Loader module extracted the relevant content and presentation information from the Web document that it received from the HTML parser module. This information was then populated in the Web-Database module.

CLaP Performance Study

The query in Figure 3 was issued to the CLaP system for the following URL: http://www.mcs.drexel.edu/~gcmastra/diana.html. This URL was selected due to the relevancy to the query. Table 1 below shows the results of the query at search depths of 1, 2, and 3.

Table 1: Test results
Depth	Unique URLs Scanned	Pages with "Princess Diana" in Content	Pages Also Containing an Image	Pages also Containing "Diana" within IMG tags

1	35	15	13	4
2	187	25	22	8
3	595	145	142	26

Included in the results is the total number of unique URLs scanned and, from those URLs, the Web documents containing the text "Princess Diana." In addition, from this set of documents, the number of documents containing an image and the documents having the text pattern "Diana" within the image tags is also shown. Clearly, it can be seen from the results that a significant reduction in the number of Web pages selected can be observed.

The concept of querying the Web by content, link, and particularly structure was studied extensively with the CLaP system, and the system did provided promising results when comparing straight keyword searches and searches that utilized structure information. However, the query process was quite cumbersome to the users, as they were required to specify details about exactly where each search word in which they were interested would appear. Due primarily to this reason, as well as some other issues, another query-by-structure approach was needed.

A Neural Network Net Query-by-Structure Approach

Continuing the research into a query-by-structure approach, two new systems designed to capture and utilize structure information and content during a distributed query were developed. Both systems employ neural networks to organize the information based on relevancy of not only the content but the structure of the document as well. The primary goal of these systems was to test the feasibility of using neural networks to improve the query process. In addition, in order to become a viable querying system, the requirement of an extensive amount of information required of the user during the query process, such as in the CLaP system, needed to be resolved. Hence, another goal in the development of the neural network query-by-structure systems was to eliminate the need for the complex user queries.

The first issue that needed to be addressed was determining which type of neural network to utilize. For example, some types of neural networks, when provided a specific input, will identify which of a set of prototypes the input most closely matches. These types of neural networks are called supervised learning networks. Systems utilizing supervised learning neural networks could be useful in categorizing semi-structured documents based on how the structure of one document most closely matches the structure of other classified documents. Additionally, these types of neural network, when given a set of documents as input, can be utilized to identify document relevancy and quantify the results so that a ranking of documents can be obtained. However, to effectively perform the classification, it is necessary for systems utilizing supervised learning networks to create a set of prototype vectors corresponding to the predetermined set of prototypes.

When a predetermined set of prototypes is not available, a different type of neural network, namely unsupervised competitive networks, can be utilized. Unsupervised competitive neural networks utilize only the inputs and attempt to group together or cluster the inputs into related classes. By performing the classification based on document structure, documents with similar structure will be clustered together. Additionally, the results yielded from this approach can be used in different fashions. One approach could be to deem the largest cluster the most relevant cluster, with the notion being that several documents of similar structure are likely more useful. Another approach could be to present the user with the groupings of clusters and allow him or her to select which documents to view, based on his or her opinion of the most relevant clusters. This approach provides the user with more flexibility, especially in light of the fact that ultimately it is the user who determines relevancy. But, it may not do as much in terms of reducing the number of relevant documents. So, it seems worthwhile to study both approaches, as it is likely that a hybrid of the two might serve best.

Neural Network Net Query-by-Structure Systems Overview

Although similar, the systems developed utilize different types of neural networks: a supervised Hamming Network and an unsupervised Competitive Network (Hagan, Demuth, & Beale, 1996). The query process for both systems involves five steps (as shown in Figure 4).

click to expand
Figure 4: Neural network query-by-structure process.

First, the user initiates a query to the User Query module by entering a standard generic search using keywords. In order to limit the number of Web documents that must be retrieved and parsed during a dynamic Web query, the user must also select a maximum number of URLs he or she wishes to receive in the results. Additionally, the supervised neural network system requires that the user select a document type. This type field is used by the system to categorize document structure. Upon issuance of a user query, the Generic Search engine module simply performs a standard Web search with the search words provided by the user using a commercial search engine. Any commercial search engine that produces results containing the URLs and descriptions of relevant documents can perform this operation.

Although these results are actual documents on the Web, the only Web query performed at this point was to contact the commercial search engine, which in turn simply performed a query on its local database. As a result, in order to capture document structure information, it was necessary to actually retrieve each of the Web documents specified in the results of the Generic Search Engine query. Hence, the Generic Search Engine module takes the list of URLs returned by the commercial search engine, and one at a time, retrieves the source code of each of the Web documents. These documents are then parsed in order to analyze the document structure.

For each Web document, document structure was quantified by creating a feature vector based on whether specific words appeared between specific HTML tags in the Web document's source code. The words selected are words deemed relevant to one of the document types specified in the initial user query. As an example, consider that a resume document, regardless of the creator's occupation, will likely contain several of the following words: business, education, experience, objective, title, and work. Potentially more significant is the fact that these words will likely be highlighted in the document by specific HTML tags, such as bold (), heading (<h#>), font (), and italics (). Hence, the feature vector would contain an entry for each of the resume words, and if any of those words appeared at least once between any one of the specified HTML tags, a 1 would be placed in the feature vector for that word, and a 0 otherwise. It should be noted that most of the entries in the feature vector should be 0's because the feature vector will contain entries for several other words (recall that WebWatcher utilized a 530-element feature vector) for other document types that are not expected to appear in a resume document. This feature vector was then used as the input vector to the neural networks.

After construction, the document feature vectors are added to the Document Database module. This allows future user queries to be initiated directly to a database that already has captured document structure information, hence bypassing the time-consuming Web query. In addition, the feature vectors are then sent to the Neural Network module. Both systems then employ their respective neural network to organize the information based on relevancy of not only the content but the structure of the document as well. Lastly, the neural network outputs the results of the user query.

The following two sections describe in greater detail the supervised Hamming neural network and the unsupervised Competitive network. This book is not designed to be a text about neural networks, so the objective here is simply to provide a high-level description of the networks utilized by the query-by-structure process. Readers unfamiliar with neural networks should note that most neural networks are simply a series of mathematical manipulations performed on a set of vectors and matrices.

Hamming Neural Network

A Hamming neural network is a supervised learning network that consists of two layers (Hagan et al., 1996). The first layer (feedforward layer) performs a correlation between an input vector and a set of prototype vectors and produces the output a¹ (the superscript indicates the network layer) as described by the following formula:

In formula (3.1), W¹ is called the weight matrix. Each row in the weight matrix represents one of the prototype vectors. The weight matrix is multiplied by the vector p, which represents an input vector. The resultant vector is then added to the bias vector b¹ to produce the output vector a¹. The purelin transfer function in the formula above is simply a linear function that produces the same output as its input.

The second layer (recurrent layer) performs a competition to determine which of the prototype vectors is closest to the input vector using the following formulas:

The initial input to the recurrent layer is show in (3.2), and is simply the output, a¹, from layer 1. The recurrent layer also has a weight matrix W². This matrix is constructed so that each element on its diagonal has a value of 1 and all remaining elements have a value of of −ε (where of −ε is some number less than 1 / (S − 1), and S is the number of rows in the weight matrix. Hence, the recurrent layer repeatedly multiplies the weight matrix by the a² vector calculated in the previous iteration. Note that in this formula, the poslin linear transfer function is applied and produces linear results for positive values in the matrix and zero for negative values. The recurrent layer continually applies this function until every element but one in the resultant vector contains a zero. This nonzero i^th element represents that the initial input vector provided to the feedforward layer most closely matches the i^th prototype vector.

A Hamming neural network is considered supervised because the prototype vectors are determined in advance and are designed to resemble the expected input vectors. Hence, the first step in designing the Hamming network for the query-by-structure system was to initialize the prototype vectors. Each prototype vector was designed so that each element in the vector would contain a Boolean value indicating whether a given word appeared between a given HTML tag or not. Since each prototype vector represented a different document type, if a specific word was expected to appear within a given structure context inside that specific type of document, the value of that entry was set to one, and if not, the value was set to zero. For instance, using the example from the previous section, the resume prototype vector would be a vector that contained all 0's except for the entries that represented the resume words.

After initialization of the prototype vectors, the approach taken by the supervised neural network query-by-structure system was to slightly modify the Hamming neural network. First, the system performs the correlation between each of the input vectors and the prototype vectors, one at a time. Here, the number of input vectors (n) is equivalent to the number of URLs returned by the initial Web query. So, given n input vectors, the feedforward layer would need to be applied n times, once for each input vector. As a result, instead of a single valued vector a¹, the system produces a matrix of vectors A¹. Hence, the output of the feedforward layer is just the result of the matrix multiplication of the matrix of the input vectors P¹ with each of the prototype vectors in the matrix W¹. Assuming that the system has no bias vector, the feedforward layer can be described by the formula:

The recurrent layer performs the competition between each of the elements in the matrix A¹ produced as the output from feedforward layer. Hence, this layer can be described mathematically as:

As a result, the output from the second layer of the neural network is a set of vectors representing which prototype vector most closely matches each input vector. Recall, though, that the objective of this system was to determine which of the Web documents (input vectors) most closely matches the selected document type (prototype vector). Fortunately, the output from the second layer works rather well in solving this problem. Consider that each of the columns represents one output vector. Each of these vectors contain elements containing all zeros but one. Additionally, since the document type is known, and the applicable corresponding prototype vector is known, all that needs be done is to extract each of the vectors from the output matrix A² that contain a nonzero value in the nonzero i^th element. This set of vectors now represents all the Web documents that most closely match the selected document type. In addition, the vector that has the largest value matches the prototype vector better than any other vector. Hence, sorting the values in each of the vectors, from highest to lowest, provides a new ranking of the Web documents. Finally, after sorting, the system simply outputs the URLs and descriptions captured in the initial search engine query, but in the order specified by the neural network.

Unsupervised Competitive Neural Network

The unsupervised competitive neural network replaces the recurrent layer in the Hamming network with a competition transfer function. Like the Hamming network, this competition layer still identifies the largest value in the vectors produced by the feedforward network. However, instead of applying the recurrence equation on the output vectors, the competition transfer function simply produces a vector of all 0's, except for the entry that contained the largest value in the vector produced by the feedforward network, which it sets to 1. In addition, the objective of the unsupervised network is significantly different than that of a supervised network. The goal of the supervised network is to match the input vectors to the predetermined prototype vectors. However, in the unsupervised network, the prototype vectors are not known in advance. Instead, the input vectors are actually used to train the weight matrix. This was accomplished using the Kohonen learning rule (Hagan et al., 1996):

and

In equation (3.7) and (3.8), α represents a real value between 0 and 1 called the learning rate, p(q) represents the input vector, _iw(q) and _iw(q−1) represent the i^th row in the weight matrix at time q and q−1, respectively, and i* represents the row that had the largest value in the vector produced by the feedforward network, and hence the row that contained the 1 in the output vector. In general terms, all the vectors in the weight matrix at time q will remain the same as they were at time q − 1 except for the row in the weight matrix that won the competition. The values in this row of the weight matrix will move from their current values at time q − 1, toward the input vector at time q. How quickly the row in the weight matrix moves towards the input vector is a factor of the learning rate α. For example, if α=1, then from (3.7) we can see that row i in the weight matrix would be set to the values of the input vector. However, if α=0, then no change would ever result. Clearly, the value of the learning rate is of utmost importance for this type of neural network.

The implementation of the unsupervised competitive neural network for the query-by-structure system was fairly straightforward. First, each Web document is transformed into an input vector. The only difference between this input vector and the input vector in the Hamming neural network is that each element contains a numeric value containing the number of occurrences of a specific word within the Web document (rather than a Boolean value), provided that the word was contained within the appropriate structure tags, as was described for the Hamming network. After being created, each input vector is then normalized. In order to initialize the network, a weight matrix for the prototype vectors was randomly generated. The values taken on by each entry was a real number between −1 to 1, inclusive.

Once the network was properly initialized, the process of training the weight matrix was simply a matter of randomly selecting an input vector from the set of input vectors of Web documents, applying the competitive layer in the neural network to find the largest value in the vectors produced by the feedforward layer, and updating the appropriate row in the weight matrix using the Kohonen rule. This process was repeated for a predetermined number of iterations, at which time the final weight vectors have been formed. Using these weight vectors, each input is classified and placed into one of n class vector sets. These class vector sets make up the final results of the neural network, where users are presented with the ordered results in each class instead of all the results ordered together in one list.

Experimental Performance and Results

Preliminary tests of the Hamming neural network system showed promising results. Initially, four document types were selected: resumes, conference papers, news articles, and e-commerce Web documents. Obviously, there could be an infinite number of different document types, along with variations within those types, but the objective was not to classify as many different types of documents as possible, but rather to simply test the feasibility of using neural networks to improve query-by-structure performance. The following is the list of words used to construct the feature vectors for the initial testing of the system: abstract, activities, bibliography, business, by, buy, cart, conclusion, education, experience, introduction, mall, MasterCard, news, objective, price, purchase, related, references, resume, shop, sports, store, Visa, vitae, work. The HTML tags utilized by the system were: bold (), strong (), emphasis (<emp>), underline (), heading (<h#>), font (), and italics (). Hence, four prototype vectors were utilized, with each one representing a different document type. Each vector contained only Boolean values that represented whether or not one of the words specified above was expected to appear within a given structure context inside that specific type of document. According to the word ordering above, the four prototype vectors utilized are:

As an example, consider a simple query that was applied to the system that retrieved only 10 URLs looking for a resume document type using the keywords: "Computer Science." The commercial search engine results for this query placed a career services Web site first and a Computer Science program's students resume repository second. After converting these results to input vectors and filtering them through the neural network, the career services Web site ended up ranked 7^th, and the resume repository ended up dead last at 10^th. In addition, in between these two documents were a very poorly structured resume and a Web page that simply contained a note stating that if you were interested in that individual's resume, you could contact him via e-mail. At the top of the rankings were Web documents containing actual resumes.

Other experiments were performed with different keywords, document types, and a varying numbers of URLs returned in the result with similar results to the example above. Table 2 compares the results of the system to the commercial search engine results for some of these tests. A brute force approach of viewing each document to determine if it actually was of the desired type was necessary.

Table 2: Hamming network vs. commercial search engine result comparison
Keywords	Document Type	Pages Retrieved	Hamming Network	Commercial Search Engine
Biology	Resume	50	9 out of top 10 were resumes	5 out of top 10 were resumes
Secretary	Resume	30	4 out of top 6 were resumes	2 out of top 6 were resumes
Political Science	Research Paper	30	4 out of top 6 were research papers	0 out of top 6 were research papers
War	Research Paper	30	3 out of top 6 were research papers	1 out of top 6 were research papers
Microsoft	News Article	30	3 out of top 6 contained news articles	1 out of top 6 contained news articles
College Football	News Article	40	5 out of top 8 contained news articles	2 out of top 8 contained news articles
Textbooks	E-Commerce Site	40	7 out of top 8 are e-commerce sites	3 out of top 8 are e- commerce sites
Flowers	E-Commerce Site	30	5 out of top 6 are e-commerce sites	2 out of top 6 are e- commerce sites

The performance of the unsupervised competitive neural network was not as consistent as that of the supervised Hamming network. The hope for this network is that the input vectors will be clustered based on similar structure. However, much of the preliminary testing resulted in most documents being placed in only one or two classes (currently the system can only cluster to four classes). There are several contributing factors to these sporadic results. First, unlike the supervised neural network that creates the weight matrix based on the prototype vectors, this network randomly generates its weight matrix. As a result, each test yields different results. Secondly, the number of times the network iterates also plays a role in determining the results. This value can only be determined through extensive testing. Currently, the number of iterations performed is equal to three times the number of input vectors. Another variable factor in this network is the choice of the learning rate. This value ranges between 0 and 1, with higher values resulting in fast but unstable learning and lower values resulting in slow but stable learning. Once again, the optimal value for the learning rate can only be determined through extensive testing. Finally, a third issue is with the creation of the input vectors. As discussed earlier, these vectors are generated based on document structure. Hence, creating input vectors that properly distinguish between different types of documents is paramount to the performance of the network.


	Brought to you by Team-Fly