KEYWORD-BASED INFORMATION SEARCHING | (ed.) Intelligent Agents for Data Mining and Information Retrieval

Domain Classification of Course Documents

In order to manage and search for course documents more accurately and efficiently , all course documents were analyzed and categorized according to the knowledge domain they belong to. Each course was further decomposed into sub-course documents to implement a new course scaffolding. In this way, the course documents' reusability was improved. The course documents were organized into tree hierarchy structures as shown in Figure 10.

Figure 10: Course Document Structure

In our experiment, the training documents include "Computer Architecture" and "Bio machine". Each has ten sub-course documents.

Keyword Selections and Document Indexing

Since the course documents can be very large in size , the system performance and speed can be degraded. To query and retrieve the course documents more efficiently, we selected a set of keywords from the sample documents and indexed them with the occurrence frequency and knowledge domain.

The index includes three parts : the selected keywords, its occurrences, and the knowledge domain it belongs to. The keywords are a set of vocabulary selected from the corresponding course document. Occurrences are the frequency with which the keywords appear in the document. The document where the keywords appeared was also recorded in the index table to trace the location of the keywords. The index reflects the general concept of each document. An example of an index table is shown in Table 11.

Table 11: Example of Index Table
Keyword	Knowledge Domain	Document	Occurrence
genetic	biology	BioMachine1	6
DNA	biology	BioMachine1	4
zoology	biology	BioMachine2	5
memory	computer	ComputerArchitecture1	7
Processor	computer	ComputerArchitecture2	5
disk	computer	ComputerArchitecture3	4

The keywords in Table 11 were selected from each course document. In order to get the set of vocabularies that represents the general concept of the course document, text mining techniques were used on all the course documents. The algorithm involves three processes: word recognition, stop word elimination , and keyword selection.

Word recognition refers to the process of scanning the course document while ignoring the punctuation and word cases. In this process, the word prefixes and suffixes were removed to get the root word. Since words with common roots often have similar meanings, the root words were used for synonym generation and query expansion when the user searched for course documents.

In the process of stop words elimination, a list of stop words was deleted from the text. The stop words are non-semantic bearing words. They include a large number of pronouns, adjectives, and adverbs, including we, you, this, there, at, in, etc. Some high-frequency words that are too general to represent a document concept, such as say, listen, help, etc., were also deleted. Search engines usually do not index these stop words because doing so will result in the retrieval of a tremendous quantity and trivial records.

After word recognition and stop words elimination, the text was scanned to select the words with high occurrences and store them in the index table, as shown in Table 11.

Some words have high occurrences in many documents; they are too general to represent the content of the specific document. These words were ignored during our indexing process. What we were interested in were those words that have higher occurrences in only certain documents. These words were selected as the index keywords for particular documents, and they were used, with higher accuracy and speed, for data retrieval.

Using the above algorithm for selecting index keywords, Equation 9 takes into account the factors mentioned above. The word score was calculated to measure the weight of the word to each document.

In Equation 9, IS stands for the index score. tf _xj is the term frequency, which is the number of occurrences of term j in document x . df _j is the document frequency, which is the number of documents in a fixed-size collection in which term j occurs. N is the number of document size. The calculated score S measures the weight of term j to document x .

With reference to Equation 9 and Table 11 by taking into account the document frequency, the index table was further evolved into Table 12.

Table 12: Index Table of Course Documents
Keyword	Knowledge Domain	Document	Term Frequency	Document Frequency	Collection Size	Index Score
genetic	biology	BioMachine1	6	2	10	10.75
DNA	biology	BioMachine1	4	4	10	4.39
zoology	biology	BioMachine2	5	1	10	11.99
memory	computer	ComputerArchitecture1	7	3	10	9.7
Processor	computer	ComputerArchitecture2	5	4	10	5.49
disk	computer	ComputerArchitecture3	4	2	10	7.17

Keyword-Based Searching

The search for e-learning course documents is based on keyword matching. A mobile agent carries user-entered keywords and roams in network to search for information on behalf of the user, according to the user's preference. As the search agent finds information that matches the user's requirements at remote sites, it sends the information back to the user via Aglet message. The monitor agent saves the keywords entered by the user to build the user's preference profile. Figure 11 shows the process of intelligent search with mobile agent.

Figure 11: Document Search Process

The user input keyword was first searched through the index table to find its suitable knowledge domain. In the next step, only the course documents under the relevant domain will be further queried. For example, if the user entered "hardware" as the search keyword, as the keyword reached the index table, it may be under domain of "computer" or "network", etc. Other domains, such as "biology" and "mathematics", will be eliminated from further query processing. By domain matching, the computing time was reduced and, as a result, the efficiency and accuracy of data retrieval was improved.

A thesaurus module was used to expand the user's query by generating synonyms corresponding to each user keyword. Its main purpose was to broaden the search criteria and to improve the document retrieval precision. The following figure illustrates the process of user query expansion.

At the agent host server, the query criterion carried by the mobile agent was used to retrieve course documents from document storage, based on the matching score of expanded user keywords and index words for each document. The matching score reflects the similarity between the user query and the course documents. It was calculated based on Equation 10.

Figure 12: Keyword Expansion

Word1 denotes the user-entered keyword for the search, and Q = {Word1, Word2, Word3 WordN} denotes the expanded query by thesaurus module that includes a list of synonyms to Word1. Then, the matching score between the expanded query and the course documents was calculated as follows :

In Equation 10, MS stands for Matching Score; Word i is the i th synonym in the expanded query; and Occur is the frequency Word i appears in a particular document.

After the Matching Score for each document within the relevant knowledge domain was computed, the documents were sorted and ranked.

In our experiment, "Processor" was entered as the keyword for search. The mobile agent carried the keyword over to a remote agent host. The station agent at the remote host first classified the keyword "Processor" under the domain "Computer", and then the keyword was expanded by thesaurus module to widen the search criteria. The expanded query was {Processor, CPU, Controller, Microcontoller}. There were 10 course documents under the domain "Computer". The index table for the 10 course documents was scanned to retrieve the occurrences of each word in the expanded query. The occurrences of expanded keywords in each document were summed together to get the matching score, and the documents were ranked in descending order.

The ranked documents based on matching score are shown in Table 13.

Table 13: Ranking Retrieved Documents
User Keyword	Domain	Expanded Keywords	Document	Occurrences
Processor	Computer	Processor	ComputerArchitecture2	5
Processor	Computer	CPU	ComputerArchitecture2	3
Processor	Computer	Controller	ComputerArchitecture2	1
Processor	Computer	Microcontroller	ComputerArchitecture2
Processor	Computer	Processor	ComputerArchitecture5	3
Processor	Computer	CPU	ComputerArchitecture5	2
Processor	Computer	Controller	ComputerArchitecture5	1
Processor	Computer	Microcontroller	ComputerArchitecture5

With reference to Equation 10, the matching score for the document "ComputerArchitecture2" was calculated as:

In the same way, the matching score for the document "ComputerArchitecture5" was calculated to be 6. The ranked document list is shown in Table 14.

Table 14: Ranked Documents
	Document	Matching Score
1	ComputerArchitecture2	9
2	ComputerArchitecture5	6
3	ComputerArchitecture4	3
4

Which documents should be presented to the user from Table 14 was decided by applying the threshold. With a threshold matching score value of 5, the search agent sent the two documents "ComputerArchitecture2" and "ComputerArchitecture5" back to the user terminal.