EVOLUTIONARY COMPUTING, FUZZY LOGIC AND I-AGENTS FOR INFORMATION FILTERING | (ed.) Intelligent Agents for Data Mining and Information Retrieval

Evolutionary computing are powerful search optimization and learning algorithms based on the mechanism of natural selection and, among other operations, use operations of reproduction, crossover and mutation on a population of solutions. An initial set (population) of candidate solutions is created. In this research, each individual in the population is a candidaterelevant homepage that is represented as a URL-string. A new population of such URL-strings is produced at every generation by the repetition of a twostep cycle. Firstly, each individual URL-string's ability is assessed. Each URL-string is assigned a fitness value, depending on how well it performed (how relevant the page is). In the second stage, the fittest URL-strings are preferentially chosen to form the next generation. A modified-mutation is used to add diversity within a small population of URL-strings. It is used to prevent premature convergence to a non-optimal solution. The modified-mutation operator adds new URL-strings to the evolutionary computing population when it is called.

Evolutionary computing is used to assist in improving i-agent performance. This research is based on the successful simulations of employing an i-agent.

The simulation assumes that, first, a connection to the WWW via a protocol, such as HTTP (HyperText Transport Protocol), is done. Next, it assumes that a URL (Universal Resource Locator) object class can be easily created. The URL class represents a pointer to a "resource" on the WWW. A resource can be something as simple as a file or a directory, or it can be a reference to a more complicated object, such as a query result via a database or a search engine.

The resulting information obtained by the i-agent resides on a host machine. The information on the host machine is given by a name that has an html extension. The exact meaning of this name on the host machine is both protocol-dependent and host-dependent. The information normally resides in an existing file, but it could be generated "on the fly". This component of the URL is called the file component , even though the information is not necessarily in a file. The i-agent facilitates the search for and retrieval of information from WWW searches according to keywords provided by the user. Filtering and retrieval of information from the WWW using the i-agent, with the use of evolutionary computing and fuzzy logic according to keywords provided by the user , is described:

Phase 1:

Select the required search engine(s), such as AltaVista, Excite, Google, HotBot, Infoseek, Northernlight, etc.;
Combine the keywords (k ₁ , k ₂ , , k _n ) given by the user in a form understandable to the search engine(s) and submit the keywords to the i-agent;
Obtain the results of the search from the selected search engine(s). The host machine (of the search engine) returns the requested information and data with no specific format or acknowledgment.

Phase 2:

The i-agent program then calls its routines to identify all related URLs obtained from search engine(s) and inserts them into a temporary list (only the first 600 URLs returned are chosen) referred to as "TempList";

For each URL in the TempList, the following tasks are performed:

2.1

Once all URLs are retrieved, initialize the generation zero (of the evolutionary computing population) using the supplied URL by the i-agent (Given an URL address from TempList, connect to that web page);

2.2

Once the connection is established, read the web page and rank it as described:

More weight is assigned to the query term shown applied to the web page with a frequency of occurrence higher than the other terms (k ₁ , k ₂ , , k _n ). Both position and frequency of keywords are used to assign a position and frequency score to a page. If the instances of the keywords on the web page are more frequent, and the position earlier on the web page than those with the other occurrence instances, the higher the web page's ranking. The following fuzzy rules are used to evaluate and assign a score to a web page:

If Frequency_of_keywords = High , then Frequency_Score = High;
If Frequency_of_keywords = Medium , then Frequency_Score = Medium;
If Frequency_of_keywords = Low , then Frequency_Score = Low;

The score obtained from applying these fuzzy rules is called the Frequency_Score . The position of a keyword on a web page is used to assign a position score for the web page. The following fuzzy rules are used to evaluate and assign a position score to a web page:

If Position_of_keywords = Close_To_Top , then Position_Score = High;
If Position_of_keywords = More_&_Less_Close_To_Top , then Position_Score = Medium;
If Position_of_keywords = Far_From_Top , then Position_Score = Low;

The score obtained from the above fuzzy rules is called Position_Score .

The number of links on a web page is used to assign a link score for the web page. The following fuzzy rules are used to evaluate and assign a link score to a web page:

If Number_of_Links = Large , then Link_Score = High;
If Number_of_Links = Medium , then Link_Score = Medium;
If Number_of_Links = Small , then Link_Score = Low;

The score obtained from the previous fuzzy rules is called Link_Score .

A final calculation, based on the scores for each page by aggregating all scores obtained from the fuzzy rules above, is created. That is, for each web page, a score according to the following is derived:

Score = (2* Frequency_Score) + Position_Score + Links_Score

2.2.1	For web pages with high scores, identify any URL link in this web page (we call these links child URLs ) and create a list of these URLs;

2.2.2	For each child URL found on the web page, connect to that web page, evaluate, and assign a score as described in 2.2. Store the URLs with their scores in a list called FitURLs.

2.2.3	Process the information, read, and save it locally.

The next (modified crossover) step involves the selection of the two child URLs (see 2.2.1) that have the highest score (the score for a page will be referred to as "fitness" from here on).
Modified-mutation is used to provide diversity in the pool of URLs in a generation. For modified-mutation, we choose a URL from the list of already created FitURLs, URLs with high fitness (see 2.2.2). The process of selection, modified-crossover, and modified-mutation is repeated for a number of generations until a satisfactory set of URLs is found or until a predefined number of generations (200 was the limit for our simulation) is reached. In some cases, the simulation found that the evolutionary computing system converged fairly quickly and had to be stopped before 200 generations.
Finally, display the URLs with their fitness.