WEBPARSER: SEMI-AUTOMATIC WEB KNOWLEDGE EXTRACTION | (ed.) Intelligent Agents for Data Mining and Information Retrieval

This section describes our approach to designing flexible and simple methods for information extraction from web sources. We have designed and implemented a parser, named WebParser, which, through the definition of several rules, can extract knowledge from HTML pages. If any changes are produced in the web page, it will only be necessary to redefine the rules to allow the parser to work correctly again. The utilization of rules creates flexibility and adaptability for the web agents that may use this parser.

However, it is necessary to define what kind of knowledge can be extracted and filtered from the available web pages. We will consider that all web pages can be roughly classified into two knowledge categories:

Non-structured knowledge . The stored information in the page is represented using natural language , so it will be necessary to apply NLP (Natural Language Processing) techniques to allow the information extraction.
Semi-structured knowledge . It is possible to find, inside the page, a structure (e.g., a table or list) which stores the information by using some kind of marks to delimit the data (e.g., <table>, </table>, <ul>, </ul>, <ol>, </ol> tags in HMTL).

The WebParser proposed is a simple software module, which is specialized in the extraction of knowledge stored in the second kind of pages. Therefore, the knowledge extracted by the parser will be stored in a specific structure inside the web page.

The WebParser Architecture

A parser can be defined as: A module, library or program that is able to translate an input (usually a text file) into an output with an internal representation . The main goal of the WebParser is to accept web pages and generate a data-output structure that contains the filtered information. The WebParser uses as input the HTML page to be filtered along with several sets of rules that must be defined by the engineer to obtain the information.

Figure 2 shows the rule-based architecture for the semiautomatic web knowledge parser. The definition of several rules allows the engineer to modify the behavior of the parser and adapt it in a simple way. These rules will be used by the parser to represent the knowledge to extract, and the output structure to store the knowledge respectively. Two main types of rules must to be defined:

HTML-Rules . These rules define the type of knowledge to extract (tables, lists, etc.), and the position where this knowledge is inside the page.
DataOutput-Rules . This set of rules defines the final data (output) structure that will be generated by the parser when the extraction process ends.

Figure 2: Semi-Automatic Web Parser Architecture

We have used the term " semiautomatic " because, once the engineer defines the two sets of rules to describe the knowledge to be extracted, the rest of the processes are automatic. If the page changes, or if we want to extract other knowledge inside the same page, it will only be necessary to modify those rules. Several limitations and conditions have been considered in the process of designing the parser. These can be summarized as:

The WebParser uses a set of predefined rules ( special characters ) which are used by the parser to preprocess the HMTL page. These special characters (e.g.: , .., ±, etc.) that have their HTML representations (as: á é .. ñ etc.) are first translated into standard characters (e.g.: a, e, n, etc.) to avoid possible problems in the extraction process.
These sets of rules are written by the engineer, and those rules will be stored into text files to facilitate the modification.
Only the following types of web pages are actually parsed:
- Web pages which contain one or more tables (<table> </table>) can be parsed.
- Web pages which contain one or more lists . It is possible to extract information from unordered (<ul> </ul>), ordered (<ol> </ol>), and definition (<dt> </dt>) lists.
- Web pages that contain nested structures built using tables and/or lists.
The code and final implementation of the WebParser will be written in the Java language to obtain a portable and reusable software. This implementation decision was made after taking into account that our main goal is to integrate this software into web agents. Actually, Java is a suitable and very popular language used by a large number of researchers and companies to implement their agent-based and web applications.

Definition of the Sets Rules in the WebParser

From the architecture designed for the WebParser (shown in Figure 2), it is necessary to provide two different rules to extract the information from a given page.

HTML-Rules . Although it is possible to define different rules, the WebParser uses a specific HTML-Rule for filtering each page. This rule is used to select what structures will be filtered from the page. These filtering rules have two attributes:
- Type . This attribute tells the parser what type of structure will be filtered. Only list and table attributes are allowed.
- Position . If the web page stores several structures (tables, lists, etc.), this attribute is used to locate which of those structures are the target of the extraction process. If there are nested structures, we can use the dot (".") to locate the exact position of the structure, i.e., struc1.struc2.strucj represents that information stored in the j-th structure, that is nested with two level depth, will be extracted.
DataOutput-Rules . These rules define the output data structure and what knowledge will be extracted from the page. Only one of those rules (as in the HTML-Rules) is used for every page. These rules are built using the following attributes:
- Data Level . This attribute shows where the data is located within the structure.
- Begin-mark/End-mark . Once the cells that store the data are fixed (using the previous attribute), it is necessary to set the begin and end patterns which are used to enclose the data. For instance, when the data is stored in a table (it will be stored between the tags <td> and </td>), it is possible to use as begin-mark the symbol <td> data </td> to show the string that represents the data begins from this symbol (it is possible to use any string to indicate the beginning and ending of the pattern).
- Attribute- name . Once the information is selected, it is necessary to provide the name of the attributes that will be associated with the retrieved information.
- Attribute distribution . This attribute shows the attributes-name when the structure to be filtered is a table . In this situation, we have a horizontal (Table 2 inside the third structure in Figure 3) or a vertical (Table 3 inside the third structure in the example) distribution in the tables. This attribute could have a null value if the structure does not have any attribute name (e.g., a table with only numerical information). If the structure to filter is a list , the value will be null because no distribution is necessary for the parser (the different items retrieved will be stored in a Java vector).
- Data types . The predefined value of any attribute or data extracted is String . However, the parser can extract other types of data such as: integer (int), float (flo), doubles (doub), etc. The WebParser will cast the extracted string into the desired type of data.
- Data structure . Finally, it is necessary to provide the final data output structure that the parser will generate. It can be either a vector or a table. It is possible to select a horizontal table ( tableh : the attributes will be put in the first row and the data in the next rows) or a vertical table ( tablev : the attributes will be put in the first column and the data in the next columns ). If the extracted information is a list, it will be stored in a vector.

Figure 3: Web Page Example and HTML Code with Several Types of Structures

Figure 3 shows an example of a simple web page (and its related HMTL code) that stores three different structures: a simple unordered list, a table, and a nested structure which combines lists and tables recursively. For instance, if we wish to extract only the second simple table and the ordered list stored in the third structure (that is nested inside into a table, and inside into a list) from the web page, it will only be necessary to define the rules (HTML and DataOutput) shown in Figure 4. The attributes shown in these rules are used by the parser to:

HTML-Rule (a) describes that the structure to extract is a table , and that it is the second ( position = 2 ) structure stored in the page. DataOutput-Rule (a) shows that the data is in the second cell of the table, and that the < td > tag is used as begin-end pattern. The names of the attributes are provided in that order to the parser. So, if the distribution of the attributes in the table is h orizontal/ v ertical ( distrib=hv ), we will first indicate the name of the attributes in the rows ({att1,1+att1,2+att1,3}) and then name the attributes in the columns ({att2,1+att3,1}). The data type to retrieve will be String values and, finally, the WebParser will generate a table ( data struc= tablehv ) to store the retrieved data.
HTML-Rule (b) describes that the structure to extract is a list which is stored inside the third structure ( position=3 ). DataOutput-Rule (b) shows that the data is stored in the second cell of the table, which is stored in third position in the list ( data Level=3.3.2 ), and that it possibly uses the < li > tag as the begin-end pattern. There are no names associated with the data to retrieve ( attrib={null} ), and no distribution of them is necessary. The data type to retrieve will be integer values and, finally, the WebParser will generate a vector ( data struc=sortlist ) to store the retrieved data.

Rule (a): table

Rule (b): list

 - HTML rule:      type= table      position= 2 - DataOutput rule:      data Level= 2      begin-mark= <td>      end-mark= </td>      attrib= {att1,1+att2,1+att2,1}              {att2,1+att3,1}      distrib= hv      data type= (str)      data struc= tablehv

 - HTML rule:      type= list      position= 3 - DataOutput rule:      data Level= 3.3.2      begin-mark= <li>      end-mark= </li>      attib = {null}      distrib = (null)      data type = (int)      data struc = sortlist

Figure 4: HTML and DataOutput-Rules to Extract the Information Stored in the Selected Structures

Actually, the output of the WebParser is a Java object (vector or tables), so this output will be modified by the agent as needed.