BACKGROUND


Information is a vital resource to individuals and organizations; its timely location can influence key decisions that affect both. It is, thus, no wonder that massive research efforts have been undertaken in recent years with the aim of improving upon existing search facilitates, especially among unstructured and semi-structured resources, where the problem of information finding is most pronounced (Han & Chang, 2002). Looking into ways to extract information from semi-structured texts has been investigated in many system integration projects (El-Beltagy, 1998), such as TSIMMIS (Garcia-Molina et al., 1995) and Lore (McHugh et al., 1997). These systems have tried to provide an integrated view of related data scattered across various structured and semistructured resources, and have, thus, developed templates and wrappers to extract structured information from semi-structured texts. The primary goal of such systems was to unlock the wealth of information stored within legacy applications and to integrate those with other related /similar data available in other resources. Toward this end, specific languages, representation models and ontologies were designed and adopted.

Also, much work has been carried out within the knowledge acquisition community with the aim of providing automatic support for the extraction of information from unstructured texts. This task is still proving to be a rather challenging one. Information Extraction (IE) systems have, thus, appeared with a more focused goal of supporting the task of extracting information from specific domains or for particular tasks (Vargas-vera et al., 2001).

IE systems often rely on templates, hand generated annotations, or domain dependant NLP knowledge. For example, the SoftMealy system (Hsu, 1998) and the system presented in Kushmerick et al. (1997) are both IE systems that attempt to extract information from web pages through examples of such pages, all of which exhibit similar structure. These systems work when structure templates of well-defined fields of content exist. For example, a page containing some country codes may have the name of a country formatted in bold and the code for that country formatted in italics (Kushmerick et al., 1997). It is possible, then, to use this formatting information to extract country-code pairs.

However, it is often the case that structure or formatting on its own cannot be used to extract information. One of the solutions intended to overcome this obstacle is to tag the information in a way that would enable its extraction. Indeed, XML (Bray et al., 1998) emerged as a way to achieve precisely that.

Taking this idea a step further is the approach that has been adopted by SHOE (see Heflin & Hendler, 2000a; Heflin & Hendler, 2000b). SHOE is a web-based knowledge representation language that can be embedded in web pages. By explicitly specifying the ontology being used within a web page and tagging information within that page, using that ontology, it is possible to appropriately extract information from that page and to infer relations and information not explicitly represented. This idea was the basis for the DARPA agent markup language (DAML) (DARPA, 2000). DAML, RDF (Lessila & Swick, 1999) and a number of other languages are all part of the Semantic Web, the goal of which is to enrich information resources with semantics that can be processed by computers (Fensel, 2000).

What can be said regarding this approach, in general, is that, for its successful application to existing documents, automatic metadata augmentation mechanisms have to be devised. Trying to manually re-author existing documents in order to comply with these emerging standards is simply not possible because of their sheer volume. The work presented here attempts to do just that, but only for documents that exhibit the characteristics outlined in the previous section and in the next .




(ed.) Intelligent Agents for Data Mining and Information Retrieval
(ed.) Intelligent Agents for Data Mining and Information Retrieval
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 171

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net