data mining: opportunities and challenges
Chapter XIII - Query-By-Structure Approach for the Web
Data Mining: Opportunities and Challenges
by John Wang (ed) 
Idea Group Publishing 2003
Brought to you by Team-Fly

A significant amount of research has already been done regarding the query-by-structure approach. The CLaP system was initially created to determine the feasibility of the query-by-structure approach. The system allowed the user to dynamically search the Web for documents based on an elaborate set of structure criteria. Even from the early testing with the CLaP system, it became clear that the results indicated that the use of document structure could improve query results. However, the queries, which required the user to specify exactly which tags the search words would appear between, were much too cumbersome for the user to make the system usable. So, although exhibiting promising results, the CLaP system was essentially abandoned in favor of systems using the neural network query-by-structure approach.

The results from the testing of the query-by-structure approach using neural networks has been promising enough to warrant further research, and development on a more enhanced version of the systems has already begun. The neural network systems did eliminate the need for extensive structure information during the query process. It does, however, require that the user select a document type. Consequently, the document structure criteria was predetermined by the system. However, the notion of how a specific type of document should look is something that should really be determined by the user of the system rather than the designer of the system. As a result, future enhancements to the system will allow the user to specify structure criteria, not by providing detailed structure information, but rather by providing examples of document types. Hence, the prototype vectors for the system will be constructed not by the designers of the system, but rather dynamically, based on the examples provided by the users. In simplistic terms, the user will be able to say, "Give me a document that looks like this but contains these words."

In addition to structure, another new trend to consider in the query process is word semantics. According to the online American Heritage Dictionary (http://education.yahoo.com/reference/dictionary/), the definition of the word "semantics" is "the meaning or the interpretation of a word, sentence, or other language form." It should be noted that the meaning of a word and the interpretation of a word are two entirely different things altogether. For example, the word "screen" can have several different meanings depending upon to what it is referring. A computer has a screen; a window can have a screen; a movie theater has a screen; and, during the hiring process, it is standard practice to screen an applicant. Hence, the interpretation of the word "screen" is entirely dependent upon the context in which the word is used. Additionally, several different words could have the same meaning. For example, depending upon the context, the words auto, automobile, car, and vehicle could be interpreted as the same thing. Most search engines today, unless specifically designed to do so, cannot handle the semantics of the words. As a result, although a search issued on a commercial search engine will result in a significant number of purportedly relevant pages, in actuality, the results will likely contain several irrelevant pages as a result of one or more of the search words having different meanings. In addition, since there could be one or more words that has a meaning identical to one of the search words, several relevant pages will actually not be returned by the search engine.

Unfortunately, identifying word semantics is an extremely difficult task. However, a more feasible approach might be to instead identify document semantics. On a basic level, document semantics can be thought of as a document's type. In other words, almost every document typically has some sort of meaning. Some examples of this include resumes, research papers, news articles, etc., which are all specific types of documents. The general idea is that a document's type could provide information about the semantics of the words within the document. For example, if a document is classified as a resume, and the user specified "restaurant" as one of the search words, in this instance, the word "bar" would likely be synonymous with tavern. If, on the other hand, the user had specified "lawyer" as one of the search words, then the word "bar" would likely be referring to the legal bar. Hence, by allowing the user to specify semantic information in the form of a document type, and subsequently using this information during the query process, invaluable information about the semantics of the words within the document can potentially be captured.

Finally, it should be noted that there is currently a paradigm shift taking place towards a new Semantic Web (Berners-Lee, Hendler, & Lassila, 2001). The basic notion is that new Web documents are being created with specific semantic tags, primarily using XML. These new types of documents provide additional meaning to the text within the document. In other words, the HTML tags within a document are used to describe the structure or layout of the document, and the XML tags are used to provide a more detailed description of the content of the documents. New search tools are being created to identify documents containing these descriptive semantic tags that will potentially revolutionize the way Web searches are performed and immensely increase the quality of these searches.

Brought to you by Team-Fly

Data Mining(c) Opportunities and Challenges
Data Mining: Opportunities and Challenges
ISBN: 1591400511
EAN: 2147483647
Year: 2003
Pages: 194
Authors: John Wang

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net