Why Interpret Documents?


As we discussed in Chapter 3, a document is unstructured data. Until interpreted, it might as well say "blah, blah, blah." Once we have "interpreted" a document we may:

  • Know whether it applies to us

  • Know what "type" of document it is

  • Know what it is about

  • Understand the content (potentially at many different levels)

Does It Apply?

It has been said that "information is data that changes you." A forecast of snow will change you only if it is relevant to your future plans; for example, if you're planning a ski trip and the forecast is for heavy snow, you might choose to bring your good skis instead of your "rock skis."[27],[28] Until we interpret data, at least at some superficial level, it cannot possibly change us.

What "Type" of Document Is It?

The "type" of document is often shorthand for what it is concerns (primarily a stereotypic format and a relationship to the reader). For example, we think of an "invoice" as a different type of document than a "testimonial." We could put all our documentation in letters in the exact same format, but this would require detailed reading of everything to find what we are looking for. The "type" of document sets up an expectation of the content.

What Is It About? (Superficial Understanding)

A mortgage broker can sift through a large pile of papers and rapidly find the ones that are appraisals and income verification, passing over reams of disclosures and estimates. Mortgage brokers can do this because they have practiced this interpretation and reduced it to sets of patterns that work rapidly and generally accurately. They also have a mechanism for checking whether their initial interpretations were accurate.

Once they know what type of document it is, and that it applies to the real estate property they are working on, they quickly move to understanding the content (e.g., extracting the appraised value or the income).

What Is It Really About? (Levels of Content Understanding)

It is possible to interpret and therefore to understand a document at many levels. Key levels include the following:

  • Traditional document metadata—There is a set of information that is generally available as a by-product of creating or recording the document. This information is not specific to the content of the document. It includes author, date created, most recent modification date, language, document length, source, media, and format.

  • Keywords—Many techniques exist for extracting and indexing documents based on keywords. Generally, these words are not interpreted, but are indexed because they represent unusual words on which someone may want to base a search. One of the most common techniques is to eliminate all the common English words from a document and then consolidate the duplicate words in what remains. This gives a reasonable summary of words on which to base a search, again, without any interpretation of the meaning or context of the words.

  • Key concepts—The next level requires interpretation. As we discuss later in this chapter, several techniques exist, including human interpretation, but all share the goal of determining word meaning in context. This level of interpretation has to find meaning in phrases, as well as individual words.

  • Document type—For most of our correspondence, knowing the type of document is the most important factor. Remember, though, that types are a form of idiolect. Although we may have partial agreement on the difference between "bills" and "junk mail," many people make little or no distinction between "invoices" and "statements." There are three main ways to deduce a document type. The first is to have a human interpreter code it. The second is to have the type identified on the form or template from which the document was produced (this is how humans do a lot of their document recognition). The third is to interpret enough of the concepts to determine the type from the content.

  • Context—Sometimes context is easier to define than document type. There are two levels of context: a document within a broader group of documents (e.g., within a correspondence or within a compilation of related works) and the concepts within the document in the context of the document. For example, the word "frame" might be ambiguous, because it could mean the border around a picture, too rough in the walls of a house or a room, to implicate someone who is innocent, or to put in context. Context can often be derived from clues, and once derived can be used to further disambiguate other terms.

  • Relevance—Generally, we wait until we have established a document's type and context before we determine its relevance, but some approaches take advantage of surrogate measures of relevance. As we will discuss in the section on Google,[29] there are means of approximating relevance that can be effective even in the absence of any further interpretation. However, general relevance cannot help us when we wish to determine whether a document is specifically relevant to a particular task at hand.

  • Obligation and relationship—Finally, we interpret documents to determine if they specifically relate to us. In particular, does this document obligate us, or give us some special opportunity? Is this subpoena for me, or for someone else? Is this really a subpoena? How about this notification from Publishers Clearing House—is it legitimate, have I really "won" something? When our systems can sift incoming messages as efficiently as a personal assistant can, we will have made a leap forward in having computers help with semantic tasks.

The ability to automate some of the task of interpretation will become more significant as the knowledge explosion continues. We will not be able to know everything we need to know. It may be sufficient to be aware of whole fields of knowledge, as long as you can access them when needed.

[27]As pointed out by one of my reviewers, "rock skis" is not only a compound but is part idiolect (see Chapter 2). "Rock skis" are the skis with which you don't mind if you run over rocks.

[28]In another reviewer's idiolect, these are "rock hoppers."

[29]See http://www.google.com/ for further information.




Semantics in Business Systems(c) The Savvy Manager's Guide
Semantics in Business Systems: The Savvy Managers Guide (The Savvy Managers Guides)
ISBN: 1558609172
EAN: 2147483647
Year: 2005
Pages: 184
Authors: Dave McComb

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net