Section 9.2. Controlled Vocabularies


9.2. Controlled Vocabularies

Vocabulary control comes in many shapes and sizes. At its most vague, a controlled vocabulary is any defined subset of natural language. At its simplest, a controlled vocabulary is a list of equivalent terms in the form of a synonym ring, or a list of preferred terms in the form of an authority file. Define hierarchical relationships between terms (e.g., broader, narrower) and you've got a classification scheme. Model associative relationships between concepts (e.g., see also, see related) and you're working on a thesaurus. Figure 9-1 illustrates the relationships between different types of controlled vocabularies.

Figure 9-1. Types of controlled vocabularies


Since a full-blown thesaurus integrates all the relationships and capabilities of the simpler forms, let's explore each of these building blocks before taking a close look at the "Swiss Army Knife" of controlled vocabularies.

9.2.1. Synonym Rings

A synonym ring (see Figure 9-2) connects a set of words that are defined as equivalent for the purposes of retrieval. In practice, these words are often not true synonyms. For example, imagine you're redesigning a consumer portal that provides ratings information about household products from several companies.

Figure 9-2. A synonym ring


When you examine the search logs and talk with users, you're likely to find that different people looking for the same thing are entering different terms. Someone who's buying a food processor may enter "blender" or one of several product names (or their common misspellings). Take a look at the content, and you're likely to find many of these same variations.

There may be no preferred terms, or at least no good reason to define them. Instead, you can use the out-of-the-box capabilities of a search engine to build synonym rings. This can be as simple as entering sets of equivalent words into a text file. When a user enters a word into the search engine, that word is checked against the text file. If the word is found, then the query is "exploded" to include all of the equivalent words. For example, in Boolean logic:

(kitchenaid) becomes (kitchenaid or "kitchen aid" or blender or 
"food processor" or cuisinart or cuizinart)

What happens when you don't use synonym rings? Consider Figure 9-3, which shows the results of a search for "pocketpc." Pretty discouraging, huh? Looks like we might have to look elsewhere. But look what happens when we put a space between "pocket" and "pc" (Figure 9-4).

Figure 9-3. Results of a search at Computershopper


Figure 9-4. Another search on the same site


Suddenly, the site has oodles of information about the Pocket PC. A simple synonym ring linking "pocketpc" and "pocket pc" would solve what is a common and serious problem from both user and business perspectives.

However, synonym rings can also introduce new problems. If the query term expansion operates behind the scenes, users can be confused by results that don't actually include their keywords. In addition, the use of synonym rings may result in less relevant results. This brings us back to the subject of precision and recall.

As you may recall from Chapter 8, precision refers to the relevance of documents within a given result set. To request high precision, you might say, "Show me only the relevant documents." Recall refers to the proportion of relevant documents in the result set compared to all the relevant documents in the system. To request high recall, you might say, "Show me all the relevant documents." Figure 9-5 shows the mathematics behind precision and recall ratios.

Figure 9-5. Precision and recall ratios


While both high precision and high recall may be ideal, it's generally understood in the information retrieval field that you usually increase one at the expense of the other. This has important implications for the use of controlled vocabularies.

As you might guess, synonym rings can dramatically improve recall. In one study conducted at Bellcore in the 1980s,[*] the use of synonym rings (they called it "unlimited aliasing") within a small test database increased recall from 20 to 80 percent. However, synonym rings can also reduce precision. Good interface design and an understanding of user goals can help strike the right balance. For example, you might use synonym rings by default but order the exact keyword matches at the top of the search results list. Or, you might ignore synonym rings for initial searches but provide the option to "expand your search to include related terms" if there were few or no results.

[*] The Trouble with Computers: Usefulness, Usability, and Productivity, by Thomas K. Landauer (MIT Press).

In summary, synonym rings are a simple, useful form of vocabulary control. There is really no excuse for the conspicuous absence of this basic capability on many of today's largest web sites.

9.2.2. Authority Files

Strictly defined, an authority file is a list of preferred terms or acceptable values. It does not include variants or synonyms. Authority files have traditionally been used largely by libraries and government agencies to define the proper names for a set of entities within a limited domain.

As shown in Figure 9-6, the Utah State Archives & Records Service has published a listing of the authoritative names of public institutions in the state of Utah. This is primarily useful from content authoring and indexing perspectives. Authors and indexers can use this authority file as the source for their terms, ensuring accuracy and consistency.

Figure 9-6. An authority file


In practice, authority files are commonly inclusive of both preferred and variant terms. In other words, authority files are synonym rings in which one term has been defined as the preferred term or acceptable value.

The two-letter codes that constitute the standard abbreviations for U.S. states as defined by the U.S. Postal Service provide an instructive example. Using the purist definition, the authority file includes only the acceptable codes:

AL, AK, AZ, AR, CA, CO, CT, DE, DC, FL, GA, HI, ID, 
IL, IN, IA, KS, KY, LA, ME, MD, MA, MI, MN, MS, MO, MT, NE, NV, NH, 
NJ, NM, NY, NC, ND, OH, OK, OR, PA, PR, RI, SC, SD, TN, TX, UT, VT, 
VA, WA, WV, WI, WY.

However, to make this list useful in most scenarios, it's necessary to include, at a minimum, a mapping to the names of states:

AL Alabama
AK Alaska
AZ Arizona
AR Arkansas
CA California
CO Colorado
CT Connecticut
 . . . 

To make this list even more useful in an online context, it may be helpful to include common variants beyond the official state name:

CT Connecticut, Conn, Conneticut, Constitution State

At this point, we run into some important questions about the use and value of authority files in the online environment. Since users can perform keyword searches that map many terms onto one concept, do we really need to define preferred terms, or can synonym rings handle things just fine by themselves? Why take that extra step to distinguish CT as the acceptable value?

First, there are a couple of backend reasons. An authority file can be a useful tool for content authors and indexers, enabling them to use the approved terms efficiently and consistently. Also, from a controlled vocabulary management perspective, the preferred term can serve as the unique identifier for each collection of equivalent terms, allowing for more efficient addition, deletion, and modification of variant terms.

There are also a number of ways that the selection of preferred terms can benefit the user. Consider Figure 9-7, where Drugstore.com is providing a mapping between the equivalent term "tilenol" and the authoritative brand name, "Tylenol." By showing users the preferred terms, you can educate them. In some cases, you'll be helping them to correct a misspelling. In others, you may be explaining industry terminology or building brand recognition.

Figure 9-7. Mapping between equivalent terms


These "lessons" may be useful in very different contexts, perhaps during the next telephone conversation or in-store interaction a customer has with your organization. It's an opportunity to nudge everyone toward speaking the same language, without assuming or requiring such conformity within the search system. In effect, the search experience can be similar to an interaction with a sales professional, who understands the language of the customer and translates it back to the customer using the company or industry terminology.

Preferred terms are also important as the user switches from searching to browsing mode. When designing taxonomies, navigation bars, and indexes, it would be messy and overwhelming to present all of the synonyms, abbreviations, acronyms, and common misspellings for every term.

At Drugstore.com, only the brand names are included in the index (see Figure 9-8); equivalent terms like "tilenol" don't show up. This keeps the index relatively short and uncluttered, and in this example, reinforces the brand names. However, a trade-off is involved. In cases where the equivalent terms begin with different letters (e.g., aspirin and Bayer), there is value in creating pointers:

Aspirin see Bayer

Figure 9-8. Brand index at Drugstore.com


Otherwise, when users look in the index under A for aspirin, they won't find Bayer. The use of pointers is called term rotation. Drugstore.com doesn't do it at all. To see a good example of term rotation used in an index to guide users from variant to preferred terms, we'll switch to the financial services industry.

In Figure 9-9, users looking for "before-tax contributions" are guided to the preferred term "pretax contributions." Such integration of the entry vocabulary can dramatically enhance the usefulness of the site index. However, it needs to be done selectively; otherwise, the index can become too long, harming overall usability. Once again, a careful balancing act is involved that requires research and good judgment.

Figure 9-9. A site index with term rotation


9.2.3. Classification Schemes

We use classification scheme to mean a hierarchical arrangement of preferred terms. These days, many people prefer to use taxonomy instead. Either way, it's important to recognize that these hierarchies can take different shapes and serve multiple purposes, including:

  • A frontend, browsable Yahoo-like hierarchy that's a visible, integral part of the user interface

  • A backend tool used by information architects, authors, and indexers for organizing and tagging documents

Consider, for example, the Dewey Decimal Classification (DDC). First published in 1876, the DDC is now "the most widely used classification scheme in the world. Libraries in more than 135 countries use the DDC to organize and provide access to their collections."[] In its purest form, the DDC is a hierarchical listing that begins with 10 top-level categories and drills down into great detail within each.

[] From OCLCs Introduction to the Dewey Decimal Classification at http://www.oclc.org/dewey/about/about_the_ddc.htm.

000 Computers, information, & general reference
100 Philosophy & psychology
200 Religion
300 Social sciences
400 Language
500 Science
600 Technology
700 Arts & recreation
800 Literature
900 History & geography

For better or worse, the DDC finds its way into all sorts of interface displays. As Figure 9-10 shows, the National Library of Canada uses it as a browsable hierarchy.

Figure 9-10. The Dewey Decimal Classification in action


Classification schemes can also be used in the context of searching. Yahoo! does this very effectively. You can see in Figure 9-11 that Yahoo!'s search results present "Category Matches," which reinforces users' familiarity with Yahoo!'s classification scheme.

Figure 9-11. Category Matches at Yahoo!


The important point here is that classification schemes are not tied to a single view or instance. They can be used on both the back end and the front end in all sorts of ways. We'll explore types of classification schemes in more detail later in this chapter, but first let's take a look at the "Swiss Army Knife" of vocabulary control, the thesaurus.

9.2.4. Thesauri

Dictionary.com defines thesaurus as a "book of synonyms, often including related and contrasting words and antonyms." This usage hearkens back to our high school English classes, when we chose big words from the thesaurus to impress our teachers.

Our species of thesaurus, the one integrated within a web site or intranet to improve navigation and retrieval, shares a common heritage with the familiar reference text but has a different form and function. Like the reference book, our thesaurus is a semantic network of concepts, connecting words to their synonyms, homonyms, antonyms, broader and narrower terms, and related terms.

However, our thesaurus takes the form of an online database, tightly integrated with the user interface of a web site or intranet. And though the traditional thesaurus helps people go from one word to many words, our thesaurus does the opposite. Its most important goal is synonym managementthe mapping of many synonyms or word variants onto one preferred term or conceptso the ambiguities of language don't prevent people from finding what they need.

So, for the purposes of this book, a thesaurus is:

A controlled vocabulary in which equivalence, hierarchical, and associative relationships are identified for purposes of improved retrieval.[]

[] Guidelines for the Construction, Format, and Management of Monolingual Thesauri. ANSI/NISO Z39.191993 (R1998).

A thesaurus builds upon the constructs of the simpler controlled vocabularies, modeling these three fundamental types of semantic relationships.

As you can see from Figure 9-12, each preferred term becomes the center of its own semantic network. The equivalence relationship is focused on synonym management. The hierarchical relationship enables the classification of preferred terms into categories and subcategories. The associative relationship provides for meaningful connections that aren't handled by the hierarchical or equivalence relationships. All three relationships can be useful in different ways for the purposes of information retrieval and navigation.

Figure 9-12. Semantic relationships in a thesaurus