Section 8.5. Search Algorithms | Information Architecture for the World Wide Web: Designing Large-Scale Web Sites

8.5. Search Algorithms

Search engines find information in many ways. In fact, there are about 40 different retrieval algorithms alone, most of which have been around for decades. We're not going to cover them all here; if you'd like to learn more, read any of the standard texts on information retrieval.^[*]

^[*] A good starting point is Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto (Addison-Wesley).

We bring up the topic because it's important to realize that a retrieval algorithm is essentially a tool, and just like other tools, specific algorithms help solve specific problems. And as retrieval algorithms are at the heart of search engines, it's important to note that there is absolutely no single search engine that will meet all of your users' information needs. Remember that fact the next time you hear a search engine vendor claim that his product's brand-new proprietary algorithm is the solution to all information problems.

8.5.1. Pattern-Matching Algorithms

Most retrieval algorithms employ pattern matching; that is, they compare the user's query with an index of, typically, the full texts of your site's documents, looking for the same string of text. When a matching string is found, the source document is added to the retrieval set. So a user types the textual query "electric guitar," and documents that include the text string "electric guitar" are retrieved. It all sounds quite simple. But this matching process can work in many different ways to produce different results.

8.5.1.1. Recall and precision

Some algorithms return numerous results of varying relevance, while some return just a few high quality results. The terms for these opposite ends of the spectrum are recall and precision. There are even formulas for calculating them (note the difference in the denominators):

Recall = # relevant documents retrieved / # relevant documents in collection

Precision = # relevant documents retrieved / # total documents in collection

Are your site's users doing legal research, learning about the current state of scientific research in a field, or performing due diligence about an acquisition? In these cases, they'll want high recall. Each of the hundreds or thousands (or more?) results retrieved will have some relevance to the user's search, although perhaps not very much. As an example, if a user is "ego-surfing," she wants to see every mention of her nameshe is hoping for high recall. The problem, of course, is that along with good results come plenty of irrelevant ones.

On the other hand, if she's looking for two or three really good articles on how to get stains out of a wool carpet, then she's hoping for high-precision results. It doesn't matter how many relevant articles there are if you get a good enough answer right away.

Wouldn't it be nice to have both recall and precision at the same time? Lots and lots of very high-quality results? Sadly, you can't have your cake and eat it, too: recall and precision are inversely related. You'll need to decide what balance of the two will be most beneficial to your users. You can then select a search engine with an algorithm biased toward either recall or precision or, in some cases, you can configure an engine to accommodate one or the other.

For example, a search tool might provide automatic stemming, which expands a term to include other terms that share the same root (or stem). If the stemming mechanism is very strong, it might treat the search term "computer" as sharing the same root ("comput") as "computers," "computation," "computational," and "computing." Strong stemming in effect expands the user's query by searching for documents that include any of those terms. This enhanced query will retrieve more related documents, meaning higher recall.

Conversely, no stemming means the query "computer" retrieves only documents with the term "computer" and ignores other variants. Weak stemming might expand the query only to include plurals, retrieving documents that include "computer" or "computers." With weak stemming or no stemming, precision is higher and recall is lower. Which way should you go with your search systemhigh recall or high precision? The answer depends on what kinds of information needs your users have.

Another consideration is how structured the content is. Are there fields, rendered in HTML or XML or perhaps in a document record, that the search engine can "see" and therefore search? If so, searching for "William Faulkner" in the author field will result in higher precision, assuming we're looking for books authored by Faulkner. Otherwise, we're left with searching the full text of each document and finding results where "William Faulkner" may be mentioned, whether or not he was the author.

8.5.2. Other Approaches

When you already have a "good" document on hand, some algorithms will convert that document into the equivalent of a query (this approach is typically known as document similarity). "Stop words," like "the," "is," and "he" are stripped out of the good document, leaving a useful set of semantically rich terms that, ideally, represent the document well. These terms are then converted into a query that should retrieve similar results. An alternative approach is to present results that have been indexed with similar metadata. In Figure 8-7, the first of WebMD's results for a search on "West Nile" is complemented by a link to "See More Content like this."

Figure 8-7. Many WebMD search results are accompanied by a link to"See More Content like this"

Approaches such as collaborative filtering and citation searching go even further to help expand results from a single relevant document. In the following example from CiteSeer (see Figure 8-8), we've identified an article that we like: "Application Fault Level Tolerance in Heterogeneous Networks of Workstations." Research Index automatically finds documents in a number of ways.

Figure 8-8. CiteSeer provides multiple ways to expand from a single search result

Cited by: What other papers cite this one? The relationship between cited and citing papers implies some degree of mutual relevance. Perhaps the authors even know each other.
Active bibliography (related documents): Conversely, this paper cites others in its own bibliography, implying a similar type of shared relevance.
Similar documents based on text: Documents are converted into queries automatically and are used to find similar documents.
Related documents from co-citation: Another twist on citation, co-citation assumes that if documents appear together in the bibliographies of other papers, they probably have something in common.

There are other retrieval algorithms, more than we can cover here. What's most important is to remember that the main purpose of these algorithms is to identify the best pool of documents to be presented as search results. But "best" is subjective, and you'll need to have a good grasp of what users hope to find when they're searching your site. Once you have a sense of what they wish to retrieve, begin your quest for a search tool with a retrieval algorithm that might address your users' information needs.