Chapter 5 -- Introducing Microsoft Full-Text Search Technologies | Microsoft SharePoint(TM) Portal Server 2001 Resource Kit (Examples & Explanations Series)

Chapter 5

This chapter reviews the concept of full-text search and explains how different Microsoft^® products implement full-text search. This information can help you to determine which Microsoft products are best for your information retrieval needs.

Microsoft full-text search technology contributes to a number of server and client products. Search functionality varies, depending on the requirements of each product. However, all products benefit from the common advantage of efficient retrieval of unstructured, textual data by means of a full-text index.

The following Microsoft products use variants of Microsoft full-text search technology:

Index Server, Indexing Service for Microsoft Windows^®
Microsoft SharePoint^™ Portal Server 2001
Microsoft SQL Server^™ 7 and SQL Server 2000
Microsoft Exchange Server 2000
Microsoft Site Server 3
Microsoft Office XP

The product you choose depends on your needs. For example, you might want to search intranet sites, Internet sites, or Exchange public folders, or you might want to search over-structured or unstructured data. You might need to cater to an internal team, or you might need to serve the needs of customers over your extranet site. These and other considerations help you determine which product is best for you.

For more information about full-text search technology or these products, see Appendix B, For More Information.

Full-Text Search

Full-text search provides relevant information from a collection of sources in response to a user's need. This need is typically expressed as a textual query that looks for each, or any, of the query terms in each of the documents in the collection. A simple approach opens and scans each document when a query is processed, looking for each of the query terms. However, opening every document at query processing time and searching for the query terms can be very time consuming. This approach is impractical beyond the individual user searching a small number of documents.

The simple solution is to do much of the work ahead of time. This is done by extracting information about the terms in each document and storing the information in a way that is easy to retrieve. When the search engine processes a query, there is no need to scan each document. The search engine only needs to compare the documents to each other by using the inverted index. The search engine then chooses the documents that are most relevant to the query.

The principle of doing much of the work ahead of query time serves as the foundation of all full-text search technologies, including Microsoft full-text search. To be effective, a search technology must

Get documents from various document stores.
Extract text from various document formats.
Update the index with the document terms.
Rank the documents, bringing the most relevant documents to the top of a list.

Good search technology performs these tasks for documents in various languages, over many different types of formats, and across documents stored in a variety of document repositories. Good search technology returns those documents that are truly relevant to a user's need. At its best, full-text search technology fits into a complete knowledge solution, where direct textual query is the user's last resort. Full-text search technology should interpret the information the user needs by using advanced mechanisms, and it should answer the query with a combination of structured and unstructured information.

The following components of Microsoft full-text search technology provide an excellent full-text search solution:

Protocol handlers. A protocol handler accesses data over a particular protocol or from a particular store. Common protocol handlers include the file protocol, Hypertext Transfer Protocol (HTTP), Messaging Application Programming Interface (MAPI), and HTTP Distributed Authoring and Versioning (HTTPDAV). The protocol handler processes URLs passed to it by the Gatherer.
Gatherer. The Gatherer maintains the queue of URLs to access across protocols. For example, a Web site crawl may include hundreds of pages and create network traffic by accessing each Web page one at a time. To increase efficiency, the Gatherer interleaves URLs from a remote Web location with URLs from other Web locations or with access to file system documents or other stores. The Gatherer may use additional logic to improve crawl efficiency, such as SharePoint Portal Server adaptive crawling. The Gatherer balances the load that the gathering process imposes on crawled servers. The Gatherer maintains the queue of URLs to be processed and manages the combined crawl. For each document accessed, the Gatherer fetches the stream of content from the protocol handler and passes it on to the appropriate filter.
Filters. Filters (also known as IFilters) extract textual information from a specific document format, such as Microsoft Word documents or text files. For example, Microsoft provides the Microsoft Office filter, which can extract terms from Word, Microsoft Excel, and Microsoft PowerPoint^® files. Other filters work with HTML or e-mail messages. There are also third-party filters, such as the PDF filter provided by Adobe.
The filter extracts a stream of textual information from a document, discarding all non-textual and formatting information. The filter produces strings of text and property/value pairs to pass in turn to the index engine. All filters are written to an application programming interface (API). For more information about filters, see Appendix B.
Word breakers and stemmers. A word breaker is a component that determines where the word boundaries are in the stream of characters in the query or in the document being crawled. A stemmer extracts the root form of a given word. For example, "running," "ran," and "runner" are variants of the word "run." In some languages, a stemmer expands the root form of a word to include alternate forms.
SharePoint Portal Server provides word breakers for English, French, Spanish, Japanese, Thai, Korean, Traditional Chinese, and Simplified Chinese. SharePoint Portal Server uses the Windows 2000 Server Indexing Service word breakers for Dutch, Italian, Swedish, and German. When SharePoint Portal Server crawls documents that are in multiple languages, the customized word breaker for each language enables the resulting terms to be more accurate for that language. When there is a word breaker for the language family, but not for the specific sub-language, the major language is used. For example, SharePoint Portal Server uses the French word breaker to handle text that is French Canadian. If no word breaker is available for a particular language, SharePoint Portal Server uses the neutral word breaker. Words are broken at neutral characters, such as spaces and punctuation marks. The code for determining where words are broken is built into the Microsoft Search (MSSearch) service and cannot be changed. You cannot create custom word breakers.
Index engine. The function of the index engine is to prepare an inverse index of content. An inverse index is a data structure with a row for each term. In this row, there is information about the documents in which the term appears and the number of occurrences and relative position of the term within each document. The inverse index provides the ability to apply statistic and probabilistic formulas to compute the relevance of documents quickly.
Applications that do not have full-text search enabled, such as Windows or Microsoft Outlook^®, access each document at query time. These applications traverse each document and use a filter or other outdated technology to find query terms. This process is very slow when compared to an inverse index. The inverse index provides the ability to go directly into a ranking formula instead of going to sources.
Ranking. Ultimately, the task of evaluating a query results in a set of relevant documents. In relational databases, each row either is in the result set or is not. For example, when a user queries for "all accounts with a balance lower than or equal to $30,000", it is easy to tell which rows in the accounts table to return. The task of full-text search, by contrast, is subtler. The queries are imperfect representations of an information need, and the documents retrieved vary in their relevance. Full-text search ranks the most relevant documents at the top of the result set. Less relevant documents are still valuable to the user, however. Full-text search ranks these documents further below.

Microsoft full-text search products differ in the algorithm used for this ranking. Index Server and Site Server 3 use vector-based ranking algorithms, while later products employ an advanced probabilistic algorithm.

Query Languages

To express the information request to the system, the user depends on a language that describes the restrictions and conditions over the terms. For example, a user may be interested in all documents published in the previous week. To query for this, the user must express both the concept of "publishing" a document and the precise time range. For example, the time range might start on the previous Monday and end on the previous Sunday.

Microsoft full-text search products evolved through three different query languages:

Query Dialect 1
Query Dialect 2
Structured Query Language (SQL) full-text extensions

The following sections discuss Microsoft products that incorporate Microsoft full-text search technology. Each section includes an overview of the product, its target user, and the way in which full-text search integrates with the product. For more information about these products and the related technologies, see Appendix B.