Chapter 5
This chapter reviews the concept of full-text search and explains how different Microsoft® products implement full-text search. This information can help you to determine which Microsoft products are best for your information retrieval needs.
Microsoft full-text search technology contributes to a number of server and client products. Search functionality varies, depending on the requirements of each product. However, all products benefit from the common advantage of efficient retrieval of unstructured, textual data by means of a full-text index.
The following Microsoft products use variants of Microsoft full-text search technology:
The product you choose depends on your needs. For example, you might want to search intranet sites, Internet sites, or Exchange public folders, or you might want to search over-structured or unstructured data. You might need to cater to an internal team, or you might need to serve the needs of customers over your extranet site. These and other considerations help you determine which product is best for you.
For more information about full-text search technology or these products, see Appendix B, For More Information.
Full-text search provides relevant information from a collection of sources in response to a user's need. This need is typically expressed as a textual query that looks for each, or any, of the query terms in each of the documents in the collection. A simple approach opens and scans each document when a query is processed, looking for each of the query terms. However, opening every document at query processing time and searching for the query terms can be very time consuming. This approach is impractical beyond the individual user searching a small number of documents.
The simple solution is to do much of the work ahead of time. This is done by extracting information about the terms in each document and storing the information in a way that is easy to retrieve. When the search engine processes a query, there is no need to scan each document. The search engine only needs to compare the documents to each other by using the inverted index. The search engine then chooses the documents that are most relevant to the query.
The principle of doing much of the work ahead of query time serves as the foundation of all full-text search technologies, including Microsoft full-text search. To be effective, a search technology must
Good search technology performs these tasks for documents in various languages, over many different types of formats, and across documents stored in a variety of document repositories. Good search technology returns those documents that are truly relevant to a user's need. At its best, full-text search technology fits into a complete knowledge solution, where direct textual query is the user's last resort. Full-text search technology should interpret the information the user needs by using advanced mechanisms, and it should answer the query with a combination of structured and unstructured information.
The following components of Microsoft full-text search technology provide an excellent full-text search solution:
The filter extracts a stream of textual information from a document, discarding all non-textual and formatting information. The filter produces strings of text and property/value pairs to pass in turn to the index engine. All filters are written to an application programming interface (API). For more information about filters, see Appendix B.
SharePoint Portal Server provides word breakers for English, French, Spanish, Japanese, Thai, Korean, Traditional Chinese, and Simplified Chinese. SharePoint Portal Server uses the Windows 2000 Server Indexing Service word breakers for Dutch, Italian, Swedish, and German. When SharePoint Portal Server crawls documents that are in multiple languages, the customized word breaker for each language enables the resulting terms to be more accurate for that language. When there is a word breaker for the language family, but not for the specific sub-language, the major language is used. For example, SharePoint Portal Server uses the French word breaker to handle text that is French Canadian. If no word breaker is available for a particular language, SharePoint Portal Server uses the neutral word breaker. Words are broken at neutral characters, such as spaces and punctuation marks. The code for determining where words are broken is built into the Microsoft Search (MSSearch) service and cannot be changed. You cannot create custom word breakers.
Applications that do not have full-text search enabled, such as Windows or Microsoft Outlook®, access each document at query time. These applications traverse each document and use a filter or other outdated technology to find query terms. This process is very slow when compared to an inverse index. The inverse index provides the ability to go directly into a ranking formula instead of going to sources.
Microsoft full-text search products differ in the algorithm used for this ranking. Index Server and Site Server 3 use vector-based ranking algorithms, while later products employ an advanced probabilistic algorithm.
To express the information request to the system, the user depends on a language that describes the restrictions and conditions over the terms. For example, a user may be interested in all documents published in the previous week. To query for this, the user must express both the concept of "publishing" a document and the precise time range. For example, the time range might start on the previous Monday and end on the previous Sunday.
Microsoft full-text search products evolved through three different query languages:
The following sections discuss Microsoft products that incorporate Microsoft full-text search technology. Each section includes an overview of the product, its target user, and the way in which full-text search integrates with the product. For more information about these products and the related technologies, see Appendix B.