Understanding the Indexing Service

[Previous] [Next]

The Indexing Service provides Web-type indexing and querying to corporate intranets, Internet sites, and more conventional networks without reformatting documents. With the click of a button, end users can index and query the contents of intranet or Internet sites on Windows 2000 Server with Internet Information Services (IIS). The Indexing Service does more than just index documents, however. It provides a system for publishing information on your intranet or on the Web. Because the Indexing Service indexes both the content and properties of formatted documents, you don't need to convert existing documents to HTML to make them available to your users. Instead, documents in a variety of formats, such as Microsoft Word or Microsoft Excel, are directly available.

Even though its primary function is the indexing of Web servers, the Indexing Service is useful on any network where searches for documents are common, and it is essential on any network with frequent searches through large numbers of files.

The Indexing Service functions much as one would expect—it catalogs a set of documents, enabling dynamic full-text searches using either the search function, a query form, or Microsoft Internet Explorer. Just as an index in a book maps an important word to a page inside the book, content indexing on a computer takes a word within a document and maps it back to that document. Documents to be indexed can be specified in catalogs and can include document properties as well as the actual text in the document. Once the Indexing Service is set up, it needs no ongoing maintenance and administration is required only if you need to change a basic configuration. If you didn't include the Indexing Service in your original installation of Windows 2000, you can add it through Add/Remove Programs in Control Panel.

Defining Terms

When administering the Indexing Service, you'll encounter a number of terms that have a special meaning when used in the Indexing Service context. Here are some of the most common ones, with their definitions:

  • Corpus The entire collection of HTML pages and other documents indexed by the Indexing Service.
  • Virtual root An alias to a physical location on disk. For example, in IIS, the virtual root /maildocs points to the physical disk location %SystemRoot%\help\mail.
  • Scope The range of documents to be searched when executing a query. Physical paths or virtual roots can specify scopes.
  • Scan The process by which files and directories are checked for modifications. Scanning is performed against virtual roots that have been selected for indexing.
  • Catalog A directory where all temporary (word lists) and persistent (shadow and master) indexes and cached properties are stored for a particular scope.
  • CiDaemon A child process created by the Indexing Service. CiDaemon works in the background, filtering documents for the Indexing Service.
  • Filter Part of a dynamic link library (DLL) of filters, each designed to extract textual information and properties from a specific type of formatted document.
  • Query A request to search files for specific data.
  • Word list When a document is indexed, the index information goes first to a small temporary index, called a word list. Word lists are maintained in memory until the Indexing Service combines them into the existing indexes.
  • Persistent index Data for an index that is stored on disk. Unlike word lists, which exist only in memory, a persistent index survives shutdowns and restarts. Persistent-index data is stored in a highly compressed format. There are two types of persistent indexes: shadow indexes (also referred to as saved indexes and as temporary indexes) and master indexes.
  • Shadow index A persistent index created by merging word lists and occasionally other shadow indexes into a single index. A catalog can have multiple shadow indexes.
  • Shadow merge The process by which word lists and shadow indexes are combined into a single shadow index. A shadow merge is performed to free up memory used by word lists and also to make the filtered data persistent.
  • Master index A persistent index that contains the indexed data for a large number of documents. This is usually the largest persistent data structure. In an ideal state, this is the only index present because all of the indexed data is stored in the master index and there are no shadow indexes or word lists. A master index is created through a master merge.
  • Master merge The process by which shadow indexes are combined with the current master index into a single master index. Unlike shadow merges, this is usually a fairly long process.

How Indexing Works

The Indexing Service uses filters that can read certain types of documents, extract the text and properties, and send that information to the indexing engine. The filters included with Windows 2000 will index the following kinds of documents: text, HTML, Microsoft Office 95 and later, and Internet Mail and News (provided IIS is installed). The Indexing Service can use other filters made available by software vendors. The vendor that supplies the filter will also supply installation instructions.

After extracting the text and properties, the Indexing Service determines the language the document is written in and removes words that are on the language's exception list. The exception list contains prepositions, pronouns, articles, and so forth and is appropriately named Noise.xxx, where xxx represents the language. Figure 26-1 shows a portion of the Noise.eng file, which contains the exception list for American English. You can add words to or remove words from the exception list using any text editor, such as Notepad.

After words from the exception list are removed, the remaining words are stored first in a word list in memory. At least once a day, the word lists are combined to form temporary saved indexes, and later the Indexing Service consolidates the temporary indexes into a single master index. All this is done automatically, although under certain circumstances you may need to intervene by initiating a merge manually, as described later in this chapter.

click to view at full size.

Figure 26-1. A portion of the exception list for American English.



Microsoft Windows 2000 Server Administrator's Companion, Vol. 1
Microsoft Windows 2000 Server Administrators Companion (IT-Administrators Companion)
ISBN: 1572318198
EAN: 2147483647
Year: 2000
Pages: 366

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net