Getting Started with the Indexing Service


The Indexing Service extracts information from designated documents and organizes the results into a catalog that can be searched quickly and easily. The extracted information includes the content (text) within documents as well as document properties, such as the document title and author. To understand how the Indexing Service works, let’s look at the following subjects:

  • How you can use and install the Indexing Service

  • How the Indexing Service builds indexes and catalogs

  • How you can search and manipulate indexes

Using the Indexing Service

The Indexing Service indexes the following types of documents:

  • HTML (.htm or .html)

  • American Standard Code of Information Interchange (ASCII) text files (.txt)

  • Microsoft Word documents (.doc)

  • Microsoft Excel spreadsheets (.xls)

  • Microsoft PowerPoint presentations (.ppt)

  • Internet mail and news (when you index NNTP virtual servers)

Other documents for which a document filter is installed can be indexed as well. The Indexing Service isn’t installed on your Web server by default, but you can install it using the Windows Components Wizard. To access and use this wizard, follow these steps:

  1. Log on to the computer using an account with administrator privileges.

  2. Access Control Panel. Double-click Add Or Remove Programs. This displays the Add Or Remove Programs dialog box.

  3. Start the Windows Components Wizard by clicking Add/Remove Windows Components.

  4. In the Components list box, select Indexing Service and then click Next to continue. The wizard then installs the Indexing Service.

  5. Click Finish when prompted.

Once you’ve installed the Indexing Service, you manage the service using the Indexing Service snap-in for the Microsoft Management Console (MMC) or the Indexing Service node in Computer Management. Regardless of the option you choose, you can work with both local and remote servers using the same techniques. The only task that’s different is connecting to remote servers.

With the Indexing Service snap-in, you set the server you want to work with when you add the snap-in to a management console. Here are the steps for adding the Indexing Service snap-in to a management console and selecting a server to work with:

  1. Open the Run dialog box by clicking Start and then clicking Run.

  2. Type mmc in the Open field and then click OK. This opens the MMC.

  3. In MMC, click File, and then click Add/Remove Snap-In. This opens the Add/Remove Snap-In dialog box.

  4. On the Standalone tab, click Add.

  5. In the Add Standalone Snap-In dialog box, click Indexing Service and then click Add.

  6. Select Local Computer to connect to the computer on which the console is running. Or select Another Computer and then type the name of a remote computer.

  7. Click Finish. Afterward, click Close and then click OK.

With the Computer Management console, you connect to the local server automatically when you start the utility. You can connect to a different computer by right-clicking the Computer Management node, selecting Connect To Another Computer, and then following the prompts. Figure 12-1 shows the Indexing Service node in the Computer Management console.

click to expand
Figure 12-1: Use the Indexing Service node in the Computer Management console to manage the Indexing Service.

As you can see, selecting the Indexing Service node displays an overview of the currently installed catalogs, which include the default System and Web catalogs. The catalog summary provides the following information:

  • Catalog The descriptive name set when the catalog was created

  • Location The physical location of the catalog, such as D:\Catalogs\WWW\

  • Size (Mb) The size of the catalog in megabytes

    Note

    The typical catalog is 25 percent to 40 percent of the total size of the documents indexed. This means that if you index 1 GB of documents, you’ll need an additional 250 MB–400 MB of storage space for the associated catalog.

  • Total Docs The total number of documents designated for indexing in this catalog

  • Docs To Index The total number of documents that remain to be indexed

  • Deferred For Indexing The total number of documents that need to be indexed but can’t be indexed because they’re in use

    Note

    The Indexing Service defers indexing of documents being used and attempts to index the documents when they’re no longer in use.

  • Word Lists The number of word lists associated with the catalog and stored in system memory

  • Saved Indexes The number of indexes within the catalog that have been saved to disk.

  • Status The status of the indexing process

If you access the Indexing Service using Computer Management, you’ll find that two default catalogs were created when you installed the service. These catalogs are the following:

  • System The System catalog contains an index of all documents on all hard disk drives attached to the server.

  • Web The Web catalog contains an index of the default Web site.

    Tip

    I recommend deleting the System catalog. This catalog typically isn’t used on an IIS server, and maintaining the catalog uses system resources that could be better used elsewhere.

You can create additional catalogs at any time. When you create a catalog, you can associate the catalog with a Web site and an NNTP virtual server. The service then uses the indexing settings on the directories associated with the site or virtual server to determine which documents should be indexed. You configure indexing settings on directories as detailed in the section of this chapter entitled “Setting Web Resources to Index.”

Indexing Service Essentials

The Indexing Service stores catalog information in Unicode format. This allows the service to index and query content in multiple languages. The Indexing Service performs three main functions to process document contents:

  • Indexing Indexing is the process of extracting information from documents. The index contains contents from the main body of documents but doesn’t include words on any exception word lists associated with the catalog. Indexes are compressed to save space.

  • Catalog building Catalog building is the process of storing the index information in a named location. Catalogs contain extracted content in the form of indexes and stored properties for a set of documents.

  • Merging Merging is the process of combining temporary indexes to create combined or master indexes. Merging indexes improves the performance of the Indexing Service and reduces the amount of RAM used to store temporary indexes in memory.

Indexing and catalog building take place automatically in the background when the Indexing Service is running. When first started, the Indexing Service takes an inventory of the directories associated with each catalog to determine which documents should be indexed. This process is referred to as scanning. The Indexing Service can perform two types of scans:

  • Full

  • Incremental

Full scans take a complete look at all documents associated with a catalog. The Indexing Service performs a full scan under the following circumstances:

  • When the service is run for the first time after installation

  • When a folder is added to a catalog

  • As part of recovery if a serious error occurs

  • When you manually choose to do so

Incremental scans look only at documents modified since the last full or incremental scan. The Indexing Service performs incremental scans under the following circumstances:

  • When you start or restart the Indexing Service

  • When you change a local document

  • When the Indexing Service loses change notifications

  • Any time you manually start an incremental scan

    Note

    File system change notifications are important parts of the incremental scanning process. Whenever local documents are modified, the operating system generates change notifications and the Indexing Server reads them. In most cases change notifications for documents on remote systems won’t reach the local Indexing Service. To account for this, the Indexing Service periodically performs incremental scans on any remote directories associated with a catalog.

After completing a scan of documents to be indexed, the Indexing Service begins to build the necessary catalogs. It does this by reading each document using a document filter. Filters are software components that interpret the structure of a particular kind of document, such as an ASCII text file, a Word document, or an HTML document. Using the appropriate filter, the Indexing Service extracts the document contents and property values, storing the property values and the path to the document in the index. Next, the Indexing Service uses the filter to determine the language in which the document is written and breaks the document body (content) into individual words. Each supported language has an exception list that provides a list of words that the Indexing Service should ignore.

You’ll find exception lists in the \%SystemRoot%\System32 directory. These files are stored as ASCII text files and are named Noise.lang, where lang is a three- letter extension that indicates the language of the exception list. You can add entries to or remove entries from the exception list using a standard text editor or word processor.

The Indexing Service also stores values of selected document properties in the property cache. The property cache is a storage place for values of properties that you might want to search on or display in the list of search results. Within the property cache are two storage levels: primary and secondary. The primary storage level is for values that are frequently accessed, and, as such, these values are stored in a way that makes them quick and easy to retrieve. The secondary storage level is for additional values that are used infrequently.

After discarding words on the exception list and updating the property cache, the Indexing Service stores the remaining document content in a word list. Each document can have one or more word lists associated with it. Word lists are combined to form temporary indexes called shadow indexes. Shadow indexes are stored on disk in a compressed file format. Multiple shadow indexes can be, and usually are, in the catalog at any given time. The Saved Indexes entry, mentioned previously, lists the number of shadow and master indexes in a catalog. Over time, the number of shadow indexes can grow substantially. This occurs as documents are added to and modified within indexed directories.

The Indexing Service uses a process called shadow merging to combine word lists and temporary indexes, thereby reducing the number of temporary resources used and improving the service’s overall responsiveness. Shadow merges occur during scans and as part of the normal housekeeping process implemented by the Indexing Service. The key events that trigger a shadow merge are when there are too many word lists stored in memory (1012 by default) or when the total size of all word lists exceeds a preset value (2560 KB by default).

The result of the indexing process is a master index. Each catalog has one, and only one, master index. The master index is created the first time you create a catalog and is kept up to date by periodically merging it with shadow indexes to create a new master index. This process of merging shadow indexes with the master index is called master merging. Once a master merge has occurred, there’s only one saved index associated with a catalog—namely, the master index.

Master merges are triggered automatically based on the size of the shadow indexes, the amount of free disk space on the catalog drive, and the number of document changes in indexed directories. Automatic master merges, regardless of condition, are scheduled to occur nightly at midnight as well. If necessary, you can force a master merge. The key reason for forcing a master merge is to cause the Indexing Service to update a catalog so that all changes are reflected in search results immediately. As you might imagine, the master merge process is resource-intensive, so you normally wouldn’t force a master merge during peak usage hours.

Settings that control scanning, merging, and other Indexing Service processes are found in the Registry and are stored here:

HKEY_LOCAL_MACHINE \SYSTEM \CurrentControlSet \Control \ContentIndex

Registry settings, given in decimal value, that control scanning and merging include the following:

  • MasterMergeCheckpointInterval Sets the interval for determining whether a master merge should be performed. The default value is 8192 seconds.

  • MasterMergeTime Sets the default time for when a daily master merge should be performed. The default value is 60, meaning 60 seconds after the start of a new day.

  • MaxFilesizeFiltered Sets the maximum size of filtered content for a particular document. By default, this is set to 256 KB.

  • MaxFreshCount Sets the maximum number of document updates and changes that triggers a master merge. By default, if more than 10,000 documents are changed, a master merge is triggered.

  • MaxIndexes Sets the maximum number of indexes that should be associated with a catalog before shadow merging is forced. By default, if more than 25 indexes are associated with a catalog, the Indexing Service will perform a shadow merge.

  • MaxShadowIndexSize Sets a maximum size value for shadow indexes in 128 KB increments. Used with MinDiskFreeForceMerge to force master merges when disk space is low and the size of the shadow index exceeds this value. The default is 15 (15 128 KB = 1920 KB).

  • MaxWordLists Sets the maximum number of word lists that can exist in a catalog. When this number is exceeded, a shadow merge is triggered. By default, this value is set to 20.

  • MaxWordlistSize Sets the maximum size of all word lists associated with a catalog. This value is set in increments of 128 KB and when exceeded, a shadow merge is triggered. By default, this value is set to 20 (20 128 KB = 2560 KB).

  • MinDiskFreeForceMerge Sets a minimum free disk space value. If a drive containing catalogs has less disk space than this value and the total size used by shadow indexes exceeds MaxShadowIndexSize, the Indexing Service performs a master merge. The default is 15 MB.

  • MinSizeMergeWordlists Sets the minimum size threshold for merging word lists with a shadow index. If the word lists’ size exceeds this value, a shadow merge is triggered. The default is 256 KB.

Searching Catalogs

Searching is the process of looking through the catalog to find information. Users can search the catalog in several ways. The technique most often used with Web servers is to build a query form that can be used to search the catalog. The Indexing Service includes a query form for each catalog that can be used to test the installation. You can also create query forms using Active Server Pages (ASP) and Internet data query (IDQ) files.

With ASP, you create the query form and handle the results using a combination of server-side scripts that use ASP objects, HTML, and client-side scripts. The scripts you use can be written in any installed scripting language, and both Microsoft VBScript and Microsoft JScript are installed by default. Typically, you’ll use the same page to implement the query form and display the results once the user has entered search parameters. For example, you could create a page called Query.asp that implements the query form and has an embedded script that submits the search parameters and then formats the search results.

IDQ, on the other hand, is a special language designed for submitting queries to the Indexing Service. With IDQ you create separate pages for handling each step in the query process. You use the following elements:

  • An HTML page that ends with the .htm or .html extension to implement the query form

  • An IDQ page that ends with the .idq extension to define the fixed query parameters for searches

  • An HTML extension file that ends with the .htx extension to format the results of the query

An advantage of IDQ over ASP is that IDQ queries are much faster and more efficient in their use of Indexing Service resources. Regardless of whether you use ASP or IDQ to handle searches, you must set basic parameters that provide default values for the Indexing Service. The parameters you should set are summarized in Table 12-1.

Table 12-1: Basic Parameters for the Indexing Service

Parameter

Description

Sample Value for IDQ

CiCatalog

Sets the file location of the catalog to be searched. If you don’t set this parameter, the Indexing Service searches the Inetpub directory for a default catalog.

CiCatalog = D:\Catalogs\WWW

CiFlags

Sets the search flags for the query. The DEEP flag tells the Indexing Service to search all subdirectories within the current scope.

CiFlags = DEEP

CiMaxRecordsIn ResultSet

Sets the maximum number of records to return in the result set.

CiMaxRecordsInResultSet = 100

CiMaxRecords PerPage

Sets the maximum number of records to return in a single page.

CiMaxRecordsPerPage = 20

CiRestriction

Stores the search values entered by the user as passed from the query form.

CiRestriction = %CiRestriction%

CiScope

Sets the scope of the query within the catalog. If scope is set to /, the search begins at the top (or root) of the document tree.

CiScope = /Docs

Note

Most organizations have Web developers whose job is to create the Web pages needed for searching, handling, and displaying results. As the Web administrator, you assist the development team in setting parameters and publishing the Web pages when they’re completed.




Microsoft IIS 6.0Administrator's Consultant
Microsoft IIS 6.0Administrator's Consultant
ISBN: N/A
EAN: N/A
Year: 2003
Pages: 116

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net