Architecture and Components of the Microsoft Search Engine | MicrosoftВ® Office SharePointВ® Server 2007 Administrators Companion

Search in SharePoint Server 2007 is a shared service that is available only through a Shared Services Provider (SSP). In a Windows SharePoint Services 3.0-only implementation, the basic search engine is installed, but it will lack many components that you'll most likely want to install into your environment. Table 16-1 provides a feature comparison between the search engine that is installed with a Windows SharePoint Services 3.0-only implementation and an SharePoint Server 2007 implementation.

Table 16-1: Feature Comparison between Windows SharePoint Services 3.0 and SharePoint Server 2007
Open table as spreadsheet
Feature	Windows SharePoint Services 3.0	SharePoint Server 2007
Content that can be indexed	Local SharePoint content	SharePoint content, Web content, Exchange public folders, file shares, Lotus Notes, Line of Business (LOB) application data via the BDC
Relevant results	Yes	Yes
Search-based alerts	Yes	Yes
Create Real Simple Syndication (RSS) from result set	Yes	Yes
The "Did You Mean.…?" prompt	Yes	Yes
Duplicate collapsing	Yes	Yes
Scopes based on managed properties	No	Yes
Best Bet	No	Yes
Results removal	No	Yes
Query reports	No	Yes
Customizable tabs in Search Center	No	Yes
People Search/Knowledge Network	No	Yes
Crawl information via the BDC	No	Yes
Application programming interfaces (APIs) provided	Query	Query and Administration

The architecture of the search engine includes the following elements:

Content source The term content source can sometimes be confusing because it is used in two different ways in the literature. The first way it is used is to describe the set of rules that you assign to the crawler to tell it where to go, what kind of content to extract, and how to behave when it is crawling the content. The second way this term is used is to describe the target source that is hosting the content you want to crawl. By default, the following types of content sources can be crawled (and if you need to include other types of content, you can create a custom content source and protocol handler):

q	SharePoint content, including content created with present and earlier versions
q	Web-based content
q	File shares
q	Exchange public folders
q	Any content exposed by the BDC
q	IBM Lotus Notes (must be configured before it can be used)

Crawler The crawler extracts data from a content source. Before crawling the content source, the crawler loads the content source's configuration information, including any site path rules, crawler configurations, and crawler impact rules. (Site path rules, crawler configurations, and crawler impact rules are discussed in more depth later in this chapter.) After it is loaded, the crawler connects to the content source using the appropriate protocol handler and uses the appropriate iFilter (defined later in this list) to extract the data from the content source.
Protocol handler The protocol handler tells the crawler which protocol to use to connect to the content source. The protocol handler that is loaded is based on the URL prefix, such as HTTP, HTTPS, or FILE.
iFilter The iFilter (Index Filter) tells the crawler what kind of content it will be connecting to so that the crawler can extract the information correctly from the document. The iFilter that is loaded is based on the URL's suffix, such as .aspx, .asp, or .doc.
Content index The indexer stores the words that have been extracted from the documents in the full-text index. In addition, each word in the content index has a relationship set up between that word and it's metadata in the property store (Shared Services Provider's Search database in SQL Server) so that the metadata for that word in a particular document can be enforced in the result set. For example, if we're discussing NTFS permissions, than the document may or may not appear in the result set based on the permissions for that document that contained the word in the query because all result sets are security-trimmed before they are presented to the user so that the user only sees links to document and sites to which the user already has permissions.

The property store is the Shared Services Provider's (SSP) Search database in SQL Server that hosts the metadata on the documents that are crawled. The metadata includes NTFS and other permission structures, author name, data modified, and any other default or customized metadata that can be found and extracted from the document, along with data that is used to calculate relevance in the result set, such as frequency of occurrence, location information, and other relevance-oriented metrics that we'll discuss later in this chapter under the section titled "Relevance Improvements." Each row in the SQL table corresponds to a separate document in the full-text index. The actual text of the document is stored in the content index, so it can be used for content queries. For a Web site, each unique URL is considered to be a separate "document."

Use the Right Tools for Index Backups and Restores

We want to stress that you need both the index on the file system (which is held on the Index servers and copied to the Query servers) and the SSP's Search database in order to successfully query the index.

The relationship between words in the index and metadata in the property store is a tight relationship that must exist in order for the result set to be rendered properly, if at all. If either the property store or the index on the file system is corrupted or missing, users will not be able to query the index and obtain a result set. This is why it is imperative to ensure that your index backups successfully back up both the index on the file system and the SSP's Seach database. Using the SharePoint Server 2007's backup tool will backup the entire index at the same time and give you the ability to restore the index as well (several third-party tools will do this too).

But if you only backup the index on the file system without backing up the SQL database, then you will not be able to restore the index. And if you backup only the SQL database and not the index on the file system, then you will not be able to restore the index. Do not let your SQL Administrators or Infrastructure Administrators sway you on this point: in order to obtain a trustworthy backup of your index, you must use either a third-party tool written for precisely this job or the backup tool that ships with SharePoint Server 2007. If you use two different tools to backup the SQL property store and the index on the file system, it is highly likely that when you restore both parts of the index, you'll find, at a minimum, the index will contain inconsistencies and your results will vary based on the inconsistencies that might exist from backing up these two parts of the index at different times.

Crawler Process

When the crawler starts to crawl a content source, several things happen in succession very quickly. First, the crawler looks at the URL it was given and loads the appropriate protocol handler, based on the prefix of the URL, and the appropriate iFilter, based on the suffix of the document at the end of the URL.

Note

The content source definitions are held in the Shared Services Provider Search SQL Server database and the registry. When initiating a crawl, the definitions are read from the registry because this gives better performance than reading them from the database. Definitions in the registry are synchronized with the database so that the backup/restore procedures can backup and restore the content source definitions. Never modify the content source definitions in the registry. This is not a supported action and should never be attempted.

Then the crawler checks to ensure that any crawler impact rules, crawl rules, and crawl settings are loaded and enforced. Then the crawler connects to the content source and creates two data streams out of the content source. First the metadata is read, copied, and passed to the Indexer plug-in. The second stream is the content, and this stream is also passed to the Indexer plug-in for further work.

All the crawler does is what we tell it to do using the crawl settings in the content source, the crawl rules (formerly known as site path rules in SharePoint Portal Server 2003) and crawler impact rules (formerly known as site hit frequency rules in SharePoint Portal Server 2003). The crawler will also not crawl documents that are not listed in the file types list nor will it be able to crawl a file if it cannot load an appropriate iFilter. Once the content is extracted, it is passed off to the Indexer plug-in for processing.

Indexer Process

When the Indexer receives the two data streams, it places the metadata into the SSP's Search database, which, as you'll recall, is also called the property store. In terms of workflow, the metadata is first passed to the Archival plug-in, which reads the metadata and adds any new fields to the crawled properties list. Then the metadata is passed to the SSP's Search database, or property store. What's nice here is that the archival plug-in (formerly known as the Schema plug-in in SharePoint Portal Server 2003) automatically detects and adds new metadata types to the crawled properties list (formerly known as the Schema in SharePoint Portal Server 2003). It is the archival plug-in that makes your life as a SharePoint Administrator easier: you don't have to manually add the metadata type to the crawled properties list before that metadata type can be crawled.

For example, let's say a user entered a custom text metadata field in a Microsoft Office Word document named "AAA" with a value of "BBB." When the Archival plug-in sees this metadata field, it will notice that the document doesn't have a metadata field called "AAA" and will therefore create one as a text field. It then writes that document's information into the property store. The Archival plug-in ensures that you don't have to know in advance all the metadata that could potentially be encountered in order to make that metadata useful as part of your search and indexing services.

After the metadata is written to the property store, the Indexer still has a lot of work to do. The Indexer performs a number of functions, many of which have been essentially the same since Index Server 1.1 in Internet Information Services 4.0. The indexer takes the data stream and performs both word breaking and stemming. First, it breaks the data stream into 64-KB chunks (not configurable) and then performs word breaking on the chunks. For example, the indexer must decide whether the data stream that contains "nowhere" means "no where" or "now here." The stemming component is used to generate inflected forms of a given word. For example, if the crawled word is "buy," then inflected forms of the word are generated, such as "buys," "buying," and "bought." After word breaking has been performed and inflection generation is finished, the noise words are removed to ensure that only words that have discriminatory value in a query are available for use.

Results of the crawler and indexing processes can be viewed using the log files that the crawler produces. We'll discuss how to view and use this log later in this chapter.