Overview of Index Creation

When MSSearch crawls a document and includes it in an index, it goes through the following series of steps to apply the proper language resources to that document:

  • In filtering the document, MSSearch checks for a valid Language Code Identifier (LCID), a property that contains a standard international abbreviation that uniquely identifies the document locale. MSSearch uses the LCID to identify the appropriate word breaker for that document. Some applications, such as Microsoft Word, emit the language property at the word level. In this case, MSSearch changes the word breaker to ensure that it uses the proper language to read and recognize the word. It only changes to another word breaker when it receives a new language property. If the LCID specifies an unsupported language, MSSearch attempts to find a primary language as an alternate. If the primary language is supported, MSSearch uses that language. For example, if the document uses the English_Australian language, MSSearch uses English_US for the word breaker.
  • If the filtered document does not have a language identifier, and the IFilter does not specify the default language for the server, MSSearch uses the neutral word breaker.
  • If the filtered document does not have a language identifier, the IFilter may suggest that MSSearch use the default locale specified for the operating system. If that language does not have a supported word breaker, MSSearch uses the neutral word breaker.
  • In all instances, MSSearch breaks the following properties using the neutral word breaker.

Properties That Always Use the Neutral Word Breaker

  • Filename
  • ContentClass
  • Shortfilename
  • Reverse Filename
  • Path

A single installation of MSSearch can crawl documents containing content in multiple languages. The resultant index is language independent and contains words in any language without differentiation.

However, to ensure that SharePoint Portal Server crawls content in multiple languages efficiently and successfully, you must consider the following topics:

  • OS/Locale setup. When you install Microsoft Windows® 2000, you specify a particular language and locale. The language and locale affect many settings. These settings include numeric format, date format, currency format, uppercase and lowercase mapping, dictionary sort ordering, and others. Although this helps provide excellent localized support for Windows, unintended results can occur within MSSearch.
  • Word breakers. The first unintended result centers around word breakers for MSSearch. The default language determines the default word breaker. SharePoint Portal Server uses this default word breaker when it builds an index and when it processes search queries. For example, if a server has a default language of Simplified Chinese and SharePoint Portal Server crawls a document that contains no LCID property, MSSearch applies the Simplified Chinese word breaker if the IFilter indicates this default should be used. If this document is actually English, you can imagine that the words extracted from the document and put in the index would be suspect.
  • Noise words. The second unintended result occurs at the index level. SharePoint Portal Server defines a list of noise words for each language and index. These lists are language specific. Using the wrong language resources, such as word breakers or noise files, can result in strange results. For example, in German, the word "die" is equivalent to the English word "the." However, in English, the word "die" means something entirely different. If MSSearch crawls an English document, and applies the German noise word file, all instances of "die" are ignored and not placed in the index. Thus, a query for "die" does not return this document.

These actions extend to all content included in the index. Although MSSearch can include content stored across multiple servers in the index, it does not include the default language from the server it is crawling. Instead, if the document does not carry LCID properties, MSSearch uses the language setting of the server on which MSSearch is installed. If your deployment includes a server dedicated to the purpose of creating and maintaining indexes, this behavior can dramatically affect the content that MSSearch returns during a search query.

The language specified on the server hosting the index workspace determines how MSSearch includes content in the index. In addition, SharePoint Portal Server installs noise word lists for all languages when it creates an index. You can modify these lists in order to add and remove terms, but any changes will only be in effect for subsequent crawls. For the changes take effect for all documents, you must reset the index and have MSSearch perform a full crawl.

Changes to the noise word list will not be reflected in the index until the index is rebuilt. For example, if you add the word, "Microsoft" to the noise word list, the search engine continues to return results containing "Microsoft" until SharePoint Portal Server performs a full update of the index. If you choose to remove a term from the noise word list, you must follow the same steps.

SharePoint Portal Server includes word breakers for the following languages: Japanese, Simplified Chinese, Traditional Chinese, Korean, Thai, English, Spanish, French, German, Italian, Swedish, and Dutch. If you install SharePoint Portal Server on a server with a language that is not from this list, then MSSearch uses the neutral word breaker. The neutral word breaker derives from the English word breaker. Therefore, the neutral word breaker works best when applied to documents written in western European languages.

Shared Service: MSSearch is a server-based, shared service. This means that any installations of SharePoint Portal Server and Microsoft SQL Server that are installed on the same computer use the same version of MSSearch to create indexes and to perform search queries. However, MSSearch creates individual indexes for each application. The service is shared but the data is independent. You can categorize MSSearch and its resources into shared resources and index-specific resources:

Shared Resources

  • MSSearch Application
  • Word breakers
  • Stemmers
  • Settings (resource management, failure settings, etc.)

Index-specific

  • Index files
  • Configuration files (Thesaurus, Noise Word List, etc.)
  • Performance settings
  • Filter associations

It is important that you read the documentation included with each application that uses MSSearch before installing it on the same server as SharePoint Portal Server in order to ensure that when updating MSSearch, you ensure that each application uses the correct word breaker for creating indexes and search queries.

If you install other applications that use MSSearch, you must ensure that each application uses the same word breaker for creating indexes and for search queries. If you install a newer version of a word breaker, you must reset the index and have MSSearch conduct a full crawl of your content.

SharePoint Portal Server contains the most current version of MSSearch. When you install MSSearch, setup checks the version of the word breakers and always keeps the latest version. So, if you install SQL Server 2000 on a computer running SharePoint Portal Server, MSSearch retains the word breakers from SharePoint Portal Server.



Microsoft Sharepoint Portal Server 2001 Resource Kit
Microsoft SharePoint(TM) Portal Server 2001 Resource Kit (Examples & Explanations Series)
ISBN: 0735615624
EAN: 2147483647
Year: 2001
Pages: 231

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net