Language Dependencies

                 

 
Special Edition Using Microsoft SharePoint Portal Server
By Robert  Ferguson

Table of Contents
Chapter  5.   Overview of Indexing and Searching Content


One important aspect of dealing with text and text processing is that text is language dependent. This means that text must be treated differently based on the language in which it is written. Differences include, for example, word separation, commonly used words, word expansion and replacement, and word stemming. SharePoint Portal Server provides this kind of linguistic support for the following languages:

  • Chinese (both simplified and traditional)

  • Dutch

  • English

  • French

  • German

  • Italian

  • Japanese

  • Korean

  • Spanish

  • Swedish

  • Thai

The set of languages for which linguistic support is provided is larger than the set of languages for which a translated SharePoint Portal Server user interface is available. Currently the user interface is available in English, French, German, Japanese, Italian, and Spanish.

If linguistic support is missing, SharePoint Portal Server will fall back on some language-neutral default implementation. Unfortunately, the interfaces that are used to provide linguistic support are not documented. The list of languages therefore cannot be extended.

Word Breaking

In western languages, words are typically separated by spaces. But other characters such as the hyphen, semicolon, and the colon serve as separators. The separation, called word breaking, is less simple in some far eastern languages. For Japanese, a mapping to a common character set is implemented, such that the same words written in one document or the other remain the same. SharePoint Portal Server provides a number of word breaking modules. Interestingly, though, it is not possible to create custom word breakers.

For example, searching for the term "server-based" (without single or double quotes) will also include documents that contain either the word "server" or the word "based". Luckily, relevance ranking pops in ”clearly the existence of both words next to each other is ranked higher.

TIP

To exclude matching only with a single word, you need to indicate that this is a phrase. This is done by including double quotes around the search term.


Word Stemming

Word stemming regards looking at the roots of a word, to include variations of the word in search results. This is not a simple process when different languages are involved. The grammar of languages differs substantially; for example the rules to express past or future tense can be quite different. Other rules may involve the usage of prefixes or suffixes. To illustrate the complexity just for the English language, think about the following:

In English, the past tense is generally formed by adding the suffix "-ed", such as "jumped". But some verbs, such as "go" (with past tense forms "went" and "gone"), don't follow this simple rule. Other languages will require other rules, making word stemming the most complex linguistic capability that is provided in SharePoint Portal Server.

CAUTION

SharePoint Portal Server only supports word stemming for verbs ”plural forms of a noun will not be matched with a singular form (for example, babies will not match baby).


Many European languages embrace the idea of accents ”which typically do not come into play when typing on a keyboard. Mapping accented characters to the common underlying character therefore will produce more hits.

Noise Words

Some words are very common in a particular language and therefore will match with almost every document. These words would fill up the index with useless information and are therefore filtered. In other words, they are considered noise, and thus known as noise words. In an English document, you will frequently find the noise word "these". But if the document was in German, you should not exclude this word from the index ”"these" means "thesis," a quite specific word when searching science- related information.

Thesaurus

Another linguistic feature built in to SharePoint Portal Server is a thesaurus. It allows the substitution or expansion (the usage of synonyms) of words at query time. This feature allows the user to use an acronym such as "MS", which will get replaced with "Microsoft"; or to use the term "IE" synonymously with the term "Internet Explorer".

NOTE

If an acronym is commonly used, such as "MS" in a software engineering environment, you can use it as a synonym, because many documents will also use the acronym.


To bypass the thesaurus, the user can specify a phrase match, including the term or terms in double quotes.

The terms in the thesaurus should be specific to the terminology that is used within the organization; therefore the thesaurus is by default empty.

To learn more about customization of the thesaurus, see "Customizing the Thesaurus," p. 492.


                 
Top


Special Edition Using Microsoft SharePoint Portal Server
Special Edition Using Microsoft SharePoint Portal Server
ISBN: 0789725703
EAN: 2147483647
Year: 2002
Pages: 286

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net