Functional Considerations | Special Edition Using Microsoft SharePoint Portal Server

	Special Edition Using Microsoft SharePoint Portal Server By Robert Ferguson
	Table of Contents

	Chapter 18. Configuring SPS to Crawl Other Content Sources

In the next few sections, we will look into functional considerations that must be addressed to enable crawling the content sources detailed in the previous pages.

After Adding Content

When you add content sources to a workspace or change their settings, the content sources must be crawled so as to update the index. Regularly scheduled updates can make a lot of sense in this regard, to ensure that all of the content that appears on the dashboard site for search and viewing is recent.

CAUTION

Keep in mind that crawling and creating/updating indexes is quite processor- and disk-intensive, as well as potentially network- intensive . Therefore, schedule crawls accordingly .

Using and Configuring IFilters

To crawl documents that have proprietary file extensions, you must register the IFilter for that file type. When configuring content sources, specify the file types to include in the index. For example, a coordinator may want to include files with .gwa extensions in the index.

Each file type has an IFilter associated with it. The IFilter for a particular file type must be registered on the SPS computer crawling that file type. Once the IFilter is registered, documents of that file type can be crawled and included in the index. Note that if a file type is added to an index but no filter is registered, only the file properties are included in the index.

Refer to the documentation accompanying each IFilter in regard to the procedure to register iteach is potentially quite unique. Fortunately, SharePoint Portal Server includes filters for Microsoft Office documents, HTML files, Tagged Image File Format (TIFF) files, and text files.

Should you experience difficulties in working with a particular TIFF file type, see "TIFF Issues" in the "Troubleshooting" section at the end of the chapter .

File Extension Type Rules

You can use file extension type rules to specify file types (indicated by file extensions) to include or exclude when crawling an index of all content sources. The file type inclusion/exclusion rules apply only to content that is stored outside the workspace but included in the index through content sources. Content stored in the workplace does not apply. SharePoint Portal Server updates the index to include content from document shortcuts in the same way as other content sources, though.

To add a file extension type

In the workspace, open the Management folder, and then open the Content Sources folder.
Double-click Additional Settings.
Click the Rules tab.
Click File Types.
Select the file type inclusion/exclusion method.
Type the file extension type that you want to include/exclude.
Click Add.
Click OK.

To remove a file extension type

Select the Rules tab.
Click File Types.
In the list that appears, select a file extension type, and then click Remove.
Click Yes.
Click OK.

Crawling to Map Custom Metadata to Properties

In the process of crawling, SharePoint Portal Server gathers full-text information from documents and includes it in the index. However, the ability to natively map metadata, or data about the data, to properties of SharePoint Portal Server is restricted to only the contents of Lotus Notes databases. Mapping metadata from other sources (that is, from file share and Web site content sources) is possible, though not natively. That is, no user interface exists for this mapping. Instead, this is accomplished by writing custom code.

If you need additional assistance troubleshooting Lotus Notes content source crawling, see "Troubles Crawling a Lotus Notes Content Source" in the "Troubleshooting" section at the end of the chapter.

Examples of metadata for various sources include the following:

Properties for HTML files are often maintained in <META> tags.
Properties for Microsoft Office documents are usually maintained in OLE structured storage (this metadata may be displayed by clicking on Properties from the File menu within the specific Office applications, such as Excel or Word) .

There are a number of steps required to configure a SharePoint Portal Server workspace to crawl external content, while allowing the properties and meta tags in that external content to be promoted as properties in SharePoint Portal Server. To promote properties from external content into SharePoint Portal Server properties

Create a document profile that includes the list of profile properties to be made available through SharePoint Portal Server. This profile may include custom properties.
Create a content source that points to the external data. When saving it, do not create an index.
Modify and apply the property mapping code to map the external content meta tags and property tags to the SharePoint Portal Server document profile properties. See Microsoft's SharePoint Portal Server Resource Kit for more details and sample code to accomplish this.
Flush any cached schema by stopping and starting the SharePoint Portal Server services (via Control Panel, Services).
Start the full update for the content source.

Excluding and Restricting Crawling

SharePoint Portal Server complies with the rules of robots exclusion. Web servers use these rules to control access to their sites by preventing robots, a generic term for Web crawlers or spiders, from accessing certain areas of their Web sites. SPS always searches for the Robots.txt file when crawling, and conforms to the restrictions in it. A Robots.txt file indicates specifically where robots are permitted on the site, and also allows for specific crawlers to be blocked from crawling the site. For example, to prevent a specific robot known for frequently tying up valuable CPU and disk resources from accessing your portal, update the Robots.txt file. You can also simply limit access to specific workspaces on the server in this manner.

The Robots.txt file is not actually installed with SharePoint Portal Server, but it can be manually created or copied from another server and placed in the root of the server. Keep in mind that the Robots.txt file is only "read" by SPS once a day. If you copy a new Robots.txt file to the root node of the workspace or even change the existing Robots.txt file, the changes will not go into effect at once. If the changes need to be implemented immediately, the SPS service needs to be restarted.

In lieu of blocking crawling altogether, access to certain documents and subsequent links may be blocked by using HTML meta tags . A meta tag tells a crawler whether indexing a document or following its links is permitted. This is done by using INDEX/NOINDEX and FOLLOW/NOFOLLOW attributes in the tag. Use NOINDEX and NOFOLLOW, for example, to completely prevent a document from being crawled and its embedded links from being followed.

Top