Performance Considerations

                 

 
Special Edition Using Microsoft SharePoint Portal Server
By Robert Ferguson

Table of Contents
Chapter 18.  Configuring SPS to Crawl Other Content Sources


The crawling and indexing process can consume a significant amount of both wall-clock time and hardware resources if there is a large amount of text in the content being crawled. Fortunately, to optimize these processes, a number of SPS- related architecture considerations, parameters, or components may be addressed or " tuned ," including the overall architecture, optimizing the server configuration, and configuring/tuning the crawling process itself.

graphics/troubleshoot_icon.gif

If you find yourself in the unenviable position of trying to troubleshoot crawling issues arising after a power failure, or trying to resume crawling after restoring your crawling server from a tape backup, see "Troubleshooting the Impact of Power Failures During Crawling" and "Crawl Issues After Performing an SPS Restore," respectively, in the "Troubleshooting" section at the end of the chapter.

Optimizing SPS Architecture for Crawling

While SharePoint Portal Server may be installed on a single machine, for anything but the smallest of implementations this is simply not a realistic solution. Even a portal installation for 200 users should be spread across no less than two or three servers: one server dedicated to crawling and indexing, one to document management and other content-related services, and perhaps even another to searching.

For two real-world sample scenarios where specifying a dedicated crawl server is appropriate, see "Installing the Crawling Servers," p. 560 in Chapter 21, and "Configuring Crawling," p. 598 in Chapter 22.

This is true in most implementationsSharePoint Portal Server services should be distributed across multiple servers for optimum performance. The next few pages drill down into methods of optimizing crawling.

Shared Versus Dedicated Workspaces or Servers

Rather than maintaining a single server for both crawling/indexing and supporting user searches, an excellent practice is to split these functions out.

For example, one or more servers may be configured to create and update indexes. These servers are dedicated to crawling content sourcesthey are not used for document management or searching. Each index is then propagated to the workspace on another server or servers dedicated to searching, called a destination workspace. Typically, this amounts to what is known as an index workspace.

The servers dedicated to crawling/indexing create an index of this content and then propagate or copy the indexes to the index workspace(s) on the servers dedicated to searching.

Note that the server or servers dedicated to searching contain one workspace for every four indexes to be propagated. These, like the crawling/indexing servers above, are dedicated in terms of "single function" too. They provide the dashboard site, and store documents displayed on the dashboard site too, such as announcements, holiday schedules, and organization information.

Synchronizing Crawl/Index and Search Servers

SharePoint Portal Server supports synchronizing servers dedicated to crawling with servers dedicated to searching. During the synchronization process, the servers compare metadata about the index content, category hierarchy information, subscriptions, and auto-categorization rules.

However, synchronization becomes a problem if your organization uses firewalls. Placement and configuration of these firewalls must be methodically planned. That is, index propagation requires the standard Windows file sharing protocolif you are using index propagation, most firewalls simply cannot exist between the server dedicated to creating and updating indexes and the server dedicated to searching. Best case: If a firewall must exist between these servers, it must allow Windows file share access.

Optimizing the Data Store

To increase performance or simplify file management, the location of the following data store and log files associated with the SharePoint Portal Server computer must be carefully addressed:

Search Indexes By default, the index resides under the root node. SharePoint Portal Server Administration may be used to change this path, for example if a fast pair of dedicated hardware-mirrored RAID 1 disks is later needed to maximize write performance and availability. Note that if this path changes, the existing indexes do not automatically move to the new index locationonly new indexes are automatically created in the new location.

TIP

To move existing indexes to a new location, see ToolsHowTo.txt in the Support\Tools directory on the SharePoint Portal Server CD.


Search Temporary Files SharePoint Portal Server may need to create temporary files for documents being crawled. Again, use SharePoint Portal Server Administration to move the files to a different drive on the same computer. For best performance and availability, the temporary files location should point to a dedicated fast pair of hardware-mirrored disks.

In both of the preceding casesindexes and temporary filesa fast pair of dedicated hardware-mirrored drives speaks to the need to keep other filesincluding data, property store files, indexes, WSS system files, operating system, pagefile, and so onoff of these dedicated drive pairs. That is, these other files and resources should reside on their own set or sets of disk drives .

Optimizing MSSearch

By default, the Microsoft Search (MSSearch) service temporary files are kept in the folder specified by the system TMP variable (typically WINNT\TEMP on the C: drive). If the C:\WINNT\TEMP directory does not exist, the temporary files are stored in the folder specified by the system TEMP variable (sometimes C:\TEMP, depending on IT standards and customs ). To optimize performance, reset the TMP variable to point to a dedicated fast pair of hardware-mirrored drives.

CAUTION

It is imperative that enough space exists on this drive to store the MSSearch temporary files. Otherwise , MSSearch will fail to operate correctly.


WSS Optimization

Every SharePoint Portal Server computer contains one public store (wss.mdb), and all workspaces hosted on the server sit on this Web Storage Systemthe Microsoft Web Storage System Database file. Use SharePoint Portal Server Administration to change this path, for example to a fast and comfortably sized high-performance RAID set. Upon doing so, the existing file moves to the new location.

Figure 18.6. Click the Data tab in Server Properties to quickly identify the location of various SPS data and log files.

graphics/18fig06.jpg

The Web Storage System-Streaming Database (wss.stm) is used for streaming files, and as such contains data and is a companion to the Web Storage System Database file mentioned above. Together, these two files form the SharePoint database. Like wss.mdb, this file should also be moved to a comfortably sized high-performance RAID set. Over time, these two files will tend to grow quickly.

The log files for the Web Storage System Database should also be placed on a separate pair of fast hardware-mirrored drives for optimum performance and availability. Given the "write" nature of logs in general, an array controller supporting battery- backed cached writes is most desirable. Use SharePoint Portal Server Administration to change this path, noting that when the location changes, the existing files will move to the new location.

Optimizing the Property Store

The placement of the Property Store and Property Store Log files, which contain the metadata from documents, should also be addressed. Unlike the other files above, the file location cannot be modified by using SharePoint Portal Server Administration. To modify these file locations, refer to the ToolsHowTo.txt document in the Support\Tools directory on the SharePoint Portal Server CD.

TIP

For optimal performance, the Property Store and Property Store Log files should be split out onto dedicated physical drives. These files are accessed and shared across all workspaces on the local SharePoint Portal Server.


As a general rule of thumb, assuming budget approval, if you crawl a large quantity of documents (multimillion documents) or simply need the fastest portal implementation available, optimize performance by placing the indexes, property store, log files, Web Storage System files, and Web Storage System log files each on dedicated drives. Hardware-based RAID 1 or 0+1 implementations are best for both logs and data files, though data files (given their less stringent write requirements) can often be successfully housed on RAID 5 volumes , thereby reducing overall cost.

For a more thorough discussion of the various RAID types and the benefits of each, see "RAID 1 Versus RAID 5," p. 589.

Load Options Optimization

As creating an index requires resources both from the server that creates the index and from the server that stores the content included in the index, Microsoft determined early on that a method of specifying Load options or settings would be helpful in tuning SharePoint. Using these settings helps to ensure that the load on the computers being crawled is manageable.

Load options consist of site hit frequency rules and time-out settings. These are detailed in the following sections.

Site Hit Frequency

A site hit frequency rule determines how often SharePoint Portal Server requests documents from a Web site, and how many documents are requested . By default, the site hit frequency is limited to five simultaneous document requestsrefer to Figure 18.7. You can use the site hit frequency rule to modify demand on specific sites. Though you may want a higher document request frequency for creating or updating an index on your own intranet, it is recommended that you specify a lower frequency for external Web sites, so that you do not overload the sites with document requests. Web sites can identify you from the email address provided when creating an index. If you overload a site with requests , you could be denied access to that site in the future.

Figure 18.7. Default Site Hit Frequency Rule settings.

graphics/18fig07.jpg

To add a site hit frequency rule

  1. In the console tree, select the server for which you want to add a site hit frequency rule.

  2. On the Action menu, click Properties (or right-click the server name , and then select Properties on the shortcut menu).

  3. Click the Load tab.

  4. Click Add.

  5. The Add Site Hit Frequency Rule dialog box appears.

  6. In Site name, type the site name, such as http://example.microsoft.com. Multiple site name expressions may be entered, and these are evaluated in order (therefore, "*" should always be the last expression).

  7. Select one of the following frequency options:

    • Request documents simultaneously . SharePoint Portal Server uses all potential/allocated system resources to request as many documents as possible, with no delay between document requests. This setting is usually too resource- intensive for Internet sites, but may be acceptable for some intranet sites.

    • Limit the number of simultaneous document requests, thereby specifying the maximum number of documents that SharePoint Portal Server can request at one time from the site. The default setting for all sites is five simultaneous document requests.

    • Wait a specified amount of time after each document request, that is, delay a certain period of time between document requests. SharePoint Portal Server requests one document per site at one time, and then waits for the specified amount of time to elapse before requesting the next document.

  8. If the frequency is too high, SharePoint Portal Server can easily overload Web sites with requests. Consider specifying lower frequency rates for Internet sites over which you may have no control, and increasing the frequency for intranet sites over which you do have control. Otherwise, an astute Web server administrator will simply block you from crawling their site by updating his own Robots.txt file.

  9. Click OK.

Time-out Settings

Time-out settings determine how long SharePoint Portal Server waits for either a connection to a particular site or a response from a site. Use these settings to minimize time waiting for connections to servers that are down, too busy to respond, or otherwise unavailable.

Optimizing Index Resource Usage

The General tab of the server properties allows for index resource usage to be tuned. In fact, it is here that a "dedicated" index server is createdby moving the slider control all the way over to the right, the server becomes a dedicated index server. This slider also allows for granular control of the amount of memory and other resources that the server allocates to updating indexes.

Figure 18.8. The Indexing resource usage slide bar enables you to configure how the SharePoint Portal Server's resources are used, varying from dedicated to background.

graphics/18fig08.jpg

To tune indexing resource usage

  1. In the console tree, select the server for which you want to add a site hit frequency rule.

  2. On the Action menu, click Properties (or right-click the server name, and then select Properties on the shortcut menu).

  3. Click the General tab, if necessary.

  4. Adjust the slide bar for Indexing resource usage, as required.

  5. Click Apply.

  6. Click OK.

For detailed indexing and search usage data, see "Indexing and Search Resource Usage", p. 494 in Chapter 19.

The Crawling Processan Optimization Overview

In regard to optimizing the actual crawling process to increase performance, the crawl method employed to update indexes is of paramount importance. In this section, we look closely at three methods of tuning the crawl processAdaptive vs. Incremental vs. Notification Updates.

Regardless of the tuning method, we must use the Scheduled Updates tab on the Additional Settings Properties page to schedule incremental and adaptive updates for the content sources in the index.

Adaptive Updates

An adaptive build is an incremental build with an added statistical formula that allows SharePoint Portal Server to maintain a record of how often content changes. With this data, SPS will then crawl only the content that is statistically most likely to have changed. All content sources that do not support notification updates (see the next section) participate in adaptive updates by default. Note that if no updates have ever been done, the first time an adaptive update is performed is the same as performing a full update. Similarly, the second time an adaptive update is performed is the same as performing an incremental update. Only by the third time the adaptive update is performed is a real improvement in performance apparent.

To configure an adaptive update

  1. Select the Adaptive updates check box.

  2. On the Schedule tab of the Adaptive Updates Properties page, select the appropriate time and days for the index updates.

  3. Click OK twice.

An adaptive update does not provide a significant performance improvement on small corpuses (those fewer than approximately 2,500 documents). But it is faster than an incremental or full update, at the expense of occasionally missing some updated content. To compensate for this, Microsoft ensures that documents that have not been touched by SPS for two weeks will always be included, even if they have not been updated. Thus, worst case, the index has two-week-old data.

Performance improvement between an adaptive update and an incremental update (discussed more in the next section) depends on the number of documents and the frequency of changes to the documents. The higher the percentage of documents that change infrequently, the better the performance is.

Incremental Updates

An incremental update of an index contains only changed contentdeleted content is removed from the index, and unchanged content remains as is. Therefore, performing an incremental update will always be faster than performing a full update. If an incremental update is the first update that you createthat is, if you have not previously performed a full updatethat incremental update is actually a full update. This occurs only if the incremental update is the first update you do. Subsequent incremental updates are true incremental updates.

To configure an incremental update

  1. Select the Incremental updates check box.

  2. On the Schedule tab of the Incremental Updates Properties page, select the appropriate time and days for the index updates.

  3. Click OK twice.

Notification Updates

A notification update is the most efficient of all types of index updates. SharePoint Portal Server uses this method by default when possible. If a content source supports notification updates, it automatically sends a notification of any changes made to the index. This notification triggers an update of the individual content source in the index. Notifications are available only for crawling file shares located on an NTFS partition on a computer running Windows NT 4.0 or Windows 2000.

NOTE

SharePoint Portal Server also updates notification-based content sources when the index is reset.


Limitations of SPS Crawling

While SharePoint Portal Server is a powerful search tool, featuring strong search and portal core functionality, it does have certain limitations. For example, general mapping capabilities like those found in products from Orbital Software or Tacit do not exist. In SharePoint, then, such mapping is accomplished simply through writing custom code.

SharePoint Portal Server also does not include Web discussions in the index when crawling sister SPS machines or workspaces. Thus, while it finds and indexes other documents from those servers and workspaces, discussion items are sorely lacking in search results.

Filtering limitations also present a challenge in Microsoft SharePoint Portal Server. Filters remove Formatting and extract both the text of the document and any properties defined in the file itself. SharePoint Portal Server has a limit of 16 megabytes (MB) of text data that it filters from a single document. After this limit is reached, SPS enters a warning in the gatherer log (a log file created by SPS after an index is updated) and SharePoint Portal Server considers the document successfully indexed.

TIP

The 16MB limit applies only to the text in the document, not to figures or graphs or other non-text material. The file size of the document as a whole does not matter.


graphics/troubleshoot_icon.gif

If you find yourself without a clear place to start in troubleshooting general indexrelated issues, see "Viewing the Gatherer Log" in the "Troubleshooting" section at the end of the chapter.

Finally, as we have read previously, destination workspaces can only accept up to four propagated indexes. If you create more than four index workspaces on a server dedicated to crawling, you must create additional workspaces on the destination server to propagate the additional indexes.

TIP

One workaround involves using multiple index workspaces. One index workspace, for example, might be dedicated to crawling your intranet site and other internal content sources, while a second index workspace might be dedicated to crawling Internet-based content.



                 
Top


Special Edition Using Microsoft SharePoint Portal Server
Special Edition Using Microsoft SharePoint Portal Server
ISBN: 0789725703
EAN: 2147483647
Year: 2002
Pages: 286

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net