Analysis and Design

After collecting information and creating a deployment plan, the project team synthesized information to provide a description of the existing infrastructure and a vision of the new infrastructure.

Searching Using Site Server

Originally, most sites within Microsoft did not offer any type of search. Individual departments or groups built their own sites, and the overhead of setting up, running, and maintaining a search capability on each site was burdensome. The major business division portals—such as IT, HR, Product, Finance, Sales, Support, Legal, Operations, and Microsoft Corporate—offered some search capability. The basic problem was that they all set up their own environments and often crawled each other's sites, resulting in duplication of efforts, sometimes three or four times over.

Site Server 3 became the backbone of this centralized search solution. It was set up with dedicated servers for crawling and searching. Site Server created a catalog for each site. The owner of each site or portal specified what content to include or exclude from the catalog for its site, in addition to what, if any, content on its site should not be crawled.

After developing the process for including content in an index, ITG created a set of custom ASP pages—one for querying and one for returning results. ITG modified these pages to fit each portal's needs for custom query capabilities and custom results sets. One by one, the major portals moved to this search solution because they could get better search capabilities for less effort. After all the major intranet sites had migrated, a number of second-tier sites also implemented this search solution.

Site Server 3 Infrastructure

The Site Server 3 architecture at Microsoft consisted of one search server and two crawl servers. The crawl servers included content in their indexes from their respective catalogs, and then propagated the information from the catalogs to the search server. This architecture, shown in Figure 27.2, ensured that the search capability was always available to users.

Figure 27.2. Site Server 3 search solution architecture

This solution crawled about 3 million corporate intranet documents and files, handling nearly 30,000 queries per day. There were 48 catalogs on these servers, and many sites requested that their searches include several of these catalogs.

Searching with SharePoint Portal Server

SharePoint Portal Server is a complete solution—integrated document management, corporate portal, and search. However, this deployment implements only the search and index creation aspects of SharePoint Portal Server. Because this migration does not include the dashboard site and document management features, separate teams started projects to test those features.

The project team modeled the new design largely on the existing Site Server 3 design. The team modified the existing set of custom search and results pages to handle SharePoint Portal Server in addition to Site Server 3. In this design, as each portal migrates to SharePoint Portal Server, the portals simply change their Web forms to point to the new query page on the SharePoint Portal Server computer.

SharePoint Portal Server Propagation Model

The propagation model includes two servers dedicated to creating and maintaining indexes and one server dedicated to searching as part of the centralized search service, as illustrated in Figure 27.3.

Figure 27.3. Enterprise search tiered server architecture

The server dedicated to searching stores a copy of the index propagated from the index workspaces of the servers dedicated to creating indexes.

The task of creating an index is resource intensive. Consequently, with SharePoint Portal Server, you can create an index workspace on a separate server to isolate the tasks associated with creating and maintaining indexes from other SharePoint Portal Server tasks. After you create the index, SharePoint Portal Server propagates it to the server dedicated to searching. SharePoint Portal Server propagates the index immediately after creating it, or you can schedule the creation of the index to coincide with times of low network traffic.

For more information about this scenario, see Chapter 3, Introducing SharePoint Portal Server.

Architecture Comparison

The migration to SharePoint Portal Server did not change the basic architecture for searching across the corpnet. The Site Server 3 architecture used two servers dedicated to crawling content and one search server. The hardware configuration for the Site Server 3 architecture included one server with four processors, used for searching, and one server with two processors, both used for crawling. The largest Site Server 3 catalog existed on a server with four processors. The SharePoint Portal Server architecture uses the same architecture as Site Server except that both servers used for creating and maintaining indexes use four processors. This difference in hardware configuration did not affect the results because most performance measures were made by using the largest catalog.

The project team estimated that additional RAM might also help performance. A master merge is an MSSearch process in which separate content index sub-files are merged into a single content index file. Because SharePoint Portal Server performs master merges less frequently while updating indexes, performance on the server used for creating and maintaining indexes improves with additional memory. Previous tests of additional RAM on the servers running Site Server 3 and Microsoft Windows NT® version 4 did not show significant performance gains. However, the ITG corporate server standard for operating systems changed from Windows NT 4 to Microsoft Windows® 2000 Advanced Server. Windows 2000 makes better use of additional memory than Windows NT 4. Therefore, the project team doubled RAM to 512 megabytes (MB) on each server that hosted an index workspace.

The project team estimated hard disk size requirements based on the index size in Site Server 3 and added room for growth. After determining this number, they doubled it to hold a backup copy of the indexes on the server. The ITG standard hard disk configuration for running SharePoint Portal Server places the document store that includes documents and associated metadata on one hard disk, the content indexes on a second disk, and the logs on a third disk to minimize bottlenecks and maximize input/output (I/O) throughput.

Server Configurations

The following table lists the server configurations that ITG used for this project.

Table 27.2   Enterprise Search Hardware Configurations

Hardware configuration Enterprise search Index 1 Index 2

Processor

4 X 550 megahertz (MHz)

4 X 550 MHz

4 X 400 MHz

Memory (initial)

512 MB RAM

512 MB RAM

512 MB RAM

Memory (final)

2 gigabytes (GB) RAM

2 GB RAM

512 MB RAM

Disk space

92 GB

68 GB

35 GB

OS

Windows 2000 Advanced Server SP1

Windows 2000 Advanced Server SP1

Windows 2000 Advanced Server SP1

As the table shows, the team increased RAM in one of the crawl servers to test scalability; this nearly doubled the crawl speed. The team also increased RAM in the search server to provide approximately 1 GB for the server to cache the property store. This reduced latency.

SharePoint Portal Server Architecture

Figure 27.4 shows the current architecture at Microsoft for enterprise search.

Figure 27.4. SharePoint Portal Server architecture

Reviewing the Catalog

Site Server 3 creates catalogs to enable searching of content. SharePoint Portal Server creates indexes. An index is a resource that is built to enable full-text search of documents, document properties, and content stored outside the workspace but made available through content sources. A workspace can include multiple propagated indexes. When you create the workspace, SharePoint Portal Server automatically creates one index. You can propagate indexes only from index workspaces and only to a single destination workspace on another server (usually a server that is used primarily for searching). A destination workspace can accept indexes from up to four index workspaces. An index workspace is designed to manage only content sources.

The review identified 48 catalogs in the Site Server 3 environment. The primary intranet catalog included approximately 2.5 million documents; the remaining half million documents were spread across the other 47 catalogs.

Search Scopes

There were two main reasons to redefine the catalogs using search scopes. First, many of these catalogs wasted resources crawling the same content. Second, because the SharePoint Portal Server search service is multi-threaded, it was possible for the SharePoint Portal Server to have two threads crawling the same content at the same time.

Search scopes in SharePoint Portal Server offer the ability to restrict searching to a subset of an index. Scopes label entries in the full-text index so that they can be quickly identified by queries to deliver faster and more relevant information. The design of the index handles the search scopes by ensuring that the server passes the correct catalog parameters to the custom search page.

The project team created search scopes to help classify content for a single index without having to create additional workspaces. For example, suppose that Human Resources Web and Legal Web wanted to offer search of their own sites, but both wanted to include the Policy site. Instead of having two separate workspaces for each and crawling the Policy site twice, the team created a single workspace with three search scopes. The team created a scope of the content source pointing to the Policy site called "Policy" and then created a scope for all the content sources pointing to the Legal sites called "Legal." They also created a scope, called "HR," for all the content sources pointing to the Human Resources site. This reduced the number of index workspaces from three to one and prevented crawling the Policy site twice. From the Human Resources site, users can also search the Human Resources and Policy sites by using the different search scopes. Likewise, from the Legal site, users can also search the Policy and Legal sites by using the different search scopes. The queries return more relevant query results by using only the relevant search scopes.

Query Performance

Another consideration in catalog review and redesign was query performance and load balancing. Although search scopes are useful, overusing them can cause performance issues. One logical extension of search scopes includes crawling everything in one workspace, and creating scopes for each content source accordingly. In that case, using the index from the single workspace with many scopes performs all queries. However, as the number of search scopes increases, query performance declines and the index size increases. Because of this, the project team decided to limit search scopes to only two or three, and mainly in smaller workspaces.

An alternative approach is to create a workspace for each site or group of sites on the intranet, and then create a query that spans both workspaces. This also causes query performance to decline as you increase the number of index workspaces included in the query, so the team also decided to limit these types of queries to include only two or three workspaces.

Duplication

The team reviewed the existing catalog structure to eliminate redundant crawling. They reviewed the content sources and created a better design. During the process, the team closely examined scopes or queries across index workspaces that might compromise performance. In certain cases, performance was improved by crawling the same content twice from different workspaces and having search run a query against one workspace rather than having multiple dashboard sites query only one workspace.

To conduct the review of the catalogs, the team described each Site Server 3 catalog in a Microsoft Excel spreadsheet, as shown in the following tables.

Table 27.3   Reviewing Content Sources

Content source Hops and depth Adaptive Scope Schedule

\\server01\d$\ Inetpub\handbook

This folder and all subfolders

Yes

Handbook

None

\\server01\d$\Inetpub\humanre sourcesWeb

This folder and all subfolders

Yes

None

None

http://search1/sas/ dir.asp?setid=1

1 page hop, 0 site hops

No

None

Weekly

Table 27.4   Reviewing Site Path Rules

Site path rules Crawl account Complex URLs

Avoid   file://server01/d$\inetpub\handbook\*_vti*\*

Crawl   file://server01/d$\inetpub\handbook\*

default

Yes

Avoid   file://server01/d$\inetpub\humanresourcesrweb\*_vti*\*

Crawl   file://server01/d$\inetpub\hrweb\*

default

No

Table 27.5   Reviewing Catalog Information

Source Display Mappings

\\server01\d$\Inetpub\handbook

http://corphandbook/

\\server01\d$\Inetpub\hrweb

http://hrwebsite/

The team then compared and identified catalogs to consolidate. The initial examination reduced more than half the number of catalogs, from 48 to 20. After several iterations, the team reduced the number of catalogs to 11.

Consolidation and Workspace Creation

As an outcome of this exercise, the team decided to create a one-to-one correspondence between remaining catalogs and workspaces. Figure 27.4 shows the final layout of the servers and workspaces.

Identifying Key Points

The key points learned in the Analysis and Design phase were:

  • Deployment requires no significant hardware change. Additional memory or processors improve performance.
  • Migration is a great time to review and clean up catalogs.
  • Catalog redesign requires a variety of approaches:
    • Remove duplicate crawls of content where possible.
    • Limit searches to no more than two or three scopes or workspaces.


Microsoft Sharepoint Portal Server 2001 Resource Kit
Microsoft SharePoint(TM) Portal Server 2001 Resource Kit (Examples & Explanations Series)
ISBN: 0735615624
EAN: 2147483647
Year: 2001
Pages: 231

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net