Crawling Different Types of Content


One challenge of using a common search engine across multiple platforms is that the type of data and access methods to that data change drastically from one platform to another. Let's look at four common scenarios.

Desktop Search

Rightly or wrongly (depending on how you look at it), people tend to host their information on their desktop. And the desktop is only one of several locations where information can be saved. Frustrations often arise because people looking for their documents are unable find them because they can't remember where they saved them. A strong desktop search engine that indexes content on the local hard drives is essential now in most environments.

Intranet Search

Information that is crawled and indexed across an intranet site or a series of Web sites that comprise your intranet is exposed via links. Finding information in a site involves finding information in a linked environment and understanding when multiple links point to a common content item. When multiple links point to the same item, that tends to indicate that the item is more important in terms of relevance in the result set. In addition, crawling linked content that, through circuitous routes, might link back to itself, demands a crawler that knows how deep and wide to crawl before not following available links to the same content. Within a SharePoint site, this can be more easily defined. We just tell the crawler to crawl within a certain URL namespace and, often, that is all we need to do.

In many environments, Line of Business (LOB) information that is held in dissimilar databases that represent dissimilar data types are often displayed via customized Web sites. In the past, crawling this information has been very difficult, if not impossible. But with the introduction of the Business Data Catalog (BDC), you can now crawl and index information from any data source. The use of the BDC to index LOB information will be important if you want to include LOB data into your index.

Enterprise Search

When searching for information in your organization's enterprise beyond your intranet, you're really looking for documents, Web pages, people, e-mail, postings, and bits of data sitting in disparate, dissimilar databases. To crawl and index all this information, you'll need to use a combination of the BDC and other, more traditional types of content sources, such as Web sites, SharePoint sites, file shares, and Exchange public folders. Content sources is the term we use to refer to the servers or locations that host the content that we want to crawl.

Note 

Moving forward in your SharePoint deployment, you'll want to strongly consider using the mail-enabling features for lists and libraries. The ability to include e-mail into your collaboration topology is compelling because so many of our collaboration transactions take place in e-mail, not in documents or Web sites. If e-mails can be warehoused in lists within sites that the e-mails reference, this can only enhance the collaboration experience for your users.

Internet Search

Nearly all the data on the Internet is linked content. Because of this, crawling Web sites requires additional administrative effort in setting boundaries around the crawler process via crawl rules and crawler configurations. The crawler can be tightly configured to crawl individual pages or loosely configured to crawl entire sites that contain DNS name changes.

You'll find that there might be times when you'll want to "carve out" a portion of a Web site for crawling without crawling the entire Web site. In this scenario, you'll find that the crawl rules might be frustrating and might not achieve what you really want to achieve. Later in this chapter, we'll discuss how the crawl rules work and what their intended function is. But it suffices to say here that although the search engine itself is very capable of crawling linked content, throttling and customizing the limitations of what the search engine crawls can be tricky.




Microsoft Office Sharepoint Server 2007 Administrator's Companion
MicrosoftВ® Office SharePointВ® Server 2007 Administrators Companion
ISBN: 0735622825
EAN: 2147483647
Year: 2004
Pages: 299

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net