Automatic Categorization

Automatic categorization with the Category Assistant is composed of two parts. First, you must "train" the Category Assistant to recognize documents belonging to particular categories. Second, SharePoint Portal Server crawls documents for inclusion in an index. During this latter process, SharePoint Portal Server associates documents with categories based on information from the training documents.

When using the Category Assistant, consider the following factors:

  • You must provide the Category Assistant with training documents. To do this, you check documents in to the workspace. During check-in, you specify at least one category. After check-in, you must publish the documents before they can be included in an index. You can also include external documents in the set of training documents by creating Web links in the workspace to the documents.
  • You can use the Category Assistant to categorize any document included in the index no matter where it resides. SharePoint Portal Server creates an index of searchable information that includes all workspace content. It can also include a variety of information stored outside the workspace on other SharePoint Portal Server workspaces, Web sites, file systems, Microsoft Exchange Servers, and Lotus Notes databases. When indexed, documents are categorized by the Category Assistant. The Category Assistant's precision can be controlled so that more or fewer documents are categorized.

After evaluating the Category Assistant in a test environment, the Cairn Energy project team decided that in order to ensure the most accurate results, careful planning was in order. The following sections outline the planning process. In addition, they describe how this team developed training documents.

Selecting Training Documents

The project team attributes the quality of categorization achieved by the Category Assistant to the quality of the training documents used. Although finding good training documents is time consuming, it greatly reduced the amount of time spent categorizing documents overall.

The team found that understanding how the Category Assistant learns was useful when choosing training documents. The training process builds a list of definitive terms for each category by comparing training documents in a single category to those in other categories. The Category Assistant identifies the top 300 shared features among the training documents for a category. The Category Assistant then applies an algorithm to all documents included in the index to determine the proposed category membership.

The project team chose ten categories for training from the overall list of categories. Each category included a minimum of 20 training documents.

The project team found the following points useful for selecting training documents:

  • Explain how the Category Assistant works to the people supplying the training documents.
  • Create at least ten categories for the Category Assistant to learn.
  • Use training documents that contain a minimum of 2,000 words each.
  • Choose documents with a large number of words per file. For example, Microsoft Excel spreadsheets and Microsoft PowerPoint® presentations frequently did not make good training documents. Files with high word counts, such as Microsoft Word documents, Adobe Acrobat files, and files in the Tagged Image File Format (TIFF) format make good training documents.
  • Use training documents that represent a broad range of examples from the subject category.
  • Use training documents that cover the category subject throughout the document. Even though documents may start on the subject, if much of the content is not relevant, this lowers accuracy.
  • You can use training documents that belong to multiple categories.

Adding Training Documents

The team found that dragging documents into the workspace by using Web folders was the most efficient way to add documents. The team created document profiles that included the Categories attribute. After they added documents to the workspace, they checked them in and published them using the new document profile. The team used these documents as representative samples for training the Category Assistant.

Cairn Energy found that adding training documents individually was time consuming. With SharePoint Portal Server, you can check in multiple documents at once to speed the process. SharePoint Portal Server only includes published documents in the index so you must ensure approval and publishing of documents used for training before running the Category Assistant.

If you do not want to add a document to the workspace but you do want to use it as a training document, you can create a Web link that points to the external document. To do this, create a blank document in the workspace to represent the external document and apply the Web link document profile. You can add the Categories attribute to the Web link profile. When you check in and publish the document in the workspace, you can assign the appropriate category to it. When SharePoint Portal Server crawls the workspace, it also crawls the URL associated with the Web link and crawls the metadata included on the document profile. This includes this document in the index, but leaves it in the original location. By following this process, you can include external documents in the set of training documents.

The team developed the following process:

  • Add multiple documents to the workspace by using Web folders.
  • Categorize documents in the workspace by using the document profile.
  • Add Web links in the workspace to external documents.
  • Categorize the external documents by using the Web Link document profile.

After completing these steps, you can begin training the Category Assistant.

Training the Category Assistant

To categorize documents automatically, you must complete two tasks. First, train the Category Assistant with a set of documents that represent your categories. Second, apply the newly learned categories to all the documents included in the index. You can train the Category Assistant first and then schedule SharePoint Portal Server to perform a full crawl at the next appropriate time. At Cairn Energy, the team found this useful because categorization and crawling affect overall performance of SharePoint Portal Server.

To access the Category Assistant, in the workspace, right-click the Categories folder, and then click Properties.

Monitoring Training

You can monitor the training process by using the Microsoft Windows® 2000 Event Viewer Log.

If insufficient training documents are available, SharePoint Portal Server generates an error message in the Application log as MSSearch Gatherer Event 3065 workspace name_train$$$ Catalog.

When you initiate a training session, SharePoint Portal Server enters a message in the Application log as MSSearch Gatherer Event 3035 workspace name_train$$$ Catalog.

Upon successful completion of the training, SharePoint Portal Server generates a message in the Application log as MSSearch Gatherer Event 3018 workspace name_train$$$ Catalog.

All documents that you categorize using the document profile during check-in are potential training documents. It is important to maintain the accuracy of categories applied to documents in this way. If you retrain the Category Assistant by using poor quality training documents, you affect the accuracy of the automatic categorization performed by the Category Assistant. Cairn Energy only trained the Category Assistant when the high quality of training documents was certain.

Categorizing Documents

After you complete the training, you can manually start the crawl process so that SharePoint Portal Server includes the documents in the index. Alternatively, you can defer the crawl until the next scheduled time. Each time SharePoint Portal Server performs a full update, it also categorizes the documents included in the index.

You can limit the documents you automatically categorize to documents stored in the workspace, or you can choose to include documents stored outside the workspace. Cairn Energy automatically categorized all documents regardless of their location.

The team initially set the Category Assistant to "High Precision" when training it. You can update the index by using the same training documents. Cairn Energy found that reducing the precision increases the number of documents suggested by the Category Assistant. In addition, the quality of the training documents affects the accuracy of the suggested categories. Cairn Energy experimented with reducing precision, but the team decided that including fewer documents with higher accuracy was more suitable to their deployment.

After SharePoint Portal Server categorizes a document, you can view the proposed category structure.

To view the proposed category structure:

  1. In Web folders, right-click the document you want to view, and then click Properties.
  2. Click the Search and Categories tab.

The categories are listed under Categories suggested by the Category Assistant. SharePoint Portal Server displays documents from external content sources in the Categories folder in the workspace.

If the Category Assistant generates inaccurate results, you can override the suggested categories for documents stored within the workspace.

To override the categories suggested by the Category Assistant:

  1. Open the properties page for the document for which you want to override the suggested categories.
  2. On the Search and Categories tab, clear the Show Suggested Categories check box.

You cannot override this setting for documents stored outside of the workspace. Therefore, it is very important to use good training documents to achieve the highest possible levels of accuracy.



Microsoft Sharepoint Portal Server 2001 Resource Kit
Microsoft SharePoint(TM) Portal Server 2001 Resource Kit (Examples & Explanations Series)
ISBN: 0735615624
EAN: 2147483647
Year: 2001
Pages: 231

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net