Document Categorization

After you create a category hierarchy, the next step is to categorize the content in the workspace. There are two methods of associating documents with categories:

  • Manually assigning categories by editing the document properties (metadata).
  • Automatically assigning categories by using the Category Assistant.

Categorizing Documents Manually

There are two ways to assign individual documents to categories:

  • Editing the Search and Categories tab of the document's Properties page.
  • Adding the Categories property to the document profile.

In addition to these two methods, this section describes how to apply categories to shortcuts (links) to content stored outside the workspace.

Edit the Properties Page of the Document

You can manually categorize a document by editing the Search and Categories tab on the Properties page of the document. By using Windows Explorer, an author or coordinator can select one or more values from the checklist of workspace categories. If the document is stored in an enhanced folder, you must check out the document before you can change the document's category assignments. For a small number of documents, you can use this method of categorization exclusively.

Update the Document Profile

You can also categorize a document by using document profiles. If the coordinator has configured the document profile to display categories, the author will be able to select categories when they check in the document. Adding the Categories property to document profiles provides a way to enforce category assignment when authors check in a document. It also distributes the task of document categorization among multiple authors. This method is particularly useful for bulk categorization scenarios, such as when a large set of documents is migrated into the workspace. Both methods are illustrated in the following figure.

Figure 17.2  Two ways to manually assign categories to a document

Categorize Links to Content Outside the Workspace

When stored in a SharePoint Portal Server folder, shortcuts provide the ability to annotate content stored outside the workspace with metadata. By using shortcuts, you can manually assign categories to information stored outside the workspace. SharePoint Portal Server includes a special document profile called the Web Link profile, which includes a property called Link, for this purpose.

When you add a .URL file to a workspace folder and apply the Web Link profile, SharePoint Portal Server uses the Link property to determine the target of a shortcut. SharePoint Portal Server automatically updates the Link property for a .URL file but not for a .LNK file. To fill in this property, right-click the shortcut to open the profile form and then select Edit Profile. When the form opens, SharePoint Portal Server populates the Link property. You can close the form by clicking OK.

If you categorize multiple shortcuts at the same time (bulk edit), SharePoint Portal Server does not automatically update this property. The shortcuts do not display correctly until you open each shortcut individually.

When you add a shortcut to the workspace, the object created in the store is a file called a stub. When the Link property is set on a stub, two things happen:

  • SharePoint Portal Server does a one-page crawl of the link target. As it does this, SharePoint Portal Server applies any properties on the stub automatically to the link target. For example, if you create a shortcut and set the Link property on the Web Link profile to, and then set the Categories property to "Technology," SharePoint Portal Server includes the Web site in the content links that it displays when you browse the Technology category.
  • When it displays a stub that has the Link property set on it, the dashboard site renders a hyperlink not to the stub (which would be a URL to a shortcut in the workspace), but rather to the target of the Link property. In the previous example, when you browse the Technology category, you see a link to rather than the shortcut stored in the workspace.

In a folder associated with the Web Link document profile, add a shortcut to the content that you want to categorize and apply the Web Link document profile to the shortcut. Edit the document profile to apply the appropriate categories. Ensure that you fill in the Link property correctly before closing the profile form. To preserve the Link property of the shortcut when you drag and drop it in the workspace, ensure that the default document profile for the folder includes the Link property.

The Title property of the Web Link document profile overwrites the actual title of the document retrieved by the shortcut. This is true even if the Title property remains empty. To avoid this problem, create a document profile for the shortcuts that includes the Link and Categories properties but not the Title property.

If you crawl a large quantity of content outside the workspace, you can also apply categories automatically. The next section of this chapter describes this method of automatic categorization.

Categorizing Documents Automatically

Efficiently categorizing documents presents a significant challenge to coordinators. Not only can there be a vast amount of information aggregated for the dashboard site but also this information typically lacks inherent structure, making it hard to organize sensibly. To solve this problem, SharePoint Portal Server includes technology that will automatically categorize crawled documents as well as documents published in the workspace. If you plan to use categories for a large number of files, the Category Assistant can efficiently assign categories from your category structure to existing documents and add them automatically to new documents. This reduces the time required to implement categories for your users.

The Category Assistant is based on an adaptive algorithm that can learn the "definition" of a topic if given sufficient training examples. Before using it, you must manually apply categories to a representative selection of documents for the Category Assistant to use as training examples. The Category Assistant compares documents assigned to one category with documents from other categories to identify the most characteristic features (words). Ultimately, the definition of a category is the list of words that best distinguish documents in one category from documents in other categories.

When SharePoint Portal Server updates the index, the Category Assistant compares the category definition to the list of words contained in each new document encountered. More distinguishing words, such as those in the document's title, are given greater weight in the category definitions. The comparison of category definition to document yields a number that represents the confidence with which the Category Assistant would place the document in the given category. SharePoint Portal Server tags the document with the category only if this confidence number is above the precision level set by the coordinator. SharePoint Portal Server can and often does automatically categorize a single document into multiple categories.

SharePoint Portal Server associates documents with categories when it updates the index. For this reason, there may be a delay before you see a document appear in the assigned category. The length of the delay depends on the index method you use and the amount of content that is included in the index.

Extend Category Properties and Views

The Category Assistant categorizes documents by stamping them with metadata. Specifically, there is a hidden property on the base document profile called Autocategories (urn:attributes:autocategories). The Category Assistant populates this property with the categories that best describe the document. This property is different from the Categories property, which users update manually. There are two reasons for this difference:

  • To differentiate between a manually categorized and automatically categorized item.
  • To enable the Category Assistant to overwrite previous automatic categorization values for a given document without disturbing manual categorizations.

When you enable the Category Assistant, SharePoint Portal Server queries both properties to create category views in Web folders and the dashboard site. When you disable the Category Assistant, SharePoint Portal Server eliminates the query for Autocategories, leaving only the query for the Categories property. If the Category Assistant is not functioning as the coordinator expects, this makes it easy to turn it off and eliminate all automatically categorized documents from category views.

Configure the Category Assistant

You can access the Category Assistant from the Properties page of the top-level category folder. SharePoint Portal Server enables the feature by default but the Category Assistant does not perform any categorization until you train it.

Consider the following points before training the Category Assistant:

  • Try to provide as many examples as possible. These examples should encompass as many facets of the category as possible. For example, a category about Llamas might include training documents about the evolution of llamas, typical llama habitats, and llama behaviors.
  • Consider applying multiple categories to a document. You can assign a document to any category that a user might access. For example, if a user wants to find information about a utility that your group uses, she might look under a category called Internal Tools or a category whose name describes the purpose of the tool, such as Archiving.

To configure and train the Category Assistant:

  1. On the Category Assistant property page, select the set of documents you want to be automatically categorized. You can select only documents stored in the workspace, only documents stored outside of the workspace, or all documents (default selection).
  2. Set the precision level. If you set a high precision level, the Category Assistant will require a more precise match and might categorize fewer documents.
  3. Click the Train Now button. This button must be clicked in order to create the category definitions. After they are trained, all subsequent documents that are indexed will be subject to automatic categorization.
  4. To disable the Category Assistant, clear the Enable Category Assistant check box.

Figure 17.3  Category Assistant property page

Train the Category Assistant

Training the Category Assistant is the most important step in categorizing documents automatically. The Category Assistant needs training examples for each category. Without good training examples, the accuracy of the Category Assistant is limited. It is recommended that you use a minimum of 10 documents per category to train the Category Assistant successfully.

Ideal training documents are

  • All related to the same category topic. For example, if the category were Product Design, including a document about product specifications would be useful. However, including a training document about product inventory would reduce the Category Assistant's accuracy.
  • Primarily textual. Word processing documents are excellent training examples. Documents such as spreadsheets do not offer as much text for the Category Assistant to use for categorization.
  • Relatively long. There must be enough text for the Category Assistant to analyze the documents and identify the keywords that define a category.

Good training examples for each category improve the accuracy of the Category Assistant. The more training examples you provide, the more precise the Category Assistant can be.

You can assign the task of training the Category Assistant to one person or several. Two training models are:

  • Allow authors to categorize documents. If you want to distribute training responsibilities across a group of authors, you can add the Categories property to your document profiles. As authors check in and categorize their documents, they add training examples for the Category Assistant. The benefit of this model is that you use a greater number of documents as training examples. This procedure works best if the authors clearly understand the category structure.
  • Assign training responsibilities to one individual. If you want to control the Category Assistant training process, you can remove the Categories property from your document profiles. You can then assign categories by editing the Search and Categories tab of the Properties page of a specific document.

Note that SharePoint Portal Server treats any document that you manually categorize as a training example. Therefore, if contributors check in their documents and categorize them on a day-to-day basis, they are implicitly training the Category Assistant. The benefits of this design is that far more documents will be treated as training examples and the coordinator need not worry about managing a special set of training documents.

Override the Category Assistant

At times, you may want to override automatically chosen categories on individual documents. To support this, a property (urn:content-classes:item::issuggestedcategoryused) indicates whether the automatically selected categories should be included in category views or not. If it is set to TRUE, then the document will appear in category listings in the Web folder view and on the dashboard site. The property is set by selecting the Display document in suggested categories check box on the Search and Categories tab on the Properties page of a document.

If the Category Assistant does not select the appropriate categories for a document, a coordinator can override the Category Assistant by using the following methods:

  • For a single document. The coordinator may enable the Category Assistant for the workspace but occasionally override automatically chosen categories for specific documents. For example, the Category Assistant may place a document about hats in the Coats category. The coordinator can correct the category assignment by editing the document's Properties page. To do this, clear the Display document in suggested categories check box on the Search and Categories tab on the Properties page and manually assign the appropriate categories.
  • For all documents. If the Category Assistant is not performing as expected, the coordinator can disable it and neutralize all automatically assigned categories. When the Category Assistant categorizes documents, it updates a hidden property on the base document profile called Autocategories. When the Category Assistant is disabled, SharePoint Portal Server ignores the Autocategories property.

It is difficult to return to an automatic categorization system after you override the Category Assistant for more than a few documents. There is no automated way to do this. If you override the Category Assistant, and then want to undo that action, you must manually update the Search and Categories tab on the Properties page of the document. Your changes will take effect at the next index update.

Microsoft Sharepoint Portal Server 2001 Resource Kit
Microsoft SharePoint(TM) Portal Server 2001 Resource Kit (Examples & Explanations Series)
ISBN: 0735615624
EAN: 2147483647
Year: 2001
Pages: 231 © 2008-2017.
If you may any questions please contact us: