Auto Categorizing Documents

                 

 
Special Edition Using Microsoft SharePoint Portal Server
By Robert Ferguson

Table of Contents
Chapter 12.  Creating Categories


Without a bit of assistance, categorizing documents can prove to be extremely difficult and challenging, not to mention tedious , to even the most content-knowledgeable of Coordinators. To decrease the amount of effort required to categorize documents, SharePoint Portal Server provides an administrative tool called the Category Assistant. The Category Assistant can automatically categorize published documents as well as crawled documents in the workspace. This feature is extremely helpful when you plan to automatically assign categories (leveraging your category structure) to a large number of files that already exist in your workspace, or to documents as they are created.

The Category Assistant is a tool that requires a certain amount of configuration and training to optimize all of the functionality it can provide. By comparing documents assigned to a specific category with other categorized documents, the Category Assistant identifies the most common characteristics (words) and can then automatically assign an existing or new document to a category based on this information. This process of teaching the Category Assistant about your content is referred to as training.

An algorithm is used to learn the meaning of a topic after the Category Assistant has been trained. Before you can use the Category Assistant, a series of activities has to occur which includes manually applying categories to a selection of documents so that they may be used as examples for training.

Before we explain how to start or actually use the Category Assistant, an explanation of how this tool actually works is in order. When the Category Assistant attempts to automatically assign a new document to a category or multiple categories, the following process is being performed:

  1. The Category Assistant compares the list of categories and their definition to a list of words in each new document it reads.

  2. The words in each document are weighed in terms of importance. The comparison of category definitions to each document yields a confidence number that corresponds to the confidence with which the Category Assistant would place a document in a given category.

  3. The confidence number is then compared to a precision number set by the workspace Coordinator. The precision number allows the Coordinator to control to what degree each document is measured for its fit in a particular category.

  4. SPS tags each document with a category only if the confidence number is above the precision level set by the Coordinator. The document is automatically tagged with a single category or multiple categories in the "property" value of the document.

graphics/troubleshoot_icon.gif

Should you run into problems with how documents are categorized, see "Documents Not Categorized as Expected" in the "Troubleshooting" section at the end of the chapter.

More explicitly, metadata is used to stamp a document to categorize it. Each document has a hidden property or attribute called Autocategories (urn:attributes:autocategories). This document metadata is populated with the categories that describe the document during auto categorization. The resulting property is different from the properties used by a user to manually categorize a document. There are two reasons why SharePoint Portal Server distinguishes between these two different types of properties:

  • To provide the ability to overwrite an automatic categorization of a document without affecting the manual categories already assigned.

  • To distinguish between a manual and automatically assigned category.

These events occur after the index has been updated in SharePoint Portal Server. Therefore, there may be a delay before you see a document appear in the assigned category. The length of the delay depends on two things: the amount of content that is in the index and the index method you used.

To learn more about the various indexing methods employed by SPS, see "Updating the Index," p. 104.

To create views for the Web Folders and dashboard site, SharePoint Portal Server performs queries on both types of properties except when the Category Assistant is disabled (see Figure 12.9). If the Category Assistant is disabled, the query for the attribute Autocategories is eliminated. This feature also allows the Coordinator to restrict the view of automatically categorized documents when the Category Assistant is not performing properly and troubleshooting is necessary. We will discuss how to configure the Category Assistant in the next section.

Figure 12.9. View of the Category Assistant.

graphics/12fig09.jpg

Configuring the Category Assistant

You must take a planned approach to optimize the use of SharePoint Portal Server's Category Assistant. To access the Category Assistant, simply click the Properties page of the top-level category folder.

To configure the Category Assistant in SharePoint Portal Server, there are a number of ways to ensure efficiency and success.

This feature is available by default, but does not begin to provide any real value until it is trained. Therefore, you should create a plan that encompasses the amount and different types of examples (categorized documents) that you will provide the Category Assistant to utilize as training documents. When developing this plan, there are a number of objectives you should seek to accomplish.

  • Determine the number of category examples you will provide. Remember, the more references that the Category Assistant is able to utilize for the comparison activity, the more detailed and precise the automatic categorization of your documents will be. It would be beneficial to create one or more examples for category.

  • For each category, determine the different examples you want to provide. For example, the subcategory about Laptops might include training examples for each type of Laptop that may relate to this category.

  • Determine the different categories that each of your training documents may apply to. You will want to apply multiple categories to a document if they are applicable . This will increase the efficiency of your Category Assistant.

You may want to collaborate with your content owners and stakeholder committee to review the examples with them. They can help obtain the previous objectives quickly and with ease. This may possibly decrease the amount of time you will need to prepare before beginning to train the Category Assistant.

With the above in mind, you can now begin to configure and train the Category Assistant. Perform the following:

  1. From the Category Assistant property page, select the documents you wish to automatically categorize. You have three choices:

    All documents (default)

    Documents stored in the workspace

    Documents outside of the workspace

  2. Set the parameter for the precision level. Remember, the precision level measures the exactness of categorization. If you set a high precision level, you might have fewer documents that will be auto categorized. A high level will require a more exact match between the comparison of a document's characteristics (words) and the definition of the category.

  3. Finally, click the Train Now button (see Figure 12.10). When you click this button, the category definitions can be created. After the training process occurs, all additional documents that are indexed will go through this automatic categorization process as well.

    Figure 12.10. Category Assistant properties page.

    graphics/12fig10.jpg

TIP

You can disable the Category Assistant by clearing the Enable Category Assistant check box.


It is very important for the Coordinator to develop and follow a plan regarding how the Category Assistant will be used. By determining an approach to use and configure this feature, you will decrease the amount of manual categorization that is required and increase the quality of functionality that is provided to the users of the workspace. Some of the best approaches include

  • Explain how the Category Assistant works to the people supplying the training documents or otherwise managing the content.

  • Title documents descriptively. This is important because words in the title of a document are given greater weight in the category definitions.

  • Use the Precision setting consistently, so that the weighting applied to each document is uniform.

  • Develop a standard "suite" of test-dense sample documents to use as ideal examples of good training documents. Share these examples with the community responsible for training the Category Assistant for their particular workspaces. Details on training the Category Assistant are covered in the next section, and apply to things like minimum words per document, types of documents, and so on.

  • Leverage the authors in your workspace to continually refine and add training examples to the Category Assistant (also covered next).

  • Use training documents that represent a broad range of subject category examples, including internal and external content.

  • Finally, test all new categories, as well as the changes or impact that new training examples have on the Category Assistant, in your Technical Sandbox or test environment.

graphics/troubleshoot_icon.gif

If you find that your authors are not able to categorize your portal content, see "Authors Complain They Cannot Categorize Documents" in the "Troubleshooting" section at the end of the chapter .

Training the Category Assistant

It is very important to properly train the Category Assistant, which equates to providing excellent text-based example documents for each category. The accuracy of automatically categorizing documents depends on the quality and variety of these examples. That is, to properly utilize the Category Assistant and begin the process of auto categorizing, you must provide a base foundation of training examples. To accomplish this

  1. Determine who will be responsible for training the Category Assistant.

  2. Determine the most efficient types of document examples with which to begin.

Like many other features in SharePoint Portal Server, training the Category Assistant can be delegated to one or more resources. There are two different ways you can delegate this activity.

  • Author Categorization

    If you would like to maximize the number of examples you provide, this feature should be utilized, as it allows the workload of categorizing to be distributed across the content experts. This in turn increases the number and quality of training documents provided to the Category Assistant. In order to accomplish this, the Coordinator will distribute the training responsibilities to a group of authors by adding the Categories property to the document profiles. It is important to make sure that the authors understand the category structure. Why? Because as they check in and categorize their documents, they ultimately will add training examples to be used by the Category Assistant.

  • Single Resource Categorization

    You can control the Category Assistant process by removing the Categories property from your document profile. If this is the method you choose, you will then assign categories by editing the Search and Categories tab of the Properties page of a specific document.

To remove the Categories property from a document profile (for example, from the Add Document Profile Wizard), perform the following:

  1. Select the document profile to use as a template (note that the Base Document Profile is the default template).

  2. Click Next.

  3. Clear the check box next to the Categories property.

Figure 12.11. Categories property in document profile.

graphics/12fig11.jpg

Figure 12.12. Search and Categories tab of document Properties page.

graphics/12fig12.jpg

Note that SharePoint Portal Server acknowledges any document that is manually categorized as a training example. As documents are checked in and categorized, this activity therefore contributes to training your Category Assistant. With this functionality, and the approach described previously, the Coordinator is no longer solely responsible for managing all of the training documents for the Category Assistant.

NOTE

We suggest at least a minimum of 10 to 15 examples for each category, each with perhaps 2,000 words.


Because SharePoint Portal Server uses training examples to determine the definition of a specific category, it is important to provide examples with enough information to accomplish this. If there are not enough quality training documents then the Category Assistant will be limited in terms of its ability to categorize a document. To make certain that you are providing good training documents, ensure that the documents you utilize while creating your first examples have the following characteristics.

  • Common category topic

  • Primarily text based

  • Contain a sufficient amount of text2,000 words is a good starting point

TIP

Microsoft Word documents, Adobe Acrobat files, and other word processing or text-based documents make excellent training examples for the Category Assistant.


Training the Category Assistant can be a relatively easy task if well thought out and defined. By executing this activity efficiently , the Coordinator will have an easier job of managing the performance of auto categorizing documents in the workspace with the Category Assistant. After the Category Assistant has been run against a set of documents, determining the proposed category structure for a particular document is quite easy. In Web folders, simply right-click the document you wish to view, and then click Properties. Note that the Search and Categories tab is available at this point. The categories are listed under the Categories Suggested by the Category Assistant heading. We can disable or override these suggested categories from here as well. Details to this end are covered next.

Overriding the Category Assistant

SharePoint Portal Server provides the ability for you to override the Category Assistant for individual or all documents when you deem necessary. There may be times when the category does not properly assign a document to the correct category or categories. This may be due to a conflict in the training examples provided.

When overriding the Category Assistant for an individual document, the Coordinator can correct the Category Assignment by editing the properties by doing the following:

  1. Go to the documents Properties page.

  2. Go to the Search and Categories tab.

  3. Clear the check box Display document in suggested categories.

  4. Manually assign the appropriate categories.

There may be an occasion when you want to override the Category Assistant for all documents. For example, you may find that there comes a time when the Category Assistant is not assigning categories properly to the documents in your workspace. Therefore you may wish to halt auto categorizing. You can disable the Category Assistant by doing the following:

  1. Go to the Properties Page of the top-level category folder.

  2. Clear the Enable Category Assistant check box (see Figure 12.13).

    Figure 12.13. Disabling the Category Assistant.

    graphics/12fig13.jpg

CAUTION

Disabling the Category Assistant is a difficult decision and should be intensely reviewed before executing this action. It is very difficult to return to an automatic categorization system after you override the Category Assistant for all documents, simply because there is not an automated way to do this. That is, you will need to manually update the Search and Categories tab on the Properties page of each document. Your changes will take effect after the next index update.



                 
Top


Special Edition Using Microsoft SharePoint Portal Server
Special Edition Using Microsoft SharePoint Portal Server
ISBN: 0789725703
EAN: 2147483647
Year: 2002
Pages: 286

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net