Maintaining Full-Text Indexes | Microsoft SQL Server 2000 Unleashed (2nd Edition)

Maintaining full-text indexes presents some new challenges to the database administrator. One of the most important choices that must be made with regard to full-text indexes is how to repopulate them.

Choosing the Best Refresh Schedule

Updating a full-text index is an expensive operation. If possible, restrict the refresh periods to an off-peak time window. Complete refreshes are the simplest method for maintaining your indexes, but large data sets (more than a million rows) can be unwieldy. The correct choice for repopulation schedules must be made with application design goals in mind. Do you require real-time indexing? Can you live with one hour , one day, or one week of lag time between updating the index?

You have several options for refreshing full-text data:

Perform a full repopulation ”This option can be scheduled or performed manually when required. This choice is best when you replace all of the full-text data in your indexes at once. For example, if you had a nightly job that truncated the indexed table and imported fresh data, a full repopulation would be the most efficient method. Note that this operation will cause the search server to issue a table scan against each indexed table, although if the table is in use by other processes, it will acquire less intrusive , shared, row-level locks.
Perform an incremental repopulation ”If the number of changes to your table is small relative to the total size, this form of repopulation will take the least amount of resources to complete. You could schedule an incremental repopulation every hour or half hour, which would update the full-text index with all changes to the data at a regular interval. The advantage of incremental repopulation is that it is usually faster than full repopulation. However, if your text data changes frequently, the incremental update could take longer than a full repopulation. In most cases, an incremental repopulation, in which all the text data in the table is not repopulated, will be the most efficient method.

SQL Server supports this method by using timestamps to identify rows that have changed since the last update. Your table must have a timestamp column to support incremental updates. If anything in the row changes, the timestamp will change and the row will be flagged for reindexing, even if the full-text indexed column didn't change. If you want to avoid the timestamp requirement, investigate change-tracking options.
Perform a scheduled change-tracking repopulation ”If you have a steady stream of updates to your index, and you do not have a timestamp column, this form of repopulation is best.

The difference between change tracking and incremental repopulation is the method used by SQL Server to support them. With incremental repopulation, SQL Server stores only the timestamp of the last successful repopulation, and scans the table during refreshes to identify rows with larger (later) timestamps. Change tracking uses a system table, sysfulltextnotify, to track the table, row, and full-text index key when changes occur.

With either of these last two options, you must choose a scheduled refresh that fits with the design goals of your application.

Backing Up Full-Text Indexes

The ability to back up the full-text indexes depends heavily on the update method you have chosen . In a worst-case scenario, you can restore your database backups (which are kept consistent by a transaction log), rebuild the catalog, and then perform a full repopulation. However, with large full-text indexes, this could be a lengthy process.

If you are using change tracking, it is more difficult to implement a solid backup policy because there are constant changes occurring to the full-text data. You could perform a backup of the files once a day, and restore these partial indexes in the event of a failure. This would provide limited functionality until you could take the system down for a full repopulation.

If you are using incremental updates, you can back up the full-text indexes that are in the file system separately by stopping the MSSearch service, backing up all the files in the catalog, backing up the SQL Server database, and restarting the MSSearch service. In the event of an index corruption or file system failure, stop the search service. Restore these files, restart the search server, and immediately perform an incremental full-text update. Search Server will receive all the changes to the table based on the timestamp from the last update.

TIP

Microsoft maintains a Knowledge Base article on recommended methods for performing full-text Index backups. See KB article #Q240867.

Effects of Inconsistent Data

Maintenance of full-text indexes is an asynchronous process, meaning that your indexes can easily be out of sync with your data. SQL Server uses transactions and locking to prevent inconsistency in its data, but the full-text search engine has neither of these features.

This section describes the effects of inconsistent data with Search Server. Following are descriptions of what would happen if a row is inserted, updated, or deleted in the SQL Server table, but the full-text index has not yet been updated to reflect the change.

Inserted Row

A row that is inserted to the table, but isn't in the index, will not be returned by any search that would normally find it. Because the unique key associated with the word did not exist at the time of the index creation, the new row will not be reachable by Search Server. Of course, you can still get to the row through conventional Transact -SQL commands.

Deleted Row

A deleted row will work correctly, although extra processing will be involved. When you write a query to find certain words, SQL Server asks the Search Server for the key values of rows that satisfy your request. When these results are returned, SQL Server attempts to retrieve the row using this key. If the row has been deleted, no row will be returned.

Updated Row

If the text for an indexed column is modified such that it no longer satisfies the full-text search criteria, the index will still find the old row, but will return the new information. Because the key value is unchanged, and the Search Server only returns the key values that satisfy your query, SQL Server has no way to know that the results are incorrect. This can be confusing to a user .