Microsoft Search Ser vice | Beginning Transact-SQL With SQL Server 2000 and 2005

To compensate for this shortcoming, Microsoft implemented a flexible, free-form text indexing technology very similar to the product I used to index our support call system. The Microsoft Search Service was originally adopted to index newsgroup servers and web sites. Because it was capable of indexing practically any volume of text stored in files, it was integrated into SQL Server several versions ago. Today, any SQL data type capable of storing text characters can be indexed for free-form searches using full-text indexing and the Microsoft Search Service. Keep in mind that this is not a capability of SQL Server, rather a separate service made accessible through extensions in both the SQL Server product and the Transact-SQL language. Because text in practically any form can be indexed, SQL Server can be used as a storage repository for content such as Office documents that can then be indexed and searched using full-text indexing. Word documents, for example, contain both text and binary markup information. This poses no problem because the non-textual data is simply ignored.

Full-text indexing works much differently than standard indexes in SQL Server. Indexed data is not stored in the database. Full-text catalogs store index data in separate catalog files on the server. When the index is populated, the search service weeds out all of the noise words such as "and" and "of." All of the remaining words are added to a table-specific index stored within a catalog. Multiple columns can be added to the same index within a catalog.

Most of us use full-text searching every day. Although the Microsoft Search Service is not implemented on the same scale as the Google and Yahoo web search engines, the fundamental technology is the same. If you have used any of the leading web search services on-line, then you're already familiar with some of the things you can do with the Microsoft Search Service and full-text indexing.

Soundex Matching

One of the great challenges when mixing the nuance of language with the exactness of computing is to make sense of things that are similar to other things, but not exactly the same. My friend, Steve, who writes a humorous newspaper column, says that there are people who like things to be black and white, and there are other people who are OK with things in that gray area in between. For the gray people, driving 70 on a 60-mile-per-hour road is perfectly acceptable for the surgeon on his way to the hospital. This is not to say that the gray people are all about breaking rules and cheating their employers, they just have a different way of looking at things. Likewise, at times you may need to match a word or phrase that is similar to another, in that gray area between equal to and not equal to. One of the great challenges, for those of us who spend our lives in the world of Boolean logic, is to cope with the concept of inexact comparisons.

Soundex, as applied in SQL Server, is a standard used to compare words based on their phonetic equivalents, using a mathematical algorithm. This standard is based on the Consensus Soundex, developed by Robert Russell and Margaret Odell in the early 1900s. It was used by the United States Census in the 19th and early 20th centuries and in genealogical research to index and deal with spelling variations in surnames. Although the rules are based on English language phonetic rules, it does work with many words in different languages. Here's something to keep in mind: Just as the rules of spoken language can be a bit arbitrary, so is this. Soundex matching is pretty accurate, most of the time, but on occasion some exceptions may occur. Use it for search and matching features to be validated by a user, but don't bet the farm on every result.

The sound of a word is represented by a letter, representing the first sound, followed by a three-digit integer, each numeral representing adjacent consonant sounds. Before processing a word, the letters A, E, I, O, U, H, W, and Y are ignored unless they have a phonetic significance when combined with another letter. The first three prominent consonant sounds (after the first letter, if it's a consonant) are translated as shown in the following table.

Letters	English Phonetic Description	Represented By
B, F, P, V	labials and labio-dentals	1
C, G, J, K, Q, S, X, Z	gutterals and sibilants	2
D, T	dental-mutes	3
L	palatal-fricative	4
M, N	labio-nasal and lingua-nasal	5
R	dental fricative	6

The resulting value is padded with zeros, if necessary. Here are some simple examples. The words "Two," "To," and "Too" all have the same pronunciation. I'll pass each to the Soundex function:

 SELECT SOUNDEX('Two') SELECT SOUNDEX ('To') SELECT SOUNDEX ('Too')

The result is the same for each word, T000, as shown in Figure 11-1.

image from book
Figure 11-1:

Because there are no consonants after the "T," zeros are added. This happens to be the same value returned for Tea, Tee, Tow, Toe, and Toy. Using a more complex word, the result is more precise. For example, the Soundex value for the word "Microsoft" is M262: 2 for C, 6 for R, and 2 for S.

Try a few different words. Generally, I've found this to work reasonably well for comparing the closeness of words, but using the Soundex function for this purpose is not an exact science. For example, the word Seattle has a Soundex value of S340, which is the same for the word Settle. However, the word Subtle has a Soundex value of S134 because the algorithm missed the fact that the B is silent. This confirms what I've known all along, that people from Seattle are not very subtle.

The SOUNDEX() function returns a character string. With the exception of an exact match, you would need to parse this string and convert the numeric value to a numeric type. This would allow you to make quantitative comparisons. In this example, I use variables to hold the input and output values. The SOUNDEX() function output is parsed using the SUBSTRING() function to return only the numerical value. The difference is calculated and converted to a positive value using the ABS() function. Using this approach, I'll compare Redmond and Renton, two neighboring Washington State cities that many people (including my wife) often confuse:

 ECLARE @Word1 VarChar(100) DECLARE @Word2 VarChar(100) DECLARE @Value1 Int DECLARE @Value2 Int DECLARE @SoundexDiff Int SET @Word1 = ‘Redmond’ SET @Word2 = ‘Renton’ SELECT @Value1 = CONVERT(Int, SUBSTRING(SOUNDEX(@Word1), 2, 3)) SELECT @Value2 = CONVERT(Int, SUBSTRING(SOUNDEX(@Word2), 2, 3)) SET @SoundexDiff = ABS(@Value1 - @Value2) PRINT @SoundexDiff

According to the SOUNDEX() function, these two words are quite different phonetically. My query returns a difference of 180. If you don't want to go to this much work and don't need such a granular comparison, all this effort isn't necessary.

The DIFFERENCE() Function

The DIFFERENCE() function is really just a wrapper around two SOUNDEX() function calls and some business logic to compare the values. It simplifies the comparison, reducing the result to a scale from 0 to 4, where the value 4 indicates a very close or exact match.

I'll use the DIFFERENCE() function to compare the words To and Two:

 SELECT DIFFERENCE('To', 'Two')

The result is 4, indicating a very close or exact match.

Using the DIFFERENCE() function to compare Redmond with Renton, as follows,

 SELECT DIFFERENCE ('Redmond', 'Renton')

returns 3, meaning a similar but not-so-close match.