Using a FULLTEXT Search with Short Words

4.13.1 Problem

FULLTEXT searches for short words return no records.

4.13.2 Solution

Change the indexing engine's minimum word length parameter.

4.13.3 Discussion

In a text like the KJV, certain words have special significance, such as "God" and "sin." However, if you perform FULLTEXT searches on the kjv table for those words using a MySQL 3.23 server, you'll observe a curious phenomenonboth words appear to be missing from the text entirely:

mysql> SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('God');
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+
mysql> SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('sin');
+----------+
| COUNT(*) |
+----------+
| 0 |
+----------+

One property of the indexing engine is that it ignores words that are "too common" (that is, words that occur in more than half the records). This eliminates words such as "the" or "and" from the index, but that's not what is going on here. You can verify that by counting the total number of records, and by using SQL pattern matches to count the number of records containing each word:[5]

[5] The use of COUNT( ) to produce multiple counts from the same set of values is described in Recipe 7.2.

mysql> SELECT COUNT(*) AS 'total verses',
 -> COUNT(IF(vtext LIKE '%God%',1,NULL)) AS 'verses containing "God"',
 -> COUNT(IF(vtext LIKE '%sin%',1,NULL)) AS 'verses containing "sin"'
 -> FROM kjv;
+--------------+-------------------------+-------------------------+
| total verses | verses containing "God" | verses containing "sin" |
+--------------+-------------------------+-------------------------+
| 31102 | 4118 | 1292 |
+--------------+-------------------------+-------------------------+

Neither word is present in more than half the verses, so sheer frequency of occurrence doesn't account for the failure of a FULLTEXT search to find them. What's really happening is that by default, the indexing engine doesn't include words less than four characters long. On a MySQL 3.23 server, there's nothing you can do about that (at least, nothing short of messing around with the MySQL source code and recompiling). As of MySQL 4.0, the minimum word length is a configurable parameter, which you can change by setting the ft_min_word_len server variable. For example, to tell the indexing engine to include words containing three or more characters, add a set-variable line to the [mysqld] group of the /etc/my.cnf file (or whatever option file you put server settings in):

[mysqld]
set-variable = ft_min_word_len=3

After making this change and restarting the server, rebuild the FULLTEXT index to take advantage of the new setting:

mysql> ALTER TABLE kjv DROP INDEX vtext;
mysql> ALTER TABLE kjv ADD FULLTEXT (vtext);

Then try out the new index to verify that it includes shorter words:

mysql> SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('God');
+----------+
| COUNT(*) |
+----------+
| 3878 |
+----------+
mysql> SELECT COUNT(*) FROM kjv WHERE MATCH(vtext) AGAINST('sin');
+----------+
| COUNT(*) |
+----------+
| 389 |
+----------+

That's better!

But why do the MATCH( ) queries find 3878 and 389 records, whereas the earlier LIKE queries find 4118 and 1292 records? That's because the LIKE patterns match substrings and the FULLTEXT search performed by MATCH( ) matches whole words.

Using the mysql Client Program

Writing MySQL-Based Programs

Record Selection Techniques

Working with Strings

Working with Dates and Times

Sorting Query Results

Generating Summaries

Modifying Tables with ALTER TABLE

Obtaining and Using Metadata

Importing and Exporting Data

Generating and Using Sequences

Using Multiple Tables

Statistical Techniques

Handling Duplicates

Performing Transactions

Introduction to MySQL on the Web

Incorporating Query Resultsinto Web Pages

Processing Web Input with MySQL

Using MySQL-Based Web Session Management

Appendix A. Obtaining MySQL Software

Appendix B. JSP and Tomcat Primer

Appendix C. References



MySQL Cookbook
MySQL Cookbook
ISBN: 059652708X
EAN: 2147483647
Year: 2005
Pages: 412
Authors: Paul DuBois

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net