Performing FULLTEXT Searches


In Chapter 4, the LIKE keyword was introduced as a way to perform somewhat simple string matches like

 SELECT * FROM users WHERE last_name  LIKE 'Smith%' 

This type of conditional is effective enough but is still very limiting. For example, it would not allow you to do Google-like searches using multiple words. For those kinds of situations, you need FULLTEXT searches.

FULLTEXT searches require a FULLTEXT index, which you'll create next. Then you'll learn the special SQL syntax for performing these searches. Finally, you'll read about some capabilities added in MySQL 4.0.1 to make your FULLTEXT searches more exacting.

Creating a FULLTEXT Index

The FULLTEXT index is just a special type of index necessary to run FULLTEXT searches. You can create one on any CHAR, VARCHAR, or TEXT column in a MyISAM table.

 CREATE TABLE comments ( comment_id INT UNSIGNED NOT NULL  AUTO_INCREMENT, subject VARCHAR(100), comment TEXT, PRIMARY KEY (comment_id), FULLTEXT (comment) ) 

In this chapter's example, I'll add a FULLTEXT index to the urls table, on the description column. Since that table already exists, I'll use an ALTER command.

To create a FULLTEXT index

1.

Open the mysql client and select the contents database.

2.

Add the FULLTEXT index (Figure 5.33).

 ALTER TABLE urls ADD FULLTEXT INDEX  (description); 

Figure 5.33. The FULLTEXT index is added to the urls table.


This syntax is familiar to those used earlier to create standard and unique indexes.

3.

Confirm the table's structure by viewing its CREATE syntax (Figure 5.34).

 SHOW CREATE TABLE urls \G 

Figure 5.34. Viewing the table's creation statement reveals the FULLTEXT index, among other things.


You've already seen a couple of SHOW commands by now, so hopefully this one isn't too surprising. The SHOW CREATE TABLE tablename command asks MySQL to print out the SQL command used to create (or recreate) the named table. Using the \G trick (mentioned in a tip in the previous section) makes the results are easier to view.

Tips

  • Inserting records into tables with FULLTEXT indexes can be much slower because of the complex index that's required.

  • You can add FULLTEXT indexes on multiple columns, if those columns will all be used in searches.


Performing Basic FULLTEXT Searches

Once you've established a FULLTEXT index on a column, you can start querying against it, using the MATCH and AGAINST functions.

 SELECT * FROM tablename WHERE MATCH  (tablename WHERE MATCH  ( keyword2) 

MySQL will return matching rows in order of a mathematically-calculated relevance, just like a search engine. When doing so, certain rules apply:

  • Strings are broken down into their individual keywords.

  • Keywords less than four characters long are ignored.

  • Very popular words, called stopwords, are ignored.

  • If more than fifty percent of the records match the keywords, no records are returned.

This last fact is problematic to many users as they begin with FULLTEXT searches and wonder why no results are retrieved. When you have a sparsely populated table, there just won't be sufficient records for MySQL to return relevant results.

To perform FULLTEXT searches

1.

Thoroughly populate the urls table, focusing on adding lengthy, descriptive definitions.

Once again, SQL INSERT commands can be downloaded from this book's corresponding Web site.

2.

Run a simple FULLTEXT search on the word security (Figure 5.35).

 SELECT url FROM urls WHERE MATCH  (description) AGAINST('security); 

Figure 5.35. A basic FULLTEXT search.


This is a very simple example that will return some results as long as at least one and less than fifty percent of the records in the urls table have the word security in their description.

3.

Run the same FULLTEXT search while also showing the relevance (Figure 5.36).

 SELECT url, MATCH(description)  AGAINST('security) AS R FROM  urls WHERE MATCH(description)  AGAINST('security); 

Figure 5.36. You can also select a FULLTEXT search's relevance in your query.


If you use the same MATCH...AGAINST expression as a selected value, the actual relevance will be returned.

4.

Run a FULLTEXT search using multiple keywords (Figure 5.37).

 SELECT url FROM urls WHERE MATCH  (description) AGAINST('mysql  database); 

Figure 5.37. Using the FULLTEXT search, you can easily find descriptions that contain multiple keywords.


With this query, a match will be made only if a description contains both mysql and database.

Tips

  • Remember that if a FULLTEXT search returns no records, this means that either no matches were made or that over half of the records match.

  • For sake of simplicity, I wrote all of the queries in this section as simple SELECT statements. You can certainly use FULLTEXT searches within joins or more complex queries.

  • MySQL comes with several hundred stopwords already defined. These are part of the application's source code.

  • The minimum keyword lengthfour characters by defaultis a configuration setting you can change in MySQL.

  • FULLTEXT searches are case-insensitive by default.


Performing Boolean FULLTEXT Searches

The basic FULLTEXT search is nice, but a more sophisticated FULLTEXT search can be accomplished using its Boolean mode. This feature has been present in MySQL since version 4.0.1.

Boolean mode uses a number of operators (Table 5.9) to dictate how each keyword is treated. With the operators you also use the phrase IN BOOLEAN MODE in your query:

 SELECT * FROM tablename WHERE MATCH(column) AGAINST('+database  -mysql IN BOOLEAN MODE) 

Table 5.9. Use these operators to fine-tune your FULLTEXT searches.

Boolean Mode Operators

OPERATOR

MEANING

+

Must be present in every match

-

Must not be present in any match

~

Lowers a ranking if present

*

Wildcard

<

Decrease a word's importance

>

Increase a word's importance

""

Must match the exact phrase

()

Create subexpressions


In that example, a match will be made if the word database is found and mysql is not present. Alternatively, the tilde (~) is used as a milder form of the minus sign, meaning that the keyword can be present in a match, but such matches should be considered less relevant.

The wildcard character (*) matches variations on a word, so cata* matches catalog, catalina, and so on. Two operators explicitly state what keywords are more (>) or less (<) important. Finally, you can use double quotation marks to hunt for exact phrases and parentheses to make subexpressions.

The following query would look for records with the phrase Web develop with the word html being required and the word JavaScript detracting from a match's relevance:

 SELECT * FROM tablename WHERE MATCH(column) AGAINST('>"Web develop"  +html ~JavaScript IN BOOLEAN MODE) 

When using Boolean mode, there are several differences as to how FULLTEXT searches work:

  • If a keyword is not preceded by an operator, the word is optional but a match will be ranked higher if it is present.

  • Results will be returned even if more than fifty percent of the records match the search.

  • The results are not automatically sorted by relevance.

Because of this last fact, you'll also want to sort the returned records by their relevance, as I'll demonstrate in the next sequence of steps. One important rule that's the same with Boolean searches is that the minimum word length (four characters by default) still applies. So trying to require a shorter word using a plus sign (+php) still won't work.

To perform FULLTEXT Boolean searches

1.

Run a simple FULLTEXT search on variations of the word develop (Figure 5.38).

 SELECT url, description FROM  urls WHERE MATCH(description)  AGAINST('develop*' IN BOOLEAN  MODE) \G 

Figure 5.38. A simple Boolean mode FULLTEXT search.


This query will find records that contain develop, developer, development, etc. To confirm this, I'm also selecting the actual description column, and I'm using the \G TRick to make the result easier to view.

2.

Find matches involving secure programming, with the emphasis on security (Figure 5.39).

 SELECT url FROM urls WHERE MATCH  (description) AGAINST('>secur*  +program*' IN BOOLEAN MODE); 

Figure 5.39. This search looks for variations on two different keywords, ranking the one higher than the other.


This query first finds all records that have both secur* (secure, security, …) and program* (program, programmer, programming, …) in them. Then the results are ranked, with secur* outweighing program*.

3.

Run the same query but in order of relevance (Figure 5.40).

 SELECT url, MATCH(description)  AGAINST('>secur* +program*' IN  BOOLEAN MODE) AS R FROM urls WHERE  MATCH(description) AGAINST  ('>secur* +program*' IN BOOLEAN  MODE) ORDER BY R DESC; 

Figure 5.40. When doing Boolean mode searches, you must sort the returned results yourself (compare with Figure 5.39).


As I mentioned earlier, Boolean mode searches do not return the results in order. To make that happen, you'll need to also select the relevance and add an ORDER BY that sorts using those values.



    PHP and MySQL for Dynamic Web Sites. Visual QuickPro Guide
    PHP and MySQL for Dynamic Web Sites: Visual QuickPro Guide (2nd Edition)
    ISBN: 0321336577
    EAN: 2147483647
    Year: 2005
    Pages: 166
    Authors: Larry Ullman

    flylib.com © 2008-2017.
    If you may any questions please contact us: flylib@qtcs.net