Hack 15. Search for Keywords Without LIKE

You can do a simple keyword search using the LIKE operator. Unfortunately, this can be slow. Fortunately, an efficient keyword search is available in many systems.

Often you must store large chunks of text in a table. For example, suppose you have a table called story, which contains the author of a story and the story itself:

CREATE TABLE story (
 author varchar(100),
 body varchar(1000)
);
INSERT INTO story (author,body) VALUES('Atzeni'
 ,'Many database systems, through the use of SQL,↵
 are wonderful at collating...');
INSERT INTO story (author,body) VALUES('Adams'
 ,'The definitions involved in understanding SQL databases are big.↵
 You may have thought the distance from your chair to the fridge↵
 was big, but that''s peanuts compared to standard definitions.');
INSERT INTO story (author,body) VALUES('Russell and Cumming'↵
 ,'Often you must store large chunks of text in a table.');

If you wanted to find out which body has the phrase "database system" in it, you could do the following:

SELECT author FROM story
 WHERE body LIKE '%database system%'

This accurately returns matches where the exact match is found. However, for some text searches, partial matches would also be useful, as well as common roots (such as "system" and "systems") and result weighting (a higher score for "database system" than for "the database used a system").

In Oracle, the LIKE operator is case sensitive. If you want to do a case-insensitive search you can force the value into lowercase:

SELECT author FROM story WHERE LOWER(body) LIKE '%database system%'

The LIKE clause forces the database system to do a linear scan of the text fields in order to find the words of interest, and therefore performance will be slow. What you really need is an index on the words in these text strings. The FULLTEXT construct supports this type of indexing. It has other advantages as well: it can use a natural language engine to aid the matching algorithm, and it can return the quality of the match (rather than just trUE or FALSE).

3.1.1. MySQL

In order to do a FULLTEXT pattern match, you must first create a FULLTEXT index:

ALTER TABLE story ADD FULLTEXT(body);

Now that the index is created, you can perform the query:

SELECT author FROM story
WHERE MATCH (body) AGAINST ('database systems');

MATCH returns a floating-point number, where 0.0 is irrelevant and higher numbers indicate an increasingly better match quality. You can specify the match quality in the SELECT line:

mysql> SELECT author, MATCH (body) AGAINST ('database systems')
 -> FROM story
 -> ORDER BY 2 DESC;
+---------------------+-------------------------------------------+
| author | MATCH (body) AGAINST ('database systems') |
+---------------------+-------------------------------------------+
| Atzeni | 1.3253291845322 |
| Adams | 0 |
| Russell and Cumming | 0 |
+---------------------+-------------------------------------------+

By default, words are not indexed unless they are at least four characters long, are composed of characters from only a particular range, and are not too popular (popular words are those that appear in more than 50 percent of the rows). Also, query elements using "filler words" are silently ignored. All of these limits are administrator configurable (see http://dev.mysql.com/doc/refman/5.1/en/fulltext-fine-tuning.html).

MySQL also has an IN BOOLEAN MODE text-searching capability. This can work without a FULLTEXT index (but it might be much slower if you use it this way). It modifies the behavior of AGAINST to allow Boolean tests to be defined. It does not give scores other than 1 or 0. You could use this to perform the search:

mysql> SELECT author,
 -> MATCH (body) AGAINST ('+database +systems' IN BOOLEAN MODE)
 ->  AS SCORE
 -> FROM story
 -> ORDER BY 2 DESC;
+---------------------+-------+
| author | SCORE |
+---------------------+-------+
| Atzeni | 1 |
| Adams | 0 |
| Russell and Cumming | 0 |
+---------------------+-------+

3.1.2. PostgreSQL

To get full text searching in PostgreSQL, you need to use the Tsearch2 module. A more detailed guide on how to do this is available from devx (http://www.devx.com/opensource/Article/21674).

To install Tsearch2 (from a source-code install) go to your source directory for PostgreSQL and type the following at the Linux or Unix shell prompt (you may need to be root for the install step):

$ cd contrib/tsearch2
$ make
$ make install

To use Tsearch2 in a particular database, you need to issue this command:

$ psql dbname < tsearch2.sql

tsearch2.sql should be in your install directory's contrib directory (for instance, /usr/local/pgsql/share/contrib/tsearch2.sql). If you encounter permission errors you might be better off using the Postgres user account for this procedure.

The script creates a number of helper tables, all of which should have GRANTs to allow the required users to access the tables. These tables are pg_ts_cfg, pg_ts_cfgmap, pg_ts_dict, and pg_ts_parser. If you want to try things out you can just continue to use the Postgres user account.

To use this new searching capability, you need to add a column to the tables to be searched (to hold some system vector data concerning the field to be searched), add an index, and prepare the new column for searching:

ALTER TABLE story ADD COLUMN vectors tsvector;
CREATE INDEX story_index 
 ON story USING gist(vectors);
SELECT set_curcfg('default');
UPDATE story
 SET vectors = to_tsvector(body);

Finally, you can perform your search:

dbname=> SELECT author,rank (vectors,q)
dbname-> FROM story, to_tsquery('database&systems') AS q
dbname-> ORDER BY rank(vectors,q) DESC;
 author | rank
---------------------+-----------
 Atzeni | 0.0991032
 Adams | 1e-20
 Russell and Cumming | 1e-20

3.1.3. SQL Server

Implementation of full text searching in SQL Server utilizes the Microsoft Search Engine. This is external to the database and has to be configured separately first. Part of this configuration requires you to specify a location for saving the full text indexes. Because these indexes are stored outside the normal database structure, you need to remember to back these up separately when you are doing a database backup.

Make sure you have the Microsoft Search Engine installed on your machine. Then, using SQL Server Management Studio (which is the current name for the Enterprise Manager), expand the nodes by selecting Databasesimages/U2192.jpg border=0> Storage. Right-click on Full-Text Catalog and select New Full-Text Catalog. You will be requested for the filename and location to use for the new catalog.

If you are using the Express edition of SQL Server, you will have to download SQL Server Management Studio separately.

To build and use the index on a table, right-click on the table and choose Full-Text Index Table and then Define Full-Text Indexing on a Table. The Full-Text Wizard will start, which requires you to specify the following: a unique index name, the columns to index, the catalog in which to store the index, and the schedule on which you want the index to be rebuilt. Note that if you did not create a catalog as described earlier, you can create one from this wizard.

With the index defined, you still need to populate the index. Right-click on the table, and choose Full-Text Index Table and then Start Full Population; the index will be built using the table's current data set. You have to repeat this whenever the table is modified.

With the index built, you can query it with FREETEXT. Alternatively, you can use FREETEXTTABLE, which will return the answer as a table object. You use FREETEXT as an operator, along with the TABLE version, if you want to use the result of the search directly in a JOIN:

SELECT author
FROM story
WHERE FREETEXT(body,'database systems')

You can use CONTAINS (and the associated CONTAINSTABLE) as an alternative to FREETEXT. It offers more flexibility and a considerable number of extended options, allowing a wide range of ways to weight the match. This includes how far or near a search term is from another term. The details are available at the MSDN library at http://msdn2.microsoft.com/en-us/library/ms189760.aspx.

3.1.4. Oracle

In Oracle, you have many different extensions and options available for text string indexing. You can find one of the best, simple tutorials on this at http:// www.oracle.com/technology/oramag/oracle/04-sep/o54text.html.

To use the indexing in Oracle, the user who will be maintaining the index must have ctxapp rights (permission to use Oracle's text features). He also needs to set up a lexer preference. Different lexers are available for different languages (such as English and French):

GRANT ctxapp to andrew;

BEGIN CTX_DDL.CREATE_PREFERENCE(
'english_lexer','basic_lexer');

CTX_DDL.SET_ATTRIBUTE(
'english_lexer','index_themes',
'no');

Once this is complete, the user can create an index for body:

CREATE INDEX song_index ON story(body)
 INDEXTYPE IS CTXSYS.CONTEXT
 PARAMETERS('LEXER english_lexer 
 STOPLIST ctxsys.default_stoplist');

You can use this index in your SQL query:

SELECT author
FROM story
WHERE contains(body,'database systems',1) > 0;

SQL Fundamentals

Joins, Unions, and Views

Text Handling

Date Handling

Number Crunching

Online Applications

Organizing Data

Storing Small Amounts of Data

Locking and Performance

Reporting

Users and Administration

Wider Access

Index