If you've ever used Google to search for a web site (and who hasn't), you've used a full-text search engine. A full-text search engine catalogs the words and phrases found in a set of documents and then lets you search for documents that match a given pattern. The tsearch2 contributed module adds full-text search features to a PostgreSQL server. When you add the tsearch2 package to a database, you get a small collection of new data types, a large collection of new functions (most of which you'll never invoke directly), and a new search operator. When you use tsearch2, the documents that you catalog and search are the values stored in your database. If you have customers table that stores miscellaneous notes about each customer, the notes column in each row might be a searchable document. If your database includes a parts list, you may want to catalog the description of each part. In tsearch2 terms, a document is a string that you want to catalog and search. A pattern is a logical collection of words that you want to find. A pattern may be as simple as a single word. If a pattern contains multiple words, you can use pattern operators to define how the words relate to the documents that you want to search. For example, you can create a pattern that matches any document that contains every word in the pattern, a pattern that matches any document that contains any of the words in the pattern, a pattern that matches all documents that contain none of the words in the pattern, or any combination.
Before you try out any of the examples in this section, install tsearch2 into your database and then execute the following command:
movies=# SELECT set_curcfg( 'default' ); set_curcfg ---------- (1 row)
If you don't call set_curcfg() first, the queries that I show you will fail with the error:
ERROR: could not find tsearch config by locale
I'll explain the set_curcfg() function in a moment, but let's look at a few sample queries first.
The @@ operator is similar in concept to LIKE, ILIKE, or the regular-expression operators (~, ~*, and so on): It compares a pattern against a string. To use the @@ operator, you must convert the pattern into an object of type tsquery and the string you want to compare against into an object of type tsvector. For example, to search the tapes table for titles that include the word "Godfather":
movies=# SELECT tape_id, title FROM tapes WHERE movies-# to_tsvector( title ) @@ to_tsquery( 'Godfather' ); tape_id | title ----------+-------------- AB-12345 | The Godfather AB-67472 | The Godfather (2 rows)
If you want to search for all titles that include Rear and Window, the search pattern should look like this:
to_tsquery( 'Rear & Window' )
To search for all titles that include Rear or Window, separate the words with the | operator, like this:
to_tsquery( 'Rear | Window' )
To search for all titles that include the word Window, but not the word Rear, use the ! operator:
to_tsquery( 'Rear & ! Window' )
You can combine operators to create complex patterns. Use parentheses to group expressions. For example, this pattern
to_tsquery( 'Island & (Earth | Gilligan)' )
will match This Island Earth and Escape From Gilligan's Island, but not Escape From Devil's Island.
It's easy to confuse the tsvector and tsquery data types since they appear to be so similar. They are, in fact, very different. A tsvector contains a catalog of the words that appear in a document. A tsquery contains a search pattern, not just a list of words. In fact, if you try to create a tsquery from a string that contains a list of words, tsearch2 will reward you with an error message (ERROR: syntax error). Instead, you have to tell tsearch2 how those words relate to the document that you're searchingyou have to separate multiple words with a tsquery operator (&, |, or !).
Of course, if pattern matching was the only feature offered by tsearch2, it wouldn't be very excitingafter all, you can match patterns with LIKE/ILIKE or the regular-expression operators. tsearch2 supports three distinct advantages over the other pattern matching mechanisms offered by PostgreSQL:
A stop word is a word that tsearch2 automatically ignores when searching and indexing. If you've ever Googled for a phrase like the world's best margarita recipe, you may have noticed the following reply:
"the" is a very common word and was not included in your search.
"The" is a stop wordit's a word that would match just about every (English) document ever searched (other examples include "a," "and," "not," "some," and so on.). Stop words take up extra space (and slow down searches) without contributing to the task of identifying interesting documents. When you convert a search pattern into a tsquery object, tsearch2 strips out any stop words that it finds in the pattern.
Stemming is the process of identifying word variations. When you create a search pattern into a tsquery object, to_tsquery() replaces each word in the pattern with its stem. For example, donate, donation, donating, donates, and donated are all variations of the same word. If to_tsquery() finds one of those variants in a search pattern, it replaces each occurrence with the stem: donat. That means that a search for donate will match documents that contain any variant of donate.
Of course, if you stem a search pattern, you must also stem the documents that you are searching through. to_tsvector() removes stops words and stems word variations using the same set of rules used by to_tsquery().
The process of stemming, stopping, and cataloging the words in a document is expensive. When you execute a query that invokes to_tsvector() like this:
SELECT tape_id, title FROM tapes WHERE to_tsvector( title ) @@ to_tsquery( 'Godfather' );
the PostgreSQL server will stem, stop, and catalog every row in the table. If you execute the same query (or a similar query) again, the server has to stem, stop, and catalog every row a second time. You can greatly improve performance by building a tsvector for each document at the time you add the document to the database.
The movies sample database that we've been using in most of this book doesn't really contain enough data to thoroughly exercise tsearch2, but the recalls table presented in Chapter 4, "Performance," does. The recalls table (in the perf database) contains 39,241 rows of information about automobile recalls. Each row contains three large VARCHAR fields that contain a description of a defect, the possible consequences of the defect, and the corrective action promised by the manufacturer.
To demonstrate the performance benefits offered by tsearch2, I'll use tsearch2 to count the number of recalls that contain the word hydraulic (in the desc_defect column ).
It takes my computer approximately 6.4 seconds to execute the following query:
SELECT COUNT(*) FROM recalls WHERE to_tsvector( desc_defect ) @@ to_tsquery( 'hydraulic' );
That query stems, stops, and catalogs every desc_defect in the recalls table, and identifies 808 rows that match the given pattern. If I execute the same query repeatedly, each iteration takes (approximately) the same amount of time.
Just for purposes of comparison, it takes approximately 2.7 seconds to perform a similar query using a regular-expression:
SELECT COUNT(*) FROM recalls WHERE desc_defect ~* 'hydraulic';
Again, if I execute the same query repeatedly, each iteration takes roughly the same amount of time.
Why does a tsearch2-based search take nearly two and a half times longer than a regular-expression search? Because the to_tsvector() function is stemming, stopping, and cataloging every word found in the desc_defect column. The regular-expression search simply scans through the desc_defect column and stops as soon as it finds the word hydraulic. In fact, the tsearch2 and the regular-expression search identify a different set of matches (because of the stemming rules and parsing rules used by tsearch2).
It's usually a waste of time to stem, stop, and catalog every row for each tsearch2 query, because the vast majority of the documents remain unchanged from query to query. Instead, I'll add a tsvector column to my recalls table and "precompute" the stem, stop, and catalog information required for a tsearch2 query:
perf=# ALTER TABLE recalls perf-# ADD COLUMN fts_desc_defect TSVECTOR; ALTER TABLE perf=# UPDATE RECALLS SET perf-# fts_desc_defect = to_tsvector( desc_defect ); UPDATE 39241 perf=# VACUUM FULL ANALYZE recalls; VACUUM
Now I can run a tsearch2 query again, but this time, I search the new fts_desc_defect column instead:
SELECT COUNT(*) FROM recalls WHERE fts_desc_defect @@ to_tsquery( 'hydraulic' );
Notice that I don't have to convert fts_desc_defect into a tsvector because it already is a tsvector. This query identifies the same set of rows selected by the first query, but this query runs in 0.22 seconds. That's a considerable improvement over the original query (6.7 seconds). But I can make it faster still.
tsearch2 provides the infrastructure required to index tsvector values. To create an index that @@ can use:
perf=# CREATE INDEX fti_desc_defect ON recalls USING GIST( fts_desc_defect ); CREATE INDEX perf=# VACUUM FULL ANALYZE recalls; VACUUM
Now when I search the fts_desc_defect column (using the same query), it takes less than .04 seconds (4/100ths of a second) to identify the same set of rows. PostgreSQL uses the index on fts_desc_defect to read only those rows that match the search pattern.
At this point, I've improved performance, but I've introduced a bug. If I search the desc_defect column, PostgreSQL has to stem, stop, and catalog every row in the recalls table (every time I search) and I get poor performance. If I search the pre-cataloged fts_desc_defect column, I get good performance. But what happens if I add a new row to the recalls table? Or UPDATE an existing row (and change the words in the desc_defect column)? Searching against desc_defect guarantees that I'll see the most recent data. Searching against fts_desc_defect guarantees that I'll see obsolete data. Fortunately, this problem is easy to fixin fact, there are two different solutions.
The most obvious way to keep fts_desc_defect up-to-date is to create a trIGGER that recomputes the tsvector whenever I add a new row or update an existing row. tsearch2 even comes with a function that you can use to implement the trigger:
perf=# CREATE TRIGGER tg_fts_recalls perf-# BEFORE UPDATE OR INSERT ON RECALLS perf-# FOR EACH ROW perf=# EXECUTE PROCEDURE tsearch2( fts_desc_defect, desc_defect ); CREATE TRIGGER
The tsearch2() function expects two arguments: the name of a tsvector column and the name of a text (or other string-valued) column. The trigger will effectively call ts_tovector( desc_defect ) and copy the result into the fts_desc_defect column.
I can test this trigger pretty easily:
perf=# SELECT COUNT(*) FROM recalls fts_desc_defect @@ to_tsquery('hydraulic'); count ------ 808 (1 row) perf=# UPDATE recalls perf-# SET desc_defect = 'busted hydraulic line' perf-# WHERE record_id = 4909; UPDATE 1 perf=# SELECT COUNT(*) FROM recalls fts_desc_defect @@ to_tsquery('hydraulic'); count ------ 809 (1 row)
Another way to solve this problem is to drop the fts_desc_defect columnyou don't need it to gain the benefits offered by pre-cataloging and indexing. Instead of adding an fts_desc_defect column and then creating an index that covers that column, just create a function-based index. First, I'll clean out the tsvector column that I added earlier:
perf=# DROP TRIGGER tg_fts_recalls ON recalls; DROP TRIGGER perf=# DROP INDEX fti_desc_defect; DROP INDEX perf=# ALTER TABLE recalls DROP COLUMN tg_fts_recalls; ALTER TABLE
Now I'll create a new index function-based index:
perf=# CREATE INDEX fti_desc_defect ON recalls perf-# USING GIST( to_tsvector( desc_defect )); CREATE INDEX
As you might expect, it takes a while to create the function-based index. PostgreSQL reads through every row in the recalls table; invokes the to_tsvector() function to stem, stop, and catalog the desc_defect column; and then stores the result in the new index. Of course, if I add a new row to the recalls table (or update an existing row), PostgreSQL ensures that the index is kept up-to-date.
By creating a function-based index (or, more properly, an expression-based index), I eliminate the need for a trigger, I can get rid of the extra tsvector column (and save quite a bit of space), and I still get the performance boost offered by a pre-cataloged index.
Searching Multiple Columns
The recalls table contains three VARCHAR fields that we might want to search: desc_defect (a description of the defect), con_defect (possible consequences of the defect), and cor_action (the corrective action promised by the manufacturer). I could search all three columns using a query such as
perf=# SELECT COUNT(*) FROM recalls WHERE perf-# to_tsvector(desc_defect) @@ to_tsquery('hydraulic' ) perf-# OR perf-# to_tsvector(con_defect) @@ to_tsquery('hydraulic' ) perf-# OR perf-# to_tsvector(cor_action) @@ to_tsquery('hydraulic' ); count ------ 902 (1 row)
That works, but it's not a simple query to write. Instead of searching through each column individually, I can string all three columns together and search through the concatenation:
perf=# SELECT COUNT(*) FROM recalls WHERE perf-# to_tsvector(desc_defect || con_defect || cor_action) perf-# @@ to_tsquery('hydraulic' ) count ------ 902 (1 row)
Unfortunately, the simplicity of this query is misleading: It works for the recalls table, but it won't produce correct results if your documents contain any NULL values. (The document columns in recalls contain no NULL values.) A NULL value wonks out this query because the concatenation operator (||) assumes that any string appended to a NULL results in a NULL. In short: 'hydraulic line' || NULL || 'hydraulic piston' evaluates to NULL. That means that a NULL value in desc_defect, con_defect, or cor_action would effectively hide the other values in that row. To fix this problem, I can rewrite the query using the coalesce() function to map NULL values into some other value (in this case, an empty string):
perf=# SELECT COUNT(*) FROM recalls WHERE perf-# to_tsvector( perf(# COALESCE( desc_defect, '' ) || ' ' || perf(# COALESCE( con_defect, '' ) || ' ' || perf(# COALESCE( cor_action, '' )) perf-# @@ to_tsquery('hydraulic' ) count ------ 902 (1 row)
This query takes about 15 seconds on my computer. (The OR version takes the same amount of time.) So much for simplicitythe OR version was easier to write and easier to understand.
However, I've been leading you down this tortuous path for a good reason. Remem ber that in the previous section I showed you how to create an expression-based index? I can create an index defined by the rather complex expression that I wrote in that last query, and PostgreSQL will use that index to search for patterns in desc_defect, con_defect, and cor_action.
perf=# CREATE INDEX fti_recalls ON recalls USING GIST( perf-# to_tsvector( perf(# COALESCE( desc_defect, '' ) || ' ' || perf(# COALESCE( con_defect, '' ) || ' ' || perf(# COALESCE( cor_action, '' )); CREATE INDEX
Now, when I run the previous query (the one with all the COALESCE noise in it), I see the results in 0.44 seconds.
Simplifying tsearch2 with Customized Functions
tsearch2 queries tend to be rather unwieldy. You can simply tsearch2 by creating a few wrapper functions that hide the details of the complicated queries. For example, I can create a function named documents() that will return the (properly coalesced) concatenation of desc_defect, con_defect, and cor_action:
perf=# CREATE FUNCTION documents( recall recalls ) RETURNS TSVECTOR AS perf-# $$ perf$# SELECT perf$# COALESCE( $1.desc_defect, '' ) || ' ' || perf$# COALESCE( $1.con_defect, '' ) || ' ' || perf$# COALESCE( $1.cor_action, '' ); perf$# $$ LANGUAGE 'SQL' IMMUTABLE;
I can also define a function named document() (singular this time) that converts documents() into a tsvector:
perf=# CREATE FUNCTION document( recall recalls ) RETURNS TSVECTOR AS perf-# $$ perf$# SELECT to_tsvector( documents( $1 )); perf$# $$ LANGUAGE 'SQL' IMMUTABLE;
Now I can use the document() function in conjunction with the @@ operator:
perf=# SELECT COUNT(*) FROM recalls r perf-# WHERE document(r) @@ to_tsquery( 'hydraulic' ); count ----- 902 (1 row)
That's much easier to type in and much easier to read. I can even create a function-based index based on document():
perf=# DROP INDEX fti_recalls; DROP INDEX perf=# CREATE INDEX fti_recalls USING GIST( document( recalls )); CREATE INDEX
The document() function returns a tsvector based on desc_defect, con_defect, and cor_action. The only thing that I can do with a tsvector is search it, so I may as well add another function that simplifies the search:
perf=# CREATE FUNCTION find_recalls( pattern text ) RETURNS SETOF RECALLS AS perf-# $$ perf$# SELECT * FROM recalls WHERE perf$# to_tsvector( perf(# COALESCE( desc_defect, '' ) || ' ' || perf(# COALESCE( con_defect, '' ) || ' ' || perf(# COALESCE( cor_action, '' )) perf(# @@ to_tsquery( $1 ); perf$# $$ LANGUAGE 'SQL'; CREATE FUNCTION
Now I can simply invoke find_recalls() to search for a pattern in desc_defect, con_defect, or cor_action:
perf=# SELECT COUNT(*) FROM find_recalls( 'hydraulic' ); count ----- 902 (1 row)
Searching for Phrases
You can't use tsearch2 to search for phrasestsearch2 catalogs the individual words in a document, but doesn't keep enough information to know when one word directly follows another. You can't use tsearch2, for example, to search for a phrase such as hydraulic line. (Enclosing the phrase in quotes won't help.) You can search for a pattern such as hydraulic & line, but that will match all documents that contain both words, even if line appears before hydraulic or if hydraulic is separated from line by a number of other words.
To search for a phrase, you have to use LIKE, ILIKE, or a regular-expression operator. For example
perf=# SELECT COUNT(*) FROM recalls perf-# WHERE documents(recalls) ~* 'hydraulic line'; count ------ 53 (1 row)
But you can still use tsearch2 to speed up a phrase search. It stands to reason that any document that contains the phrase hydraulic line will contain the individual words hydraulic and line, right? Looking at it the other way around, a document cannot contain the phrase hydraulic line unless it contains the individual words hydraulic and line. You already know how to identify the set of rows that contain the words hydraulic and linejust add AND document(recalls) @@ to_tsquery( 'hydraulic & line' ) to the WHERE clause.
The original version of this query (regular-expression only) takes about 8.2 seconds to run on my computer. By adding tsearch2 to a regular-expression based phrase search, I can drastically reduce the number of rows that the server will have to search. The new query looks like this:
perf=# SELECT COUNT(*) FROM recalls perf-# WHERE documents(recalls) ~* 'hydraulic line' perf-# AND document(recalls) @@ to_tsquery('hydraulic & line'); count ------ 53 (1 row)
The new version takes 0.12 seconds to identify the same set of rows. I can use the timer utility (described in Chapter 4) to see what the server does with each query. Here's the timer output from the first (regular-expression only) query:
$ timer "SELECT COUNT(*) FROM recalls > WHERE documents(recalls) ~* 'hydraulic line'" +--------+---------------+------------------+---------------------------------+ | | SEQUENTIAL I/O | INDEXED I/O | | | scans| tuples | heap_blks |cached| scans| tuples | idx_blks |cached| |--------+------+--------+-----------+------+------+--------+----------+------+ |recalls | 1| 39241 | 5399 | 0 | 0| 0 | 0 | 0 | |--------+------+--------+-----------+------+------+--------+----------+------+
You can see that the server evaluated this query by scanning every one of the 39,241 rows in the table.
$ timer "SELECT COUNT(*) FROM recalls > WHERE documents(recalls) ~* 'hydraulic line'" > AND document(recalls) @@ to_tsquery('hydraulic & line'); +--------+----------------------------------+---------------------------------+ | | SEQUENTIAL I/O | INDEXED I/O | | | scans| tuples | heap_blks |cached| scans| tuples | idx_blks |cached| |--------+------+--------+-----------+------+------+--------+----------+------+ |recalls | 0| 0 | 71 | 5 | 1| 126 | 173 | 298 | |--------+------+--------+-----------+------+------+--------+----------+------+
The server first uses an index scan (on fts_recalls) to quickly identify the 126 rows that satisfy the tsearch2 criteria and then matches each of those rows against the regular expression.
Configuring tsearch2
Each time you start a new client session, tsearch2 tries to find a configuration that matches the server's locale. tsearch2 configurations are stored in a small collection of tables: pg_ts_cfg, pg_ts_cfgmap, pg_ts_parser, and pg_ts_dict. tsearch2 comes with three predefined configurations: default, default_russian, and simple:
perf=# SELECT * FROM pg_ts_cfg ts_name | prs_name | locale -----------------+----------+------------- default | default | C default_russian | default | ru_RU.KOI8-R simple | default | (3 rows)
To find the proper configuration, tsearch2 searches pg_ts_cfg for a row where the locale column matches your server's locale. If it can't find a matching configuration, you'll see a message stating ERROR: could not find tsearch config by locale. You can find the locale used by your server with the following query:
perf=# SELECT setting FROM pg_settings WHERE name = 'lc_ctype'; setting ------------ en_US.UTF-8 (1 row)
If your server's locale doesn't match any of the locales in pg_ts_cfg, you have four options:
The first option is simple, but it complicates your code. The to_tsvector() function comes in two flavors[2]. To use the first flavor, you invoke to_tsvector() with a single string argument and it converts that string into a tsvector using the "current" configuration. To use the second flavor, call ts_vector() with two arguments: the name of a configuration and a string. to_tsvector() will convert the string into a tsvector using the configuration that you specified in the first argument. ts_toquery() comes in two flavors as well.
[2] tsearch2 provides a third flavor for to_tsvector() and to_tsquery(). You must know the OID of a pg_ts_cfg row to use the third form.
The second option is inconvenient and somewhat dangerous. You have to call set_curcfg() in every client session that might use tsearch2if you've created a trigger or index based on tsearch2, that means you have to call set_curcfg() in any session that could update a cataloged column.
Cloning an existing configuration is often the easiest and safest choice. Cloning an existing configuration is a two-step process. First, you add a new row to the pg_ts_cfg table, then you make a copy of the corresponding entries in the pg_ts_cfgmap table. For example, if your server's locale is en_US.UTF8, you can clone the default configuration with the following commands:
perf=# INSERT INTO pg_ts_cfg VALUES( 'default_enUS', 'default', 'en_US.UTF8'); INSERT perf=# INSERT INTO pg_ts_cfgmap perf-# SELECT 'default_enUS', tok_alias, dict_name perf-# FROM pg_ts_cfgmap WHERE ts_name = 'default'; INSERT
The first command creates a new (empty) configuration named default_enUStsearch2 will select this configuration when the server's locale is en_US.UTF8. The second command clones the default entries in pg_ts_cfgmap, creating an identical set of entries that belong to default_enUS. Once you've created a clone that matches your server's locale, you should be able to use tsearch2 without specifying an explicit configuration in each call to to_tsquery() and to_tsvector().
Creating a new configuration from scratch is not too complex, but you'll need an understanding of the stemming, stopping, and cataloging process before you start.
When you create a tsvector from a text string, tsearch2 starts by invoking a parser that picks apart the text string into its component parts. tsearch2 comes with a single parser (named default), but you can write your own parser if you have special requirements. The default parser was designed to parse plain-text and HTML documents: It knows how to process HTML tags, HTTP headers, email and host addresses, and so on. The parser identifies and classifies each token in a text string. For example, given the string "send 42 messages to bruce@example.com", the default parser will identify four words (send, messages, and to), one unsigned integer (42), and an email address (bruce@example.com). The default parser classifies each token into one (or more) of the categories shown in Table 25.1[3].
[3] The default parser also defines three categories for words and word fragments composed of Cyrillic characters: nlword, nlhword, and part-nlhword.
Category |
Description |
Examples |
---|---|---|
lword |
Any word composed entirely of alphabetic characters |
bruce |
word |
Any word composed of alphabetic and numeric characters |
bruce42, postgres81 |
|
An Internet email address (user@host) |
bruce@example.com |
url |
An HTTP or FTP URL |
http://www.postgresql.org/index.html ftp://ftp.postgresql.org/index.html |
host |
An Internet hostname |
www.postgresq.orglocalhost.localdomain |
sfloat |
A floating point number in scientific notation |
325.667E12 6.626E-34 |
version |
A generic version number (a number with more than one decimal point) |
8.0.0 2.6.9.1 |
part_hword |
Parts of a hyphenated word |
post-gres-sql8 |
lpart_hword |
Latin parts of a hyphenated word |
post-gre-sql8 |
blank |
Whitespace and any characters not matched by other rules |
(parens are considered blanks) $so are other special characters! |
tag |
An HTML tag |
|
http |
The protocol component of an HTTP URL |
http://www.postgresql.org |
hword |
A hyphenated word |
postgre-sql8 |
lhword |
A hyphenated Latin word |
postgre-sql |
uri |
A uniform resource identifier (usually the filename component of a URL) |
http://www.postgresql.org/index.html |
file |
A relative or absolute Linux/Unix pathname |
/tmp/README.txt../ README.txt |
float |
A floating-point number |
3.14159 6.626 |
int |
A signed integer |
-32 +45 |
uint |
An unsigned integer |
32 45 |
entity |
An HTML entity |
, |
<a href="img.png">When the parser finishes tokenizing and classifying the text string, it ends up with a collection of token values and each token is assigned to a category. Some of the "words" in the text string may result in multiple tokens. For example, the string http://www.postgresql.org/index.html produces four tokens (you can call the ts_debug() function to see the result of the parsing process):</a>
<a href="img.png"> perf=# SELECT token, tok_type perf-# FROM ts_debug('http://www.postgresql.org/index.html'); token | tok_type -------------------------------+--------- http:// | http www.postgresql.org/index.html | url www.postgresql.org | host /index.html | uri (4 rows) </a>
<a href="img.png">Next, the parser iterates through the list of tokens and weeds out any that are deemed uninteresting. To decide which tokens to discard, to_tsvector() uses the token type (and the configuration name) to locate a record in the pg_ts_cfgmap table. If to_tsvector() can't find a matching entry in pg_ts_cfgmap, it discards the token. For example, given the tokens parsed from http://www.postgresql.org/index.html, to_tsvector() finds:</a>
<a href="img.png"> perf=# SELECT * FROM pg_ts_cfgmap perf-# WHERE ts_name = 'default' perf-# AND tok_alias IN( 'http', 'url', 'host', 'uri' ); ts_name | tok_alias | dict_name ---------+-----------+---------- default | url | {simple} default | host | {simple} default | uri | {simple} (3 rows) </a>
<a href="img.png">Notice that tsearch2 won't find an entry for ts_name = 'default' and tok_alias = 'http', so it discards that token (the http:// header). The default configuration discards blank, tag, http, and entity tokens. Discarding a "word" based on its token classification is similar to stopping an entire category of words. Discarded tokens are not cataloged by tsearch2, so you won't be able to search for them. Of course, you can tell tsearch2 that you want it to catalog a given category by adding that category to the pg ts cfgmap table. Similarly, you can tell tsearch2 to ignore a given category (say, the file category) by removing that category from pg_ts_cfgmap.</a>
<a href="img.png">For each token that makes it through the pg_ts_cfgmap filter, tsearch2 starts the stemming and stopping process. When to_tsvector() finds an entry in pg_ts_cfgmap that matches the configuration name and token type, that entry identifies a dictionary processor. ts_tsvector() feeds the token into that dictionary processor and adds the result (if any) to the tsvector. The dictionary processor may stem the token by trans lating it into a new token. The dictionary processor may instead stop the word by returning a NULL value. Or, the dictionary processor may pass the token through without modification.</a>
<a href="img.png">The tsearch2 package comes with five sample dictionary processors.</a>
<a href="img.png">The simple dictionary processor converts each token into lowercase characters and the searches for the result in a list of stop wordsif it finds the (lowercased) token in the list, it returns NULL, otherwise it returns the lowercased token to to_tsvector() (and to_tsvector() adds the token to the tsvector that it's building). tsearch2 installs the simple dictionary processor with an empty stop word list (which means that every token makes it through the simple dictionary after it's been translated to lowercase). To add a stop word list (which is just a newline-separated list of words), save the name of your stop word file in the dict_initoption column of the pg_ts_dict row corresponding to the simple dictionary processor. For example, if you've stored a list of stopwords in a file named /usr/share/stopwords.english, execute the following command:</a>
<a href="img.png"> perf=# UPDATE pg_ts_dict SET dict_initoption = '/usr/share/stopwords.english'; UPDATE </a>
<a href="img.png">The en_stem dictionary processor handles stop words and stemming. en_stem searches for the token in a list of stop words and discards the token if found. (Like the simple dictionary processor, en_stem finds the stop word list in its pg_ts_dict.dict initoption.) If the token is not found in the stop word list, en_stem TRies to convert the token into its root form by stripping off common English prefixes and suffixes. For example, en_stem converts donate, donation, donating, donates, and donated into the stem donat. to_tsvector() stores the stem in the tsvector that it's building. If you search for the word donate, tsearch2 will match donate, donation, donating, donates, and donated.</a>
<a href="img.png">The ru_stem dictionary processor is identical to the en_stem processor except it stems each token using rules designed for Russian text. (You would most likely use a different list of stop words too.)</a>
<a href="img.png">The synonym dictionary processor doesn't do any stop word processing (or stemming). When you feed a token to the synonym processor, it searches for a match in a list of word-synonym pairs. If it finds a match, the processor returns the synonym. For example, given the list of synonyms:</a>
<a href="img.png"> zaurus pda newton pda pocketpc pda nokia phone treo phone </a>
<a href="img.png">The synonym processor will translate zaurus, newton, and pocketpc into pda, and will translate nokia and TReo into phone. If synonym can't find a match in the list, it returns NULL.</a>
<a href="img.png">The last dictionary processor is named ispell_template. ispell_template is based on the ispell program and it searches for each token in a separate dictionary file (not included with tsearch2). If ispell_template finds the token (or a variant of the token) in the dictionary, it returns the stemmed form of the word to to_tsvector(). If ispell_template can't find the token (or a variant of the token) in the dictionary, it returns NULL (and the token is discarded). ispell_template also uses a stop word list to filter out common words. There's an important difference between ispell_template and en_stem. Both dictionary processors convert tokens into stem form, but ispell_template will discard any token that it can't find in the dictionary: en_stem, on the other hand, simply passes through any token that it can't stem. The ispell_template processor won't work until you connect it to a dictionarya process described in the "Tsearch Introduction" document that comes with tsearch2.</a>
<a href="img.png">You can string multiple dictionary processors together by listing each one in the pg_ts_cfgmap.dictname column. For example, to apply the synonym processor and then the en_stem processor to every lword token:</a>
<a href="img.png"> perf=# UPDATE pg_ts_cfgmap perf-# SET dict_name = '{"synonym","en_stem"} perf-# WHERE ts_name = 'default_enUS' AND tok_alias = 'lword'; UPDATE </a>
<a href="img.png">tsearch2 tries each dictionary processor, in order, and stops as soon as a processor returns a non-NULL value.</a>
<a href="img.png">Now you know how all of the pieces fit together. tsearch2 uses the server's locale to find a configuration (in the pg_ts_cfg table). The configuration identifies a parser. tsearch2 uses that parser to split a text string into a set of tokens and assigns a category to each token. The pg_ts_cfgmap maps each configuration/token category combination into the name of a dictionary processor. (If a combination is not found in pg_ts_cfgmap, tsearch2 discards all tokens of that category.) The dictionary processor (typically) filters each token through a list of stop words and then stems anything that makes it through the filter.</a>
<a href="img.png">To create a new configuration, you can write a new parser, change the pg_ts_cfgmap to include (or exclude) token categories, modify the pg_ts_cfgmap to apply a different dictionary processor to a token category, implement a new dictionary processor, or modify the list of stop words used by a dictionary. If you use the synonym dictionary pro cessor, you can also modify the synonym map. In most cases, you won't need to write any code (unless you find that you have to implement a new parser or dictionary processor); just adjust a configuration table (or external file). If you do write a new parser or dictionary processor, consider donating it to the PostgreSQL community so other users can benefit from your efforts.</a>
<a href="img.png">tsearch2 offers a number of other features that I haven't described here. You know that tsearch2 can identify the documents that match a given patterntsearch2 can also rank the matches according to relevance. (Check out the rank() and rank_cd() functions.) When tsearch2 finds a document that matches a pattern, you can ask the headline() function to produce a string that highlights the search words in context. See the tsearch2 documentation for more details.</a>
<a href="img.png">If tsearch2 doesn't have what you need, check out the OpenFTS package. OpenFTS is a user-friendly wrapper around tsearch2. You can use OpenFTS to expose the documents in your database to users that may not know how to formulate SQL queries (and may not understand the results). You can find OpenFTS at openfts.sourceforge.net.</a>
<a href="img.png"> </a>
<a href="img.png">Index</a> |
Part I: General PostgreSQL Use
Introduction to PostgreSQL and SQL
Working with Data in PostgreSQL
PostgreSQL SQL Syntax and Use
Performance
Part II: Programming with PostgreSQL
Introduction to PostgreSQL Programming
Extending PostgreSQL
PL/pgSQL
The PostgreSQL C APIlibpq
A Simpler C APIlibpgeasy
The New PostgreSQL C++ APIlibpqxx
Embedding SQL Commands in C Programsecpg
Using PostgreSQL from an ODBC Client Application
Using PostgreSQL from a Java Client Application
Using PostgreSQL with Perl
Using PostgreSQL with PHP
Using PostgreSQL with Tcl and Tcl/Tk
Using PostgreSQL with Python
Npgsql: The .NET Data Provider
Other Useful Programming Tools
Part III: PostgreSQL Administration
Introduction to PostgreSQL Administration
PostgreSQL Administration
Internationalization and Localization
Security
Replicating PostgreSQL Data with Slony
Contributed Modules
Index