22.2 Building Substring Indexes | PHP and PostgreSQL. Advanced Web Programming2002

Indexes are the key to performance. If data is selected and retrieved from a table, indexes will help you to speed up your queries significantly. However, in some operations indexes cannot be used. One of these operations is substring searching. If you are working with regular expressions, PostgreSQL is not capable of using indexes. Internally an index is a tree built on a sorted list. This way data can be retrieved efficiently. In substrings it is not possible to benefit from the advantages of a sorted list. Therefore it is necessary to build a workaround. This can easily be done by using the fulltextindex package, which can be found in PostgreSQL's contributed section ($PATH_TO_POSTGRES/contrib/fulltextindex).

22.2.1 Installation

To install the package, just go to the directory containing the code and run make as well as make install as shown in the next listing:

 [root@duron fulltextindex]# make sed 's,MODULE_PATHNAME,$libdir/fti,g' fti.sql.in >fti.sql gcc -O3 -march=athlon  -Wall -Wmissing-prototypes -Wmissing-declarations -fpic -I. -I../../src/include   -c -o fti.o fti.c gcc -shared -o fti.so fti.o rm fti.o [root@duron fulltextindex]# make install /bin/sh ../../config/install-sh -c -m 644 fti.sql /usr/local/pgsql/share/contrib /bin/sh ../../config/install-sh -c -m 755 fti.so /usr/local/pgsql/lib /bin/sh ../../config/install-sh -c -m 644 ./README.fti /usr/local/pgsql/doc/contrib /bin/sh ../../config/install-sh -c -m 755 ./fti.pl /usr/local/pgsql/bin

After compiling the source code of the package, you can add the package to the desired database. In this example fti.sql is sent to the database called phpbook:

 [hs@duron fulltextindex]$ psql phpbook < fti.sql CREATE

22.2.2 The Basic Concepts and an Example

The idea of PostgreSQL's substring index package is that for every column you want to index, a table for storing the substrings must exist. In this table a list of all possible substrings a string can have is stored. To use the substring index, a join must be performed. Let's create an empty table first:

 phpbook=# CREATE TABLE word (id int4, word text); CREATE

In the next step the table needed to store the substrings is created. Notice the names of the columns because these names are essential for making the system work correctly.

 phpbook=# CREATE TABLE fti_phpbook_word (string text, id oid); CREATE

Now a trigger has to be defined, which makes sure that every record added, modified, or removed is processed correctly. Every record added to the table is split up into substrings and added to the table storing the substrings. Modifications are performed in both the original and the substring table. Removing records has to be done in both tables as well. Let's take a look at the trigger:

 CREATE TRIGGER fti_word_word_trigger         AFTER  INSERT OR UPDATE OR DELETE ON word         FOR EACH ROW EXECUTE PROCEDURE fti(fti_phpbook_word, word);

The heart of the trigger is the fti function. It computes the substrings needed in the substring table and adds the values to it. Now that the trigger has been created, you can try to insert a record into the table:

 phpbook=# INSERT INTO word VALUES (1, 'PostgreSQL'); INSERT 25063 1

One record has been added to the table:

 phpbook=# SELECT * FROM word;  id |    word ----+------------   1 | PostgreSQL (1 row)

Let's see which records have been added to the table containing the substrings:

 phpbook=# SELECT * FROM fti_phpbook_word;    string   |  id ------------+-------  ql         | 25063  sql        | 25063  esql       | 25063  resql      | 25063  gresql     | 25063  tgresql    | 25063  stgresql   | 25063  ostgresql  | 25063  postgresql | 25063 (9 rows)

As you can see, nine substrings can be made out of the word PostgreSQL. Let's create three indexes:

 phpbook=# CREATE INDEX idx_word_word ON word (word); CREATE phpbook=# CREATE INDEX idx_fti_oid ON fti_phpbook_word (id); CREATE phpbook=# CREATE INDEX idx_fti_string ON fti_phpbook_word (string); CREATE

To make PostgreSQL use the indexes, whenever possible sequential scans must be turned off temporarily. This can easily be done by using the SET command:

 phpbook=# SET enable_seqscan TO off; SET VARIABLE

The basic idea of substring indexing is that only one type of regular expression can be processed using an index. If the ^ symbol is used to match the beginning of a string, it is possible to use an index but only if no other features of regular expressions, such as nested repetitions, are used. Let's take a look at the execution plan:

 phpbook=# EXPLAIN SELECT * FROM fti_phpbook_word WHERE string ~ '^gre'; NOTICE:  QUERY PLAN: Index Scan using idx_fti_string on fti_phpbook_word  (cost=0.00..4.68 rows=1 width=36) EXPLAIN

As you can see, an index is used. To join the two tables, a join has to be performed:

 SELECT word.*         FROM word, fti_phpbook_word         WHERE word.oid=fti_phpbook_word.id                 AND fti_phpbook_word.string ~ '^gresql';

One row is returned:

  id |    word ----+------------   1 | PostgreSQL (1 row)

To compute the execution plan, EXPLAIN has to be used:

 EXPLAIN SELECT word.*         FROM word, fti_phpbook_word         WHERE word.oid=fti_phpbook_word.id                 AND fti_phpbook_word.string ~ '^gresql';

As you can see, indexes are used.

 Hash Join  (cost=100000004.68..100000005.70 rows=1 width=44)   ->  Seq Scan on word  (cost=100000000.00..100000001.01 rows=1 width=40)   ->  Hash  (cost=4.68..4.68 rows=1 width=4)         ->  Index Scan using idx_fti_string on fti_phpbook_word (cost=0.00..4.68 rows=1 width=4)

Sometimes using substring indexes is the only way to speed up queries significantly. In the case of millions of records, it can be the only chance to avoid sequential scans.