14.3 Regular Expressions and SQL


Database-driven applications have to process huge amounts of data. If pattern matching with the help of regular expressions is performed by PHP, the entire data has to be transferred to the Web server where the PHP interpreter is running. With huge amounts of data, this is not a good solution because a large amount of memory and network bandwidth is wasted. Therefore, it can be useful to utilize regular expressions on the database level. PostgreSQL is capable of working with regular expressions. In some cases this is the more efficient way because of less overhead.

In this section you will learn about regular expressions and SQL.

14.3.1 Getting Started with SQL Regular Expressions

Working with SQL-style regular expressions is easy. However, it is important to understand the basic differences and specifics.

Let's create a table and insert some records into it:

 CREATE TABLE product (         id int4,                -- product id         name text,              -- name of product         prodtype text,          -- type of product         price numeric(9,2),     -- price of product         instock int4            -- number of products available ); COPY product FROM stdin USING DELIMITERS ';'; 34543532;The Pinky and the Brain in Vienna;Video;23.99;982 34543533;The Pinky and the Brain meet Elvis;Video;23.99;546 44545432;PostgreSQL Developer's Handbook;Book;44.99;40 44545433;PHP and PostgreSQL;Book;44.99;654 65665490;Falco Cybershow: Soundtrack;CD;23.49;60 65665491;Manifold Live;CD;23.69;129 \. 

A table for storing products has been created and some records have been added to it using the COPY command. To see what is in the table, you can execute a simple SELECT statement:

 phpbook=# SELECT * FROM product;     id    |                name                | prodtype | price | instock ----------+------------------------------------+----------+-------+---------  34543532 | The Pinky and the Brain in Vienna  | Video    | 23.99 |     982  34543533 | The Pinky and the Brain meet Elvis | Video    | 23.99 |     546  44545432 | PostgreSQL Developer's Handbook    | Book     | 44.99 |      40  44545433 | PHP and PostgreSQL                 | Book     | 44.99 |     654  65665490 | Falco Cybershow: Soundtrack        | CD       | 23.49 |      60  65665491 | Manifold Live                      | CD       | 23.69 |     129 (6 rows) 

The target of the next query is to retrieve only those records containing the string PostgreSQL. Therefore the ~ can be used. ~ performs case-sensitive substring matching and returns all records satisfying the expression:

 phpbook=# SELECT * FROM product WHERE name ~ 'PostgreSQL';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40  44545433 | PHP and PostgreSQL              | Book     | 44.99 |     654 (2 rows) 

In this example two records have been returned. If postgresql is spelled in lowercase letters, no records will be found:

 phpbook=# SELECT * FROM product WHERE name ~ 'postgresql';  id | name | prodtype | price | instock ----+------+----------+-------+--------- (0 rows) 

The ~ is case-sensitive, so no records match the expression. To perform case-insensitive pattern matching, PostgreSQL provides the ~* operator:

 phpbook=# SELECT * FROM product WHERE name ~* 'postgresql';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40  44545433 | PHP and PostgreSQL              | Book     | 44.99 |     654 (2 rows) 

Two records have been found.

Sometimes it is necessary to find all records that do not match a regular expression. Therefore, PostgreSQL provides the !~ operator. Just like when working with the ~ operator, PostgreSQL performs case-sensitive pattern matching:

 phpbook=# SELECT * FROM product WHERE name !~ 'PostgreSQL';     id    |                name                | prodtype | price | instock ----------+------------------------------------+----------+-------+---------  34543532 | The Pinky and the Brain in Vienna  | Video    | 23.99 |     982  34543533 | The Pinky and the Brain meet Elvis | Video    | 23.99 |     546  65665490 | Falco Cybershow: Soundtrack        | CD       | 23.49 |      60  65665491 | Manifold Live                      | CD       | 23.69 |     129 (4 rows) 

All four records that do not contain PostgreSQL are retrieved.

To perform case-insensitive matching, you can use the !~* operator:

 phpbook=# SELECT * FROM product WHERE name !~* 'postgresql';     id    |                name                | prodtype | price | instock ----------+------------------------------------+----------+-------+---------  34543532 | The Pinky and the Brain in Vienna  | Video    | 23.99 |     982  34543533 | The Pinky and the Brain meet Elvis | Video    | 23.99 |     546  65665490 | Falco Cybershow: Soundtrack        | CD       | 23.49 |      60  65665491 | Manifold Live                      | CD       | 23.69 |     129 (4 rows) 

Sometimes it is necessary to see if the beginning of a string matches a certain pattern. In this case you can include the ^ symbol in the regular expression. ^ matches the beginning of a string as shown in the next example:

 phpbook=# SELECT * FROM product WHERE name ~* '^postgresql';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

One record has been found because there is only one product whose name starts with PostgreSQL. The counterpart of ^ is the $ symbol, which matches the end of a string. Let's take a look at an example:

 phpbook=# SELECT * FROM product WHERE name ~* 'elvis$';     id    |                name                | prodtype | price | instock ----------+------------------------------------+----------+-------+---------  34543533 | The Pinky and the Brain meet Elvis | Video    | 23.99 |     546 (1 row) 

Only one record has been retrieved because the Pinky and the Brain videotape is the only one about Elvis.

If you don't know how to spell Pinky correctly, you can use square brackets to tell PostgreSQL that one of the characters in brackets is the correct one. In this scenario you don't know whether to spell Pinky with b or p:

 phpbook=# SELECT * FROM product WHERE name ~* '[bp]inky';     id    |                name                | prodtype | price | instock ----------+------------------------------------+----------+-------+---------  34543532 | The Pinky and the Brain in Vienna  | Video    | 23.99 |     982  34543533 | The Pinky and the Brain meet Elvis | Video    | 23.99 |     546 (2 rows) 

Two records have been found; both contain the string Pinky.

As you have already seen in the sections about Perl-style and POSIX-style regular expressions, the * symbol is used for working with repeatable characters or subpatterns. The * symbol is also supported by SQL-style regular expressions:

 phpbook=# SELECT * FROM product WHERE name ~* 'handbo*k';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

In the following scenario, the word handbook can be spelled with zero or more os. One record matches the regular expression. To define a subpattern, brackets can be used. The syntax is the same as you have already seen when dealing with Perl- and POSIX-style regular expressions:

 phpbook=# SELECT * FROM product WHERE name ~* 'handb(o)*k';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

The same record has been returned. In addition to the * symbol, PostgreSQL supports + and ? as well, as shown in the next two examples:

 phpbook=# SELECT * FROM product WHERE name ~* 'handbo+k';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) phpbook=# SELECT * FROM product WHERE name ~* 'handboo?k';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

14.3.2 Escaping Characters

Escaping characters is important when working with SQL-style regular expressions. The syntax is the same as the one you already saw when dealing with Perl- and POSIX-style regular expressions.

One way to escape a character is to use a backslash:

 phpbook=# SELECT * FROM product WHERE name ~* 'Developer\'s';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

The single quote in Developer's is escaped and this way the entire string can be retrieved easily. An additional method to escape a single quote would be to use two single quotes:

 phpbook=# SELECT * FROM product WHERE name ~* 'Developer''s';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

This syntax is already familiar to you when working with PL/pgSQL, you use the same syntax.

14.3.3 LIKE

LIKE is oneof the most widespread keywords when dealing with regular expressions on a database level. The idea of LIKE is to have a keyword for working with regular expressions rather than having a set of operators.

The way LIKE works differs slightly from the regular expressions you have just seen. Let's take a look at an example:

 phpbook=# SELECT * FROM product WHERE name LIKE 'handboo?k';  id | name | prodtype | price | instock ----+------+----------+-------+--------- (0 rows) 

No values are returned because LIKE looks for complete matches and does not perform substring searching. To tell PostgreSQL that some characters might be found before and after the string Handbook, you have to add a % symbol:

 phpbook=# SELECT * FROM product WHERE name LIKE '%Handbook%';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

This way a record will be retrieved. With the help of LIKE, case-sensitive matching is performed: If the string Handbook is spelled with lowercase letters, no records will be returned.

 phpbook=# SELECT * FROM product WHERE name LIKE '%handbook%';  id | name | prodtype | price | instock ----+------+----------+-------+--------- (0 rows) 

This is an important point that is often forgotten. To get around the problem, you can use the lower function:

 phpbook=# SELECT * FROM product WHERE lower(name) LIKE '%handbook%';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

lower computes the lowercase of a string. This way it does not matter that LIKE performs case-sensitive matching. Another way to solve the problem is to use operators instead of LIKE. The ~~ is equal to the LIKE statement:

 phpbook=# SELECT * FROM product WHERE lower(name) ~~ '%handbook%';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

The case-insensitive counterpart of the ~~ operator is the ~~* operator. The next example shows how the same target can be achieved without using the lower function:

 phpbook=# SELECT * FROM product WHERE name ~~* '%handbook%';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40 (1 row) 

To find all values that do not match a LIKE clause, you can use NOT LIKE:

 phpbook=# SELECT * FROM product WHERE name NOT LIKE '%Handbook%';     id    |                name                | prodtype | price | instock ----------+------------------------------------+----------+-------+---------  34543532 | The Pinky and the Brain in Vienna  | Video    | 23.99 |     982  34543533 | The Pinky and the Brain meet Elvis | Video    | 23.99 |     546  44545433 | PHP and PostgreSQL                 | Book     | 44.99 |     654  65665490 | Falco Cybershow: Soundtrack        | CD       | 23.49 |      60  65665491 | Manifold Live                      | CD       | 23.69 |     129 (5 rows) 

LIKE is widely used by many SQL programmers and it is an easy task to get used to working with the command. However, in some cases it is more flexible to work with PostgreSQL's operators than with LIKE.

14.3.4 Regular Expressions and Performance

Although regular expressions are a tremendous advantage, there are some conditions that can lead to real trouble in real-time systems. When the amount of data processed by your application grows, the situation might become dangerous for various reasons.

The key to high performance is to have the right indexes defined on the appropriate columns. Let's define two indexes:

 phpbook=# CREATE INDEX idx_product_id ON product(id); CREATE phpbook=# CREATE INDEX idx_product_name ON product(name); CREATE 

The target of the index is to optimize the query looking for the word postgresql:

 phpbook=# SELECT * FROM product WHERE name ~* 'postgresql';     id    |              name               | prodtype | price | instock ----------+---------------------------------+----------+-------+---------  44545432 | PostgreSQL Developer's Handbook | Book     | 44.99 |      40  44545433 | PHP and PostgreSQL              | Book     | 44.99 |     654 (2 rows) 

To find out what PostgreSQL does internally, you can take a look at the execution plan of the query. As you can see, PostgreSQL performs a sequential scan, which means that the entire table is read:

 phpbook=# EXPLAIN SELECT * FROM product WHERE name ~* 'postgresql'; NOTICE:  QUERY PLAN: Seq Scan on product  (cost=0.00..1.07 rows=1 width=87) EXPLAIN 

Sometimes sequential scans are performed even if they are not absolutely necessary. To make PostgreSQL use an index whenever possible, sequential scans can be turned off temporarily:

 phpbook=# SET enable_seqscan TO off; SET VARIABLE 

After turning off sequential scans, you can use EXPLAIN again to see if something has changed:

 phpbook=# EXPLAIN SELECT * FROM product WHERE name ~* 'postgresql'; NOTICE:  QUERY PLAN: Seq Scan on product  (cost=100000000.00..100000001.08 rows=1 width=87) EXPLAIN 

PostgreSQL still performs a sequential scan. Even if you are looking for PostgreSQL at the beginning of the cell, the situation won't change:

 phpbook=# EXPLAIN SELECT * FROM product WHERE name ~* '^postgresql'; NOTICE:  QUERY PLAN: Seq Scan on product  (cost=100000000.00..100000001.08 rows=1 width=87) EXPLAIN 

The reason for PostgreSQL's behavior is clear and easy to understand. When scanning an index, the database checks whether the value you are looking for and the data in the index match. This operation is a simple comparison of two strings. Because a regular expression is much more complex than a string, there is no way of scanning an index to find a value matching a regular expression. This is an important point that you have to keep in mind when working with regular expressions. As long as the amount of data is comparatively low, this won't be a problem, but if you are looking for values in a table that consists of millions of tables, it will take a long time and the overall performance of your system will decrease.

In most cases there is no way to get around the problem, so regular expressions should be avoided if a lot of data has to be checked within a short period of time or in the case of a lot of concurrent queries.



PHP and PostgreSQL. Advanced Web Programming2002
PHP and PostgreSQL. Advanced Web Programming2002
ISBN: N/A
EAN: N/A
Year: 2004
Pages: 201

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net