Contributed Software

Some programmers have developed tools, modules, and examples that have not been included in the core distribution of PostgreSQL, but can be found in the contributed section, which is included in the tar-ball containing the source code of PostgreSQL.

These tools are not maintained by the PostgreSQL core development team, because some of them are addressed to only a small audience or are not high-quality software, like the core of PostgreSQL.

Although contributed software is not a core part of PostgreSQL, it is worth having a closer look at some of the tools. This section presents an overview of the most useful tools included in the contributed section of the PostgreSQL distribution.

`cube` ”A Self-Defined Datatype

Some specific applications have to store information about multidimensional cubes in a database. Cubes need not always be three-dimensional. In some cases, they are even five- or six-dimensional. In addition to storing the data, a huge set of operators and GiST support methods are available. Gene Selkov, Jr. (a computational scientist at the Mathematics and Computer Science Division atof the Argonne National Laboratory) has implemented this datatype, and we will have you take a closer look at it in this section.

Installing cube

To compile and install this extension, go to the contrib /cube directory in your PostgreSQL source distribution and install the code by using make and make install . This will work fine if you have installed PostgreSQL from source.

In the next step, you insert the datatype into the template1 database so that it can be inherited by all databases you will create in the future; it does not affect databases that already exist. You use the following command:

 [hs@athlon cube]$  psql template1 < cube.sql

Note

If you don't want cube to be inherited by all databases, insert cube.sql into only those databases that you want to support the datatype.

Now that the datatype has successfully been inserted into a database, we can test the datatype by performing a small regression test:

 [hs@athlon cube]$  make installcheck  make -C ../../src/test/regress pg_regress make[1]: Entering directory `/usr/src/redhat/SOURCES/postgresql-7.1.2/src/test/regress' make[1]: `pg_regress' is up to date. make[1]: Leaving directory `/usr/src/redhat/SOURCES/postgresql-7.1.2/src/test/regress' ../../src/test/regress/pg_regress cube (using postmaster on Unix socket, default port) ============== dropping database "regression"         ============== ERROR:  DROP DATABASE: database "regression" does not exist dropdb: database removal failed ============== creating database "regression"         ============== CREATE DATABASE ============== installing PL/pgSQL                    ============== ============== running regression test queries        ============== test cube                 ... ok =====================  All 1 tests passed. =====================

As you can see, the test has been passed successfully, and therefore it is safe to use cube now.

Working with cubes

This section introduces some simple examples of how you can work with cubes efficiently .

Cubes can be one- or multi-dimensional. There are no restrictions on the number of axes a cube might contain. Before starting, we create a database called contributed :

 [hs@athlon cube]$  createdb contributed  CREATE DATABASE

The following example shows the easiest way to define a cube:

 contributed=#  SELECT '4'::cube AS cube;  cube ------  (4) (1 row)

You can see that a one-dimensional cube can be defined by casting a float value to cube . This cube is actually a point, which is the same as a one-dimensional cube.

Let's look at a slightly more complex example, which shows a point in space:

 contributed=#  SELECT '4, 5, 6'::cube AS cube;  cube -----------  (4, 5, 6) (1 row)

We simply have to pass the various values to PostgreSQL by defining a simple list. The list is cast to cube in order to tell PostgreSQL, how to treat the input.

The following is an n-dimensional box represented by a pair of opposite corners:

 contributed=#  SELECT '(0,0,0,0),(1,-2,3,-4)'::cube AS cube;  cube -----------------------------  (0, 0, 0, 0),(1, -2, 3, -4) (1 row)

This cube starts at (0, 0, 0, 0) and ends at (1, -2, 3, -4) . Those two points passed to PostgreSQL define two opposite corners of the cube. When defining the cube, PostgreSQL makes sure that the lower-left is stored first (internally).

If you define cubes, make sure that the points defining the cube have the same dimensionality. If the first element in the cube is represented by a point in four-dimensional space, the second component must also be a four-dimensional point. If it is not, the following error is displayed:

 contributed=#  SELECT '(1,-2,3,-4),(0,0,0)'::cube;  ERROR:  (3) bad cube representation; different point dimensions in (1,-2,3,-4) and (0,0,0)

Sometimes it is necessary to compute the intersection of two cubes. Therefore, cube offers a function called cube_inter . The following shows how it can be used:

 contributed=#  SELECT cube_inter('(1,2,3,4),(0,0,0,0)','(0,0,0,0),(-1,-2,-3,-4)');  cube_inter --------------  (0, 0, 0, 0) (1 row)

The result is (0, 0, 0, 0) , because this is the only point the two cubes have in common.

It is not only possible to compute the intersection of two cubes; you also can combine two cubes:

 contributed=#  SELECT cube_union('(1,2,3,4),(0,0,0,0)','(0,0,0,0),(-1,-2,-3,-4)');  cube_union -------------------------------  (-1, -2, -3, -4),(1, 2, 3, 4) (1 row)

(-1, -2, -3, -4),(1, 2, 3, 4) contains all points that the two cubes passed to the function contain. If you want to combine more than two cubes in one, you combine the list of cubes recursively, as shown in the following example:

 contributed=#  SELECT cube_union(cube_union('(1,2,3,4),(0,0,0,0)','(0,0,0,0),(-   1,-2,-3,-4)'), '(0,0,0,0),(9,-10,11,-12)');  cube_union ----------------------------------  (-1, -10, -3, -12),(9, 2, 11, 4) (1 row)

To find out whether a cube contains a certain point, cube provides a function called cube_contains . We have to pass a valid cube and a valid point to the function:

 contributed=#  SELECT cube_contains('(1,-2,3,-4),(0,0,0,0)','(0,-1,1,-2)');  cube_contains ---------------  t (1 row)

t (true) is returned, which means that the point '0,-1,1,-2' is a part of the cube.

Let's look at the operators. As with all datatypes, = and != operators are provided to check whether two cubes are equal. = and != are essential for the datatype; otherwise , it would be impossible to perform join operations (when performing a join, two values have to be compared, so a = is necessary).

The operators < and > are also defined for cubes:

 contributed=#  SELECT '(0,0,0,0),(1,2,3,4)'::cube < '(0,0,0,0),(2,2,3,4)'::cube;  ?column? ----------  t (1 row)

< and > are normally used for sorting ( ORDER BY ). In most cases, they do not have a practical reason. To show you why these two operators are suitable only for sorting, let's look at how they work.

The first element of each cube is compared, and if the two elements differ , the smaller/higher value has been found. If both values are equal, PostgreSQL checks the second element, and so on.

Note

< (less than) and > (greater than) have nothing to do with the volume of the two cubes.

cube is based on GiST, which allows the PostgreSQL implementation of R-tree, an index structure for indexing geometric data types, to be used with any data type. GiST supports a set of operators that can be used in combination with a self-defined datatype.

The operator << finds out whether a cube is left of another cube:

 contributed=#  SELECT '(-2,-3),(-1,-2)'::cube << '(0,0),(2,2)'::cube;  ?column? ----------  t (1 row)

'(-2,-3),(-1,-2)' is left of '(0,0),(2,2)' , so true is returned. The counterpart of the << operator is the >> operator used to find out whether a cube is on the right side of the second cube defined.

Some other operators are also defined for cubes. The following table lists the most important operators defined for cubes:

Operator	Meaning
<<	Left of
>>	Right of
&<	Left over
&>	Right over
&&	Overlaps
=	Equal
!=	Not equal
@	Contains
~	Contained in

cube and Indices

When working with large amounts of data, it is necessary to define indices in order to speed up queries. To show you how to define an index on a column, we create a table called mycubes :

 contributed=#  CREATE TABLE mycubes(a cube DEFAULT '0,0'::cube);  CREATE

We define an index of type gist . PostgreSQL supports a number of different types of indexing a column. gist is the appropriate type in this case:

 contributed=#  CREATE INDEX cube_idx ON mycubes USING gist (a);  CREATE

To find out more about indexing, refer to Chapter 3, "An Introduction to SQL." Now that we have defined the table and the index, we can have a look at the overview of the table:

 contributed=#  \   d   mycubes  Table "mycubes"  Attribute  Type         Modifier -----------+------+------------------------  a          cube  default '(0, 0)'::cube Index: cube_idx

Working with `fulltext` Indices

The more data you have to process, the longer queries will take if no indices are used. Indices are efficient data structures, which enable the database to access single rows or groups of rows much faster. If no indices are used, the database has to perform a so called sequential scan (or full table scan ), which means that a table has to be read from the beginning to the end. On large tables, this can take a lot of time, and the overall performance of your database system will slow down significantly.

The Problem with Indices and Substring Searching

Not all problems can be solved by defining an index. The next example shows a common problem in combining indices and PostgreSQL:

We define a table, which contains one text column. We also define an index for that column in order to access the column fast:

 fulltext=#  \   d   mytexts  Table "mytexts"  Attribute  Type  Modifier -----------+------+----------  a          text  Index: mytext_idx

We insert four records into the table:

 fulltext=#  SELECT * FROM mytexts;  a ---------  Charles  Paul  Etschi  Epi (4 rows)

To make sure that the database does not perform a sequential scan, if it is not absolutely necessary to complete the query, we set enable_seqscan to off :

 fulltext=#  SET enable_seqscan TO off;  SET VARIABLE

In the next step, we perform a query to look for Etschi :

 fulltext=#  SELECT * FROM mytexts WHERE a='Etschi';  a --------  Etschi (1 row)

As we have expected, one record has been found in the table. Let's look at the execution plan of the query:

 fulltext=#  EXPLAIN SELECT * FROM mytexts WHERE a='Etschi';  NOTICE:  QUERY PLAN: Index Scan using mytext_idx on mytexts  (cost=0.00..2.01 rows=1 width=12) EXPLAIN

The database performs an index scan to find the result as fast as possible.

Note

If enable_seqscan is not turned off, PostgreSQL still performs a sequential scan, because the table is too small to justify the overhead of using an index. The gist of the example is that an index scan can be used by the database.

The ituation will be different if we use ~ (case-sensitive regular expression) instead of = :

 fulltext=#  SELECT * FROM mytexts WHERE a ~ 'Etschi';  a --------  Etschi (1 row)

The result is still the same because we look for the full record, but the execution plan has changed:

 fulltext=#  EXPLAIN SELECT * FROM mytexts WHERE a ~ 'Etschi';  NOTICE:  QUERY PLAN: Seq Scan on mytexts  (cost=100000000.00..100000001.15 rows=1 width=12) EXPLAIN

The database has to perform a sequential scan, although enable_seqscan has been turned off. Unfortunately, that substring searching cannot be done with indices in PostgreSQL. To get rid of the problem, we can use the fulltextindex package in the contributed directory of the PostgreSQL source code.

The package is not an extremely sophisticated solution, but it will solve most of the problems people normally have in combination with substring searching.

Working with the `fulltextindex` Package

Installing the package on your PostgreSQL server is an easy task. Go to the contrib/fulltextindex directory in your PostgreSQL source tree and use make and make install .

The target of the next example is to have the opportunity to perform substring searching on table mytexts . Therefore, we write an SQL script:

 -- creating a function needed for fulltext searching CREATE FUNCTION fti() RETURNS opaque AS         '/usr/local/postgresql/lib/contrib/fti.so' LANGUAGE 'C'; -- creating a table to store the substrings CREATE TABLE mytexts_fti (string varchar(25), id oid); -- creating the table that will contain the "real" data CREATE TABLE mytexts(a text); -- creating indices CREATE INDEX mytexts_fti_idx ON mytexts_fti (string); CREATE INDEX mytexts_idx ON mytexts (a); -- creating a trigger that causes the substrings to be created CREATE TRIGGER mytexts_fti_trigger AFTER update OR insert OR delete ON mytexts FOR EACH ROW EXECUTE PROCEDURE fti(mytexts_fti, a); -- inserting data into the "real" table COPY mytexts FROM stdin; Charles Paul Etschi Epi \. -- vacuuming db VACUUM VERBOSE ANALYZE;

The idea behind the whole package is to define a trigger, which splits up all data inserted into a column and stores a list of substrings in a separate table that can easily be indexed. First we create a function called fti , which will be needed by the trigger processing every record inserted into mytexts .

Note

The path to the shared object might differ on your system and depends on the location where you have installed PostgreSQL.

In the next step we define a table labeled mytexts_fti that will contain all substrings and mytexts (the table containing the original values). After that, we index all columns involved in the process; we define three indices.

Now that all indices and tables are ready, we can define the trigger. In the previous source code, it is executed on every insert, update, or delete operation. This is necessary to keep the table containing the substrings maintained. Depending on the event the trigger is fired for, the action performed by the fti function differs . This is possible, because the function can access special variables containing the reason that has caused the trigger to be fired (see Chapter 4, "PL/PGSQL").

The most crucial part, when defining the trigger, is the list of parameters the fti function has to be called with. The first parameter defines the relation in which to store the substrings. The second trigger tells the function which field to extract substrings from. The table we want the substrings to be added to has to look like the one in the next example:

 fulltext=#  \d   mytexts_fti  Table "mytexts_fti"  Attribute          Type           Modifier -----------+-----------------------+----------  string     character varying(25)   id         oid

The first column has to contain the text, and the second column must contain the object id of the row containing the full string. If the data structure differs, the trigger cannot be executed successfully.

After adding the script we have just discussed to the database, here is the content of the two tables involved in the process now:

 fulltext=#  SELECT * FROM mytexts;  a ---------  Charles  Paul  Etschi  Epi (4 rows)

The mytexts table contains the records, as we saw before. mytexts_fti contains a list of substrings and an object id:

 fulltext=#  SELECT * FROM mytexts_fti;  string      id ---------+----------  es       20272538  les      20272538  rles     20272538  arles    20272538  harles   20272538  charles  20272538  ul       20272539  aul      20272539  paul     20272539  hi       20272540  chi      20272540  schi     20272540  tschi    20272540  etschi   20272540  pi       20272541  epi      20272541 (16 rows)

The list of the substrings is several times longer than the original list, and an additional column is needed to refer to the original list. When working with huge amounts of data, this costs a lot of storage.

Here's how we can perform the substring search now:

  SELECT p.*   FROM mytexts p, mytexts_fti f1   WHERE f1.string='schi'   AND p.oid=f1.id;

To compute the same result we achieved with the help of the ~ operator, we have to perform a join operation now. At first sight this seems more complicated ” especially for more complex queries. To get rid of the additional complexity, a view can be defined. Here is the result of the previous query:

 a --------  Etschi (1 row)

Let's look at the execution plan of the query now. To avoid sequential scans , we set enable_seqscan to off again (this has to be done for every session, respectively, for every new backend):

 fulltext=#  SET enable_seqscan TO off;  SET VARIABLE fulltext=#  EXPLAIN SELECT p.* FROM mytexts p, mytexts_fti f1 WHERE   f1.string='schi' AND p.oid=f1.id;  NOTICE:  QUERY PLAN: Nested Loop  (cost=100000000.00..100000003.10 rows=1 width=20)   ->  Index Scan using mytexts_fti_idx1 on mytexts_fti f1  (cost=0.00..2.01 rows=1 width=4)   ->  Seq Scan on mytexts p  (cost=100000000.00..100000001.04 rows=4 width=16) EXPLAIN

The database still performs a sequential scan on oid . We have not defined an index on that column yet, because we want to say a few more words about that index. Every object in PostgreSQL has a unique object identifier. In case of rows, these object identifiers do not appear in the table when using \d , but it can be indexed like any other column. Let's index the column:

 fulltext=#  CREATE INDEX mytexts_fti_idx3 ON mytexts(oid);  CREATE

Now that all columns needed for the join operation are indexed, let's look at the execution plan again:

 fulltext=#  EXPLAIN SELECT p.* FROM mytexts p, mytexts_fti f1 WHERE f1.string='schi' AND p.oid=f1.id;  NOTICE:  QUERY PLAN: Nested Loop  (cost=0.00..4.04 rows=1 width=20)   ->  Index Scan using mytexts_fti_idx1 on mytexts_fti f1  (cost=0.00..2.01 rows=1 width=4)   ->  Index Scan using mytexts_fti_idx3 on mytexts p  (cost=0.00..2.01 rows=1 width=16) EXPLAIN

PostgreSQL uses two indices now. When the amount of data grows, the query won't slow down significantly (at least not as much).

Working with ISBN and ISSN

At the Third International Conference on Book Market Research and Rationalization in the Book Trade, held in November 1966 in Berlin, an international numbering system for books was first discussed. Some publishers were already thinking of a computer- based system to make identification and numbering easier.

The International Standard Book Number (ISBN) system was developed out of the book numbering system introduced in the United Kingdom in 1967 by J. Whitaker & Sons, Ltd., and in the United States in 1968 by R. R. Bowker.

The International Organization for Standardization (ISO) Technical Committee 46 on Information and Documentation finally investigated whether it was possible to use the English standard as the worldwide standard for book identification. Finally, ISBN was approved as ISO standard 2108 in 1970.

Now ISBN is an international standard used by publishers all over the world, and every new book is assigned to a unique ISBN number.

In this section, you explore the ISBN module, which was contributed to PostgreSQL by Garrett A. Wollman in 1998.

To show you how you can work with ISBN as a datatype, we create a table containing two columns:

 isbnissn=#  CREATE TABLE myisbn(name text, number isbn);  CREATE

Let's insert a record into the table:

 isbnissn=#  INSERT INTO myisbn VALUES('Apache Administration', '3-8266-0554-3');  INSERT 20273569 1

As you can see, the record has been inserted successfully, because the number we passed to the INSERT statement is a valid ISBN number. If we query the table, we can see that the value is stored in the table:

 isbnissn=#  SELECT * FROM myisbn;  name              number -----------------------+---------------  Apache Administration  3-8266-0554-3 (1 row)

Let's try it with an invalid ISBN number:

 isbnissn=#  INSERT INTO myisbn VALUES('no book', '324324324324234');  ERROR:  isbn_in: invalid ISBN "324324324324234"

The ISBN module makes sure that no invalid ISBN numbers can be inserted into a field. But why do we have to use a self-defined datatype, if we can also use something like char(16) ? The reason for that is simple. Imagine a user interface where someone can insert data into a table. If we used char(16) , it would hardly be possible to check whether the input was okay or to work with operators such as < and > . These two operators cannot hardly be defined for character types.

Let's look at the next example. We try to figure out whether the ISBN of the book Barbarians Led by Bill Gates is smaller than the ISBN of the book about Apache:

 isbnissn=#  SELECT * FROM myisbn WHERE number>'3-8266-0506-3'::isbn;  name              number -----------------------+---------------  Apache Administration  3-8266-0554-3 (1 row)

The book about Bill Gates is newer than the book about Apache, so it is still displayed in the list. This example shows why to use such datatypes.

The module also supports ISSN. The International Standard Serial Number (ISSN) is a code used on catalogs, databases, and commercial transactions each time serial publications are involved. It is also universally used.

PostgreSQL Large Object Dumper

You have already learned that binary large objects cannot be dumped as easily as other datatypes. The PostgreSQL large object dumper ( pg_dumplo ) by Karel Zak provides an easy way to do the job and is easy to use.

Backup

To install the package, we go to the contrib/pg_dumplo directory in the PostgreSQL source tree and use make and make install . With these commands, the software is compiled and installed on the system.

Let's get to some examples and see how pg_dumplo can be used. First we create a table where we will store the object handle:

 lodumper=#  CREATE TABLE lo (id oid);  CREATE

Then we import a simple ASCII file into the table. To show you how that can be done, we use the /etc/passwd file.

 lodumper=#  \   lo_import /etc/passwd  lo_import 20273581

The table is still empty because we have imported only the file into the database; this does not mean that the object id is automatically inserted into the table:

 lodumper=#  SELECT * FROM lo;  id ---- (0 rows)

To insert the object handle into the table, we perform a simple INSERT statement:

 lodumper=#  INSERT INTO lo VALUES (20273581);  INSERT 20273583 1

Note

The id passed to the INSERT statement has to be the id returned by the import function.

As you can see, the table contains one record now:

 lodumper=#  SELECT * FROM lo;  id ----------  20273581 (1 row)

Now that we have added a large object to the database, we can show you how to back up a table containing binary objects only. For backup purposes, we use pg_dumplo . Here is a syntax overview of the program:

 [hs@athlon pg_dumplo]$  pg_dumplo --help  pg_dumplo 7.1.0 - PostgreSQL large objects dump pg_dumplo [option] -h --help                    this help -u --user=<username>         username for connection to server -p --password=<password>     password for connection to server -d --db=<database>           database name -t --host=<hostname>         server hostname -s --space=<dir>             directory with dump tree (for export/import) -i --import                  import large obj dump tree to DB -e --export                  export (dump) large obj to dump tree -l <table.attr ...>          dump attribute (columns) with LO to dump tree -a --all                     dump all LO in DB (default) -r --remove                  if is set '-i' try remove old LO -q --quiet                   run quietly -w --show                    not dump, but show all LO in DB Example (dump):   pg_dumplo -d my_db -s /my_dump/dir -l t1.a t1.b t2.a                   pg_dumplo -a -d my_db -s /my_dump/dir Example (import): pg_dumplo -i -d my_db -s /my_dump/dir Example (show):   pg_dumplo -w -d my_db Note:  * option '-l' must be last option!        * option '-i' without option '-r' make new large obj in DB          not rewrite old, the '-i' UPDATE oid numbers in table.attr only!        * if is not set option -s, the pg_dumplo use $PWD

To back up the table, we can use a command like this one:

 [hs@athlon test]$  pg_dumplo -s . -d lodumper -l lo.id  dump lo.id (1 large obj) Exported 1 large objects.

You can see that one object has been saved to disk successfully. The -s flag defines the directory where the result of the dump has to be stored. In this example, we want the output to be saved in the current directory. The -d flag defines the database we want to dump, and -l tells pg_dumplo which columns have to be dumped.

After executing the shell command, here is the output:

 [hs@athlon test]$  tree *  lodumper -- lo    `-- id        `-- 20273581 `-- lo_dump.index 2 directories, 2 files

pg_dumplo organizes the output in subfolders . The object ids of the various files are taken as filenames so that the objects can easily be identified. The file lo_dump.index contains a list of the dumped objects:

 # # This is the PostgreSQL large object dump index # #       Date:     Sun Jul 22 17:24:27 2001 #       Host:     localhost #       Database: lodumper #       User:     hs # # oid   table   attribut        infile # 20273581        lo      id      lodumper/lo/id/20273581

This file is needed for restoring the backup.

Recovery

Now that we have dumped the BLOBs in our database, we want to have a look at the recovery process, because dumping the files without knowing how to perform the recovery process is not useful.

The file we created before is not enough for recovering a database completely, because only information about BLOBs is stored. Therefore we have to perform an ordinary pg_dump :

 [hs@athlon lodumper]$  pg_dump lodumper > lo.sql

Let's look at the recovery process. First we create the database where the data will be inserted during the recovery process. In the next step, we insert the dump of the database into the database we just created:

 [hs@athlon lodumper]$  createdb lodumper  CREATE DATABASE [hs@athlon lodumper]$  psql lodumper < lo.sql

All data but the binary objects have been inserted into our database; next we deal with the BLOBs:

 [hs@athlon lodumper]$  pg_dumplo -s /tmp/test -i -d lodumper  20273581        lo      id      lodumper/lo/id/20273581 Imported 1 large objects.

The recovery process works just like the backup process. We use the -i flag to tell pg_dumplo that an import has to be performed instead of an export.

pg_dump and pg_dumplo are two reliable programs for saving databases. However, there is one problem you have to be aware of when working with the two programs: When dumping a database with pg_dump , the database makes sure that a consistent snapshot of the database is saved to disk. pg_dumplo also makes sure that no inconsistent data is saved, but if the two scripts are executed one after the other, it cannot be guaranteed that both scripts will see the same snapshot of the database. Consider this in production environments; otherwise, you might have problems during the recovery process.

Benchmarking with `pgbench`

To perform simple benchmarks, you can use a program called pgbench , which can be found in contrib/pgbench in your PostgreSQL sources. Compiling and installing the program works just like the installation process for the software, we have already presented (use make and make install ).

The program pgbench provides many command-line flags. Here is an overview of all options provided by the program:

 [hs@athlon pgbench]$  ./pgbench --help  ./pgbench: invalid option -- - usage: pgbench [-h hostname][-p port][-c nclients][-t ntransactions][-s scaling_factor][-n][-v][-S][-d][dbname] (initialize mode): pgbench -i [-h hostname][-p port][-s scaling_factor][- d][dbname]

As you can see, we have quite a lot of options to define the way the benchmark has to be processed . Beside the flags needed to define the host and the database we want to benchmark, we can use the -c flag to tell pgbench the number of clients simulated by the benchmarking software. The -s flag defines the scaling factor, which influences the size of the test database (if it is set to 100, 10 million records will be used for testing). -t sets the size of a transaction.

Before we can start the benchmark, we have to initialize the database, which can be done by using the -i flag:

 [hs@athlon pgbench]$  ./pgbench -i pgbench  NOTICE:  CREATE TABLE/PRIMARY KEY will create implicit index 'branches_pkey' for table 'branches' NOTICE:  CREATE TABLE/PRIMARY KEY will create implicit index 'tellers_pkey' for table 'tellers' NOTICE:  CREATE TABLE/PRIMARY KEY will create implicit index 'accounts_pkey' for table 'accounts' creating tables... 10000 tuples done. 20000 tuples done. 30000 tuples done. 40000 tuples done. 50000 tuples done. 60000 tuples done. 70000 tuples done. 80000 tuples done. 90000 tuples done. 100000 tuples done. vacuum...done.

The test is performed with 100,000 records. If -s is set, the number of records is changed.

Let's perform a test:

 [hs@athlon pgbench]$  ./pgbench pgbench -c 20 -t 100  starting vacuum...end. transaction type: TPC-B (sort of) scaling factor: 1 number of clients: 20 number of transactions per client: 100 number of transactions actually processed: 2000/2000 tps = 76.807278(including connections establishing) tps = 78.277629(excluding connections establishing)

Our test system can perform around 77 transactions per second ( tps ). Depending on the type of benchmark we run (size of transaction, and so forth), this value changes significantly.

The `seg` module

Often, scientific results contain intervals of floating-point numbers. Although it is possible to model an interval with the help of two columns, it is, in most cases, not the most elegant way to solve the problem. In view of these problems, a datatype called seg has been developed to perform all important operations in combination with intervals on PostgreSQL.

The seg package has been developed by Gene Selkov, Jr. and can be found in the contrib/seg directory of the PostgreSQL source tree. In this section, you take a closer look at this module and see how it can be used efficiently.

After compiling the module using make and make install , we insert the SQL code for adding the seg module to our database with the following command:

 [hs@athlon seg]$  psql myseg < seg.sql

If the SQL script can be added without any problems, seg is available in the database called myseg now. To make sure that everything is working correctly, we perform a small regression test:

 [hs@athlon seg]$  make installcheck  make -C ../../src/test/regress pg_regress make[1]: Entering directory `/data/postgresql-7.1.3/src/test/regress' make[1]: `pg_regress' is up to date. make[1]: Leaving directory `/data/postgresql-7.1.3/src/test/regress' ../../src/test/regress/pg_regress seg (using postmaster on Unix socket, default port) ============== dropping database "regression"         ============== ERROR:  DROP DATABASE: database "regression" does not exist dropdb: database removal failed ============== creating database "regression"         ============== CREATE DATABASE ============== installing PL/pgSQL                    ============== ============== running regression test queries        ============== test seg                  ... ok =====================  All 1 tests passed. =====================

The test has been passed successfully, so we can start to work with the module safely. First we create a table consisting of one column:

 myseg=#  CREATE TABLE myseg (a seg);  CREATE

The next examples show how data can be added to the table:

 myseg=#  INSERT INTO myseg VALUES ('1');  INSERT 24800338 1 myseg=#  INSERT INTO myseg VALUES ('2.0');  INSERT 24800339 1 myseg=#  INSERT INTO myseg VALUES ('3.00');  INSERT 24800340 1 myseg=#  INSERT INTO myseg VALUES ('~ 4.00');  INSERT 24800341 1 myseg=#  INSERT INTO myseg VALUES ('< 5.00');  INSERT 24800342 1 myseg=#  INSERT INTO myseg VALUES ('> 6.00');  INSERT 24800343 1

Now that we have added six records to the table, we check whether the records have been inserted successfully:

 myseg=#  SELECT * FROM myseg;  a -------  1  2.0  3.00  ~4.00  <5.00  >6.00 (6 rows)

All six records are stored in the table. Let's take a closer look at the INSERT statements now. You might wonder why the first three values have been defined with different precision and why the database seems to distinguish the amount of digits on the right side of the comma. The reason lies in the way scientists report results. In scientific environments, it makes a difference whether the value is 1.0 or 1.00 . Every digit in the number is a reliable one. 1.00 means that both digits right of the comma are reliable; therefore, 1.0 is not as precise as 1.00 . The seg datatype takes care of things such as that. In record four, ~ means approximately and tells the user that the value is not defined in full precision. ~ , < and > have no influence over how the record is treated, but they help the user interpret the meaning of a value correctly.

In many cases, the seg datatype is used to define an interval of values:

 myseg=#  INSERT INTO myseg VALUES ('7(+-)0.3');  INSERT 24800344 1

This example shows how an interval can be defined. In the next example, we define an interval ranging from 6.7 to 7.3 :

 myseg=#  INSERT INTO myseg VALUES ('6.7 .. 7.3');  INSERT 24800345 1

Intervals do not need to have an upper or a lower border. If you want an interval to contain all values that are equal or higher than 8 , we define it like the following:

 myseg=#  INSERT INTO myseg VALUES ('8 ..');  INSERT 24800346 1

After inserting various records, we want to retrieve values from the table. In contrast to other datatypes, there are a few things, that have to be considered .

As you have seen before, we inserted the interval '7(+-)0.3' into the database twice. Let's look for a value that is exactly 7 :

 myseg=#  SELECT * FROM myseg WHERE a='7'::seg;  a --- (0 rows)

No records are retrieved, because no records are exactly 7 . Let's find all values that are higher than 7 :

 myseg=#  SELECT * FROM myseg WHERE a>'7'::seg;  a ------  8 .. (1 row)

One record is found ”the only record that is entirely higher than 7 . The interval '7(+-)0.3' is not found, because a part of the interval is smaller than 7 .

If we want our result to contain the entire interval, we need a query like this one:

 myseg=#  SELECT * FROM myseg WHERE a>'6.6'::seg;  a ------------  6.7 .. 7.3  6.7 .. 7.3  8 .. (3 rows)

seg has been built on the GiST interface, which is a generalized interface to R-trees (the ones used for spatial searching). When using this interface, some operators are always defined. The next example shows how we can find all records containing 7 :

 myseg=#  SELECT * FROM myseg WHERE a @ '7'::seg;  a ------------  6.7 .. 7.3  6.7 .. 7.3 (2 rows)

As we expect, two records are returned.

The following table lists operators supported by the datatype:

Operator	Meaning
<<	Left of
>>	Right of
&<	Left over
&>	Right over
&&	Overlaps
=	Equal
!=	Not equal
@	Contains
~	Contained in

The `soundex` Code

This section introduces the soundex module, which implements the soundex code as a PostgreSQL user-defined function.

The Soundex system is a method of matching similar- sounding names (or any words) to the same code. In 1880, 1900, and 1910, the code was used by the United States Census Bureau. The idea behind the system is to retrieve similar-sounding values.

To install the module, we go to the soundex directory in the contributed section of our PostgreSQL sources and run make and make install . Then we add the required function to a database (in our example, the database is called soundex ):

 [hs@athlon soundex]$  psql soundex < soundex.sql  CREATE CREATE

We define a table called mysoundex :

 soundex=#  \   d   mysoundex  Table "mysoundex"  Attribute  Type  Modifier -----------+------+----------  a          text

In the next step we insert two records into that table:

 soundex=#  INSERT INTO mysoundex VALUES('John');  INSERT 24800378 1 soundex=#  INSERT INTO mysoundex VALUES('Joan');  INSERT 24800379 1

Before we try to retrieve the value from the database, we try to find out what the text_soundex function does:

 soundex=#  SELECT text_soundex('john');  text_soundex --------------  J500 (1 row)

A code is generated that is based on the value that has to be encoded. The records sound similar, and the target is to retrieve both records from the table:

 soundex=#  SELECT * FROM mysoundex WHERE text_soundex(a) = text_soundex('john');  a ------  John  Joan (2 rows)

With the help of the text_soundex function, we compute the soundex code of john and compare it with the soundex code of the values in the table. If the soundex code of john and the soundex code of the value in the table match, the row is retrieved.

Note

The database has to perform a sequential scan; if we use a query such as the next one, PostgreSQL cannot use indices.

Let's search for josef and see what we can find:

 soundex=#  SELECT * FROM mysoundex WHERE text_soundex(a) = text_soundex('josef');  a --- (0 rows)

No values are retrieved, because josef has a different soundex code, as we can see in the next example:

 soundex=#  SELECT text_soundex('josef');  text_soundex --------------  J210 (1 row)

Removing Lost Binary Objects from a Database

When inserting a binary large object into a database, an object id is generated and the new object can be accessed using this object id. If the object id is not inserted into a table immediately, the object id might get lost and the object will become an orphan. A tool called vacuumlo has been implemented to remove orphaned large objects from a PostgreSQL database.

To install the tool, we go to the contrib/vacuumlo directory in the source tree of PostgreSQL and run make and make install . After installing the software, we have a look at the syntax overview of the command:

 [hs@athlon vacuumlo]$  ./vacuumlo  Usage: ./vacuumlo [-v] database_name [db2 ... dbn]

The -v flag (verbose) tells the program to tell the user something about the output. If it's not defined, vacuumlo performs the job silently. Let's run the program:

 [hs@athlon vacuumlo]$  ./vacuumlo -v lodumper  Connected to lodumper Checking id in lo Removed 3 large objects from lodumper.

Three objects have been removed from the database. vacuumlo is a reliable and useful tool for all database administrators that have to do some database cleaning. If applications accessing a database have not been implemented properly, BLOBs might get lost. In this case, database administrators have to deal with trash and growing databases. vacuumlo can solve all these problems once and forever, if you start it from time to time using cron .

pg_controldata

You might be annoyed if you have to collect information about your database. That was exactly the problem Oliver Elphick had to face when he implemented pg_controldata , which collects information about the database quickly and reliably.

To run the program, we have to set the environment variable $PGDATA first:

 [root@athlon pg_controldata]# export PGDATA=/var/lib/pgsql/data/

Then we can run the program:

 [root@athlon pg_controldata]#  pg_controldata  pg_control version number:            71 Catalog version number:               200101061 Database state:                       IN_PRODUCTION pg_control last modified:             Tue Jul 24 14:32:08 2001 Current log file id:                  0 Next log file segment:                164 Latest checkpoint location:           0/A31CD444 Prior checkpoint location:            0/A31CD404 Latest checkpoint's REDO location:    0/A31CD444 Latest checkpoint's UNDO location:    0/0 Latest checkpoint's StartUpID:        96 Latest checkpoint's NextXID:          41410 Latest checkpoint's NextOID:          24800380 Time of latest checkpoint:            Mon Jul 23 18:31:01 2001 Database block size:                  8192 Blocks per segment of large relation: 131072 LC_COLLATE:                           C LC_CTYPE:                             C

Several values are collected and printed on the screen. Some of these messages might be useful for you, others are not. No matter which information you want to retrieve, pg_controldata offers an easy and reliable way to do the job.

Contributed Software