10.2 Maintenance and Database Internals

After a database has been set up, the work of an administrator is not over because some things have to be done to keep the system up and running. In contrast to other databases, it takes very little effort to administer a PostgreSQL database because the entire system has been implemented flexibly and reliably. In addition, PostgreSQL is capable of handling huge amounts of data.

However, some basic administration tasks have to be done and you will learn about these in this section. In addition, it is necessary to understand some basic internals of the database. You will need this knowledge for configuring PostgreSQL to your needs.

10.2.1 `VACUUM` and `ANALYZE`

VACUUM is the most important command when working with PostgreSQL. It is used for many purposes and should be run periodically. Here is a list of what VACUUM is used for:

Regaining disk space Running operations like UPDATE will raise the amount of disk space located internally. Because PostgreSQL supports concurrent transactions, it might be necessary to store various versions of the data in the file depending on who is reading what. This will cause the files on disk to grow depending on the operations you are running. Using VACUUM will reduce the size of the files on disk.
Updating optimizer statistics PostgreSQL's optimizer depends on statistics. VACUUM and ANALYZE will update the statistics stored in the system tables.
Avoiding problems occurring because of transaction ID wraparound PostgreSQL's transaction code depends on the comparison of transaction ids (XID). The problem now is that a transaction id is a 4-byte integer, which limits the number of transaction ids to about 4 billion transactions. PostgreSQL databases prior to version 7.2 had to be rebuilt using initdb after every 4 billion transactions in order not to run into trouble. Since PostgreSQL 7.2, this limitation no longer exists, but the database must be vacuumed at least once after every billion transactions. On sites with high load, this has to be taken into consideration.

Now that you have seen what VACUUM can be used for, let's take a look at a practical example.

Create an empty table called data:

 phpbook=# CREATE TABLE data (data int4); CREATE

Currently the database cluster we are working on is about 41.7MB in size. Just go to the directory where your database cluster can be found and run du:

 [postgres@athlon pgdb]$ du 1716    ./base/1 1716    ./base/16555 3576    ./base/25110 1716    ./base/26076 8728    ./base 140     ./global 32812   ./pg_xlog 12      ./pg_clog 41732   .

In the next step you can write a PHP script for generating a file containing some UPDATE and DELETE operations:

 <?php         $fp = fopen("/tmp/data.sql", "w+");         if      (!$fp)         {                 echo "cannot open file\n";                 exit;         }         for     ($i = 0; $i < 10000; $i++)         {                 $sql = "INSERT INTO data VALUES ($i);\n";                 $sql.="DELETE FROM data;\n";                 fwrite ($fp, $sql);         }         fclose($fp); ?>

To generate the file called data.sql, use the PHP command-line interpreter:

 [postgres@athlon php]$ php makedata.php X-Powered-By: PHP/4.0.6 Content-type: text/html

If no error occurred, the file has been generated successfully and contains a set of UPDATE and DELETE operations as shown in the next listing:

 [postgres@athlon tmp]$ head data.sql INSERT INTO data VALUES (0); DELETE FROM data; INSERT INTO data VALUES (1); DELETE FROM data; INSERT INTO data VALUES (2); DELETE FROM data; INSERT INTO data VALUES (3); DELETE FROM data; INSERT INTO data VALUES (4); DELETE FROM data;

This SQL code can be sent to the database easily:

 [postgres@athlon pgdb]$ psql phpbook < /tmp/data.sql

Although not a single additional record is in the database, the amount of disk space needed has grown:

 [postgres@athlon pgdb]$ du 1716    ./base/1 1716    ./base/16555 3984    ./base/25110 1716    ./base/26076 9136    ./base 140     ./global 32812   ./pg_xlog 12      ./pg_clog 42140   .

To solve the problem and to reduce the space allocated on disk, VACUUM can be run. As you have already seen, VACUUM can be run from PostgreSQL's interactive shell, but in some cases it is more comfortable to use a shell program called vacuumdb to perform the job:

 [postgres@athlon pgdb]$ vacuumdb --help vacuumdb cleans and analyzes a PostgreSQL database. Usage:   vacuumdb [options] [dbname] Options:   -h, --host=HOSTNAME             Database server host   -p, --port=PORT                 Database server port   -U, --username=USERNAME         Username to connect as   -W, --password                  Prompt for password   -d, --dbname=DBNAME             Database to vacuum   -a, --all                       Vacuum all databases   -t, --table='TABLE[(columns)]'  Vacuum specific table only   -f, --full                      Do full vacuuming   -v, --verbose                   Write a lot of output   -z, --analyze                   Update optimizer hints   -e, --echo                      Show the command being sent to the backend   -q, --quiet                     Don't write any output Read the description of the SQL command VACUUM for details. Report bugs to <pgsql-bugs@postgresql.org>.

As you can see, it is possible to define quite a few parameters that are mostly needed for vacuuming a remote database. To vacuum all your databases, you can use the following command:

 [postgres@athlon pgdb]$ vacuumdb -e -z -a Vacuuming template1 VACUUM   ANALYZE VACUUM Vacuuming phpbook VACUUM   ANALYZE VACUUM

Because the -e flag has been used, PostgreSQL lists the SQL commands sent to the back end on the screen. After that you can see how much disk space is still allocated:

 [postgres@athlon pgdb]$ du 1744    ./base/1 1716    ./base/16555 3576    ./base/25110 1744    ./base/26076 8784    ./base 140     ./global 32812   ./pg_xlog 12      ./pg_clog 41788   .

The amount of space needed has decreased. If you are running many UPDATE and DELETE operations, it is recommended that you run VACUUM using a cron job when the load on your database is low.

To rebuild the statistics needed by the optimizer, ANALYZE has to be used. Whenever the data in your table changes significantly, you should run ANALYZE to provide the optimizer with correct and reliable information.

10.2.2 WAL

Write Ahead Logs (WAL) are a sophisticated feature of PostgreSQL. The main idea of WAL is to write files to disk after the changes have been logged. This has significant advantages because the number of accesses to the disk is reduced, so a lot of speed can be gained. In case of disaster, it is easy to restore the data because the changes have already been logged. With the help of the log, it is an easy task to repair the data after the crash safely. The concept is in a way similar to the journaling used by some filesystems.

In addition to speed, PostgreSQL can guarantee that the data in the database is consistent and ready for action after a crash.

UNDO operations are not yet implemented, but they will be implemented in future releases of PostgreSQL. With the help of UNDOs, it will be possible to implement savepoints, which will be another important step in history of PostgreSQL.

10.2.3 Maintaining Log Files

PostgreSQL can be configured to produce a lot of logging information. If the database is running for a long time, the logfile will grow and will allocate a lot of disk space. Therefore old information should be removed from time to time.

PostgreSQL does not remove old logging information by itself, so it is necessary to use external software such as Apache's logrotate.

10.2.1 VACUUM and ANALYZE

10.2.2 WAL

10.2.3 Maintaining Log Files

10.2.1 `VACUUM` and `ANALYZE`