How PostgreSQL Organizes Data

Table of contents:

The perf database and the recalls Table

Before you can really dig into the details of performance tuning, you need to understand some of the basic architecture of PostgreSQL.

You already know that in PostgreSQL, data is stored in tables and tables are grouped into databases. At the highest level of organization, databases are grouped into clustersa cluster of databases is serviced by a postmaster.

Let's see how this data hierarchy is stored on disk. You can see all databases in a cluster using the following query:

perf=# SELECT datname, oid FROM pg_database;
 datname | oid
-----------+-------
 perf | 16556
 template1 | 1
 template0 | 16555

From this list, you can see that I have three databases in this cluster. You can find the storage for these databases by looking in the $PGDATA directory:

$ cd $PGDATA
$ ls
base pg_clog pg_ident.conf pg_xlog postmaster.opts
global pg_hba.conf PG_VERSION postgresql.conf postmaster.pid

The $PGDATA directory has a subdirectory named base. The base subdirectory is where your databases reside:

$ cd ./base
$ ls -l
total 12
drwx------ 2 postgres pgadmin 4096 Jan 01 20:53 1
drwx------ 2 postgres pgadmin 4096 Jan 01 20:53 16555
drwx------ 3 postgres pgadmin 4096 Jan 01 22:38 16556

Notice that there are three subdirectories underneath $PGDATA/base. The name of each subdirectory corresponds to the oid of one entry in the pg_database table: the subdirectory named 1 contains the template1 database, the subdirectory named 16555 contains the template0 database, and the subdirectory named 16556 contains the perf database.

Let's look a little deeper:

$ cd ./1
$ ls
1247 16392 16408 16421 16429 16441 16449 16460 16472
1249 16394 16410 16422 16432 16442 16452 16462 16474
1255 16396 16412 16423 16435 16443 16453 16463 16475
1259 16398 16414 16424 16436 16444 16454 16465 16477
16384 16400 16416 16425 16437 16445 16455 16466 pg_internal.init
16386 16402 16418 16426 16438 16446 16456 16468 PG_VERSION
16388 16404 16419 16427 16439 16447 16457 16469
16390 16406 16420 16428 16440 16448 16458 16471

Again, you see a lot of files with numeric filenames. You might guess that these numbers also correspond to oids, and (by chance) you would often be correct. Every table (and index) in a database is catalogued in the pg_class system table. To find a table in pg_class, search for a row where the relname column is equal to the name of the table. For example, to find the pg_class entry for the pg_group table, execute the command

SELECT * FROM pg_class WHERE relname = 'pg_group';

The pg_class.relfilenode value for a table determines the name of the file that stores the table (likewise, the pg_class.relfilenode value for an index determines the name of the file that stores in the index). In most cases, the OID of a table's pg_class entry matches the table's relfilenode, but that's not always the case. If you ALTER a table in such a way that PostgreSQL must first make a new copy of the table and then drop the original, the table's relfilenode will change. The relfilenode may also change if you CLUSTER the table (PostgreSQL makes a new copy of the table in the desired order and then drops the original). The relfilenode value for an index may change if you rebuild in the index with a REINDEX command, or if you ALTER the data type of a column covered by the index.

To see the correspondence between a table's relfilenode and its filename, simply compare the output from the following SELECT command to the filesystem directory that contains the database

test=# SELECT relfilenode, relname FROM pg_class ORDER BY relfilenode;
 relfilenode | relname
-------------+---------------------------------
 0 | pg_xactlock
 1247 | pg_type
 1249 | pg_attribute
 1255 | pg_proc
 1259 | pg_class
 ... | ...

Each table is stored in its own disk file and the name of the file is determined by the oid relfilenode of the table's entry in the pg_class table.

There are two more columns in pg_class that might help explain PostgreSQL's storage structure:

perf=# SELECT relname, oid, relpages, reltuples FROM pg_class
perf-# ORDER BY oid
 relname | oid | reltuples | relpages
--------------+------+-----------+----------
 pg_type | 1247 | 143 | 2
 pg_attribute | 1249 | 795 | 11
 pg_proc | 1255 | 1263 | 31
 pg_class | 1259 | 101 | 2
 pg_shadow | 1260 | 1 | 1
 pg_group | 1261 | 0 | 0
 ... | ... | ... | ...

The reltuples column tells you how many tuples are in each table. The relpages column shows how many pages are required to store the current contents of the table. How do these numbers correspond to the actual on-disk structures? If you look at the table files for a few tables, you'll see that there is a relationship between the size of the file and the number of relpages columns:

$ ls -l 1247 1249
-rw------- 1 postgres pgadmin 16384 Jan 01 20:53 1247
-rw------- 1 postgres pgadmin 90112 Jan 01 20:53 1249

The file named 1247 (pg_type) is 16,384 bytes long and consumes two pages. The file named 1249 (pg_attribute) is 90,122 bytes long and consumes 11 pages. A little math will show that 16,384/2 = 8,192 and 90,122/11 = 8,192: each page is 8,192 (8K) bytes long. In PostgreSQL, all disk I/O is performed on a page-by-page basis[1]. When you select a single row from a table, PostgreSQL will read at least one pageit may read many pages if the row is large. When you update a single row, PostgreSQL will write the new version of the row at the end of the table and will mark the original version of the row as invalid.

[1] Actually, most disk I/O is performed on a page-by-page basis. Some configuration files and log files are accessed in other forms, but all table and index access is done in pages.

The size of a page is fixed at 8,192 bytes. You can increase or decrease the page size if you build your own copy of PostgreSQL from source, but all pages within a database will be the same size. The size of a row is not fixeddifferent tables will yield different row sizes. In fact, the rows within a single table may differ in size if the table contains variable-length columns. Given that the page size is fixed and the row size is variable, it's difficult to predict exactly how many rows will fit within any given page.

The perf database and the recalls Table

The sample database that you have been using so far doesn't really hold enough data to show performance relationships. Instead, I've created a new database (named perf) that holds some large tables. I've downloaded the recalls database from the U.S. National Highway Traffic Safety Administration[2]. This database contains a single table with 39,241 rows. Here is the layout of the recalls table:

perf=# d recalls
 Table "recalls"
 Column | Type | Modifiers
-------------+-------------------------+-----------
 record_id | numeric(9,0) |
 campno | character(9) |
 maketxt | character(25) |
 modeltxt | character(25) |
 yeartxt | character(4) |
 mfgcampno | character(10) |
 compdesc | character(75) |
 mgftxt | character(30) |
 bgman | character(8) |
 endman | character(8) |
 vet | character(1) |
 potaff | numeric(9,0) |
 ndate | character(8) |
 odate | character(8) |
 influenced | character(4) |
 mfgname | character(30) |
 rcdate | character(8) |
 datea | character(8) |
 rpno | character(3) |
 fmvss | character(3) |
 desc_defect | character varying(2000) |
 con_defect | character varying(2000) |
 cor_action | character varying(2000) |
Indexes: recall_record_id

Notice that there is only one index and it covers the record_id column.

[2] This data (ftp://ftp.nhtsa.dot.gov/rev_recalls/) is in the form of a flat ASCII file. I had to import the data into my perf database.

The recalls table in the perf database contains 39,241 rows in 4,413 pages:

perf=# SELECT relname, reltuples, relpages, oid FROM pg_class
perf-# WHERE relname = 'recalls';
 relname | reltuples | relpages | oid
---------+-----------+----------+-------
 recalls | 39241 | 4413 | 96409

Given that a page is 8,192 bytes long, you would expect that the file holding this table ($PGDATA/base/16556/96409) would be 36,151,296 bytes long:

$ ls -l $PGDATA/base/16556/96409
-rw------- 1 postgres pgadmin 36151296 Jan 01 23:34 96409

Figure 4.1 shows how the recalls table might look on disk. (Notice that the rows are not sortedthey appear in the approximate order of insertion.)

Figure 4.1. The recalls table as it might look on disk.

If a row is too large to fit into a single 8K block[3], PostgreSQL will write part of the data into a TOAST[4] table. A TOAST table acts as an extension to a normal table. It holds values too large to fit inline in the main table.

[3] PostgreSQL tries to store at least four rows per heap page and at least four entries per index page.

[4] The acronym TOAST stands for "the oversized attribute storage technique."

Indexes are also stored in page files. A page that holds row data is called a heap page. A page that holds index data is called an index page. You can locate the page file that stores an index by examining the index's entry in the pg_class table. And, just like tables, it is difficult to predict how many index entries will fit into each 8K page[5]. If an index entry is too large, it is moved to an index TOAST table.

[5] If you want more information about how data is stored inside a page, I recommend the pg_filedump utility from Red Hat.

In PostgreSQL, a page that contains row data is a heap block. A page that contains index data is an index block. You will never find heap blocks and index blocks in the same page file.

Page Caching

Two of the fundamental performance rules in any database system are:

Memory access is fast; disk access is slow.
Memory space is scarce; disk space is abundant.

Accordingly, PostgreSQL tries very hard to minimize disk I/O by keeping frequently used data in memory. When the first server process starts, it creates an in-memory data structure known as the buffer cache. The buffer cache is organized as a collection of 8K pageseach page in the buffer cache corresponds to a page in some page file. The buffer cache is shared between all processes servicing a given database.

When you select a row from a table, PostgreSQL will read the heap block that contains the row into the buffer cache. If there isn't enough free space in the cache, PostgreSQL will move some other block out of the cache. If a block being removed from the cache has been modified, it will be written back out to disk; otherwise. it will simply be discarded. Index blocks are buffered as well.

In the "Gathering Performance Information" section, you'll see how to measure the performance of the cache and how to change its size.

Summary

This section gave you a good overview of how PostgreSQL stores data on disk. With some of the fundamentals out of the way, you can move on to more performance issues.

Part I: General PostgreSQL Use

Introduction to PostgreSQL and SQL

Working with Data in PostgreSQL

PostgreSQL SQL Syntax and Use

Performance

Performance
How PostgreSQL Organizes Data
Gathering Performance Information
Understanding How PostgreSQL Executes a Query
Execution Plans Generated by the Planner
The ARC Buffer Manager
Table Statistics
Performance Tips

Part II: Programming with PostgreSQL

Introduction to PostgreSQL Programming

Extending PostgreSQL

PL/pgSQL

The PostgreSQL C APIlibpq

A Simpler C APIlibpgeasy

The New PostgreSQL C++ APIlibpqxx

Embedding SQL Commands in C Programsecpg

Using PostgreSQL from an ODBC Client Application

Using PostgreSQL from a Java Client Application

Using PostgreSQL with Perl

Using PostgreSQL with PHP

Using PostgreSQL with Tcl and Tcl/Tk

Using PostgreSQL with Python

Npgsql: The .NET Data Provider

Other Useful Programming Tools

Part III: PostgreSQL Administration

Introduction to PostgreSQL Administration

PostgreSQL Administration

Internationalization and Localization

Security

Replicating PostgreSQL Data with Slony

Contributed Modules

Index