Table Statistics | PostgreSQL (2nd Edition)

You've seen all the operators that PostgreSQL can use to execute a query. Remember that the goal of the optimizer is to find the plan with the least overall expense. Each operator uses a different algorithm for estimating its cost of execution. The cost estimators need some basic statistical information to make educated estimates.

Table statistics are stored in two places in a PostgreSQL database: pg_class and pg_statistic .

The pg_class system table contains one row for each table defined in your database (it also contains information about views, indexes, and sequences). For any given table, the pg_class.relpages column contains an estimate of the number of 8KB pages required to hold the table. The pg_class.reltuples column contains an estimate of the number of tuples currently contained in each table.

Note that pg_class holds only estimates ”when you create a new table, the relpages estimate is set to 10 pages and reltuples is set to 1000 tuples. As you INSERT and DELETE rows, PostgreSQL does not maintain the pg_class estimates. You can see this here:

 movies=# SELECT * FROM tapes;  tape_id       title      dist_id ----------+---------------+---------  AB-12345  The Godfather        1  AB-67472  The Godfather        1  MC-68873  Casablanca           3  OW-41221  Citizen Kane         2  AH-54706  Rear Window          3 (5 rows) movies=# CREATE TABLE tapes2 AS SELECT * FROM tapes; SELECT movies=# SELECT reltuples, relpages FROM pg_class movies-#   WHERE relname = 'tapes2';  reltuples  relpages -----------+----------       1000        10

Create the tapes2 table by duplicating the tapes table. You know that tapes2 really holds five tuples (and probably requires a single disk page), but PostgreSQL has not updated the initial default estimate.

There are three commands that you can use to update the pg_class estimates: VACUUM , ANALYZE , and CREATE INDEX .

The VACUUM command removes any dead tuples from a table and recomputes the pg_class statistical information:

 movies=# VACUUM tapes2; VACUUM movies=# SELECT reltuples, relpages FROM pg_class WHERE relname = 'tapes2';  reltuples  relpages -----------+----------          5         1 (1 row)

The pg_statistic system table holds detailed information about the data in a table. Like pg_class , pg_statistic is not automatically maintained when you INSERT and DELETE data. The pg_statistic table is not updated by the VACUUM or CREATE INDEX command, but it is updated by the ANALYZE command:

 movies=# SELECT staattnum, stawidth, stanullfrace FROM pg_statistic movies-#   WHERE starelid = movies-#     ( movies(#       SELECT oid FROM pg_class WHERE relname = 'tapes2' movies(#     );  staattnum  stawidth  stanullfrac -----------+----------+-------------  (0 rows) movies=# ANALYZE tapes; ANALYZE movies=# SELECT staattnum, stawidth, stanullfrace FROM pg_statistic movies-#   WHERE starelid = movies-#     ( movies(#       SELECT oid FROM pg_class WHERE relname = 'tapes2' movies(#     );  staattnum  stawidth  stanullfrac -----------+----------+-------------          1        12            0          2        15            0          3         4            0 (3 rows)

PostgreSQL defines a view (called pg_stats ) that makes the pg_statistic table a little easier to deal with. Here is what the pg_stats view tells us about the tapes2 table:

 movies=# SELECT attname, null_frac, avg_width, n_distinct FROM pg_stats movies-#   WHERE tablename = 'tapes2';  attname  null_frac  avg_width  n_distinct ---------+-----------+-----------+------------  tape_id          0         12          -1  title            0         15        -0.8  dist_id          0          4        -0.6 (3 rows)

You can see that pg_stats (and the underlying pg_statistics table) contains one row for each column in the tapes2 table. The null_frac value tells you the percentage of rows where a given column contains NULL . In this case, there are no NULL values in the tapes2 table, so null_frac is set to 0 for each column. avg_width contains the average width (in bytes) of the values in a given column. The n_distinct value tells you how many distinct values are present for a given column. If n_distinct is positive, it indicates the actual number of distinct values. If n_distinct is negative, it indicates the percentage of rows that contain a distinct value. A value of “1 tells you that every row in the table contains a unique value for that column.

pg_stats also contains information about the actual values in a table:

 movies=# SELECT attname, most_common_vals, most_common_freqs movies-#   FROM pg_stats movies-#   WHERE tablename = 'tapes2';  attname  most_common_vals   most_common_freqs ---------+-------------------+-------------------  tape_id                      title    {"The Godfather"}  {0.4}  dist_id  {1,3}              {0.4,0.4} (3 rows)

The most_common_vals column is an array containing the most common values in a given column. The most_common_freqs value tells you how often each of the most common values appear. By default, ANALYZE stores the 10 most common values (and the frequency of those 10 values). You can increase or decrease the number of common values using the ALTER TABLE ... SET STATISTICS command.

Another statistic exposed by pg_stat is called histogram_bounds :

 movies=#  SELECT attname, histogram_bounds FROM pg_stats movies-#   WHERE tablename = 'tapes2';  attname                 histogram_bounds ---------+------------------------------------------------  tape_id  {AB-12345,AB-67472,AH-54706,MC-68873,OW-41221}  title    {Casablanca,"Citizen Kane","Rear Window"}  dist_id  (3 rows)

The histogram_bounds column contains an array of values for each column in your table. These values are used to partition your data into approximately equally sized chunks .

The last statistic stored in pg_stats is an indication of whether the rows in a table are stored in column order:

 movies=# SELECT attname, correlation FROM pg_stats movies-#   WHERE tablename = 'tapes2';  attname  correlation ---------+-------------  tape_id          0.7  title           -0.5  dist_id          0.9 (3 rows)

A correlation of 1 means that the rows are sorted by the given column. In practice, you will see a correlation of 1 only for brand new tables (whose rows happened to be sorted before insertion) or tables that you have reordered using the CLUSTER command.