Indexes and Performance | PostgreSQL Developer's Handbook2001

Indexes are a very important method to achieve higher performance. In most cases, when working with indexes, we are not talking about performance gains of 1 or 2 percent; it is possible to gain millions of percentage of performance when the amount of data stored in a table is huge. As a rule of thumb, we can say that the higher the amount of data in a table, the more we will gain by using an index. Indexes are very efficient when many different values are stored in a column. We won't gain performance with an index when working with Boolean fields because a Boolean field can only have up to three values ( true , false , and NULL ).

We have discussed how indexes work and what types of indexes are available in PostgreSQL in the section about components of relational databases. To show you how an index can speed up your query, we will create a table and write a script that generates 10 million records. We will use the output of the script as a sample database in the following section.

Let's create a table first:

  CREATE TABLE perftest (id int, astring text, anumber int);

We have created an empty table with no index. The following script called gendata and can be used to generate random records we will insert into the database called performance :

 #!/usr/bin/perl $maxrecords = 10000000;        # 10 million if      (1 eq 1) {         open(PIPE," psql -U postgres performance") or                 die "can't open pipe: $!\n"; } else {         open(PIPE,"> /dev/null") or                 die "can't open pipe: $!\n"; } $command='COPY "perftest" FROM stdin;'; @chars=("A" .. "Z", "a" .. "z", 0 .. 9 ); @figure=( 0 .. 9 ); print PIPE "$command\n"; for($i=1; $i<=$maxrecords; $i++) {         $astring=join("", @chars[ map {  rand @chars }  (1 .. 8 ) ]);         $numbers=join("", @figure[ map {  rand @figure }  (1 .. 8 ) ]);         $check = $numbers;         if      ($check < 1)         {                 # do nothing         }         else         {                 print PIPE "$i\t$astring\t$numbers\n";         } } print PIPE "\.\n"; close(PIPE);

Let's see how the script works. $maxrecords contains the number of records we want to create. Then we open a pipe to psql for inserting data into database performance . We can connect to the database as user postgres . In the else branch, we open a pipe for writing the data to /dev/null . /dev/null is the trashcan of the system. We have included this here to find out how long the program needs to create the data (will be done with the help of the time command); we need this value to calculate the time PostgreSQL needs to insert the data.

The data will be inserted into the data with the help of the COPY command; in $command , we can find the header of the command. Then the loop is processed . Inside the loop we create two random values. $astring is a 8-byte “ong random string. $numbers is a random integer value.

The ID of the records and the two random fields are then written into the pipe. Finally, the pipe is closed. We have included five random sample datasets next so you can see how the random table might look:

 1       QqMwNEjR        90972712 2       Eqprz14z        74727204 3       0eRwFecC        16321694 4       DxVbTxhv        97166380 5       XgIuvPtl        37298281

This script will generate 10 million records, such as the ones shown previously. Let's execute the script.

 [hs@duron code]$  time ./gendata.pl  real    17m51.043s user    11m15.240s sys     0m5.510s

We have started the script with the help of time . time is a Unix command that measures the time for executing a script. You can see that it took 17 minutes and 51 seconds to execute the Perl script.

Now we want to select a value from the table. We did not define an index for the column yet:

 bash-2.04$  time echo "SELECT * FROM perftest WHERE id=5000000"  psql   performance  id     astring   anumber ---------+----------+----------  5000000  1wfi0e57  45326076 (1 row) real    3m36.231s user    0m0.010s sys     0m0.020s

We simply pipe the result of the echo command to psql . We use the Unix shell instead of the psql shell, so we can measure the time the command takes to complete. Indeed, the query takes a lot of time. Because we haven't created an index, the database performs a so-called "full table scan." This means that the database reads the table from the beginning to the end and checks every record whether is fulfills the required conditions or not.

We create an index called idx_id_perftest on column id now:

 bash-2.04$  time echo "CREATE INDEX idx_id_perftest ON perftest(id)"   psql performance  CREATE real    8m52.278s user    0m0.020s sys     0m0.000s

As you can see, it takes nearly nine minutes. We perform the query a second time:

 bash-2.04$  time echo "SELECT * FROM perftest WHERE id=5000000"  psql   performance  id     astring   anumber ---------+----------+----------  5000000  1wfi0e57  45326076 (1 row) real    3m53.837s user    0m0.030s sys     0m0.010s

The database still seems to perform a full table scan. This happens because the information in the system tables read by the optimizer, the part of the database which is responsible to find the right way through a query, have not been updated yet. We will discuss things like that in the "Using EXPLAIN and Understanding the Optimizer" section later in this chapter. To solve the problem concerning the index, we can use the VACUUM command, as shown in the following example:

 bash-2.04$  time echo "VACUUM VERBOSE ANALYZE perftest"  psql performance  NOTICE:  --Relation perftest-- NOTICE:  Pages 68494: Changed 0, reaped 0, Empty 0, New 0; Tup 10000000: Vac 0, Keep/VTL 0/0, Crash 0, UnUsed 0, MinLen 52, MaxLen 52; Re-using: Free/Avail. Space 0/0; EndEmpty/Avail. Pages 0/0. CPU 23.63s/66.78u sec. NOTICE:  Index idx_id_perftest: Pages 19688; Tuples 10000000. CPU 5.81s/13.38u sec. VACUUM real    5m38.913s user    0m0.020s sys     0m0.030s After running "VACUUM" the query will run significantly faster: bash-2.04$  time echo "SELECT * FROM perftest WHERE id=5000000"  psql   performance  id     astring   anumber ---------+----------+----------  5000000  1wfi0e57  45326076 (1 row) real    0m0.085s user    0m0.010s sys     0m0.000s

It takes only 0.085 seconds to process the query, which means that the query is about 2,750 times faster than before, only because of the index.

When measuring performance, we recommend that you use more than just one SQL statement because the result may vary when executing it several times. Good input for a performance test would be a logfile. If you write all SQL statements done by your application into a logfile, you can take the SELECT statements out of that file to test your database. This would be a good method to do performance tests under real-world circumstances.

Note

The more SQL statements you use for testing, the more representative the result of the performance test.

By default, PostgreSQL creates B-tree indexes, but we can also use hashing or R-trees instead (depending on the data type of the column we want to index).

Indexes can not only be defined for one column. One index can also be used to index two columns at once. Indexing more columns with one index will significantly speed up your query when you want to retrieve values from exactly those columns covered by the index. We will drop the index we have created before and create a new index that covers id and astring :

 DROP INDEX idx_id_perftest; CREATE INDEX idx_id_perftest ON perftest (id, astring);

We will perform two queries now. You can see that the second query is much faster than the first one. The second query uses the index very efficiently and is, therefore, very fast:

 bash-2.04$  time echo "SELECT * FROM perftest WHERE id=6000001"  psql   performance  id     astring   anumber ---------+----------+----------  6000001  FnHdo3RL  53523975 (1 row) real    0m0.419s user    0m0.000s sys     0m0.020s bash-2.04$  time echo "SELECT * FROM perftest WHERE id=6000001 AND   astring='FnHdo3RL'"  psql  performance    id     astring   anumber ---------+----------+----------  6000001  FnHdo3RL  53523975 (1 row) real    0m0.054s user    0m0.020s sys     0m0.000s

But indexes are not only used to speed up queries. Unique indexes can be used to ensure that a value can only occur once in a table. Primary keys are also unique indexes. If we use unique values on multiple columns, PostgreSQL ensures that no combination of values exist more than once. Additionally, unique indexes are faster than indexes that can contain multiple entries of one value.

Whenever a relationship between data occurs, a join can be performed. Joining means that the database has to find the corresponding data for certain values in another table. To find the right value quickly, it is extremely useful to create indexes on all columns involved in the join operation to speed up the operation significantly.