7.8 Indexes | PHP and PostgreSQL. Advanced Web Programming2002

The more records you want to store in a table, the longer it will take to find the appropriate values because the database has to check every record in the table at least once to generate the result. If you are working with millions of records, this will take far too long and the performance of your application will suffer if the number of records in the database increases.

Trees are a fundamental data structure and can be used to speed up the access to huge amounts of data. Indexes (non-American readers will spell the plural of the word index as indices, which is the correct Latin plural of the word) are based on trees, and in this section you will learn how to create and drop indexes. You will also have a brief look at PostgreSQL internals.

7.8.1 Creating and Removing Indexes

Creating indexes is as simple as creating any other object in PostgreSQL. To define an index, use the CREATE INDEX command. Let's look at the syntax overview of the command:

 phpbook=# \h CREATE INDEX Command:     CREATE INDEX Description: define a new index Syntax: CREATE [ UNIQUE ] INDEX index_name ON table     [ USING acc_method ] ( column [ ops_name ] [, ...] )     [ WHERE predicate ] CREATE [ UNIQUE ] INDEX index_name ON table     [ USING acc_method ] ( func_name( column [, ... ]) [ ops_name ] )     [ WHERE predicate ]

As you can see, the syntax is easy to understand. Let's create an index:

 phpbook=# CREATE INDEX idx_student_id ON student(id); CREATE

The name of the index is idx_student_id and it has been defined on the column called id in table student. When looking for data in the specific column, PostgreSQL might use the index in order to retrieve the data faster. PostgreSQL might use the index PostgreSQL is not forced to use the index. Depending on the decision of the optimizer, which is responsible for finding the best way through a query, PostgreSQL might neglect the index if it thinks that it is faster to perform a sequential scan. Performing a sequential scan means that the table is read from the beginning to the end without using an index. Query optimization is covered in detail in Chapter 18, "Tuning."

When creating the index, PostgreSQL will sort the data internally and a tree structure will be built. If a table is huge, the operation will take some time.

Indexes can also defined for more than just one column. This does not mean that both columns can be accessed fast when looking for values in one column. Defining an index on more than just one column means that PostgreSQL will retrieve values fast when you are looking for a pair of values that is in the column you have defined an index on. To generate an index on more than just one column, a list of columns has to be passed to the CREATE INDEX command:

 phpbook=# CREATE INDEX idx_student_id_name ON student(id, name); CREATE

So far, two indexes have been generated. To see which indexes have been defined on one table, you can use \d:

 phpbook=# \d tudent        Table "student"  Column |  Type   | Modifiers --------+---------+-----------  id     | integer |  name   | text    | Indexes: idx_student_id,          idx_student_id_name

As you can see, two indexes have been generated. The indexes are listed when retrieving information about the tables, but the columns they have been defined on will not be listed. Therefore we recommend assigning useful names to indexes that contain the names of the column the indexes have been defined for. Of course, it is possible to ask PostgreSQL about the index, but it is better to store this information in the name as well.

Let's retrieve some information about an index:

 phpbook=# \d idx_student_id Index "idx_student_id"  Column |  Type --------+---------  id     | integer btree

PostgreSQL returns the name of the column the index has been defined for, but this time you will not find the table the index has been defined for. In addition, the data type of the values in the index is returned. In this example the index is based on B-trees.

To remove an index from a column, use DROP INDEX:

 phpbook=# \h DROP INDEX Command:     DROP INDEX Description: remove an index Syntax: DROP INDEX index_name [, ...]

Let's remove an index:

 phpbook=# DROP INDEX idx_student_id_name; DROP

If no error occurred, the index has been removed successfully.

7.8.2 Repairing a Corrupt System Index

This section deals with an extremely unlikely situation. We have been working with PostgreSQL for years, but we have never seen PostgreSQL collapse because it is an extremely reliable piece of software. However, let's look at a command called REINDEX, which can be used to recover corrupted system indexes.

 phpbook=# \h REINDEX Command:     REINDEX Description: recover a corrupted system index Syntax: REINDEX { TABLE | DATABASE | INDEX } name [ FORCE ]

As you can see, REINDEX can be used not only for restoring indexes, but corrupted indexes are the most likely scenario. Let's try the command:

 phpbook=# REINDEX INDEX idx_student_id; REINDEX

As you can see, everything works just the way it should. You will have a closer look at REINDEX in Chapter 10, "PostgreSQL Administration."

7.8.3 PostgreSQL Internals

PostgreSQL provides three types of indexes:

B-trees B-tree indexes are an implementation of Lehman-Yao high-concurrency B-trees and are the standard index structure of PostgreSQL. If you are not working with geometric data types, PostgreSQL will automatically use B-trees.
R-trees R-trees are used for spatial searching and are based on Guttman's quadratic split algorithm. Spatial searching means that it is possible to perform a basic set of geometric operations using indexes. Therefore PostgreSQL is an excellent database for systems storing geographic data.
Hashes PostgreSQL's hashes are based on Litwin's linear hashing. Hashes are not used very often.

You have already heard about trees and indexes, but what do trees look like and how do they work? Take a look at Figure 7.1.

Figure 7.1. A B-tree.

graphics/07fig01.gif

The root node contains the value 100. Every value lower than the root node is on the left side all values higher than the root node are stored on the right side. In this scenario every node has two children. Those nodes that don't have children are called leaves. If you are reading the values from left to right, you will find that the tree contains a sorted list of the data. The advantage of a sorted list is that a record can be found fast. Assume that you are looking for the value 154. It will take 2 steps to find the correct value out of 7 records. In a sequential scan, the average would be 3.5 steps (7 divided by 2). If the number of records in the table doubles, PostgreSQL will need just one more step to find the correct value in an unsorted list it would take twice the time. To retrieve one record out of 1,000,000 records, it takes about 20 steps when using an index if no index is used it would take about 500,000 steps. This example shows how much faster it can be to use an index instead of a sequential scan.

Figure 7.1 shows the idea of working with trees. The real implementation of PostgreSQL is far more complex because things like transactions and overhead due to hardware-related questions have to be taken into consideration, but the most important thing is to understand why things can be much faster when working with indexes.