Optimizing Database Performance | FreeBSD 6 Unleashed

No matter how well you have designed your database schema, as your site becomes more and more popular and heavily used, it is likely that certain parts of its functionality will bog down. Perhaps you've forgotten an index in a certain crucial table. Perhaps you're doing a complex join where a series of temporary tables would be computationally cheaper. Or perhaps your dynamic pages are simply far too complex, and you need to create a cache of static information that's updated on a regular basis in order to speed up on-demand queries at the expense of instantaneous accuracy.

At the top-level view, it might seem impossible to figure out what's slowing things down; all you see might be a perpetually high load average in top. Fortunately, there are some ways to find out just what the database is doing in those queries. The following sections discuss these techniques.

Monitoring Database Usage

MySQL has a command called SHOW PROCESSLIST. Entered at the MySQL command line, this command will show you all the queries currently in progress:

[View full width]
mysql> SHOW PROCESSLIST; +--------+-------+-----------+----------------+---------+------+----------------+ | Id     | User  | Host      | db             | Command | Time | State |Info                                 | +--------+-------+-----------+----------------+---------+------+----------------+ | 625441 | root  | localhost | PictureArchive | Query   | 0    | NULL | show processlist                    | | 625443 | frank | localhost | PictureArchive | Query   | 0    | Copying to tmp | select name,dirname,id,count(*) from artists,pictures where id=pictures.artistid and  display='Y' gro | | 625444 | frank | localhost | PictureArchive | Sleep   | 1    | | NULL +--------+-------+-----------+----------------+---------+------+----------------+ 3 rows in set (0.01 sec)

If you see dozens of processes listed in this output table, with a Time value in the hundreds (of seconds), then you know you have a problemparticularly if all the queries (shown in the Info field) seem to be doing the same thing. If you can determine from the structure of the query where in your code it's being called from, you can look at the tables involved and make sure that the query is doing its joins on indexed columns and that the joins are sensibly constructed. You may also be able to redesign the code so that the query in question is called at a different time, when the CPU is not quite so busy with other queries.

The EXPLAIN command offers a more direct, if somewhat cryptic, way to drill down into your queries to see what's costing CPU cycles. This command, when placed in front of a SELECT statement, will display a table showing the tables involved in the query, the possible keys it will search on, the size of each table, and other information about how the query is being performed. You can check the table definitions to see whether all the possible keys are being indexed properly. If they aren't, you can index them using ALTER TABLE (as explained earlier in "Indexes and Keys").

The following examples use a more complex version of the same database you've seen in this chapter, including some tables that haven't been used before:

[View full width]
mysql> EXPLAIN SELECT NAME,DIRNAME,ID,COUNT(*) FROM Artists,Pictures WHERE ID=Pictures. Artist_ID AND Display='Y' GROUP BY Artist_ID ORDER BY Created DESC LIMIT 5; +----------+-------+-------------------+-----------+---------+------------+------+ | table    | type  | possible_keys     | key       | key_len | ref        | rows | Extra                                       | +----------+-------+-------------------+-----------+---------+------------+------+ | Artists  | range | PRIMARY,Display   | Display   |       2 | NULL       | 1615 | where used; Using temporary; Using filesort | | Pictures | ref   | PRIMARY,Artist_ID | Artist_ID |       4 | Artists.ID |   38 | Using index                                 | +----------+-------+-------------------+-----------+---------+------------+------+ 2 rows in set (0.18 sec)

EXPLAIN in PostgreSQL can be tuned toward helping you root out the expensive processes that cause slowdowns in your queries through explicit "cost" analysis. The following example shows the use of EXPLAIN in PostgreSQL:

[View full width]
PictureArchive=# EXPLAIN SELECT Artists.Name, Pictures.name FROM Artists LEFT JOIN  Pictures ON Artists.ID=Pictures.Artist_ID; NOTICE:  QUERY PLAN: Merge Join  (cost=823.39..1012.89 rows=10000 width=32)   ->  Index Scan using Artists_pkey on Artists (cost=0.00..52.00 rows=1000 width=16)   ->  Sort  (cost=823.39..823.39 rows=10000 width=16)   ->  Seq Scan on Pictures  (cost=0.00..159.00 rows=10000 width=16) EXPLAIN

Here, you see a join being performed between the Artists and Pictures tables, but there is no key in Pictures to help the database engine search for the elements upon which to perform the join. Costs of each operation are summed upward; as you work your way up the tree, you can see the overall cost of the query increasing. The sequential scan and sort operation at the bottom of the tree is causing a significant hit on system performance. If you had an index on the Artist_ID key in Pictures, the database wouldn't have to do a sequential scan and could instead do an index scan, which is much faster, as shown in this example:

[View full width]
PictureArchive=# CREATE INDEX Pictures_Artist_ID_key ON Pictures ( Artist_ID ); CREATE PictureArchive=# EXPLAIN SELECT Artists.Name, Pictures.Name FROM Artists LEFT JOIN  Pictures ON Artists.ID=Pictures.Artist_ID; NOTICE:  QUERY PLAN: Merge Join  (cost=0.00..628.56 rows=10000 width=32)   ->  Index Scan using Artists_pkey on Artists  (cost=0.00..52.00 rows=1000 width=16)   ->  Index Scan using Pictures_Artist_ID_key on Pictures (cost=0.00..439.06 rows=10000  width=16) EXPLAIN

Here, the overall cost range is reduced from 823.39 through 1012.89 down to 0.00 through 628.56. (These units aren't very meaningful outside of the database context but can be compared internally very effectively.) It seems this maneuver has gained some very positive results.

Creating Temporary Tables

This may not seem obvious at first blush, but sometimes it's more efficient to create a temporary table than to do a complex join. Programmatically, it may appear that the most elegant solution to a problem is to accomplish the task in as few lines of code as possibleand in fact in traditional programming languages this is mostly true. But databases don't quite work that way.

If you find yourself joining three or four tables together in a single query, and you notice that the query takes an extremely long time to execute, consider for a moment what the database is being asked to do. It must align four separate (possibly very large) disparate groupings of data together in memory, associating keys from each table with keys from another table. This process takes place in a conceptual matrix whose complexity rises by another degree with each table that's added to the mix. The multidimensional data structure that the database must hold in memory is enormous during this join, and searching through it becomes a monstrous task.

To avoid this problem, experiment with temporary tables. You can create a temporary table by simply preceding a SELECT statement with CREATE TEMPORARY TABLE t_name, where t_name is the name of the table; you can then select from this table (using the SELECT command) later in your session. Be sure to apply the appropriate indexes after the table is created. If you do this, a direct SELECT from the new tableeven including the temporary table creation and indexing queriesmight be orders of magnitude faster than a complex join that nonetheless you've managed to condense into a single query.

Note

Temporary tables only last until the end of the current database session; in the command-line context, temporary tables that you create will be there until you log out. In a web scripting context, the tables are dropped when the database handle is destroyedwhen you issue a disconnect command through the programmatic toolkits or when the script completes.

Temporary table names exist in private namespaces, so multiple temporary tables with the same name can exist as long as they've been created by different database sessions.

Using Cache Tables

If your queries are large and complex but they just can't be made any faster or more efficient by indexing, you may have no other option than to go to a cache table. This is a table whose contents are nothing more than values that are refreshed on occasion from the results of much more complex queries. What you're doing, in effect, is creating a layer of indirectiona table from which users can get their information very quickly, with a direct and uncomplicated query, but whose contents might not be accurate until refreshed by a periodic process that performs the complex queries and inserts their results into the cache table.

This is one area where Perl scripts that are not CGI programs will come in handy. You can run a cron job that executes a Perl script every 10 minutes, for instance, that connects to the database and refreshes the cache by running the complex queries. The script can operate in an atomic fashion, by issuing a BEGIN statement, followed by the series of queries, and finally a COMMIT; this way, queries that will potentially change the data involved in the complex queries are deferred until the refresh is complete.

Note

See Chapter 15, "Performance Monitoring, Process Control, and Job Automation," for more information about using cron to schedule processes.

The system will be a little slower to respond while the refresh process is running, but in betweenfor 10 minutes at a stretch, or an hour, or however frequently you run your cron jobeverything will be almost immeasurably faster.