Optimizing Table Structure

< Day Day Up >

MySQL lets you specify a number of performance-related settings when you create or alter a table. You can do this in two ways:

Via SQL in the CREATE TABLE or ALTER TABLE command
Via the MySQL Table Editor from within MySQL Query Browser or MySQL Administrator, as shown in Figure 4.2
Figure 4.2. Specifying performance-related issues.

The following sections take a look at some of these options.

Specifying Row Format

When you create or modify a table using the MyISAM engine, you can request that the engine store rows in fixed or dynamic format. If your table contains no TEXT or BLOB columns, MySQL chooses the fixed option by default, which automatically converts VARCHAR columns to CHAR. On the other hand, if you select dynamic, MySQL converts all columns of type CHAR to VARCHAR. As a third choice, running the myisampack command instructs MySQL to compress the table into a smaller, read-only format.

Because of their consistent row size, fixed format tables translate to less work (and better response) for the MySQL engine when accessing, caching, and updating information. They are also less likely to become corrupted. If disk space is not at a premium, fixed format tables are probably the best choice.

Conversely, dynamic tables use less space, but run the risk of fragmentation and/or eventual corruption. The OPTIMIZE TABLE command is reviewed in Chapter 6, "Understanding the MySQL Optimizer," to see how to correct fragmentation problems. Note that if your table has any columns of type TEXT, BLOB, or VARCHAR, MySQL automatically chooses this option.

Finally, compressed tables use the least amount of space.

The following examines the storage implications of these types of tables. We created two identical tables, except for their ROW_FORMAT, and then inserted 100,000 rows of random data:

 CREATE TABLE fixed_demo (     col1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     col2 VARCHAR(255),     col3 CHAR(40),     col4 INT,     INDEX (col2),     INDEX (col3) ) ENGINE = MYISAM ROW_FORMAT = FIXED; CREATE TABLE dynamic_demo (     col1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     col2 VARCHAR(255),     col3 CHAR(40),     col4 INT,     INDEX (col2),     INDEX (col3) ) ENGINE = MYISAM ROW_FORMAT = DYNAMIC;

As expected, the fixed_demo table used significantly more disk space than dynamic_demo (29MB data and 48MB index versus 15.5MB data and 23.3MB index).

We then ran a simple query against both tables:

 SELECT COUNT(*) FROM fixed_demo WHERE col4 > 345 AND col4 < 1099; SELECT COUNT(*) FROM dynamic_demo WHERE col4 > 345 AND col4 < 1099;

Both queries ran in less than a quarter second. However, the dynamic table query took approximately 50% longer than the fixed table query. For more complex operations, the fixed table savings could be dramatic.

Finally, we ran myisampack against the fixed_demo table. This reduced the table to 18.8MB data and 37.8MB index. Rerunning the SELECT COUNT(*) query against this newly compressed table returned results 10% faster than before compression.

Specifying Table Size

When you're using a dynamic, MyISAM table, you can use the AVG_ROW_LENGTH and MAX_ROWS options to tell MySQL how large you expect your table to be. Knowing this value helps the engine construct index keys in a more optimal way. This can translate into faster performance, especially in large tables.

To test the performance benefits of specifying table size, we created two sample tables, one of which included instructions on expected table size:

 CREATE TABLE size_defined (     col1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     col2 VARCHAR(255),     col3 CHAR(40),     col4 INT,     INDEX (col2),     INDEX (col3) ) ENGINE = MYISAM ROW_FORMAT = DYNAMIC AVG_ROW_LENGTH = 200 MAX_ROWS = 250000; CREATE TABLE size_not_defined (     col1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     col2 VARCHAR(255),     col3 CHAR(40),     col4 INT,     INDEX (col2),     INDEX (col3) ) ENGINE = MYISAM ROW_FORMAT = DYNAMIC;

We then loaded 100,000 rows of random data into each table. They both consumed the same amount of disk space, but simple queries ran approximately 10% faster against the size-defined table versus the size-undefined table.

Index Key Compression

When using MyISAM tables, you can request that the engine compress index keys by specifying the PACK_KEYS option when creating a table. You might find that your application's read operations improve; there might be an offsetting added cost for write operations.

Compressing index keys can save substantial disk space. To demonstrate, we created the following tables:

 CREATE TABLE pack_keys_demo (     col1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     col2 VARCHAR(255),     col3 INT,     INDEX (col2),     INDEX (col3) ) ENGINE = MYISAM PACK_KEYS = 1; CREATE TABLE pack_keys_demo_no (     col1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     col2 VARCHAR(255),     col3 INT,     INDEX (col2),     INDEX (col3) ) ENGINE = MYISAM PACK_KEYS = 0;

We then inserted 100,000 rows of random data into each table. For this test, we made the data in col2 highly random while the data in col3 was highly duplicate: Its values only ranged between 1 and 10.

After the tables were loaded, the pack_keys_demo table consumed 13.6MB and 20.6MB of storage for data and indexes, respectively. The pack_keys_demo_no table used 13.6MB and 43.9MB of storage for data and indexes, respectively.

The impact of index compression is discussed as part of our detailed review of indexing later in this section.

Checksum Integrity Management

If you choose MyISAM for a particular table, you might elect to have MySQL compute a checksum for each row. This adds slight overhead to data update operations, but helps you locate (and then correct) corrupted tables. This value is then used by the CHECKSUM TABLE command:

 mysql> CHECKSUM TABLE checksum_demo; +--------------------+-----------+ | Table              | Checksum  | +--------------------+-----------+ | demo.checksum_demo | 544091798 | +--------------------+-----------+ 1 row in set (0.00 sec)

To test the overhead of this option, we created two identical tables, and asked MySQL to track a checksum for only one:

 CREATE TABLE checksum_demo (     col1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     col2 VARCHAR(255) ) ENGINE = MYISAM CHECKSUM = 1; CREATE TABLE checksum_demo_no (     col1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY,     col2 VARCHAR(255) ) ENGINE = MYISAM;

We then repeatedly ran tests that inserted 1,000 rows of random data into each table. The overhead averaged less than 5% for the first table: This is a small price to pay to improve data integrity.

Alternatively, you can still run CHECKSUM TABLE for a table without this option set. However, MySQL takes some time to process the results.

Column Types and Performance

The decisions you make regarding column types can have a significant impact on performance. MySQL offers numerous choices within each major column class (string, numeric, and date/time). Although it is assumed that you already understand the difference between these data types, as well as which ones to use for your application, this section takes a few moments to focus on the interplay of column type and responsiveness.

Making the right choice when deciding how to store dates and times can be confusing: The combination of the various column types, system settings, and date/time-specific SQL can be intimidating. However, because this book is focused on performance, a detailed discussion of the nuances of storing and calculating dates is avoided. Instead, this section reviews only string and numeric column-type considerations.

Before we begin, let's review a very handy utility provided by MySQL both as an example for open source developers who want to write functions to extend the product, as well as a tool for database designers to help pick optimal column types. The analyse() procedure takes the output of a SELECT statement and returns a report detailing the information shown in Table 4.1 for each column.

Table 4.1. PROCEDURE `analyse()` Results
Column	Purpose
`field_name`	The name of the returned column
`min_value`	The smallest value in the result set for this column
`max_value`	The largest value in the result set for this column
`min_length`	The smallest number of bytes in the result set for this column
`max_length`	The largest number of bytes in the result set for this column
`empties_or_zeros`	The number of returned rows with either an empty field or zero value in this column
`nulls`	The number of returned rows with nulls in this column
`avg_value_or_avg_length`	For numeric columns, the average value in the result set for this column; for string columns, the average length of all the values in this column for the result set
`std`	The standard deviation for the result set for this column, assuming it is a numeric column
`optimal_fieldtype`	The MySQL-preferred data type for this column, based on the data sample contained in the result set

For example, take a look at the following table:

 CREATE TABLE demo_pa (     col1 INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,     col2 SMALLINT,     col3 CHAR(10),     col4 VARCHAR(40) ) ENGINE = INNODB;

After loading 10,000 rows of random data into the table and running the following SQL statement:

 SELECT * FROM demo_page PROCEDURE analyse(4);

The first parameter for analyse() refers to the number of columns to process. Optionally, you can also specify a maximum memory that MySQL can allocate to finding unique values in each column. The results are shown in Figure 4.3.

Figure 4.3. MySQL can allocate memory to finding unique values.

What did you learn about the table from this report, based on the sample data set?

col1 can probably be shortened from INT to SMALLINT.
col2 can also be shortened, from SMALLINT to TINYINT.
col3 can likely be specified with an ENUM list because only four distinct values are in the table. ENUM is discussed in more detail in Chapter 5, "Using Constraints to Improve Performance."
col4 is probably set just right.

Try experimenting with analyse() for your database; you might be surprised with what you learn.

String Considerations

MySQL offers database designers a rich choice of options when storing text and binary-based information. The following list explores the performance implications of these choices.

CHAR versus VARCHAR Database designers face a never-ending struggle when trying to decide between the disk space savings of storing strings as VARCHAR and the improved performance of using fixed CHAR columns. Remember that, as discussed earlier in this chapter, this decision is taken out of your hands if you choose a dynamic row format for a MyISAM table: MySQL automatically converts CHAR columns to VARCHAR. On the other hand, certain fixed row format MyISAM tables see all of the VARCHAR columns converted to CHAR.
Despite these limitations, and given the ever-decreasing cost of disk space, it's probably best to opt for the improved speed of the CHAR type whenever possible. As an added benefit, you are less likely to experience costly data fragmentation with fixed record tables.
CHAR BINARY versus VARCHAR BINARY Although these columns hold binary information, you can apply the same decision criteria as you do for CHAR versus VARCHAR when deciding which option to select.
BLOBs and TEXT Binary large objects (BLOBs) typically hold images, sound files, executables, and so on. This column type is further subdivided into TINYBLOB, BLOB, MEDIUMBLOB, and LONGBLOB. Maximum storage for these columns ranges from 257 bytes for TINYBLOBs all the way up to nearly 4.3GB for LONGBLOB. TEXT columns follow a similar nomenclature and storage profile as BLOB columns, with the chief differences being that TEXT columns are processed using their character set for searching and collation.

Database administrators are often torn between stowing BLOBs and TEXT data in the MySQL database versus placing the information on the file system and simply storing a link to it in the database. There are arguments in favor of both approaches from the perspective of optimal performance.

You might find that your MySQL-driven applications bog down under the increased processing and storage costs of BLOBs and TEXT, especially if that data is merely ancillary to your primary database operations. Storing variable-length information also leads to eventual fragmentation, and you might also experience data truncation if you undersize your BLOB and/or TEXT columns.

On the other hand, keeping your BLOBs and TEXT data inside MySQL lets you take advantage of its backup/restore and replication features. In addition, you now can create indexes on these types of columns. You must use prefixes to make searches that employ these indexes meaningful. To improve sorting speed, you can configure the max_sort_length option. Indexing is reviewed in Chapter 6, "Understanding the MySQL Optimizer," and engine parameters are discussed in Chapter 10, "General Server Performance and Parameters Tuning," Chapter 11, "MyISAM Performance Enhancement," and Chapter 12, "InnoDB Performance Enhancement."

Numeric Considerations

After struggling through the decisions among all of the string options, you'll be relieved to know that making a choice among the numeric column options requires much less mental exertion.

A good general rule of thumb is to use the smallest numeric type that suits your needs; this helps conserve space and can also make some joins and searches more efficient. The following lists some examples.

Integers When you need to store an integer (that is, a number with no potential fractional/decimal values), MySQL gives you a choice of five options: TINYINT, SMALLINT, MEDIUMINT, INT, and BIGINT, along with the ability to declare them SIGNED or UNSIGNED. Storage requirements for each type range from a single byte for TINYINT with a possible range of values from either 0 to 255 or 127 to +127 all the way up to eight bytes for BIGINT with a possible range from either 0 to approximately 1.8442E+19 or 9.22E+18 to +9.22E+18.
Decimals Your choices are somewhat less dramatic for decimal-based columns (that is, numbers with potential fractional/decimal values). Your options include DECIMAL/NUMERIC, FLOAT, and DOUBLE. Just selecting FLOAT or DOUBLE without specifying a precision consumes four bytes and eight bytes, respectively. You also have the opportunity to specify a precision (on both sides of the decimal place) for the DECIMAL type. This directly determines storage consumption. For example, if you define a column of type DECIMAL(10,2), you consume at least 10 bytes, along with anywhere between 0 and 2 additional bytes, depending on if there is a fraction or sign involved.

With this many choices, many database designers fall into the expedient trap of just picking a numeric type that they've used many times before, regardless of the application's needs.

For example, suppose that High-Hat Airways' customer_master table includes a field that stores the customer's age. The database designer defines this column as an INT. However, because there are not many members of the flying public greater than 255 years old, an UNSIGNED TINYINT would suffice for this column, and save three bytes per record. Considering that the table will hold tens of millions of rows, the savings could be substantial, especially if the column were used in queries, joins, and groupings.