Accelerating Data Loading | MySQL Database Design and Tuning

< Day Day Up >

The first part of this chapter discussed how you can speed up the data extraction process, as well as prepare the results for a quick reload into their destination. This section changes focus and looks at how to insert large quantities of information into MySQL as quickly as possible.

The majority of this section explores the performance-specific aspects of the mysqlimport utility. Because this utility nearly directly corresponds to the LOAD DATA INFILE SQL statement, this statement is specifically included as part of the review.

Finally, if you are running your MySQL server with binary logging enabled (to help with data recovery and/or replication), you might also elect to use the mysqlbinlog utility to extract data-manipulation statements from the log. You can then feed them into the mysql utility to re-create information on the same (or another) database server. However, since this is typically used for other purposes such as point-in-time recovery, this is a much more unwieldy and complex approach than simply using the utilities mentioned previously.

Managing Concurrency

Loading hefty volumes of information has the potential to cause significant concurrency disruption issues for other users of your database. From the users' perspective, it will appear as if your system has ground to a halt. In fact, performance might be fine, but all resources are dedicated to the data-loading process and these users are blocked. The following sections look at two ways to minimize these inconveniences when running LOAD DATA INFILE.

`LOW_PRIORITY`

During Chapter 8's exploration of optimal SQL you saw how the LOW_PRIORITY attribute, -low-priority-updates server variable, and SET LOW_PRIORITY_UPDATES command causes a data-modification operation to wait until no other threads are reading a table. This also has value when running LOAD DATA INFILE, especially for MyISAM tables: The loading halts until these other threads are through, but once started it will proceed until completion. However, your load process might end up being delayed for a long time if many lengthy queries are already under way.

`CONCURRENT`

You examined how to take advantage of MyISAM's concurrent insert capabilities in Chapter 11, "MyISAM Performance Enhancement." For now, note that supplying the CONCURRENT keyword with your LOAD DATA INFILE statement causes MySQL to append this new information to the end of the table, as long as no free blocks are in the middle of the table. Other applications will then be able to retrieve information from the table in spite of your data load.

Finally, note that LOW_PRIORITY and CONCURRENT are mutually exclusive when loading data into MyISAM-based tables.

Handling Errors

During Chapter 5, "Using Constraints to Improve Performance," you learned how MySQL uses primary keys and UNIQUE constraints to guarantee data integrity. Later, in Chapter 8, you examined how you can take advantage of "upsert" (update or insert) functionality by using the REPLACE SQL statement rather than INSERT. This is much simpler than writing logic to handle situations in which a record's primary key or unique identifier is already present in the database.

Both mysqlimport and LOAD DATA INFILE give you the same flexibility and control. By including -replace when you launch the import, you request that MySQL overlay any existing, primary key or uniquely identified rows with new data from the input file. Conversely, the --ignore directive instructs MySQL to simply bypass the values from the input file when it encounters a potential primary key or unique index breach.

Although these directives are very helpful, it's also important to comprehend the underlying reason(s) why your soon-to-be-loaded data violates your well-planned integrity constraints. This could be a symptom of a larger database design difficulty or data cleansing issue.

Engine-Specific Tips

Up until now, this chapter has focused on general, cross-engine import and export suggestions. At this point, it's time to look at engine-specific recommendations to boost performance. Note that many of these proposals were first offered during the book's earlier chapters dedicated to MyISAM and InnoDB tuning (Chapters 11, "MyISAM Performance Enhancement," and 12, "InnoDB Performance Enhancement," respectively).

MyISAM

You reviewed MyISAM's many optimization features earlier in Chapter 11. Two tuning parameters bear repeating here: bulk_insert_buffer_size and myisam_max_sort_file_size.

The bulk_insert_buffer_size variable is used to control the amount of memory devoted to MyISAM's large-scale-operation internal cache. With a default value of 8MB, it can significantly improve response time when loading data into a MyISAM-based table, as long as the table is not completely empty.

In the context of a LOAD DATA INFILE operation, the myisam_max_sort_file_size variable controls the amount of space that a temporary file consumes while MySQL rebuilds any relevant indexes. The max_extra_sort_file_size setting also plays a part: If you set either of these values too low, MySQL might elect to use a slower, key cache based algorithm to re-create these indexes. Aside from the possibility that the temporary file necessary to support a large, heavily indexed table might fill your file system, there's little reason to make this value any smaller than the default found in newer versions of MySQL.

In addition, remember to configure the myisam_sort_buffer_size setting to be large enough to accommodate your index rebuilding.

Finally, don't be afraid to alter some vital engine settings just for the duration of your load operation. For example, consider making the key_buffer and myisam_sort_buffer_size variables much larger. By raising these values to a very high percentage of available memory, you help MySQL perform much more of its data import work in memory.

InnoDB

The effort involved in validating foreign key constraints is a big potential performance bottleneck when reloading InnoDB tables. In addition, it's possible that legitimate rows will be rejected because tables are inadvertently reloaded (and the foreign key checks are executed) in the wrong order.

For example, you might have written your reload script to first load records for a child table, followed by its parent. This triggers many error messages when the data is, in fact, valid. The risks of these types of false issues multiply when your database includes many cross- referencing foreign keys.

To avoid this kind of problem, use the LOAD DATA INFILE statement and disable foreign key validation by entering SET FOREIGN_KEY_CHECKS = 0 prior to loading information. Of course, for administrators using mysqldump, you saw earlier that this utility now disables foreign key checks when extracting information.

InnoDB's rich transaction capabilities were discussed in Chapter 12. Recall that overhead is associated with every COMMIT statement, which might become substantial when loading massive blocks of information. To avoid this prospective bottleneck, be certain to disable AUTOCOMMIT (via SET AUTOCOMMIT = 0) prior to launching any large-scale import operation. For small tables or multiple-value inserts, you might not notice any benefit; it is the most sizable tables as well as single-value inserts that should see the most noticeable improvement.

It's also a good idea to disable the checking of any unique constraints during data loading, as long as you're confident that your data is clean. To do this, enter SET UNIQUE_CHECKS = 0 for the session that is doing the import. When disabled, InnoDB is then free to leverage its insert buffer for mass operations by grouping secondary index writes into larger units.

< Day Day Up >