Large-scale data imports often cause serious headaches for administrators. Fortunately, MySQL offers numerous ways to make these jobs more predictable and less exciting.
Every night, transaction summary data from each partner is first loaded into a work table. A batch operation then processes all rows in the table, granting the appropriate credit to each partner's customer within the main High-Hat customer database. The work table (partner_transactions) and another relevant table are defined as follows:
CREATE table partners ( partner_id INTEGER PRIMARY KEY AUTO_INCREMENT, partner_name VARCHAR(40) ) ENGINE = INNODB; CREATE table partner_transactions ( partner_id INTEGER NOT NULL, ff_account CHAR(10) NOT NULL, last_name VARCHAR(50) NOT NULL, first_name VARCHAR(50) NOT NULL, transaction_date DATE NOT NULL, daily_total DECIMAL (5,2) NOT NULL, PRIMARY KEY(partner_id,ff_account,transaction_date), INDEX (partner_id), INDEX (ff_account), INDEX (last_name, first_name), INDEX (transaction_date), INDEX (daily_total), FOREIGN KEY (partner_id) REFERENCES partners(partner_id) ) ENGINE = INNODB;
Now that you know the high-level information flow, what is the best way to proceed? First, when faced with a group of sluggish data import processes, pick one on which to focus your energies, rather than working on all of them in parallel. After this file has been identified, you should start by looking at the process that created the import file, followed by the file itself, and only then turn your attention to the import process and server configuration.
Using the order of analysis listed in the preceding section, here are your observations:
After speaking with the partner (who also uses MySQL), you learn that when generating the file, they sort the transactions by amount. Looking at their data file confirms this report.
The data file consists of individual INSERT statements; each statement records a single mileage-qualifying purchase for a particular member.
After the file is received at High-Hat, the import operation is run on a different machine than the database server. Both machines are on the same secure network, but are separated by 2,000 kilometers. Furthermore, Secure Sockets Layer (SSL) is used between the client and database server.
The database server is dedicated to running the import process; no other applications are running at the same time. No server parameter configuration changes are made: The database runs with the same settings as during normal operation.
Your research really paid off: The partner and High-Hat can make several easily implemented yet very powerful changes to facilitate information loading. All of these concepts are discussed at length in Chapter 15, "Improving Import and Export Operations." For this situation, the suggestions are categorized into the following areas:
Export file generation The partner can undertake two steps to make things better:
Generate the export file in the order of the High-Hat primary key Because InnoDB stores the table in the order of either your or a system-generated primary key (known as the clustered index), this reduces the amount of reorganization necessary to complete the load operation. This can have a dramatic impact on performance.
Generate the data file so that multiple INSERT operations are performed within the same batch Grouping these activities together can speed the importing process.
Import server configuration You can employ several tactics to help your import server handle information more efficiently:
Run the import closer to the database server Loading large volumes of information is time consuming enough. Having to send this data over a widely distributed network doesn't help. In fact, if enough resources are available, try to run the import process on the same machine as the database server: This reduces network traffic costs to zero.
Avoid SSL Regardless of whether you are able to alter your network topology, you can speed things up by disabling SSL. Because both machines are on the same, secure network, there is no need to incur the extra cost of encrypting and then decrypting information.
Temporarily alter InnoDB engine settings If it doesn't cause problems for other users or processes, you can allocate a very large percentage of system memory to the InnoDB buffer pool as well as increase the log buffer size (via the innodb_buffer_pool_size and innodb_log_buffer_size settings, respectively). Granting extra space to these memory-based structures helps InnoDB perform more work in memory, rather than on disk.You can also reduce disk I/O by boosting the size of InnoDB's log files via the innodb_log_file_size setting.
Import process During your diagnostic work, you noticed two glaring problems with how the import process is handled. Your recommendations include the following:
Reduce the number of indexes The partner_transactions table is heavily indexed. As you talk with the developers, you learn that the primary key is really the only necessary index. Because this is the case, you request that the other indexes be removed from the table definition: There is no need for InnoDB to do the work of keeping all of them up to date.
Disable foreign key validation To ensure data integrity, the partner_transactions table references the partners table. However, it's probably safe to assume that the partner will send you accurate information in this field, thus obviating the need for an expensive foreign key lookup for each imported row. To disable these lookups, be certain to enter SET FOREIGN KEY CHECKS = 0 prior to starting the import. The same holds true for uniqueness checks: They can be disabled by using SET UNIQUE CHECKS = 0.