INSERT | SQL Performance Tuning

The SQL Standard description of the typical INSERT format is:

 INSERT INTO <Table> [ ( <column> [,...] ) ]   <query expression>

IBM Logging

Every data-change statement causes a write to a log, but every DBMS handles the situation in a slightly different manner. As an example, here's how IBM does data-change logging.

With IBM, the logging is of rows or parts of rows:

INSERT causes a write of the new row's contents (an "after image").
DELETE causes a write of the old row's contents (a "before image").
UPDATE causes a write of the changed parts of both old and new rows.

For example, suppose Table1 's definition is:

 CREATE TABLE Table1 (    column1 CHAR(3),    column2 CHAR(3))

Table1 has one row, containing {'ABC', 'DEF'} . This UPDATE statement:

 UPDATE Table1 SET column1 = 'AZ'

causes this log record to be created:

 [Row Identifier][bytes #2-3]BEFORE-IMAGE='BC',AFTER-IMAGE='Z '

On the other hand, if Table1 's definition is:

 CREATE TABLE Table1 (    column1 VARCHAR(3),    column2 CHAR(3))

then the same UPDATE statement would affect all bytes as far as the end of the row, and the log record would be larger.

Portability

The keyword INTO is optional for Microsoft, MySQL, and Sybase, but there is no reason to leave it out and cause portability trouble.

Here are two examples of INSERT:

 INSERT INTO Table1 (column1)    VALUES ('The rain in Spain falls in the plain') INSERT INTO Table1 (column1)    SELECT column1 FROM Table2

If a particular column value appears frequently and is lengthy, then you can save network transmission by making the value the default for the column. For example:

 CREATE TABLE Table1 (    column1 VARCHAR(40) DEFAULT 'The rain in Spain',    ... ) INSERT INTO Table1 DEFAULT VALUES

This trick results in some gain if the column value is at least 100 bytes long (GAIN: 6/8).

There is no point in reordering the INSERT columns so that they are in the same order as in the storage pageit doesn't improve performance at all. (And by the way, that order isn't necessarily the same as the defined order, as we discussed in Chapter 7, "Columns.") It does, however, help slightly if the primary key or unique columns appear first in the column list. That is, if this is your table definition:

 CREATE TABLE Table1 (    column1 DECIMAL(4,2),    column2 VARCHAR(40) UNIQUE,    column3 DATE,    column4 INTEGER PRIMARY KEY)

then performance is slightly better if you do your INSERT like this:

 INSERT INTO Table1 (column4, column2, column1, column3)   VALUES (10, 'The Rain', 24.5, DATE '2001-01-01') GAIN: 2/7

WARNING

Don't do this for Sybase on a PRIMARY KEY; it shows a loss. The gain shown is for only seven DBMSs.

When there are many separate INSERT statements to execute, it's helpful if the statements are in order according to the value in some index. This increases the chance that the appropriate index page will be in cache.

Bulk INSERT

It's always good for performance if several insertions are combined. That's because there's a per-statement overhead, including the time needed to look up the table name in a catalog, to check privileges, and to form a network packet. Several ways exist to perform a multiple-row INSERT (usually called a bulk INSERT or just a load ).

The SQL-Standard way to do a bulk INSERT is to repeat the row-expression part of the statement multiple times, as in this example:

 INSERT INTO Table1   (column1, column2, column3, column4, column5)   VALUES (1, 2, 3, 4, 5),          (2, 3, 4, 5, 6),          (3, 4, 5, 6, 7)

Portability

Only IBM and MySQL support this feature.

This method is rare.

An alternate, but still SQL-Standard way to do a bulk INSERT is with a compound statement:

 BEGIN   INSERT INTO Table1 VALUES (1, 2, 3, 4, 5);   INSERT INTO Table1 VALUES (2, 3, 4, 5, 6);   INSERT INTO Table1 VALUES (3, 4, 5, 6, 7); END

Portability

Only IBM, Microsoft, Oracle, and Sybase support this feature outside of a stored procedure.

Compound statements are an excellent idea in many situations, and IBM recommends their use quite strongly. In this particular case, where the same statement is being repeated, they're less than ideal because each repetition must be parsed.

It's also possible to do a bulk INSERT with an ANSI SQL subquery, if the data you want is already in the database:

 INSERT INTO Table2   SELECT * FROM Table1 WHERE column1 < 100

Portability

MySQL doesn't support subqueries, but the DBMS does support this construct.

The nonstandard way to do a bulk INSERT is to use a special SQL-extension statement that reads data from a file, for example:

 BULK INSERT ... BATCH SIZE = n INPUT FILE = 'xxx';     /* Microsoft stuff */

This is a common method, but because it's DBMS-specific, a discussion of it is beyond the scope of this book.

The final way to do a bulk INSERT is to load multiple values into an array and pass the entire array as a parameter for a function in an API such as ODBC or JDBC. In effect, you now have to do set management inside an application program, because there is no logical difference between an array of record values and a set of records. The difficulties with this approach are: (a) you need a lot of application memory, and (b) you need a special error-handling routine because a single statement can generate hundreds of errorsone for each row that has a problem. And you must take care to stay within the maximum buffer size for a network, which can be as little as 128KB.

You can prepare for bulk INSERTs by dropping indexes, disabling constraints, and locking the table. Generally such drastic acts only make sense when the number of rows to add is huge, and some validity testing has already been done on the input values. After a huge bulk INSERT, it's a good idea to run the "statistics update" statement for your DBMS so that the optimizer knows about the new situation; sometimes it's necessary to update statistics even during the bulk INSERT, as well as after it.

The Bottom Line: INSERT

If a particular, lengthy, column value appears frequently in INSERT, save network transmission by making the value the default value for that column.

There is no point in reordering INSERT columns so that they are in the same order as in the storage page, though it does help slightly if the primary key or unique columns appear first in the column list.

When you're executing many separate INSERT statements, put the statements in order according to the value in some index.

It's always good for performance if several insertions are combined.

Prepare for bulk INSERTs by dropping indexes, disabling constraints, and locking the table.

After a huge bulk INSERT, run the "statistics update" statement for your DBMS so that the optimizer knows about the new situation.