Section 2.2. Stable Database Connections

2.2. Stable Database Connections

A new database connection can be created quickly and easily, but this ease can disguise the high cost of making repeated connections. You must manage the use of database connections with great care. The consequences of allowing multiple connections to occur, perhaps hidden within an application, can be substantial, as the next example illustrates.

Some time ago I came across an application in which numerous small files of up to an arbitrary maximum of 100 lines were being processed. Each line in these small text files contained both data and the identification of the database instance into which that data had to be loaded. In this particular case, there was a single server, but the principle being illustrated is exactly the same as if there were a hundred database instances.

The process for each file was coded as follows:

     Open the file     Until the end of file is reached           Read a row           Connect to the server specified by the row           Insert the data           Disconnect     Close the file

This process worked quite satisfactorily, except for the occasional circumstance in which a large number of small files would arrive in a very short space of time, and at a rate greater than the ability of the application to process them. This resulted in a substantial backlog, which took considerable time to clear.

I explained the problem of performance degradation as a consequence of frequent connection and disconnection to the customer with the help of a simple program (written in C) emulating the current application. Table 2-1 gives the results from that demonstration.

The program generating the results in Table 2-1 used a conventional insert statement. I mentioned in passing to the customer the existence of direct-loading techniques that are even faster.

Table 2-1. Result of connect/disconnect performance tests
Test	Results
Connect/disconnect for each line in turn	7.4 lines loaded per second
Connect once, all candidate lines individually inserted	1,681 lines loaded per second
Connect once, all candidate lines inserted in arrays of 10 lines	5,914 lines loaded per second
Connect once, all candidate lines inserted in arrays of 100 lines	9,190 lines loaded per second

The demonstration showed the importance of trying to minimize the number of separate database connections that had to be made. Thus, there was an obvious and enormous advantage in applying a simple check to determine whether the "next" insert was into the same database as the previous one. The rationalization could go further, as the number of database instances was of course finite. You could likely achieve further performance gain by setting up an array of handlers, one for each specific database connection, opening a new connection each time a new database is referenced, and thus connecting at most once to each database. As Table 2-1 shows, the simple technique of connecting only once (or a very few times) improved performance by a factor of more than 200 with very little additional effort.

Of course, this was an excellent opportunity to show that minimizing the number of round-trips between a program and the database kernel, using arrays and populating them with incoming data, can also lead to spectacular improvements in performance. By inserting several rows at once, the throughput could be radically improvedby another factor of five. The results in Table 2-1 demonstrate that improvements in the process could reach a modest factor of 1,200.

Why such dramatic improvement?

The reason for the first and biggest improvement is that a database connection is fundamentally a "heavy," or high-resource operation.

In the familiar client/server environment (which is still very widely used), the simple connection routine hides the fact that the client program first has to establish contact with a listener program on a remote machine; and then, depending on whether shared servers are being used on this machine, the listener must either spawn another process or thread and make it run some database kernel program, or hand the request, directly or indirectly, to an existing server process.

Whatever the number of system operations (process spawning or thread creation and the start of executions) your database system will need to create a new environment for each session, to keep track of what it does. Your DBMS will need to check the password provided against the encrypted password of the account for which a new session is to be created. Your DBMS may also have to execute the code for some logon trigger. It may have to execute some initialization code for stored procedures or packages the first time they are called. This does not include the base machine handshaking protocols between client and server processes. This is the reason techniques that allow the upkeep of permanent connections to the database, such as connection pooling, are so important to performance.

The reason for the second improvement is that a round-trip between your program (and even a stored procedure) and the database also has its costs.

Even when you are connected and maintain a connection, context switches between your program and the DBMS kernel take their toll. Therefore if your DBMS allows you to communicate through an array interface of some kind, you should not hesitate to use it. If, as sometimes happens, the array interface is implicit (the application program interface [API] uses arrays when you use only scalar values), it is wise to check the default array size that is used and perhaps tailor it to your particular needs. And of course, any row-by-row logic suffers the same context-switch mechanisms and is a cardinal sinas you shall have several opportunities to see throughout this chapter.

Database connections and round-trips are like Chinese Wallsthe more you have, the longer it takes to receive the correct message.

Section 2.2. Stable Database Connections

2.2. Stable Database Connections

Table 2-1. Result of connect/disconnect performance tests