IO Management | Performance Tuning for Linux Servers

I/O Management

Database performance depends heavily on fast, efficient I/O operations. With potentially terabytes of data to process, any I/O bottleneck can render a database server ineffective to meet business demands. A database administrator (DBA) typically spends a lot of time and money optimizing an I/O subsystem to reduce I/O latencies and maximize I/O throughput.

Early Linux kernels had many deficiencies that hindered I/O performance. Lack of features such as raw I/O, vectored I/O, asynchronous I/O, and direct I/O limited a DBA's ability to leverage technologies that are available on other platforms and that are known to help database performance. Also, functional deficiencies, such as bounce buffering and single I/O request-queue lock, added unnecessary system costs and introduced serialization problems. Fortunately, all of these problems have been eliminated in the 2.6 kernels or via patches to the 2.4 kernels.

Bounce Buffers

In early Linux kernels (early 2.4 and prior kernel versions), device drivers could not directly access virtual addresses in high memory. In other words, these device drivers could not perform direct memory access I/O to high memory. Instead, the kernel allocates buffers in low memory, and data is transferred between high memory and the device drivers via the kernel buffer. This kernel buffer is commonly referred to as a bounce buffer, and the process is referred to as bouncing, or bounce buffering. Because of the intensive I/O characteristics of a database server, bouncing severely degrades the performance of the database server. First, bounce buffers consume low memory, which can lead to memory shortage problems. Second, excessive bouncing leads to high system time, which can cause a database system to become completely CPU-bound. The elimination of bounce buffering in recent kernels is a major achievement in database server performance.

Raw I/O

Introduced in the 2.4 kernel, the raw I/O interface for block devices provided the capability to perform I/O directly from the user-space buffer without additional copying through a file system interface. Using the raw interface can dramatically improve the I/O performance of a database server because the file system layer is avoided. However, it should be noted that not all database workloads benefit from using raw I/O. If the same data is referenced frequently, raw I/O cannot take advantage of the file system buffer cache. The raw I/O interface gives a DBA more flexibility to design a database for performance. Depending on the I/O characteristic of the database server, a DBA can choose to use the raw interface for faster I/O operation, or use the file system leverage caching and reduce disk access, or a combination of both, depending on how database tables are accessed.

Vectored I/O

The 2.6 Linux kernel provides a full implementation of vectored I/O (also referred to as scattered I/O) through the readv and writev interfaces. The vectored read interface, readv, reads contiguous pages on disk into noncontiguous pages in memory. Conversely, the vectored write interface, writev, writes noncontiguous pages in memory onto disk in a single function call. This I/O mechanism is advantageous for database servers that frequently perform large sequential I/Os. Without a proper vectored read and write implementation, an application performing sequential I/Os does one of the following:

Issues database page size I/Os
Issues large block I/Os and uses the memcpy function to copy data pages between the read/write buffers and the database buffers

Both of these approaches incur expensive computational overhead when scanning terabytes of data, which is a common database size in today's data warehouses.

Asynchronous I/O

The asynchronous I/O mechanism gives an application the capability to issue an I/O request without having to block until the I/O is complete. This mechanism is ideal for database servers. Other platforms have provided asynchronous I/O interfaces for quite some time, but this feature is relatively new to Linux. To drive I/O throughput without an asynchronous I/O mechanism, database servers usually create many processes or threads that are dedicated to performing I/O. With many processes/threads doing the I/O activity, the database application is no longer blocked on a single I/O request. The downside of using so many processes/threads is the extra overhead costs associated with creating, managing, and scheduling these processes/threads.

The GNU C Library (GLIBC) asynchronous I/O interface utilizes user-level threads to perform blocking I/O operations. This approach simply makes the request appear asynchronous to the application. However, it is no different from the approach previously described and suffers from the same performance drawbacks. To eliminate the problem, the 2.6 kernel introduced a kernel asynchronous I/O (KAIO) interface. KAIO implements asynchronous I/O in the kernel rather than in user space via threads and guarantees true asynchrony.

Direct I/O

Direct I/O provides performance comparable to raw I/O, but with the added flexibility of a file system. As such, it is attractive to database administrators who want to leverage the flexibility of maintaining their databases when using file systems as the storage medium.

The main attraction of using file systems as a storage medium over raw devices is that resizing the database storage can be done easily; increasing the size of a file system can be done by simply adding more disks. Another advantage of file systems is the available tools, such as fsck, that the DBA can use to help maintain the data integrity.

There are some disadvantages to using file systems as a storage medium without the direct I/O interface. One is the added cost of the buffer cache daemon, which actively flushes dirty pages to disk. This activity consumes CPU cycles, which are a valuable resource for CPU-intensive database workloads. Another disadvantage relates to buffer-cache misses, which can be expensive if data is not frequently reused.

For large database systems, memory is allocated to the database buffers. Large database buffers reduce the frequency with which the database applications have to read or write to disks. They also reduce the amount of available memory for the file system buffer cache. This scenario has a twofold effect on server performance. First, the buffer-cache daemon is more active flushing dirty pages to disk (pages become dirty more quickly). Second, with a smaller buffer cache size, the likelihood that data resides in the buffer cache is much lower, which increases the likelihood of a buffer cache miss.

Block Size I/O

Device drivers transfer data in a group of bytes called a block. The block size is set to the sector size of the device and is normally 512 bytes (although many hardware devices and their associated drivers can handle larger transfer files). Associated with each block is a buffer head structure in memory. When a read request arrives from the application via a read call, the device driver breaks up the read buffer sector-size blocks, fills the buffer heads associated with each block with the data from the physical device, and then coalesces the blocks back into the original read buffer. Similarly, when a write request arrives from an application, the device driver divides the write buffer into sector-sized blocks and updates the physical data on disk with the values of the associated buffer heads. With the 2.6 kernel, the block size for raw I/O is set to 4096 bytes instead of 512 bytes. This simple change improves overall I/O throughput and CPU utilization by reducing the number of buffer heads required by raw I/O operations. The advent of fewer buffer heads reduces the kernel overhead required to maintain them, and the use of large blocks reduces the number of divide-and-coalesce operations a device driver has to perform for an I/O request.

I/O Request Lock

The Linux kernel maintains an I/O request queue for each block device. The queue is used to order I/O requests in such a way as to maximize system performance. In the 2.2 and early 2.4 kernels, all request queues were protected by a global spin lock called io_request_lock. This I/O request lock can cause serious serialization problems for database servers running on SMP machines with many disks.

In more recent kernel levels, this single global lock (io_request_lock) has been removed from the SCSI subsystem. In its place, each I/O request queue is protected by its own lock, request_queue_lock. Serialization of I/O requests still occurs, but on only a single request queue and not at the SCSI subsystem for all request queues.