IO Scheduling and the Block IO (BIO) Layer | Performance Tuning for Linux Servers

I/O Scheduling and the Block I/O (BIO) Layer

The I/O scheduler in Linux forms the interface between the generic block layer and the low-level device drivers, respectively. The block layer provides functions that are utilized by the file systems and the virtual memory manager to submit I/O requests to block devices. These requests are transformed by the I/O scheduler and made available to the low-level device drivers. The device drivers consume the transformed requests and forward them, using device-specific protocols, to the actual device controllers that perform the I/O operations. Because prioritized resource management seeks to regulate the use of a disk subsystem by an application, the I/O scheduler is considered an important kernel component in the I/O path.

It is further possible to regulate the disk usage in the kernel layers above and below the I/O scheduler. Adjusting the I/O pattern generated by the file system or the virtual memory manager (VMM) is one option. Another option is to adjust the way specific device drivers or device controllers consume and manipulate the I/O requests.

The 2.6 Linux I/O Schedulers

The various Linux 2.6 I/O schedulers can be abstracted into a generic I/O model. The I/O requests are generated by the block layer on behalf of threads that access various file systems. These threads perform raw I/O or are generated by VMM components of the kernel, such as the kswapd or pdflush threads. The producers of I/O requests initiate a call to __make_request(), which invokes various I/O scheduler functions such as elevator_merge_fn(). The enqueuing functions in the I/O framework intend to merge the newly submitted block I/O unit with previously submitted requests. These functions sort or insert the request into one or more internal I/O queues. As a unit, the internal queues form a single logical queue that is associated with each block device.

At a later stage, the low-level device driver calls the generic kernel function elv_next_request() to obtain the next request from the logical queue. The elv_next_request() call interacts with the I/O scheduler dequeue function elevator_next_req_fn. The elevator_next_req_fn then has an opportunity to select the appropriate request from one of the internal queues. The device driver processes the request by converting the I/O submission into potential scatter-gather lists and protocol-specific commands that are submitted to the device controller.

From an I/O scheduler perspective, the block layer is considered the producer of I/O requests, and the device drivers are labeled as the actual consumers. Every read or write request launched by an application results in either utilizing the respective I/O system calls or memory mapping (mmap) the file into the address space of a process. I/O operations normally result in allocating PAGE_SIZE units of physical memory. These pages are indexed, as this enables the system to locate the page in the buffer cache. Any cache subsystem improves performance only if the data in the cache is reused. The read cache abstraction allows the system to implement file-system-dependent read-ahead functionalities, as well as to construct large contiguous (SCSI) I/O commands that can be served via a single Direct Memory Access (DMA) operation. In circumstances where the cache represents pure memory bus overhead, I/O features such as direct I/O should be explored, especially in situations where the system is CPU-bound. In a general write scenario, the system is not necessarily concerned with the previous content of a file, because a write() operation normally results in overwriting the contents.

The write cache emphasizes other aspects such as asynchronous updates. There is also the possibility of omitting some write requests in the case where multiple write() operations into the cache subsystem result in a single I/O write to a physical disk. Such a scenario may occur in an environment where updates to the same or a similar inode offset are processed within a short time span.

The block layer in Linux 2.4 is organized around the buffer_head data structure. It is a daunting task to create a truly effective and performance-enhanced block I/O subsystem if the underlying buffer_head structures force each I/O request to be decomposed into 512-byte chunks.

The new representation of the block I/O layer in Linux 2.6 encourages large I/O operations. The block I/O layer now tracks data buffers by using struct page pointers. Linux 2.4 systems are prone to losing sight of the logical form of the writeback cache when flushing the cache subsystem. Linux 2.6 utilizes logical pages attached to inodes to flush dirty data, which allows combining multiple pages that belong to the same inode into a single bio. The single bio can then be submitted to the I/O layer, a process that works fine if the file is not fragmented on disk.

The 2.4 Linux I/O Scheduler

The default 2.4 Linux I/O scheduler primarily manages the disk utilization. The scheduler has a single internal queue. For each new bio, the I/O scheduler determines if the bio can be merged with an existing request. If this is not the case, a new request is placed in the internal queue, and it is sorted by the starting device block number of the request. This approach focuses on minimizing disk seek times, if the actual disk device driver processes the requests in first-in first-out (FIFO) order. An aging mechanism implemented by the I/O scheduler limits the number of times an existing I/O request in the queue can be omitted and passed by a more recent request. This process prevents any potential starvation scenario. The dequeue function in the I/O framework represents a simple removal of requests from the head of the internal queue.

The 2.6 Deadline I/O Scheduler

The deadline I/O scheduler incorporates a per-request expiration-based approach and operates on five I/O queues. The basic idea behind the implementation is to aggressively reorder requests to improve I/O performance while simultaneously ensuring that no I/O request is starved. More specifically, the scheduler introduces the notion of a per-request deadline, which is used to assign a higher preference to read requests.

The scheduler maintains five I/O queues. During the enqueue phase, each I/O request gets associated with a deadline and is inserted into I/O queues that are organized by either the starting logical block number, a sorted list, or the deadline factor, a FIFO list. The scheduler incorporates a separate sort and FIFO lists for read and write requests, respectively. The fifth I/O queue contains the requests that are to be handed off to the device driver.

During a dequeue operation, in the case where the dispatch queue is empty, requests are moved from one of the four I/O lists (sort or FIFO) in batches. The next step consists of passing the head request on the dispatch queue to the device driver; this also holds true in the case where the dispatch queue is not empty. The logic behind moving the I/O requests from either the sort or the FIFO lists is based on the goal of the scheduler to ensure that each read request is processed by the effective deadline, without starving the queued-up write requests. In this design, the goal of economizing the disk seek time is accomplished by moving a larger batch of requests from the sort list and balancing it with a controlled number of requests from the FIFO list. Hence, the deadline I/O scheduler effectively emphasizes average read request response time over disk utilization and total average I/O request response time.

The basic idea behind the deadline scheduler is that all read requests are satisfied within a specified time period. On the other hand, write requests do not have any specific deadlines. As the block device driver is ready to launch another disk I/O request, the core algorithm of the deadline scheduler is invoked. In a simplified form, the first action taken is to identify if I/O requests are waiting in the dispatch queue; if so, no additional decision needs to be made as to what to execute next. Otherwise, it is necessary to move a new set of I/O requests to the dispatch queue. The scheduler searches for work in the following places, but it migrates requests from only the first source that results in a hit:

If there are pending write I/O requests, and the scheduler has not selected any write requests for a certain amount of time, a set of write requests is selected (see Appendix A, "Tuning Kernel Parameters").
If there are expired read requests in the read_fifo list, the system moves a set of these requests to the dispatch queue.
If there are pending read requests in the sort list, the system migrates some of these requests to the dispatch queue.
Last, if there are any pending write I/O operations, the dispatch queue is populated with requests from the sorted write list.

In general, the definition of a certain amount of time for write request starvation is normally two iterations of the scheduler algorithm. After two sets of read requests have been moved to the dispatch queue, the scheduler migrates some write requests to the dispatch queue. For example, a batch set of requests can be 64 contiguous requests, but a request that requires a disk seek operation counts the same as 16 contiguous requests.

Scheduler Tunables

The following definitions discuss the scheduler's deadline I/O tuning potential, elaborating on the exposed tunables that may have to be adjusted to optimize I/O performance. Each I/O queue discussed incorporates a set of tunables that control the actual working behavior of the deadline scheduler (see /sys/block/device/iosched). Although it is impossible to stipulate any generic tuning requirements for these tunables, see Chapter 19, "Case Study: Tuning the I/O Schedulers in Linux 2.6," for information on specific experiments that discuss how to tune.

The read_expire parameter, which is specified in milliseconds, is part of the deadline equation. The goal of the scheduler is to guarantee a start service time for a given I/O request. Because the design focuses mainly on read requests, each read I/O that enters the scheduler is assigned a deadline factor that consists of the current time plus the read_expire value.
The fifo_batch parameter governs the number of requests moved to the dispatch queue. As a read request expires, it becomes necessary to move some I/O requests from the sorted I/O scheduler list to the dispatch queue of the block device. Therefore, the fifo_batch parameter controls the batch size based on the cost of each I/O request. A request is qualified by the scheduler as either a seek or stream request. (For additional information, see the definitions of the seek_cost and stream_unit parameters, discussed in the following bullet.)
The seek_cost parameter quantifies the cost of a seek operation compared to a stream_unit, which is expressed in kilobytes. The stream_unit parameter dictates the number of kilobytes used to describe a single stream unit. A stream unit has an associated cost of 1. If a request consists of XY kilobytes, the actual cost = (XY + stream_unit 1) / stream_unit. The combination of the stream_unit, seek_cost, and fifo_batch parameters determines the number of requests moved as an I/O request expires.
The write_starved parameter, expressed in number of dispatches, indicates the number of times the I/O scheduler assigns preference to read over write requests. When the I/O scheduler has to move requests to the dispatch queue, the preference scheme in the design favors read over write requests. However, the write requests cannot be starved indefinitelyafter the read requests are favored for write_starved a number of times, write requests are dispatched.
The front_merges parameter controls the request merge technique used by the scheduler. In some circumstances, a request may enter the scheduler that is contiguous to a request that is already in the I/O queue. It is feasible to assume that the new request may have a correlation to either the front or the back of the already queued request. The new request is labeled as either a front or a back merge candidate. Based on the way files are laid out, back merge operations are more common than front merges. For some workloads, it is unnecessary to even consider front merge operations; however, setting the front_merges flag to 0 disables that functionality. Despite setting the flag to 0, front merges may still occur due to the cached merge_last hint component. Because this feature represents an almost 0 cost factor, this is not considered an I/O performance issue.

The 2.6 Anticipatory I/O Scheduler

The design of the anticipatory I/O scheduler (AS) attempts to reduce the per-thread read response time. It introduces a controlled delay component into the dispatching equation. The delay is invoked on any new read request to the device driver. This allows a thread that just finished its read I/O request to submit a new read request, enhancing the chances that this scheduling behavior will result in smaller seek operations. The trade-off between reduced seeks and decreased disk utilization, due to the additional delay factor in dispatching a request, is managed by utilizing an actual cost-benefit analysis.

The next few paragraphs discuss the general design of an anticipatory I/O scheduler, outlining the different components that comprise the I/O framework. As a read I/O request completes, the I/O framework stalls for a brief amount of time, waiting for additional requests to arrive before dispatching a new request to the disk subsystem. The focus of this design is on application threads that rapidly generate another I/O request that could potentially be serviced before the scheduler chooses another task, possibly avoiding deceptive idleness. Deceptive idleness is defined as a condition that forces the scheduler into making a decision too early by assuming that the thread issuing the last request has no further disk requests lined up. This causes the scheduler to select an I/O request from another task.

The design discussed here argues that keeping the disk idle during the short stall period is not necessarily detrimental to I/O performance. The question of whether to wait at any given decision point and for how long is key to the effectiveness and performance of the implementation. The framework waits for the shortest possible period of time in which the scheduler expects the benefits of actively waiting to outweigh the costs of keeping the disk subsystem in an idle state. An assessment of the costs and benefits is possible only relative to a particular scheduling policy. To elaborate, a seek-reducing scheduler may want to wait for contiguous or proximal requests, whereas a proportional share scheduler may prefer weighted fairness as one of its primary criteria.

To allow for such a high degree of flexibility, while trying to minimize the burden on the development efforts for any particular disk scheduler, the anticipatory scheduling framework consists of three components:

The original disk scheduler, which implements the scheduling policy and is unaware of any anticipatory scheduling techniques
A scheduler-independent anticipation core
An adaptive scheduler-specific anticipation heuristic for seek reducing, such as SPTF or C-SCAN, as well as any potential proportional share (CFQ or YFQ) scheduler

The anticipation core implements the generic logic and timing mechanisms for waiting. It relies on the anticipation heuristic to decide whether and for how long to wait. The actual heuristic is implemented separately for each disk scheduler and has access to the internal state of the scheduler. To apply anticipatory scheduling to a new scheduling policy, it is merely necessary to implement an appropriate anticipation heuristic.

Any traditional work-conserving I/O scheduler operates in two states: idle or busy. Applications may issue I/O requests at any time, and these requests are normally placed in the pool of requests of the scheduler. If the disk subsystem is idle at this point, or whenever another request completes, a new request is scheduled. The scheduler's select function is called, whereupon a request is chosen from the pool and is dispatched to the disk device driver. The anticipation core forms a wrapper around this traditional scheduler scheme. Whenever the disk becomes idle, it invokes the scheduler to select a candidate request. However, instead of dequeuing and dispatching a request immediately, the framework first passes the request to the anticipation heuristic for evaluation. A return value of 0 indicates that the heuristic has deemed it pointless to wait, and the core therefore proceeds to dispatch the candidate request. However, a positive integer as a return value represents the waiting period, in microseconds, that the heuristic deems suitable. The core initiates a timeout for that particular time period and enters a new wait state. Although the disk is inactive, this state is considered different from idling, having pending requests and an active timeout. If the timeout expires before the arrival of any new request, the previously chosen request is dispatched without further delay. However, new requests may arrive during the wait period; they are added to the pool of I/O requests. The anticipation core then immediately asks the scheduler to select a new candidate request from the pool and initiates communication with the heuristic to evaluate this new candidate.

This scenario may lead to an immediate dispatch of the new candidate request, or it may cause the core to remain in the wait state, depending on the scheduler's selection and the evaluation of the anticipation heuristic. In the latter case, the original timeout remains in effect, thus preventing unbounded waiting situations by repeatedly retriggering the timeout.

Because the heuristic used is disk-scheduler-dependent, the discussion here only generalizes on the actual implementation techniques that may be utilized. Therefore, the next few paragraphs discuss a shortest positioning time first (SPTF)-based implementation. This is where the disk scheduler determines the positioning time for each available request based on the current head position and chooses the request that results in the shortest seek distance.

The heuristic has to evaluate the candidate request that was chosen by the scheduling policy. If the candidate I/O request is located close to the current head position, there is no need to wait on any other requests. Assuming synchronous I/O requests initiated by a single thread, the task that issued the last request is likely to submit the next request soon. If this request is expected to be close to the current request, the heuristic decides to wait. The waiting period is chosen as the expected YZ percentile, normally around 95% think-time, within which there is an XZ probability, normally 95%, that a request will arrive. This simple approach is transformed and generalized into a succinct cost-benefit equation that is intended to cover the entire range of values for the head positioning, as well as the think-times. To simplify the discussion, the adaptive component of the heuristic consists of collecting online statistics on all the disk requests to estimate the different time variables that are used in the decision-making process. The expected positioning time for each process represents a weighted average over the time of the positioning time for requests from that process, as measured upon request completion. Expected median and percentile think-times are estimated by maintaining a decayed frequency table of request think-times for each process.

The Linux 2.6 implementation of the anticipatory I/O scheduler follows the basic idea that if the disk drive just operated on a read request, it is assumed that another read request is in the pipeline and is worth the wait. The I/O scheduler starts a timer; at this point no more I/O requests are passed to the device driver. If a close read request arrives during the wait time, it is serviced immediately. In the process, the actual distance that the kernel considers close grows as time passes. Eventually the close requests dry out, and the scheduler decides to submit some of the write requests.

The next few paragraphs discuss the AS I/O scheduler's tuning potential, elaborating on the exposed tunables that may have to be adjusted to optimize I/O performance. (See /sys/block/device/iosched.)

The parameter read_expire governs the timeframe until a read quest is labeled as expired. This parameter further controls, to a certain extent, the interval in which expired requests are serviced. This approach equates to determining the timeslice that a single reader request is allowed to use in the general presence of other I/O requests. The approximation 100 * ((seek time / read_expire) + 1) describes the percentile of streaming read efficiency that a physical disk should receive in an environment that consists of multiple concurrent read requests.
The parameter read_batch_expire governs the time assigned to a batch (or set) of read requests prior to serving any (potentially) pending write requests. A higher value increases the priority allotted to read requests. Setting the value to less than read_expire would reverse the scenario; at this point the write requests would be favored over the read requests. The literature suggests setting the parameter to a multiple of the read_expire value. The parameters write_expire and write_batch_expire, respectively, describe and govern the behavior of both read parameters for any (potential) write requests.
The antic_expire parameter controls the maximum amount of time the AS scheduler will idle before moving on to another request. The literature suggests initializing the parameter slightly higher for large seek time devices.

The 2.6 CFQ Scheduler

The Completely Fair Queuing (CFQ) I/O scheduler can be considered to represent an extension to the better-known Stochastic Fair Queuing (SFQ) implementation. The focus of both implementations is on the concept of fair allocation of I/O bandwidth among all the initiators of I/O requests. An SFQ-based scheduler design was initially proposed for some network scheduling-related subsystems. The goal is to distribute the available I/O bandwidth as equally as possible among the I/O requests. The implementation utilizes n (normally 64) internal I/O queues, as well as a single I/O dispatch queue. During an enqueue operation, the process ID (PID) of the currently running process is utilized to select one of the internal queues, and the request is inserted into one of the queues in FIFO order. During dequeue, the SFQ design calls for a round-robin-based scan through the nonempty I/O queues, and it selects requests from the head of the queues. To avoid encountering too many seek operations, an entire round of requests is collected, sorted, and ultimately merged into the dispatch queue. The head request in the dispatch queue is then passed to the device driver.

Conceptually, a CFQ implementation does not utilize a hash function. Therefore, each I/O process is assigned an internal queue, which implies that the number of I/O processes determines the number of internal queues. In Linux 2.6.5, the CFQ I/O scheduler utilizes a hash function and a certain number of request queues, therefore resembling an SFQ implementation. The CFQ, as well as the SFQ, implementations strive to manage per-process I/O bandwidth and provide fairness at the level of process granularity.

The 2.6 noop I/O Scheduler

The Linux 2.6 noop I/O scheduler is a rather minimal overhead I/O scheduler that performs and provides basic merging and sorting functionalities. The main usage of the noop scheduler revolves around non-disk-based block devices. This includes memory devices as well as specialized software or hardware environments that incorporate their own I/O scheduling, and large caching functionality, which therefore requires only minimal assistance from the kernel. In large I/O subsystems that incorporate RAID controllers and a vast number of contemporary physical disk drives called TCQ drives, the noop scheduler has the potential to outperform the other three I/O schedulers as the workload increases.

I/O SchedulerPerformance Implications

The next few paragraphs augment the I/O scheduler discussion and introduce some additional performance issues that have to be taken into consideration while conducting an I/O performance analysis. The current AS implementation consists of several different heuristics and policies that determine when and how I/O requests are dispatched to the I/O controller. The elevator algorithm that is utilized in AS is similar to the one used for the deadline scheduler. The main difference is that the AS implementation allows limited backward movement and supports backward seek operations. A backward seek operation may occur while choosing between two I/O requests, where one request is located behind the current head position of the elevator and the other request is ahead of the current position of the elevator.

The AS scheduler utilizes the lowest logical block information as the yardstick for sorting, as well as determining the seek distance. In the case that the seek distance to the request behind the elevator is less than half the seek distance to the request in front of the elevator, the request behind the elevator is chosen. The backward seek operations are limited to a maximum of MAXBACK (1024 * 1024) blocks. This approach favors the elevator's forward movement progress while still allowing short backward seek operations. The expiration time for the requests held on the FIFO lists is tunable via read_expire and write_expire of the parameter (see Appendix A).

When a read or write operation expires, the AS I/O scheduler interrupts either the current elevator sweep or the read anticipation process to service the expired request. In the next section, we'll delve further into the world of read and write requests.