Section 5.2. Devices

5.2. Devices

Two kinds of device files exist: block device files and character device files. Block devices transfer data in chunks, and character devices (as the name implies) transfer data one character at a time. A third device type, the network device, is a special case that exhibits attributes of both block and character devices. However, network devices are not represented by files.

The old method of assigned numbers for devices where the major number usually referred to a device driver or controller, and the minor number was a particular device within that controller, is giving way to a new dynamic method called devfs. The history behind this change is that the major and minor numbers are both 8-bit values; this allows for little more than 200 statically allocated major devices for the entire planate. (Block and character devices each have their own list of 256 entries.) You can find the official listing of the allocated major and minor device numbers in /Documentation/devices.txt.

The Linux Device Filesystem (devfs) has been in the kernel since version 2.3.46. devfs is not included by default in the 2.6.7 kernel build, but it can be enabled by setting CONFIG_DEVFS_FS=Y in the configuration file. With devfs, a module can register a device by name rather than a major/minor number pair. For compatibility, devfs allows the use of old major/minor numbers or generates a unique 16-bit device number on any given system.

5.2.1. Block Device Overview

As previously mentioned, the Linux operating system sees all devices as files. Any given element in a block device can be randomly referenced. A good example of a block device is the disk drive. The filesystem name for the first IDE disk is /dev/hda. The associated major number of /dev/hda is 3, and the minor number is 0. The disk drive itself usually has a controller and is electro-mechanical by nature (that is, it has moving parts). The "General System File" section in Chapter 6, "Filesystems," discusses the basic construction of a hard disk.

5.2.1.1. Generic Block Device Layer

The device driver registers itself at driver initialization time. This adds the driver to the kernel's driver table, mapping the device number to the block_device_operations structure. The block_device_operations structure contains the functions for starting and stopping a given block device in the system:

 ------------------------------------------------------------------------- include/linux/fs.h 760  struct block_device_operations { 761   int (*open) (struct inode *, struct file *); 762   int (*release) (struct inode *, struct file *); 763   int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); 764   int (*media_changed) (struct gendisk *); 765   int (*revalidate_disk) (struct gendisk *); 766   struct module *owner; 767  }; -------------------------------------------------------------------------

The interfaces to the block device are similar to other devices. The functions open() (on line 761) and release() (on line 762) are synchronous (that is, they run to completion when called). The most important functions, read() and write(), are implemented differently with block devices because of their mechanical nature. Consider accessing a block of data from a disk drive. The amount of time it takes to position the head on the proper track and for the disk to rotate to the desired block can take a long time, from the processor's point of view. This latency is the driving force for the implementation of the system request queue. When the filesystem requests a block (or more) of data, and it is not in the local page cache, it places the request on a request queue and passes this queue on to the generic block device layer. The generic block device layer then determines the most efficient way to mechanically retrieve (or store) the information, and passes this on to the hard disk driver.

Most importantly, at initialization time, the block device driver registers a request queue handler with the kernel (specifically with the block device manager) to facilitate the read/write operations for the block device. The generic block device layer acts as an interface between the filesystem and the register level interface of the device and allows for per-queue tuning of the read and write queues to make better use of the new and smarter devices available. This is accomplished through the tagged command queuing helper utilities. For example, if a device on a given queue supports command queuing, read and write operations can be optimized to exploit the underlying hardware by reordering requests. An example of per-queue tuning in this case would be the ability to set how many requests are allowed to be pending. See Figure 5.4 for an illustration of how the application layer, the filesystem layer, the generic block device layer, and the device driver interrelate. The file biodoc.txt under /Documentation/block> has more helpful information on this layer and information regarding changes from earlier kernels.

Figure 5.4. Block Read/Write

5.2.2. Request Queues and Scheduling I/O

When a read or write request traverses the layers from VFS, through the filesystem drivers and page cache,^[6] it eventually ends up entering the block device driver to perform the actual I/O on the device that holds the data requested.

^[6] This traversal is described in Chapter 6.

As previously mentioned, the block device driver creates and initializes a request queue upon initialization. This initialization also determines the I/O scheduling algorithm to use when a read or write is attempted on the block device. The I/O scheduling algorithm is also known as the elevator algorithm.

The default I/O scheduling algorithm is determined by the kernel at boot time with the default being the anticipatory I/O scheduler.^[7] By setting the kernel parameter elevator to the following values, you can change the type of I/O scheduler:

^[7] Some block device drivers can change their I/O scheduler during runtime, if it's visible in sysfs.

deadline. For the deadline I/O scheduler
noop. For the no-operation I/O scheduler
as. For the anticipatory I/O scheduler

As of this writing, a patch exists that makes the I/O schedulers fully modular. Using modprobe, the user can load the modules and switch between them on the fly.^[8] With this patch, at least one scheduler must be compiled into the kernel to begin with.

^[8] For more information, do a Web search on "Jens Axboe" and "Modular IO Schedulers."

Before we can describe how these I/O schedulers work, we need to touch on the basics of request queues.

Block devices use request queues to order the many block I/O requests the devices are given. Certain block devices, such as a RAM disk, might have little need for ordering requests because the I/O requests to the device have little overhead. Other block devices, like hard drives, need to order requests because there is a great overhead in reading and writing. As previously mentioned, the head of the hard drive has to move from track to track, and each movement is agonizingly slow from the CPU's perspective.

Request queues solve this problem by attempting to order block I/O read and write requests in a manner that optimizes throughput but does not indefinitely postpone requests. A common and useful analogy of I/O scheduling is to look at how elevators work.^[9] If you were to order the stops an elevator took by the order of the requests, you would have the elevator moving inefficiently from floor to floor; it could go from the penthouse to the ground floor without ever stopping for anyone in between. By responding to requests that occur while the elevator travels in the same direction, it increases the elevator's efficiency and the riders' happiness. Similarly, I/O requests to a hard disk should be grouped together to avoid the high overhead of repeatedly moving the disk head back and forth. All the I/O schedulers mentioned (no-op, deadline, and anticipatory) implement this basic elevator functionality. The following sections look at these elevators in more detail.

^[9] This analogy is why I/O schedulers are also referred to as elevators.

5.2.2.1. No-Op I/O Scheduler

The no-op I/O scheduler^[10] takes a request and scans through its queue to determine if it can be merged with an existing request. This occurs if the new request is close to an existing request. If the new request is for I/O blocks before an existing request, it is merged on the front of the existing request. If the new request is for I/O blocks after an existing request, it is merged on the back of the existing request. In normal I/O, we read the beginning of a file before the end, and thus, most requests are merged onto the back of existing requests.

^[10] The code for the no-op I/O scheduler is located in drivers/block/noop-iosched.c.

If the no-op I/O scheduler finds that the new request cannot be merged into the existing request because it is not near enough, the scheduler looks for a place within the queue between existing requests. If the new request calls for I/O to sectors between existing requests it is inserted into the queue at the determined position. If there are no places the request can be inserted, it is placed on the tail of the request queue.

5.2.2.2. Deadline I/O Scheduler

The no-op I/O scheduler^[11] suffers from a major problem; with enough close requests, new requests are never handled. Many new requests that are close to existing ones would be either merged or inserted between existing elements, and new requests would pile up at the tail of the request queue. The deadline scheduler attempts to solve this problem by assigning each request an expiration time and uses two additional queues to manage time efficiency as well as a queue similar to the no-op algorithm to model disk efficiency.

^[11] The code for the deadline I/O scheduler is located in drivers/block/deadline-iosched.c.

When an application makes a read request, it typically waits until that request is fulfilled before continuing. Write requests, on the other hand, will not normally cause an application to wait; the write can execute in the background while the application continues on to other tasks. The deadline I/O scheduler uses this information to favor read requests over write requests. A read queue and write queue are kept in addition to the queue sorted by a request's sector proximity. In the read and write queue, requests are ordered by time (FIFO).

When a new request comes in, it is placed on the sorted queue as in the no-op scheduler. The request is also placed on either the read queue or write queue depending on its I/O request. When the deadline I/O scheduler handles a request, it first checks the head of the read queue to see if that request has expired. If that requests expiration time has been reached, it is immediately handled. Similarly, if no read request has expired, the scheduler checks the write queue to see if the request at its head has expired; if so, it is immediately handled. The standard queue is checked only when no reads or writes have expired and requests are handled in nearly the same way as the no-op algorithm.

Read requests also expire faster than write requests: a second versus 5 seconds in the default case. This expiration difference and the preference of handling read requests over write requests can lead to write requests being starved by numerous read requests. As such, a parameter tells the deadline I/O scheduler the maximum number of times reads can starve a write; the default is 2, but because sequential requests can be treated as a single request, 32 sequential read requests could pass before a write request is considered starved.^[12]

^[12] See lines 2427 of deadline-iosched.c for parameter definitions.

5.2.2.3. Anticipatory I/O Scheduling

One of the problems with the deadline I/O scheduling algorithm occurs during intensive write operations. Because of the emphasis on maximizing read efficiency, a write request can be preempted by a read, have the disk head seek to new location, and then return to the write request and have the disk head seek back to its original location. Anticipatory I/O scheduling^[13] attempts to anticipate what the next operation is and aims to improve I/O throughput in doing so.

^[13] The code for anticipatory I/O scheduling is located in drivers/block/as-iosched.c.

Structurally, the anticipatory I/O scheduler is similar to the deadline I/O scheduler. There exist a read and write queue each ordered by time (FIFO) and a default queue that is ordered by sector proximity. The main difference is that after a read request, the scheduler does not immediately proceed to handling other requests. It does nothing for 6 milliseconds in anticipation of an additional read. If another read request does occur to an adjacent area, it is immediately handled. After the anticipation period, the scheduler returns to its normal operation as described under the deadline I/O scheduler.

This anticipation period helps minimize the I/O delay associated with moving the disk head from sector to sector across the block device.

Like the deadline I/O scheduler, a number of parameters control the anticipatory I/O scheduling algorithm. The default time for reads to expire is second and the default time for writes to expire is second. Two parameters control when to check to switch between streams of reads and writes.^[14] A stream of reads checks for expired writes after second and a stream of writes checks for expired reads after second.

^[14] See lines 3060 of as-iosched.c for parameter definitions.

The default I/O scheduler is the anticipatory I/O scheduler because it optimizes throughput for most applications and block devices. The deadline I/O scheduler is sometimes better for database applications or those that require high disk performance requirements. The no-op I/O scheduler is usually used in systems where I/O seek time is near negligible, such as embedded systems running from RAM.

We now turn our attention from the various I/O schedulers in the Linux kernel to the request queue itself and the manner in which block devices initialize request queues.

5.2.2.4. Request Queue

In Linux 2.6, each block device has its own request queue that manages I/O requests to that device. A process can only update a device's request queue if it has obtained the lock of the request queue. Let's examine the request_queue structure:

 ------------------------------------------------------------------------- include/linux/blkdev.h 270 struct request_queue 271 { 272   /* 273   * Together with queue_head for cacheline sharing 274   */ 275   struct list_head  queue_head; 276   struct request   *last_merge; 277   elevator_t    elevator; 278 279   /* 280   * the queue request freelist, one for reads and one for writes 281   */ 282   struct request_list  rq; ------------------------------------------------------------------------

Line 275

This line is a pointer to the head of the request queue.

Line 276

This is the last request placed into the request queue.

Line 277

The scheduling function (elevator) used to manage the request queue. This can be one of the standard I/O schedulers (noop, deadline, or anticipatory) or a new type of scheduler specifically designed for the block device.

Line 282

The request_list is a structure composed of two wait_queues: one for queuing reads to the block device and one for queuing writes.

 ------------------------------------------------------------------------- include/linux/blkdev.h 283 284   request_fn_proc   *request_fn; 285   merge_request_fn  *back_merge_fn; 286   merge_request_fn  *front_merge_fn; 287   merge_requests_fn  *merge_requests_fn; 288   make_request_fn   *make_request_fn; 289   prep_rq_fn    *prep_rq_fn; 290   unplug_fn    *unplug_fn; 291   merge_bvec_fn   *merge_bvec_fn; 292   activity_fn    *activity_fn; 293 -------------------------------------------------------------------------

Lines 283293

These scheduler- (or elevator-) specific functions can be defined to control how requests are managed for the block device.

 ------------------------------------------------------------------------- include/linux/blkdev.h 294   /* 295   * Auto-unplugging state 296   */ 297   struct timer_list  unplug_timer; 298   int      unplug_thresh; /* After this many requests */ 299   unsigned long   unplug_delay; /* After this many jiffies*/ 300   struct work_struct  unplug_work; 301 302   struct backing_dev_info backing_dev_info; 303 -------------------------------------------------------------------------

Lines 294303

These functions are used to unplug the I/O scheduling function used on the block device. Plugging refers to the practice of waiting for more requests to fill the request queue with the expectation that more requests allow the scheduling algorithm to order and sort I/O requests that enhance the time it takes to perform the I/O requests. For example, a hard drive "plugs" a certain number of read requests with the expectation that it moves the disk head less when more reads exist. It's more likely that the reads can be arranged sequentially or even clustered together into a single large read. Unplugging refers to the method in which a device decides that it can wait no longer and must service the requests it has, regardless of possible future optimizations. See documentation/block/biodoc.txt for more information.

 ------------------------------------------------------------------------- include/linux/blkdev.h 304   /* 305   * The queue owner gets to use this for whatever they like. 306   * ll_rw_blk doesn't touch it. 307   */ 308   void     *queuedata; 309 310   void     *activity_data; 311 -------------------------------------------------------------------------

Lines 304311

As the inline comments suggest, these lines request queue management that is specific to the device and/or device driver:

 ------------------------------------------------------------------------- include/linux/blkdev.h 312   /* 313   * queue needs bounce pages for pages above this limit 314   */ 315   unsigned long   bounce_pfn; 316   int      bounce_gfp; 317 -------------------------------------------------------------------------

Lines 312317

Bouncing refers to the practice of the kernel copying high-memory buffer I/O requests to low-memory buffers. In Linux 2.6, the kernel allows the device itself to manage high-memory buffers if it wants. Bouncing now typically occurs only if the device cannot handle high-memory buffers.

 ------------------------------------------------------------------------- include/linux/blkdev.h 318   /* 319   * various queue flags, see QUEUE_* below 320   */ 321   unsigned long   queue_flags; 322 -------------------------------------------------------------------------

Lines 318321

The queue_flags variable stores one or more of the queue flags shown in Table 5.1 (see include/linux/blkdev.h, lines 368375).

Table 5.1. queue_flags
Flag Name	Flag Function
`QUEUE_FLAG_CLUSTER`	`/* cluster several segments into 1 */`
`QUEUE_FLAG_QUEUED`	`/* uses generic tag queuing */`
`QUEUE_FLAG_STOPPED`	`/* queue is stopped */`
`QUEUE_FLAG_READFULL`	`/* read queue has been filled */`
`QUEUE_FLAG_WRITEFULL`	`/* write queue has been filled */`
`QUEUE_FLAG_DEAD`	`/* queue being torn down */`
`QUEUE_FLAG_REENTER`	`/* Re-entrancy avoidance */`
`QUEUE_FLAG_PLUGGED`	`/* queue is plugged */`

 ------------------------------------------------------------------------- include/linux/blkdev.h 323   /* 324   * protects queue structures from reentrancy 325   */ 326   spinlock_t    *queue_lock; 327 328   /* 329   * queue kobject 330   */ 331   struct kobject kobj; 332 333   /* 334   * queue settings 335   */ 336   unsigned long   nr_requests; /* Max # of requests */ 337   unsigned int   nr_congestion_on; 338   unsigned int   nr_congestion_off; 339 340   unsigned short   max_sectors; 341   unsigned short   max_phys_segments; 342   unsigned short   max_hw_segments; 343   unsigned short   hardsect_size; 344   unsigned int   max_segment_size; 345 346   unsigned long   seg_boundary_mask; 347   unsigned int   dma_alignment; 348 349   struct blk_queue_tag *queue_tags; 350 351   atomic_t    refcnt; 352 353   unsigned int   in_flight; 354 355   /* 356   * sg stuff 357   */ 358   unsigned int   sg_timeout; 359   unsigned int   sg_reserved_size; 360 }; -------------------------------------------------------------------------

Lines 323360

These variables define manageable resources of the request queue, such as locks (line 326) and kernel objects (line 331). Specific request queue settings, such as the maximum number of requests (line 336) and the physical constraints of the block device (lines 340347) are also provided. SCSI attributes (lines 355359) can also be defined, if they're applicable to the block device. If you want to use tagged command queuing use the queue_tags structure (on line 349). The refcnt and in_flight fields (on lines 351 and 353) count the number of references to the queue (commonly used in locking) and the number of requests that are in process ("in flight").

Request queues used by block devices are initialized simply in the 2.6 Linux kernel by calling the following function in the devices' __init function. Within this function, we can see the anatomy of a request queue and its associated helper routines. In the 2.6 Linux kernel, each block device controls its own locking, which is contrary to some earlier versions of Linux, and passes a spinlock as the second argument. The first argument is a request function that the block device driver provides.

 ------------------------------------------------------------------------- drivers/block/ll_rw_blk.c 1397 request_queue_t *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock) 1398  { 1399   request_queue_t *q; 1400   static int printed; 1401  1402   q = blk_alloc_queue(GFP_KERNEL); 1403   if (!q) 1404   return NULL; 1405 1406   if (blk_init_free_list(q)) 1407    goto out_init; 1408    1409   if (!printed) { 1410    printed = 1; 1411    printk("Using %s io scheduler\n", chosen_elevator->elevator_name); 1412   } 1413   1414   if (elevator_init(q, chosen_elevator)) 1415    goto out_elv; 1416   1417   q->request_fn   = rfn; 1418   q->back_merge_fn   = ll_back_merge_fn; 1419   q->front_merge_fn   = ll_front_merge_fn; 1420   q->merge_requests_fn  = ll_merge_requests_fn; 1421   q->prep_rq_fn   = NULL; 1422   q->unplug_fn   = generic_unplug_device; 1423   q->queue_flags   = (1 << QUEUE_FLAG_CLUSTER); 1424   q->queue_lock   = lock; 1425   1426   blk_queue_segment_boundary(q, 0xffffffff); 1427   1428   blk_queue_make_request(q, __make_request); 1429   blk_queue_max_segment_size(q, MAX_SEGMENT_SIZE); 1430   1431   blk_queue_max_hw_segments(q, MAX_HW_SEGMENTS); 1432   blk_queue_max_phys_segments(q, MAX_PHYS_SEGMENTS); 1433   1434   return q; 1435  out_elv: 1436   blk_cleanup_queue(q); 1437  out_init: 1438   kmem_cache_free(requestq_cachep, q); 1439   return NULL; 1440  } -------------------------------------------------------------------------

Line 1402

Allocate the queue from kernel memory and zero its contents.

Line 1406

Initialize the request list that contains a read queue and a write queue.

Line 1414

Associate the chosen elevator with this queue and initialize.

Lines 14171424

Associate the elevator-specific functions with this queue.

Line 1426

This function sets the boundary for segment merging and checks that it is at least a minimum size.

Line 1428

This function sets the function used to get requests off the queue by the driver. It allows an alternate function to be used to bypass the queue.

Line 1429

Initialize the upper-size limit on a combined segment.

Line 1431

Initialize the maximum segments the physical device can handle.

Line 1432

Initialize the maximum number of physical segments per request.

The values for lines 14291432 are set in include/linux/blkdev.h.

Line 1434

Return the initialized queue.

Lines 14351439

Routines to clean up memory in the event of an error.

We now have the request queue in place and initialized.

Before we explore the generic device layer and the generic block driver, let's quickly trace the layers of software it takes to get to the manipulation of IO in the block device. (Refer to Figure 5.4.)

At the application level, an application has initiated a file operation such as fread(). This request is taken by the virtual filesystem (VFS) layer (covered in Chapter 4), where the file's dentry structure is found, and through the inode structure, where the file's read() function is called. The VFS layer tries to find the requested page in its buffer cache, but if it is a miss, the filesystem handler is called to acquire the appropriate physical blocks. The inode is linked to the filesystem handler, which is associated with the correct filesystem. The filesystem handler calls on the request queue utilities, which are part of the generic block device layer to create a request with the correct physical blocks and device. The request is put on the request queue, which is maintained by the generic block device layer.

5.2.3. Example: "Generic" Block Driver

We now look at the generic block device layer. Referring to Figure 5.4, it resides above the physical device layer and just below the filesystem layer. The most important job of the generic block layer is to maintain request queues and their related helper routines.

We first register our device with register_blkdev(major, dev_name, fops). This function takes in the requested major number, the name of this block device (this appears in the /dev directory), and a pointer to the file operations structure. If successful, it returns the desired major number.

Next, we create the gendisk structure.

The function alloc_disk(int minors) in include/linux/genhd.h takes in the number of partitions and returns a pointer to the gendisk structure. We now look at the gendisk structure:

 ------------------------------------------------------------------------- include/linux/genhd.h 081  struct gendisk { 082   int major;    /* major number of driver */ 083   int first_minor; 084   int minors; 085   char disk_name[16];   /* name of major driver */ 086   struct hd_struct **part;  /* [indexed by minor] */ 087   struct block_device_operations *fops; 088   struct request_queue *queue; 089   void *private_data; 090   sector_t capacity; 091 092   int flags; 093   char devfs_name[64];   /* devfs crap */ 094   int number;    /* more of the same */ 095   struct device *driverfs_dev; 096   struct kobject kobj; 097 098   struct timer_rand_state *random; 099   int policy; 100 101   unsigned sync_io;   /* RAID */ 102   unsigned long stamp, stamp_idle; 103   int in_flight; 104  #ifdef  CONFIG_SMP 105   struct disk_stats *dkstats; 106  #else 107   struct disk_stats dkstats; 108  #endif 109  };   -------------------------------------------------------------------------

Line 82

The major_num field is filled in from the result of register_blkdev().

Line 83

A block device for a hard drive could handle several physical drives. Although it is driver dependent, the minor number usually labels each physical drive. The first_minor field is the first of the physical drives.

Line 85

The disk_name, such as hda or sdb, is the text name for an entire disk. (Partitions within a disk are named hda1, hda2, and so on.) These are logical disks within a physical disk device.

Line 87

The fops field is the block_device_operations initialized to the file operations structure. The file operations structure contains pointers to the helper functions in the low-level device driver. These functions are driver dependent in that they are not all implemented in every driver. Commonly implemented file operations are open, close, read, and write. Chapter 4, "Memory Management," discusses the file operations structure.

Line 88

The queue field points to the list of requested operations that the driver must perform. Initialization of the request queue is discussed shortly.

Line 89

The private_data field is for driver-dependent data.

Line 90

The capacity field is to be set with the drive size (in 512KB sectors). A call to set_capacity() should furnish this value.

Line 92

The flags field indicates device attributes. In case of a disk drive, it is the type of media, such as CD, removable, and so on.

Now, we look at what is involved with initializing the request queue. With the queue already declared, we call blk_init_queue(request_fn_proc, spinlock_t). This function takes, as its first parameter, the transfer function to be called on behalf of the filesystem. The function blk_init_queue() allocates the queue with blk_alloc_queue() and then initializes the queue structure. The second parameter to blk_init_queue() is a lock to be associated with the queue for all operations.

Finally, to make this block device visible to the kernel, the driver must call add_disk():

 ------------------------------------------------------------------------- Drivers/block/genhd.c 193  void add_disk(struct gendisk *disk) 194  { 195   disk->flags |= GENHD_FL_UP; 196   blk_register_region(MKDEV(disk->major, disk->first_minor), 197     disk->minors, NULL, exact_match, exact_lock, disk); 198   register_disk(disk); 199   blk_register_queue(disk); 200  } -------------------------------------------------------------------------

Line 196

This device is mapped into the kernel based on size and number of partitions.

The call to blk_register_region() has the following six parameters:

The disk major number and first minor number are built into this parameter.
This is the range of minor numbers after the first (if this driver handles multiple minor numbers).
This is the loadable module containing the driver (if any).
exact_match is a routine to find the proper disk.
exact_lock is a locking function for this code once the exact_match routine finds the proper disk.
Disk is the handle used for the exact_match and exact_lock functions to identify a specific disk.

Line 198

register_disk checks for partitions and adds them to the filesystem.

Line 199

5.2.4. Device Operations

The basic generic block device has open, close (release), ioctl, and most important, the request function. At the least, the open and close functions could be simple usage counters. The ioctl() interface can be used for debug and performance measurements by bypassing the various software layers. The request function, which is called when a request is put on the queue by the filesystem, extracts the request structure and acts upon its contents. Depending on whether the request is a read or write, the device takes the appropriate action.

The request queue is not accessed directly, but by a set of helper routines. (These can be found in drivers/block/elevator.c and include/linux/blkdev.h.) In keeping with our basic device model, we want to include the ability to act on the next request in our request function:

 ------------------------------------------------------------------------- drivers/block/elevator.c 186  struct request *elv_next_request(request_queue_t *q) -------------------------------------------------------------------------

This helper function returns a pointer to the next request structure. By examining the elements, the driver can glean all the information needed to determine the size, direction, and any other custom operations associated with this request.

When the driver finishes this request, it indicates this to the kernel by using the end_request() helper function:

 ------------------------------------------------------------------------- drivers/block/ll_rw_blk.c 2599  void end_request(struct request *req, int uptodate) 2600  { 2601  if (!end_that_request_first(req, uptodate, req->hard_cur_sectors)) { 2602  add_disk_randomness(req->rq_disk); 2603  blkdev_dequeue_request(req); 2604  end_that_request_last(req); 2605   } 2606  } -------------------------------------------------------------------------

Line 2599

Pass in the request queue acquired from elev_next_request(),

Line 2601

end_that_request_first() TRansfers the proper number of sectors. (If sectors are pending, end_request() simply returns.)

Line 2602

Add to the system entropy pool. The entropy pool is the system method for generating random numbers from a function fast enough to be called at interrupt time. The basic idea is to collect bytes of data from various drivers in the system and generate a random number from them. Chapter 10, "Adding Your Code to the Kernel," discusses this. Another explanation is at the head of the file /drivers/char/random.c.

Line 2603

Remove request structure from the queue.

Line 2604

Collect statistics and make the structure available to be free.

From this point on, the generic driver services requests until it is released.

Referring to Figure 5.4, we now have the generic block device layer constructing and maintaining the request queue. The final layer in the block I/O system is the hardware (or specific) device driver. The hardware device driver uses the request queue helper routines from the generic layer to service requests from its registered request queue and send notifications when the request is complete.

The hardware device driver has intimate knowledge of the underlying hardware with regard to register locations, I/O, timing, interrupts, and DMA (discussed in the "Direct Memory Access [DMA]" section of this chapter). The complexities of a complete driver for IDE or SCSI are beyond the scope of this chapter. We offer more on hardware device drivers in Chapter 10 and a series of projects to help you produce a skeleton driver to build on.

5.2.5. Character Device Overview

Unlike the block device, the character device sends a stream of data. All serial devices are character devices. When we use the classic examples of a keyboard controller or a serial terminal as a character stream device, it is intuitively clear we cannot (nor would we want to) access the data from these devices out of order. This introduces the gray area for packetized data transmission. The Ethernet medium at the physical transmission layer is a serial device, but at the bus level, it uses DMA to transfer large chunks of data to and from memory.

As device driver writers, we can make anything happen in the hardware, but real-time practicality is the governing force keeping us from randomly accessing an audio stream or streaming data to our IDE drive. Although both sound like attractive challenges, we still have two simple rules we must follow:

All Linux device I/O is based on files.
All Linux device I/O is either character or block.

The parallel port driver at the end of this chapter is a character device driver. Similarities between character and block drivers is the file I/O-based interface. Externally, both types use file operations such as open, close, read, and write. Internally, the most obvious difference between a character device driver and a block device driver is that the character device does not have the block device system of request queues for read and write operations (as previously discussed). It is often the case that for a non-buffered character device, an interrupt is asserted for each element (character) received. To contrast this to a block device, a chunk(s) of data is retrieved and an interrupt is then asserted.

5.2.6. A Note on Network Devices

Network devices have attributes of both block and character devices and are often thought of as a special set of devices. Like a character device, at the physical level, data is transmitted serially. Like a block device, data is packetized and moved to and from the network controller via direct memory access (discussed in the "Direct Memory Access [DMA]" section).

Network devices need to be mentioned as I/O in this chapter, but because of their complexity, they are beyond the scope of this book.

5.2.7. Clock Devices

Clocks are I/O devices that count the hardware heartbeat of the system. Without the concept of elapsed time, Linux would cease to function. Chapter 7, "Scheduling and Kernel Synchronization," covers the system and real-time clocks.

5.2.8. Terminal Devices

The earliest terminals were teletype machines (hence the name tty for the serial port driver). The teletypewriter had been in development since the turn of the century with the desire to send and read real text over telegraph lines. By the early 1960s, the teletype had matured with the early RS-232 standard, and it seemed to be a match for the growing number of the day's minicomputers. For communicating with computers, the teletype gave way to the terminal of the 1970s. True terminals are becoming a rare breed. Popular with mainframe and minicomputers in the 1970s to the mid 1980s, they have given way to PCs running terminal-emulator software packages. The terminal itself (often called a "dumb" terminal) was simply a monitor and keyboard device that communicated with a mainframe by using serial communications. Unlike the PC, it had only enough "smarts" to send and receive text data.

The main console (configurable at boot time) is the first terminal to come up on a Linux system. Often, a graphical interface is launched, and terminal emulator windows are used thereafter.

5.2.9. Direct Memory Access (DMA)

The DMA controller is a hardware device that is situated between an I/O device and (usually) the high-performance bus in the system. The purpose of the DMA controller is to move large amounts of data without processor intervention. The DMA controller can be thought of as a dedicated processor programmed to move blocks of data to and from main memory. At the register level, the DMA controller takes a source and destination address and length to complete its task. Then, while the main processor is idle, it can send a burst of data from a device to memory, or from memory to memory or from memory to a device.

Many controllers (disk, network, and graphics) have a DMA engine built-in and can therefore transfer large amounts of data without using precious processor cycles.