Section 3.5. Block Devices in MINIX 3 | Operating Systems Design and Implementation (3rd Edition)

[Page 261 (continued)]

3.5. Block Devices in MINIX 3

MINIX 3 supports several different block devices, so we will begin by discussing common aspects of all block devices. Then we will discuss the RAM disk, the hard disk, and the floppy disk. Each of these is interesting for a different reason. The RAM disk is a good example to study because it has all the properties of block devices in general except the actual I/Obecause the "disk" is actually just a portion of memory. This simplicity makes it a good place to start. The hard disk shows what a real disk driver looks like. One might expect the floppy disk to be easier to support than the hard disk, but, in fact, it is not. We will not discuss all the details of the floppy disk, but we will point out several of the complications to be found in the floppy disk driver.

[Page 262]

Looking ahead, after the discussion of block drivers, we will discuss the terminal (keyboard + display) driver, which is important on all systems, and, furthermore is a good example of a character device driver.

Each of these sections describes the relevant hardware, the software principles behind the driver, an overview of the implementation, and the code itself. This structure may make the sections useful reading even for readers who are not interested in the details of the code itself.

3.5.1. Overview of Block Device Drivers in MINIX 3

We mentioned earlier that the main procedures of all I/O drivers have a similar structure. MINIX 3 always has at least two block device drivers compiled into the system: the RAM disk driver, and either one of several possible hard disk drivers or a floppy disk driver. Usually, there are three block devicesboth the floppy disk driver and an IDE (Integrated Drive Electronics) hard disk driver are present. The driver for each block device driver is compiled independently, but a common library of source code is shared by all of them.

In older versions of MINIX a separate CD-ROM driver was sometimes present, and could be added if necessary. Separate CD-ROM drivers are now obsolete. They used to be necessary to support the proprietary interfaces of different drive manufacturers, but modern CD-ROM drives are usually connected to the IDE controller, although on notebook computers some CD-ROMs are USB. The full version of the MINIX 3 hard disk device driver includes CD-ROM support, but we have taken the CD-ROM support out of the driver as described in this text and listed in Appendix B.

Each block device driver has to do some initialization, of course. The RAM disk driver has to reserve some memory, the hard disk driver has to determine the parameters of the hard disk hardware, and so on. All of the disk drivers are called individually for hardware-specific initialization. After doing whatever may be necessary, each driver then calls the function containing its main loop. This loop is executed forever; there is no return to the caller. Within the main loop a message is received, a function to perform the operation needed by each message is called, and then a reply message is generated.

The common main loop called by each disk driver process is compiled when drivers/libdriver/driver.c and the other files in its directory are compiled, and then a copy of the object file driver.o is linked into each disk driver's executable file. The technique used is to have each driver pass to the main loop a parameter consisting of a pointer to a table of the addresses of the functions that driver will use for each operation and then to call these functions indirectly.

[Page 263]

If the drivers were compiled together in a single executable file only one copy of the main loop would be needed. This code was, in fact, first written for an earlier version of MINIX in which all the drivers were compiled together. The emphasis in MINIX 3 is on making individual operating system components as independent as possible, but using common source code for separate programs is still a good way to increase reliability. Assuming you get it right once, it will be right for all the drivers. Or, a bug found in one use might very well exist unnoticed in other uses. Thus, shared source code gets tested more thoroughly.

A number of other functions potentially useful to multiple disk drivers are defined in drivers/libdriver/drvlib.c, and linking drvlib.o makes these available. All of the functionality could have been provided in a single file, but not all of it is needed by every disk driver. For instance, the memory driver, which is simpler than other drivers, links in only driver.o. The at_wini driver links in both driver.o and drvlib.o.

Figure 3-19 shows an outline of the main loop, in a form similar to that of Fig. 3-18. Statements like

code = (*entry_points->dev_read)(&mess);

are indirect function calls. A different dev_read function is called by each driver, even though each driver is executing a main loop compiled from the same source file. But some other operations, for example close, are simple enough that more than one device can call the same function.

Figure 3-19. An I/O driver main procedure using indirect calls. (This item is displayed on page 264 in the print version)

message mess;                           /* message buffer*/ void shared_io_driver(struct driver_table *entry_points){ /* initialization is done by each driver before calling this */    while (TRUE) {         receive(ANY, &mess);         caller = mess.source;         switch(mess.type) {             case READ:      rcode = (*entry_points->dev_read)(&mess); break;             case WRITE:     rcode = (*entry_points->dev_write)(&mess); break;             /* Other cases go here, including OPEN, CLOSE, and IOCTL */             default:        rcode = ERROR;         }         mess.type = DRIVER_REPLY;         mess.status = rcode;          /* result code* /         send(caller, &mess);   } }

There are six possible operations that can be requested of any device driver. These correspond to the possible values that can be found in the m.m_type field of the message of Fig. 3-17. They are:

1.	OPEN
2.	CLOSE
3.	READ
4.	WRITE
5.	IOCTL
6.	SCATTERED_IO

Many of these operations are most likely familiar to readers with programming experience. At the device driver level most operations are related to system calls with the same name. For instance, the meanings of READ and WRITE should be fairly clear. For each of these operations, a block of data is transferred from the device to the memory of the process that initiated the call, or vice versa. A READ operation normally does not result in a return to the caller until the data transfer is complete, but an operating system may buffer data transferred during a WRITE for actual transfer to the destination at a later time, and return to the caller immediately. That is fine as far as the caller is concerned; it is then free to reuse the buffer from which the operating system has copied the data to write. OPEN and CLOSE for a device have similar meanings to the way the open and close system calls apply to operations on files: an OPEN operation should verify that the device is accessible, or return an error message if not, and a CLOSE should guarantee that any buffered data that were written by the caller are completely transferred to their final destination on the device.

[Page 264]

The IOCTL operation may not be so familiar. Many I/O devices have operational parameters which occasionally must be examined and perhaps changed. IOCTL operations do this. A familiar example is changing the speed of transmission or the parity of a communications line. For block devices, IOCTL operations are less common. Examining or changing the way a disk device is partitioned is done using an IOCTL operation in MINIX 3 (although it could just as well have been done by reading and writing a block of data).

The SCATTERED_IO operation is no doubt the least familiar of these. Except with exceedingly fast disk devices (for example, the RAM disk), satisfactory disk I/O performance is difficult to obtain if all disk requests are for individual blocks, one at a time. A SCATTERED_IO request allows the file system to make a request to read or write multiple blocks. In the case of a READ operation, the additional blocks may not have been requested by the process on whose behalf the call is made; the operating system attempts to anticipate future requests for data. In such a request not all the transfers requested are necessarily honored by the device driver. The request for each block may be modified by a flag bit that tells the device driver that the request is optional. In effect the file system can say: "It would be nice to have all these data, but I do not really need them all right now." The device can do what is best for it. The floppy disk driver, for instance, will return all the data blocks it can read from a single track, effectively saying, "I will give you these, but it takes too long to move to another track; ask me again later for the rest."

[Page 265]

When data must be written, there is no question of its being optional; every write is mandatory. Nevertheless, the operating system may buffer a number of write requests in the hope that writing multiple blocks can be done more efficiently than handling each request as it comes in. In a SCATTERED_IO request, whether for reading or writing, the list of blocks requested is sorted, and this makes the operation more efficient than handling the requests randomly. In addition, making only one call to the driver to transfer multiple blocks reduces the number of messages sent within MINIX 3.

3.5.2. Common Block Device Driver Software

Definitions that are needed by all of the block device drivers are located in drivers/libdriver/driver.h. The most important thing in this file is the driver structure, on lines 10829 to 10845, which is used by each driver to pass a list of the addresses of the functions it will use to perform each part of its job. Also defined here is the device structure (lines 10856 to 10859) which holds the most important information about partitions, the base address, and the size, in byte units. This format was chosen so no conversions are necessary when working with memorybased devices, maximizing speed of response. With real disks there are so many other factors delaying access that converting to sectors is not a significant inconvenience.

The source of the main loop and common functions of all the block device drivers are in driver.c. After doing whatever hardware-specific initialization may be necessary, each driver calls driver_task, passing a driver structure as the argument to the call. After obtaining the address of a buffer to use for DMA operations the main loop (lines 11071 to 11120) is entered.

In the switch statement in the main loop, the first five message types, DEV_OPEN, DEV_CLOSE, DEV_IOCTL, DEV_CANCEL, and DEV_SELECT result in indirect calls using addresses passed in the driver structure. The DEV_READ and DEV_WRITE messages both result in direct calls to do_rdwt; DEV_GATHER and DEV_SCATTER messages both result in direct calls to do_vrdwt. The driver structure is passed as an argument by all the calls from within the switch, whether direct or indirect, so all called functions can make further use of it as needed. Do_rdwt and do_vrdwt do some preliminary processing, but then they too make indirect calls to device-specific routines.

[Page 266]

The other cases, HARD_INT, SYS_SIG, and SYN_ALARM, respond to notifications. These also result in indirect calls, but upon completion each of these executes a continue statement. This causes control to return to the top of the loop, bypassing the cleanup and reply message steps.

After doing whatever is requested in the message, some sort of cleanup may be necessary, depending upon the nature of the device. For a floppy disk, for instance, this might involve starting a timer to turn off the disk drive motor if another request does not arrive soon. An indirect call is used for this as well. Following the cleanup, a reply message is constructed and sent to the caller (lines 11113 to 11119). It is possible for a routine that services one of the message types to return a EDONTREPLY value to suppress the reply message, but none of the current drivers use this option.

The first thing each driver does after entering the main loop is to make a call to init_buffer (line 11126), which assigns a buffer for use in DMA operations. That this initialization is even necessary at all is due to a quirk of the hardware of the original IBM PC, which requires that the DMA buffer not cross a 64K boundary. That is, a 1-KB DMA buffer may begin at 64510, but not at 64514, because a buffer starting at the latter address extends just beyond the 64K boundary at 65536.

This annoying rule occurs because the IBM PC used an old DMA chip, the Intel 8237A, which contains a 16-bit counter. A bigger counter is needed because DMA uses absolute addresses, not addresses relative to a segment register. On older machines that can address only 1M of memory, the low-order 16 bits of the DMA address are loaded into the 8237A, and the high-order 4 bits are loaded into a 4-bit latch. Newer machines use an 8-bit latch and can address 16M. When the 8237A goes from 0xFFFF to 0x0000, it does not generate a carry into the latch, so the DMA address suddenly jumps down by 64K in memory.

A portable C program cannot specify an absolute memory location for a data structure, so there is no way to prevent the compiler from placing the buffer in an unusable location. The solution is to allocate an array of bytes twice as large as necessary at buffer (line 11044) and to reserve a pointer tmp_buf (line 11045) to use for actually accessing this array. Init_buffer makes a trial setting of tmp_buf pointing to the beginning of buffer, then tests to see if that allows enough space before a 64K boundary is hit. If the trial setting does not provide enough space, tmp_buf is incremented by the number of bytes actually required. Thus some space is always wasted at one end or the other of the space allocated in buffer, but there is never a failure due to the buffer falling on a 64K boundary.

Newer computers of the IBM PC family have better DMA controllers, and this code could be simplified, and a small amount of memory reclaimed, if one could be sure that one's machine were immune to this problem. If you are considering this, however, consider how the bug will manifest itself if you are wrong. If a 1K DMA buffer is desired, the chance is 1 in 64 that there will be a problem on a machine with the old DMA chip. Every time the kernel source code is modified in a way that changes the size of the compiled kernel, there is the same probability that the problem will manifest itself. Most likely, when the failure occurs next month or next year, it will be attributed to the code that was last modified. Unexpected hardware "features" like this can cause weeks of time spent looking for exceedingly obscure bugs (all the more so when, like this one, the technical reference manual says nary a word about them).

[Page 267]

Do_rdwt (line 11148) is the next function in driver.c. It, in turn calls two device-dependent functions pointed to by the dr_prepare and dr_transfer fields in the driver structure. Here and in what follows we will use the C language-like notation (*function_pointer) to indicate we are talking about the function pointed to by function_pointer.

After checking to see that the byte count in the request is positive, do_rdwt calls (*dr_prepare). This operation fills in the base and size of the disk, partition, or subpartition being accessed in a device structure. For the memory driver, which does not support partitions, it just checks that the minor device number is valid. For the hard disk it uses the minor device number to get the size of the partition or subpartition indicated by the minor device number. This should succeed, since (*dr_prepare) can fail only if an invalid device is specified in an open operation. Next, an iovec_t structure (which is defined on lines 2856 to 2859 in include/minix/type.h), iovec1, is filled in. This structure specifies the virtual address and size of the local buffer to or from which data will be copied by the system task. This is the same structure that is used as an element of an array of requests when the call is for multiple blocks. The address of a variable and the address of the first element of an array of the same type of variable can be handled exactly the same way. Then comes another indirect call, this time to (*dr_transfer), which performs the data copy and I/O operations required. The routines that handle transfers all expect to receive an array of requests. In do_rdwt the last argument to the call is 1, specifying an array of one element.

As we will see in the discussion of disk hardware in the next section, responding to disk requests in the order they are received can be inefficient, and this routine allows a particular device to handle requests in the way that is best for the device. The indirection here masks much possible variation in the way individual devices perform. For the RAM disk, dr_transfer points to a routine that makes a kernel call to ask the system task to copy data from one part of physical memory to another, if the minor device being accessed is /dev/ram, /dev/mem, /dev/kmem, /dev/boot, or /dev/zero. (No copying is required to access /dev/null, of course.) For a real disk, the code pointed to by dr_transfer also has to ask the system task for a data transfer. But before the copy operation (for a read) or after it (for a write) a kernel call must also be made to ask the system task to do actual I/O, writing bytes to registers that are part of the disk controller to select the location on the disk and the size and direction of the transfer.

In the transfer routine the iov_size count in the iovec1 structure is modified, returning an error code (a negative number) if there was an error or a positive number indicating the number of bytes transferred. It is not necessarily an error if no bytes are transferred; this indicates that the end of the device has been reached. Upon returning to the main loop, the error code or the byte count is returned in the REP_STATUS field in the reply message from driver_task.

[Page 268]

The next function, do_vrdwt (line 11182), handles scattered I/O requests. A message that requests a scattered I/O request uses the ADDRESS field to point to an array of iovec_t structures, each of which specifies the address of a buffer and the number of bytes to transfer. In MINIX 3 such a request can be made only for contiguous blocks on the disk; the initial offset on the device and whether the operation is a read or a write are in the message. So all the operations in one request will be for either reading or writing, and they will be sorted into block order on the device. On line 11198 a check is done to see if this call is being done on behalf of a kernel-space I/O task; this is a vestige of an early phase of the development of MINIX 3 before all the disk drivers had been rewritten to run in user space.

Fundamentally, the code for this operation is very similar to that for the simple read or write performed by do_rdwt. The same indirect calls to the device-dependent (*dr_prepare) and (*dr_transfer) routines are made. The looping in order to handle multiple requests is all done internal to the function pointed to by (*dr_transfer). The last argument in this case is not 1, it is the size of the array of iovec_t elements. After termination of the loop the array of requests is copied back where it came from. The io_size field of each element in the array will show the number of bytes transferred for that request, and although the total is not passed back directly in the reply message that driver_task constructs, the caller can extract the total from this array.

The next few routines in driver.c are for general support of the above operations. A (*dr_name) call can be used to return the name of a device. For a device with no specific name the no_name function returns the string "noname". Some devices may not require a particular service, for instance, a RAM disk does not require that anything special be done upon a DEV_CLOSE request. The do_nop function fills in here, returning various codes depending upon the kind of request made. Additional functions, nop_signal, nop_alarm, nop_prepare, nop_cleanup, and nop_cancel, are similar dummy routines for devices that do not need these services.

Finally, do_diocntl (line 11216) carries out DEV_IOCTL requests for a block device. It is an error if any DEV_IOCTL operation other than reading (DIOCGETP) or writing (DIOCSETP) partition information is requested. Do_diocntl calls the device's (*dr_prepare) function to verify the device is valid and to get a pointer to the device structure that describes the partition base and size in byte units. On a request to read, it calls the device's (*dr_geometry) function to get the last cylinder, head, and sector information about the partition. In each case a sys_datacopy kernel call is made to request that the system task copy the data between the memory spaces of the driver and the requesting process.

[Page 269]

3.5.3. The Driver Library

The files drvlib.h and drvlib.c contain system-dependent code that supports disk partitions on IBM PC compatible computers.

Partitioning allows a single storage device to be divided up into subdevices. It is most commonly used with hard disks, but MINIX 3 provides support for partitioning floppy disks, as well. Some reasons to partition a disk device are:

1.	Disk capacity is cheaper per unit in large disks. If two or more operating systems with different file systems are used, it is more economical to partition a single large disk than to install multiple smaller disks for each operating system.
2.	Operating systems may have limits to the device size they can handle. The version of MINIX 3 discussed here can handle a 4-GB file system, but older versions are limited to 256 MB. Any disk space beyond that is wasted.
3.	Two or more different file systems may be used by an operating system. For example, a standard file system may be used for ordinary files and a differently structured file system may be used for virtual memory swap space.
4.	It may be convenient to put a portion of a system's files on a separate logical device. Putting the MINIX 3 root file system on a small device makes it easy to back up and facilitates copying it to a RAM disk at boot time.

Support for disk partitions is platform specific. This specificity is not related to the hardware. Partition support is device independent. But if more than one operating system is to run on a particular set of hardware, all must agree on a format for the partition table. On IBM PCs the standard is set by the MS-DOS fdisk command, and other OSs, such as MINIX 3, Windows, and Linux, use this format so they can coexist with MS-DOS. When MINIX 3 is ported to another machine type, it makes sense to use a partition table format compatible with other operating systems used on the new hardware. Thus the MINIX 3 source code to support partitions on IBM computers is put in drvlib.c, rather than being included in driver.c, for two reasons. First, not all disk types support partitions. As noted earlier, the memory driver links to driver.o but does not use the functions compiled into drvlib.o. Second, this makes it easier to port MINIX 3 to different hardware. It is easier to replace one small file than to edit a large one with many sections to be conditionally compiled for different environments.

The basic data structure inherited from the firmware designers is defined in include/ibm/partition.h, which is included by a #include statement in drvlib.h (line 10900). This includes information on the cylinder-head-sector geometry of each partition, as well as codes identifying the type of file system on the partition and an active flag indicating if it is bootable. Most of this information is not needed by MINIX 3 once the file system is verified.

[Page 270]

The partition function (in drvlib.c, line 11426) is called the first time a block device is opened. Its arguments include a driver structure, so it can call device-specific functions, an initial minor device number, and a parameter indicating whether the partitioning style is floppy disk, primary partition, or subpartition. It calls the device-specific (*dr_prepare) function to verify the device is valid and to get the base address and the size into a device structure of the type mentioned in the previous section. Then it calls get_part_table to determine if a partition table is present and, if so, to read it. If there is no partition table, the work is done. Otherwise the minor device number of the first partition is computed, using the rules for numbering minor devices that apply to the style of partitioning specified in the original call. In the case of primary partitions the partition table is sorted so the order of the partitions is consistent with that used by other operating systems.

At this point another call is made to (*dr_prepare), this time using the newly calculated device number of the first partition. If the subdevice is valid, then a loop is made over all the entries in the table, checking that the values read from the table on the device are not out of the range obtained earlier for the base and size of the entire device. If there is a discrepancy, the table in memory is adjusted to conform. This may seem paranoid, but since partition tables may be written by different operating systems, a programmer using another system may have cleverly tried to use the partition table for something unexpected or there could be garbage in the table on disk for some other reason. We put the most trust in the numbers we calculate using MINIX 3. Better safe than sorry.

Still within the loop, for all partitions on the device, if the partition is identified as a MINIX 3 partition, partition is called recursively to gather subpartition information. If a partition is identified as an extended partition, the next function, extpartition, is called instead.

Extpartition (line 11501) has nothing to do with MINIX 3 itself, so we will not discuss details. Some other operating systems (e.g., Windows) use extended partitions. These use linked lists rather than fixed-size arrays to support subpartitions. For simplicity MINIX 3 uses the same mechanism for subpartitions as for primary partitions. However, minimal support for extended partitions is provided to support MINIX 3 commands to read and write files and directories of other operating systems. These operations are easy; providing full support for mounting and otherwise using extended partitions in the same way as primary partitions would be much more complicated.

Get_part_table (line 11549) calls do_rdwt to get the sector on a device (or subdevice) where a partition table is located. The offset argument is zero if it is called to get a primary partition or nonzero for a subpartition. It checks for the magic number (0xaa55) and returns true or false status to indicate whether a valid partition table was found. If a table is found, it copies it to the table address that was passed as an argument.

[Page 271]

Finally, sort (line 11582) sorts the entries in a partition table by lowest sector. Entries that are marked as having no partition are excluded from the sort, so they come at the end, even though they may have a zero value in their low sector field. The sort is a simple bubble sort; there is no need to use a fancy algorithm to sort a list of four items.