7.1. Device Overview This chapter describes the part of the system that interfaces to the hardware as is shown in the bottom part of Figure 6.1 (on page 216). Historically, the device interface was static and simple. Devices were discovered as the system was booted and did not change thereafter. Filesystems were built in a partition of a single disk. When a disk driver received a request from a filesystem to write a block, it would add the base offset of the partition and do a bounds check based on information from its disk label. It would then do the requested I/O and return the result or error to the filesystem. A typical disk driver could be written in a few hundred lines of code. As the system has evolved, the complexity of the I/O system has increased with the addition of new functionality. The new functionality can be broken into two categories: Disk management I/O routing and control Each of these areas is handled by a new subsystem in FreeBSD. Disk management consists of organizing the myriad ways that disks can be used to build a filesystem. A disk may be broken up into several slices, each of which can be used to support a different operating system. Each of these slices may be further subdivided into partitions that can be used to support filesystems as they did historically. However, it is also possible to combine several slices and/or partitions to create a virtual partition on which to build a filesystem that spans several disks. The virtual partition may concatenate together several partitions to stripe the filesystem across several disks, thus providing a high-bandwidth filesystem. Or the underlying partitions may be put together in a RAID (Redundant Array of Inexpensive Disks) to provide a higher level of reliability and accessibility than a disk alone. Or the partitions may be organized into two equal-sized groups and mirrored to provide an even higher level of reliability and accessibility than RAID. The aggregation of physical disk partitions into a virtual partition in these ways is referred to as volume management. Rather than building all this functionality into all the filesystems or disk drivers, it has been abstracted out into the GEOM (geometry) layer. The GEOM layer takes as input the set of disks available on the system. It is responsible for doing volume management. At a low level volume management creates, maintains, and interprets the slice tables and the disk labels defining the partitions within each slice. At a higher level, GEOM combines the physical disk partitions through striping, RAID, or mirroring to create the virtual partitions that are exported to the filesystem layer above. The virtual partition appears to the filesystem as a single large disk. As the filesystem does I/O within the virtual partition, the GEOM layer determines which disk(s) are involved and breaks up and dispatches the I/O request to the appropriate physical drives. The operation of the GEOM layer is described in Section 7.2. The PC I/O Architecture Historically, architectures had only one or two I/O busses and types of disk controllers. As described in the next subsection, a modern PC today can have several types of disks connected to the machine through five or more different types of interfaces. The complexity of the disk controllers themselves rivals that of the entire early UNIX operating system. Early controllers could only handle one disk I/O at a time. Today's controllers can typically juggle up to 64 simultaneous requests through a scheme called tagged queueing. A request works its way through the controller being posted as it is received, scheduled to be done, completed, and reported back to the requester. I/O may also be cached in the controller to allow future requests to be handled more quickly. Another task handled by the controller is to provide a replacement for a disk sector with a permanent error with an alternate good sector. The current PC I/O architecture is shown in Figure 7.1. Far greater detail is available at Arch [2003]. On the left of the figure is the CPU that has a high-speed interconnect through the northbridge to the system's main memory and the graphics memory that drives the system display. Note that the L1 and L2 caches are not shown in this picture because they are considered as part of the CPU. Continuing to the right of the northbridge is a moderate-speed connection to the southbridge. The southbridge connects all the high-throughput I/O busses to the system. These busses include the following: The PCI (Peripheral Component Interconnect) bus. The PCI-X and PCI Express busses are modern, backward-compatible implementations of the PCI standard. Most modern I/O cards are designed to plug into some variant of the PCI bus, since this bus architecture provides high-throughput speeds and is well designed for fully automated autoconfiguration. This bus also has the advantage of being available on many other computer architectures besides the PC. The ATA (Advanced Technology Attachment) bus. The original ATA parallel bus remains as a vestige of earlier PC designs. It supports a maximum of two drives per channel, does not have hot-plug support (changing of drives while the system is powered up and operational), and has a maximum transfer speed of 133 Mbyte per second compared with the maximum transfer speed of 320 Mbyte per second of the SCSI (Small Computer System Interface) parallel interface. ATA parallel support remains because of the many cheap disk drives that are available to connect to it. The newer Serial ATA adds hot-plug support and transfers data at rates comparable to the SCSI parallel interface. The USB (Universal Serial Bus) bus. The USB bus provides a moderate speed (up to 60 Mbyte per second) input typically used for video cameras, memory cards, scanners, printers, and some of the newer human input devices such as keyboards, mice, and joysticks. The Firewire (IEEE 1394) bus. Firewire is faster than the USB bus, running at up to 100 Mbyte per second. It is used by memory-card readers, external disks, and some professional digital cameras. Firewire was invented by Apple Computer but is now commonly found on PC-architecture machines. CardBus cards, or its slower cousin, PC cards, historically known as PCMCIA cards (Personal Computer Memory Card International Association). CardBus cards are the ubiquitous devices about the size of a thick credit card that are primarily used in laptop computers to provide Ethernet, modem, FAX, 802.11 wireless access, minidisk, SCSI and ATA disk controllers, and many other functions. The PIC (Programmable Interrupt Controller). The PIC maps the device interrupts to IRQ (Interrupt ReQuest) values for the CPU. Most modern machines use an IOAPIC (I/O Advanced Programmable Interrupt Controller) that provides much finer-grain control over the device interrupts. All processors since the Pentium Pro (1997) have had an LAPIC (Local Advanced Programmable Interrupt Controller) that works with the IOAPIC to support distribution of interrupts among the CPUs. Figure 7.1. The PC I/O architecture. Key: ATA Advanced Technology Attachment; USB Universal Serial Bus; PCI Peripheral Component Interconnect; (A)PIC (Advanced) Programmable Interrupt Controller. Also hanging off the southbridge is the Super I/O chip that provides slow-speed access to many legacy PC interfaces. These interfaces include the following: The PS2 keyboard and mouse ports. Newer machines are connecting the keyboard and mouse through the USB port, but many legacy PS2 keyboards and mice remain. Support for the AC97 (Audio CODEC) sound standard. This standard allows a single DSP (Digital Signal Processor) to be used to support both a modem and sound. The embedded controller. The embedded controller is present on many mobile systems and controls various components, including power and sleep buttons, back-light intensity of the screen, and status lights. On modern systems, the embedded controller is accessed via the ACPI (Advanced Configuration and Power Interface). The ACPI standard evolves the existing collection of power management BIOS (Basic Input Output System) code, numerous APM (Advanced Power Management) APIs, (Application Programming Interface) and PNPBIOS (Plug and Play BIOS) APIs into a well-specified power management and configuration mechanism. The ACPI specification provides support for an orderly transition from legacy hardware to ACPI hardware and allows for multiple mechanisms to exist in a single machine and be used as needed [ACPI, 2003]. Support for legacy ISA (Industry Standard Architecture) bus-based peripherals. The ISA bus was the mainstay of the PC for many years and many embedded systems still use this bus. Because of their small size and formerly high volume, these peripherals tend to be cheap. Today they are relegated to slow functions such as dialup modem, 10-Mbit Ethernet, and 16-bit sound cards. The Structure of the FreeBSD Mass Storage I/O Subsystem There were several disk subsystems in early versions of FreeBSD. The first support for ATA and SCSI disks came from Mach 2.5. Both of these were highly device-specific. Efforts to replace both resulted in CAM (Common Access Method), introduced in FreeBSD 3.0, and ATA, introduced in FreeBSD 4.0. As the ATA effort was proceeding, the CAM maintainers attempted to have it become a CAM attachment. However, the strange reservation and locking rules of the ATA register file model was a poor match for the CAM implementation, so the ATA implementation with one exception remained separate. CAM is an ANSI (American National Standards Institute) standard (X3.232-1996). A revised and enhanced version of CAM was proposed by the X3T10 group but was never approved [ANSI, 2002]. Although primarily used for SCSI, CAM is a way of interfacing IIBA (Host Bus Adapter) drivers (in CAM terminology SIM (Software Interface Module) drivers), midlayer transport glue, and peripheral drivers. CAM seems unlikely to ever be approved as a standard, but it still provides a useful framework for implementing a storage subsystem. The FreeBSD CAM implementation supports SPI (SCSI Parallel Interface), Fibre Channel [ANSI, 2003], UMASS (USB Mass Storage), IEEE 1394 (Firewire), and ATAPI (Advanced Technology Attachment Packet Interface). It has peripheral drivers for disks (da), cdrom (cd), tapes (sa), tape changers (ch), processor devices (pt), and enclosure services (scs). Additionally, there is the target emulator that allows a computer to emulate any of the supported devices and a pass-through interface that allows user applications to send I/O requests to any CAM-controlled peripheral. The operation of the CAM layer is described in Section 7.3. ATA devices have their own parallel system that is more integrated but less flexible than CAM. It implements its own midlayer transport that supports ATA disks (ad), RAID arrays (ar), cdrom (acd), floppy disk (afd), and tape (ast). It also supports operation of ATA hardware drivers connected to CardBus or the PCI bus, and the ISA bus. At the time of this writing, ATA has implemented fine-grained locking for multiprocessor support, which is not yet complete for CAM. The operation of the ATA layer is described in Section 7.4. The structure of the FreeBSD Disk I/O subsystem is shown in Figure 7.2. As the figure shows, disk drives may be attached to the system through many busses. Figure 7.2. The structure of the FreeBSD disk I/O subsystem. The fastest and most expensive connection is through a fiberoptic link. Such disk systems are usually used on large servers or when the data must travel farther than just within the case of the computer or to an adjacent rack. The more common fast choice is a controller that plugs into the PCI bus such as an Adaptec parallel SCSI controller, which can support up to 15 disks and tapes. SCSI disks generally are faster and more reliable under heavy load than the more consumer-desktop-oriented ATA disks. Parallel interface ATA disks are the cheapest and most ubiquitous. The serial interface ATA disks have faster access and have capabilities that rival those of their more costly SCSI cousins. The beginning of CAM support for ATA has begun to show up as CAM can operate ATAPI disk and tape drives. The ATAPI protocol is a subset of the SCSI MMC (Multi Media Command) specification and is used only for CD-ROM, DVD, and tape drives. Disks may also be connected through the other busses available on the PC architecture. These include the CardBus, Firewire, and USB. Usually the disks are connected through an interface that acts as a bridge from the interface to a PCI or ISA bus. The CardBus, Firewire, and USB busses may also support other types of devices that will be directly connected to their device drivers rather than be managed by the CAM layer. Network device drivers provide another important piece of functionality within the kernel. Rather than discuss them in this chapter, we shall defer describing them until we begin discussing the network stack in Section 12.1. Autoconfiguration is the procedure carried out by the system to recognize and enable the hardware devices present in a system. Historically, autoconfiguration was done just once when the system was booted. In current machines, particularly portable machines such as laptop computers, devices routinely come and go while the machine is operating. Thus, the kernel must be prepared to configure, initialize, and make available hardware when it arrives and to drop operations with hardware that has departed. FreeBSD uses a device-driver infrastructure called newbus to manage the devices on the system. Newbus builds a tree rooted at an abstract root0 node and descending in a treelike structure down the various I/O paths and terminating at the various devices connected to the machine. On a uniprocessor system the root0 node is synonymous with the CPU. On a multiprocessor system the root0 node is logically connected to each of the CPUs. At the time of this writing not all devices have been brought in under newbus. Not yet supported, but soon to be added are CPUs, timers, and PICs. Most drivers that do not understand newbus are for old legacy devices for which support is likely to be dropped. The most notable example of missing support is for disks that have problems because they may be multihomed (e.g., accessible through multiple busses). Newbus is currently designed to operate as a tree structure. It will need to be extended to allow a directed acyclic graph structure before it will be able to support multi-homed disks. Device autoconfiguration is described in Section 7.5, which gives the details of configuring devices when they appear and cleaning up after them when they disappear. Device Naming and Access Historically, FreeBSD has used static device nodes located in /dev to provide access to the hardware devices on the system. This approach has several problems: The device nodes are persistent entities in the filesystem and do not necessarily represent the hardware that is really connected to and available on the machine. When new hardware is added to the kernel, the system administrator needs to create new device nodes to access the hardware. If the hardware is later removed, the device nodes remain even though they are no longer usable. Device nodes require coordination of the major and minor numbering scheme between the device-driver tables in the kernel and the shell scripts that create them. FreeBSD 5.2 replaces the static /dev directory with a new DEVFS filesystem that is mounted on /dev when the kernel is booted. As devices are discovered either at boot or while the system is running, their names appear in the /dev filesystem. When a device disappears or becomes unavailable, its entries in /dev disappear. DEVFS has several benefits over the old static /dev directory: Only devices that are currently available appear in /dev. Adding a device to the system causes its device nodes to appear in /dev, obviating the need for a system administrator to create new device nodes. It is no longer necessary to coordinate device major and minor numbers between the kernel and device-creation scripts or filesystem device nodes. One benefit of the old static /dev was that device nodes could be given non-standard names, access permissions, owners, or groups. To provide the same flexibility, DEVFS has a rule-set mechanism that allows these changes to be automated in the new /dev implementation. These rule sets can be put in place when the system is booted and can be created or modified at any time that the system is running. Each rule provides a pattern to identify the device nodes to be affected. For each matched device node it specifies one or more actions that should be taken. Actions include creating a symbolic link to provide a nonstandard name as well as setting nonstandard permissions, owner, or group. The rule sets are checked and applied whenever a new device node is created or destroyed. They may also be checked and applied when explicitly requested to do so by the system administrator, either manually or through a system-initiated script. Zero or more dev_t entries (major and minor numbers) in /dev may be created by the device drivers each time that a device_t is created as part of the autoconfiguration process. Most device drivers create a single /dev entry, but network device drivers do not create any entries, whereas disk devices may create dozens. Additional entries may appear in /dev as the result of cloning devices. For example, a cloning device such as a pseudo-terminal creates a new device each time that it is opened. |