Section 7.2. The GEOM Layer | The Design and Implementation of the FreeBSD Operating System

7.2. The GEOM Layer

The GEOM layer provides a modular transformation framework for disk-I/O requests. This framework supports an infrastructure in which classes can do nearly arbitrary transformations on disk-I/O requests on their path from the upper kernel to the device drivers and back. GEOM can support both automatic data-directed configuration and manually or script-directed configuration. Transformations in GEOM include the following:

Simple base and bounds calculations needed for disk partitioning
Aggregation of disks to provide a RAID, mirrored, or stripped logical volume
A cryptographically protected logical volume
Collection of I/O statistics
I/O optimization such as disk sorting

Unlike many of its predecessors, GEOM is both extensible and topologically agnostic.

Terminology and Topology Rules

GEOM is object oriented and consequently borrows much context and semantics from the object-oriented terminology. A transformation is the concept of a particular way to modify I/O requests. Examples include partitioning a disk, mirroring two or more disks, and operating several disks together in a RAID.

A class implements one particular transformation. Examples of classes are a master boot record (MBR) disk partition, a BSD disk label, and a RAID array.

An instance of a class is called a geom. In a typical FreeBSD system, there will be one geom of class MBR for each disk. The MBR subdivides a disk into up to four pieces. There will also be one geom of class BSD for each slice with a BSD disk label.

A provider is the front gate at which a geom offers service. A typical provider is a logical disk, for example /dev/da0s1. All providers have three main properties: name, media size, and sector size.

A consumer is the backdoor through which a geom connects to another geom provider and through which I/O requests are sent. For example, an MBR label will typically be a consumer of a disk and a provider of disk slices.

The topological relationship between these entities are as follows:

A class has zero or more geom instances.
A geom is derived from exactly one class.
A geom has zero or more consumers.
A geom has zero or more providers.
A consumer can be attached to only one provider.
A provider can have multiple consumers attached.
The GEOM structure may not have loops; it must be an acyclic directed graph.

All geoms have a rank number assigned that detects and prevents loops in the acyclic directed graph. This rank number is assigned as follows:

A geom with no attached consumers has rank of one.
A geom with attached consumers has a rank one higher than the highest rank of the geoms of the providers to which its consumers are attached.

Figure 7.3 shows a sample GEOM configuration. At the bottom is a geom that communicates with the CAM layer and produces the da0 disk. It has two consumers. On the right is the DEVFS filesystem that exports the complete disk image as /dev/da0. On the left is stacked an MBR geom that interprets the MBR label found in the first sector of the disk to produce the two slices da0s1 and da0s2. Both of these slices have DEVFS consumers that export them as /dev/da0sl and /dev/da0s2. The first of these two slices has a second consumer, a BSD label geom, that interprets the BSD label found near the beginning of the slice. The BSD label subdivides the slice into up to eight (possibly overlapping) partitions da0sla through da0slh. All the defined partitions have DEVFS consumers that export them as /dev/da0sla through /dev/da0s1h. When one of these partitions is mounted, the filesystem that has mounted it also becomes a consumer of that partition.

Figure 7.3. A sample GEOM configuration.

Changing Topology

The basic operations are attach, which attaches a consumer to a provider, and detach, which breaks the bond. Several more complex operations are available to simplify automatic configuration.

Tasting is a process that happens whenever a new class or new provider is created. It provides the class a chance to automatically configure an instance on providers that it recognize as its own. A typical example is the MBR disk-partition class that will look for the MBR label in the first sector and if found and valid will instantiate a gcom to multiplex according to the contents of the MBR. Exactly what a class does to recognize if it should accept the offered provider is not defined by GEOM, but the sensible set of options are:

Examine specific data structures on the disk.
Examine properties like sector size or media size for the provider.
Examine the rank number of the provider's geom.
Examine the method name of the provider's geom.

A new class will be offered to all existing providers and a new provider will be offered to all classes.

Configure is the process where the administrator issues instructions for a particular class to instantiate itself. There are multiple ways to express intent. For example, a BSD label module can be specified with a level of override forcing a BSD disk-label geom to attach to a provider that was not found palatable during the taste operation. A configure operation is typically needed when first labelling a disk.

Orphaning is the process by which a provider is removed while it potentially is still being used. When a geom orphans a provider, all future I/O requests will bounce on the provider with an error code set by the geom. All consumers attached to the provider will receive notification about the orphaning and are expected to act appropriately. A geom that came into existence as a result of a normal taste operation should self-destruct unless it has a way to keep functioning without the orphaned provider. Single point of operation geoms, like those interpreting a disk label, should self-destruct. Geoms with redundant points of operation such as those supporting a RAID or a mirror will be able to continue as long as they do not lose quorum.

An orphaned provider may not result in an immediate change in the topology. Any attached consumers are still attached. Any opened paths are still open. Any outstanding I/O requests are still outstanding. A typical scenario is

A device driver detects a disk has departed and orphans the provider for it.
The geoms on top of the disk receive the orphaning event and orphan all their providers. Providers that are not in use will typically self-destruct immediately.
This process continues in a recursive fashion until all relevant pieces of the tree have responded to the event.
Eventually the traversal stops when it reaches the device geom at the top of the tree. The geom will refuse to accept any new requests by returning an error. It will sleep until all outstanding I/O requests have been returned (usually as errors). It will then explicitly close, detach, and destroy its geom.
When all the geoms above the provider have disappeared, the provider will detach and destroy its geom. This process percolates all the way down through the tree until the cleanup is complete.

While this approach seems byzantine, it does provide the maximum flexibility and robustness in handling disappearing devices. By ensuring that the tree does not unravel until all the outstanding I/O requests have returned ensures that no applications will be left hanging because a piece of hardware has disappeared.

Spoiling is a special case of orphaning used to protect against stale metadata. It is probably easiest to understand spoiling by going through an example. Consider the configuration shown in Figure 7.3 that has disk da0 above, which is an MBR geom that provides da0s1 and da0s2. On top of da0s1, a BSD geom provides da0sla through da0s1h. Both the MBR and BSD geoms have autoconfigured based on data structures on the disk media. Now consider the case where da0 is opened for writing and the MBR is modified or overwritten. The MBR geom now would be operating on stale metadata unless some notification system can inform it otherwise. To avoid stale metadata, the opening of da0 for writing causes all attached consumers to be notified resulting in the eventual self-destruction of the MBR and BSD geoms. When da0 is closed, it will be offered for tasting again, and if the data structures for MBR and BSD are still there, new geoms will instantiate themselves.

To avoid the havoc of changing a disk label for an active filesystem, changing the size of open geoms can be done only with their cooperation. If any of the paths through the MBR or BSD geoms were open (for example, as a mounted filesystem), they would have propagated an exclusive-open flag downward, rendering it impossible to open da0 for writing. Conversely, the exclusive-open flag requested when opening da0 to rewrite the MBR would render it impossible to open a path through the MBR geom until da0 is closed. Spoiling only happens when the write count goes from zero to non-zero, and the tasting only happens when the write count goes from non-zero to zero.

Insert is an operation that allows a new geom to be instantiated between an existing consumer and provider. Delete is an operation that allows a geom to be removed from between an existing consumer and provider. These capabilities can be used to move an active filesystem. For example, we could insert a mirror module into the GEOM stack pictured in Figure 7.3 as shown in Figure 7.4 (on page 274). The mirror operates on da0s1 and dalsl between the BSD label consumer and its MBR label provider da0s1. The mirror is initially configured with da0s1 as its only copy and consequently is transparent to the I/O requests on the path. Next we ask it to mirror da0s1 to dalsl. When the mirror copy is complete, we drop the mirror copy on da0s1. Finally we delete the mirror geom from the path instructing the BSD label consumer to consume from dalsl. The result is that we moved a mounted file system from one disk to another while it was being used.

Figure 7.4. Using a mirror module to copy an active filesystem.

Operation

The GEOM system needs to be able to operate in a multiprocessor kernel. The usual method for ensuring proper operation is to use mutex locks on all the data structures. Because of the large size and complexity of the code and data structures implementing the GEOM classes, GEOM uses a single-threading approach rather than traditional mutex locking to ensure data structure consistency. GEOM uses two threads to operate its stack: a g_down thread to process requests moving from the consumers at the top to the producers at the bottom and a g_up thread to process requests moving from the producers at the bottom to the consumers at the top.

Requests entering the GEOM layer at the top are queued awaiting the g_down thread. The g_down thread pops each request from the queue, moves it down through the stack, and out through the producer. Similarly, results coming back from the producers are queued awaiting the g_up thread. The g_up thread pops each request from the queue, moves it up through the stack, and sends it back out to the consumer.

Because there is only ever a single thread running up and down in the stack, the only locking that is needed is on the few data structures that coordinate between the upward and downward paths. There are two rules required to make this single-thread method work effectively:

A geom may never sleep. If a geom ever puts the g_up or g_down thread to sleep, the entire I/O system will grind to a halt until the geom reawakens. The GEOM framework checks that its worker threads never sleep, panicking if they attempt to do so.
No geom can compute excessively. If a geom computes excessively, pending requests or results may be unacceptably delayed. There are some geoms, such as the one that provides cryptographic protection for filesystems, that are compute intensive. These compute-intensive geoms must provide their own threads. When the g_up or g_down thread enters a compute-intensive geom, it will simply enqueue the request, schedule the geom's own worker thread, and proceed on to process the next request in its queue. When scheduled, the compute-intensive geom's thread will do the needed work and then enqueue the result for the g_up or g_down thread to finish pushing the request through the stack.

The set of commands that may be passed through the GEOM stack are read, write, and delete. The read and write commands have the expected semantics. Delete specifies that a certain range of data are no longer used and that it can be erased or freed. Technologies like flash-adaptation layers can arrange to erase the relevant blocks before they can be reassigned and cryptographic devices may fill random bits into the range to reduce the amount of data available for attack. A delete request has no assurance that the data really will be erased or made unavailable unless guaranteed by specific geoms in the graph. If a secure-delete semantic is required, a geom should be pushed that converts a delete request into a sequence of write requests.

Topological Flexibility

GEOM is both extensible and topologically agnostic. The extensibility of GEOM makes it easy to write a new class of transformation. For example, if you needed to mount an IBM MVS disk, a class recognizing and configuring its Volume Table of Contents information would be trivial.

In a departure from many previous volume managers, GEOM is topologically agnostic. Most volume-management implementations have strict notions of how classes can fit together, but often only a single fixed hierarchy is provided. Figure 7.5 (on page 275) shows a typical hierarchy. It requires that the disks first be divided into partitions, and then the partitions can be grouped into mirrors, which are exported as volumes.

Figure 7.5. Fixed class hierarchy.

With fixed hierarchies, it is impossible to express intent efficiently. In the fixed hierarchy of Figure 7.5, it is impossible to mirror two physical disks and then partition the mirror into slices as is done in Figure 7.6. Instead one is forced to make slices on the physical volumes then create mirrors for each of the corresponding slices, resulting in a more complex configuration. Being topologically agnostic means that different orderings of classes are treated no differently than existing orderings. GEOM does not care in which order things are done. The only restriction is that cycles in the graph are not allowed.

Figure 7.6. Flexible class hierarchy.