Section 11.3. Memory Management | The Design and Implementation of the FreeBSD Operating System

11.3. Memory Management

The requirements placed on a memory-management scheme by interprocess-communication and network protocols tend to be substantially different from those of other parts of the operating system. Although all require the efficient allocation and reclamation of memory, communication protocols in particular need memory in widely varying sizes. Memory is needed for variable-sized structures such as communication protocol packets. Protocol implementations must frequently prepend headers or remove headers from packetized data. As packets are sent and received, buffered data may need to be divided into packets, and received packets may be combined into a single record. In addition, packets and other data objects must be queued when awaiting transmission or reception. A special-purpose memory-management facility exists for use by the interprocess-communication and networking systems to address these needs.

Mbufs

The memory-management facilities revolve around a data structure called an mbuf (see Figure 11.3). Mbufs, or memory buffers, vary in size depending on what they contain. All mbufs contain a fixed, m_hdr structure that keeps track of various bits of bookkeeping about the mbuf. An mbuf that contains only data has space for 234 bytes (256 bytes total for the mbuf minus 22 bytes for the mbuf header). All structure sizes are calculated for 32-bit processors.

Figure 11.3. Memory-buffer (mbuf) data structure.

For large messages, the system can associate larger sections of data with an mbuf by referencing an external mbuf cluster from a private virtual memory area. The size of an mbuf cluster may vary by architecture, and is specified by the macro MCLBYTES (which is 2 Kbytes on the PC).

Data are stored either in the internal data area or in an external cluster, but never in both. To access data in either location a data pointer within the mbuf is used. In addition to the data-pointer field, a length field is also maintained. The length field shows the number of bytes of valid data to be found at the data-pointer location. The data and length fields allow routines to trim data efficiently at the start or end of an mbuf. In deletion of data at the start of an mbuf, the pointer is incremented and the length is decremented. To delete data from the end of an mbuf, the length is decremented, but the data pointer is left unchanged. When space is available within an mbuf, data can be added at either end. This flexibility to add and delete space without copying is particularly useful in communication-protocol implementation. Protocols routinely strip protocol information off the front or back of a message before the message's contents are handed to a higher-level processing module, or they add protocol information as a message is passed to lower levels.

Multiple mbufs can be linked to hold an arbitrary quantity of data. This linkage is done with the m_next field of the mbuf. By convention, a chain of mbufs linked in this way is treated as a single object. For example, the communication protocols build packets from chains of mbufs. A second field, m_nextpkt, links objects built from chains of mbufs into lists of objects. Throughout our discussions, a collection of mbufs linked together with the m_next field will be called a chain; chains of mbufs linked together with the m_nextpkt field will be called a queue.

Each mbuf is typed according to its use. This type serves two purposes. The only operational use of the type is to distinguish optional components of a message in an mbuf chain that is queued for reception on a socket data queue. Otherwise, the type information is used in maintaining statistics about storage use and, if there are problems, as an aid in tracking mbufs.

The mbuf flags are logically divided into two sets: flags that describe the usage of an individual mbuf and those that describe an object stored in an mbuf chain. The flags describing an mbuf specify whether the mbuf references external storage (M_EXT), whether the second set of header fields is present (M_PKTHDR), and whether the mbuf completes a record (M_EOR). A packet normally would be stored in an mbuf chain (of one or more mbufs) with the M_PKTHDR flag set on the first mbuf of the chain. The mbuf flags describing the packet would be set in the first mbuf and could include either the broadcast flag (M_BCAST) or the multicast flag (M_MCAST). The latter flags specify that a transmitted packet should be sent as a broadcast or multicast, respectively, or that a received packet was sent in that manner.

If the M_PKTHDR flag is set on an mbuf, the mbuf has a second set of header fields immediately following the standard header. This addition causes the mbuf data area to shrink from 234 bytes to 210 bytes. The packet header, which is shown in Figure 11.4, is only used on the first mbuf of a chain. It includes several fields: a pointer to the interface on which the packet was received, the total length of the packet, a pointer to the packet header, two fields relating to packet checksum calculation, and a pointer to a list of arbitrary tags.

Figure 11.4. Memory-buffer (mbuf) data structure with M_PKTHDR.

An mbuf that uses external storage is marked with the M_EXT flag. Here, a different header area overlays the internal data area of an mbuf. The fields in this header, which is shown in Figure 11.5 (on page 446), describe the external storage, including the start of the buffer and its size. One field is designated to point to a routine to free the buffer, in theory allowing various types of buffers to be mapped by mbufs. In the current implementation, however, the free function is not used, and the external storage is assumed to be a standard mbuf cluster. An mbuf may be both a packet header and have external storage, in which case the standard mbuf header is followed by the packet header and then the external storage header.

Figure 11.5. Memory-buffer (mbuf) data structure with external storage.

The ability to refer to mbuf clusters from an mbuf permits data to be referenced by different entities within the network code without a memory-to-memory copy operation. When multiple copies of a block of data are required, the same mbuf cluster is referenced from multiple mbufs. A single, global array of reference counts is maintained for the mbuf clusters to support this style of sharing (see the next subsection).

Mbufs have fixed-sized, rather than variable-sized, data areas for several reasons. First, the fixed size minimizes memory fragmentation. Second, communication protocols are frequently required to prepend or append headers to existing data areas, split data areas, or trim data from the beginning or end of a data area. The mbuf facilities are designed to handle such changes without reallocation or copying whenever possible. Finally, the dtom() function, described in the subsection on mbuf utility routines later in this section, would be much more expensive if mbufs were not fixed in size.

Since the mbuf is the central object of all of the networking subsystems, it has undergone modification with each large change in the code. It now contains a flags field and two optional sets of header fields. The data pointer replaces a field used as an offset in the initial version of the mbuf. The use of an offset was not portable when the data referenced could be in an mbuf cluster. The addition of a flags field allowed the use of a flag indicating external storage. Earlier versions tested the magnitude of the offset to see whether the data were in the internal mbuf data area. The addition of the broadcast flag allowed network-level protocols to know whether packets were received as link-level broadcasts, as was required for standards conformance. Several other flags have been added for use by specific protocols as well as to handle fragment processing.

The optional header fields have undergone the largest changes since 4.4BSD. The two headers were originally designed to avoid redundant calculations of the size of an object, to make it easier to identify the incoming network interface of a received packet, and to generalize the use of external storage by an mbuf. Since 4.4BSD the packet header has been expanded to include information on checksum calculation (a traditionally expensive operation that can now be done in hardware) and an arbitrary set of tags.

Tags are fixed-size structures that can point to arbitrary pieces of memory and are used to store information relevant to different modules within the networking subsystem. Each tag has a link to the next tag in the list, a 16-bit ID, a 16-bit length, and a 32-bit cookie. The cookie is used to identify which module owns the tag, and the type is a piece of data that is private to the module. Tags are used to carry information about a packet that should not be placed into the packet itself. Examples of these tags are given in Section 13.10.

The design has not been completely successful. It is probably not worth the complexity of having a variable-sized header on an mbuf for the packet header; instead, those fields probably should have been included in all mbufs, even if they were not used. Several of the new packet header fields could be combined because they duplicate functionality; for example, the checksum fields could be implemented using tags.

Storage-Management Algorithms

Providing the system with a network stack capable of symmetric multiprocessing required a complete rework of the memory allocation algorithms underlying the mbuf code. Whereas previous versions of BSD allocated memory with the system allocator and then carved it up for mbufs and clusters, such a simple technique does not work when using multiple CPUs.

FreeBSD 5.2 allocates virtual memory among a series of lists for use by the network memory allocation code. Each CPU has its own private container of mbufs and clusters. There is also a single, general pool of mbufs and clusters from which allocations are attempted when a per-CPU list is empty or to which memory is freed when a per-CPU list is full. A uniprocessor system acts as if it is an SMP system with one CPU, which means that it has one per-CPU list as well as the general one.

At system boot time the memory allocator initially fills up each, per-CPU, list with a tunable number of mbufs and clusters (see Section 14.6). The general list is initialized and left empty; it is used only in cases of high load, when a per-CPU list is starved for memory.

The reference counts for clusters are managed as an array, separately from the clusters themselves. The array is large enough for every mbuf cluster that could be allocated by the system. The memory dedicated to mbufs and clusters is set based on the kernel parameter maxusers, which is itself based on the amount of physical memory in the system. Basing the amount of memory dedicated to the networking subsystem on the amount of physical memory gives a good default value but should be overridden when a system is dedicated to networking tasks such as a Web server, firewall, or router.

Mbuf-allocation requests indicate either that they must be fulfilled immediately or that they can wait for available resources. If a request is marked as "can wait" and the requested resources are unavailable, the process is put to sleep to await available resources. Although a nonblocking allocation request is no longer necessary for code that executes at interrupt level, the networking code still operates as if this is the case. If mbuf allocation has reached its limit or memory is unavailable, the mbuf-allocation routines ask the network-protocol modules to give back any available resources that they can spare. A nonblocking request will fail if no resources are available.

An mbuf-allocation request is made through a call to m_get(), m_gethdr() or through an equivalent macro. An mbuf is retrieved from the appropriate per-CPU list by the mb_alloc() function and is then initialized. For m_gethdr(), the mbuf is initialized with the optional packet header. The MCLGET macro adds an mbuf cluster to an mbuf.

Release of mbuf resources is straightforward: m_free() frees a single mbuf, and m_freem() frees a chain of mbufs. When an mbuf that references an mbuf cluster is freed, the reference count for the cluster is decremented. Mbuf clusters are placed onto the appropriate per-CPU list when their reference counts reach zero.

Mbuf Utility Routines

Many useful utility routines exist for manipulating mbufs within the kernel networking subsystem. Those routines that will be used in Chapter 12 are described briefly here.

The m_copym() routine makes a copy of an mbuf chain starting at a logical offset, in bytes, from the start of the data. This routine may be used to copy all or only part of a chain of mbufs. If an mbuf is associated with an mbuf cluster, the copy will reference the same data by incrementing the reference count on the cluster; otherwise, the data portion is copied as well. The m_copydata() function is similar, but it copies data from an mbuf chain into a caller-provided buffer. This buffer is not an mbuf, or chain, but a plain area of memory pointed to by a traditional C pointer.

The m_adj() routine adjusts the data in an mbuf chain by a specified number of bytes, shaving data off either the front or back. No data are ever copied; m_adj() operates purely by manipulating the offset and length fields in the mbuf structures.

The mtod() macro takes a pointer to an mbuf header and a data type and returns a pointer to the data in the buffer, cast to the given type. The dtom() function is the inverse: It takes a pointer to an arbitrary address in the data of an mbuf and returns a pointer to the mbuf header (rather than to the head of the mbuf chain). This operation is done through simple truncation of the data address to an mbuf-sized boundary. This function works only when data reside within the mbuf.

The m_pullup() routine rearranges an mbuf chain such that a specified number of bytes reside in a contiguous data area within the mbuf (not in external storage). This operation is used so that objects such as protocol headers are contiguous and can be treated as normal data structures, and so that dtom() will work when the object is freed. If there is room, m_pullup() will increase the size of the contiguous region up to the maximum size of a protocol header in an attempt to avoid being called in the future.

The M_PREPEND() macro adjusts an mbuf chain to prepend a specified number of bytes of data. If possible, space is made in place, but an additional mbuf may have to be allocated at the beginning of the chain. It is not currently possible to prepend data within an mbuf cluster because different mbufs might refer to data in different portions of the cluster.