Section 12.2. Fundamental Concepts | Mac OS X Internals: A Systems Approach

12.2. Fundamental Concepts

Before we look at details of HFS+, let us familiarize ourselves with some fundamental terminology and data structures.

12.2.1. Volumes

In Chapter 11, we defined a file system as a scheme for arranging data on a storage medium, with a volume being an instance of a file system. An HFS+ volume may span an entire disk or it may use only a portionthat is, a slice or a partitionof a disk. HFS+ volumes can also span multiple disks or partitions, although such spanning is at the device level and thus is not specific to HFS+, which will still see a single logical volume. Figure 121 shows a conceptual view of a disk containing two HFS+ volumes.

Figure 121. A disk containing two HFS+ volumes

12.2.2. Allocation Blocks

Space on an HFS+ volume is allocated to files in fundamental units called allocation blocks. For any given volume, its allocation block size is a multiple of the storage medium's sector size (i.e., the hardware-addressable block size). Common sector sizes for disk drives and optical drives are 512 bytes and 2KB, respectively. In contrast, the default (and optimal, for the Mac OS X implementation of HFS+) allocation block size is 4KB.

As shown in Figure 121, the storage on an HFS+ volume is divided into some number of equal-size allocation blocks. The blocks are conceptually numbered sequentially. The file system implementation addresses volume contents using allocation block numbers, which are 32-bit quantities represented by the u_int32_t data type in the kernel.

Unlike HFS+, HFS supports only 16-bit allocation block numbers. Therefore, the total space on an HFS volume can be divided into at most 2¹⁶ (65,536) allocation blocks. Note that the file system will allocate space in multiples of the allocation block size regardless of the space that is actually used. Consider a somewhat contrived example: If you use the HFS volume format on a 100GB disk, the allocation block size will be 1,638,400 bytes. In other words, the file system will allocate 1.6MB even for a 1-byte file.

The allocation block size is a fundamental property of a given volume. You can choose an allocation block size other than the default when you construct a new HFS+ file systemsay, using the newfs_hfs command-line program. The following rules apply when choosing an alternate allocation block size.

It must be a power of 2.
It should be a multiple of the sector size of the storage device, with the smallest legal value being the sector size itself. Thus, for an HFS+ volume on a disk drive, an allocation block must be no smaller than 512 bytes.
newfs_hfs will not accept an allocation block size larger than MAXBSIZE, which is defined to be 1MB in <sys/param.h>. This is not a file system limitation, however, and if you really must, you can have a larger allocation block size by using another program (or a modified version of newfs_hfs) to construct the file system.

It is possible for the capacity of a volume to not be a multiple of its allocation block size. In such a case, there will be trailing space on the volume that would not be covered by any allocation block.

Fragments

An allocation block cannot be shared (split) between two files or even between forks of the same file. BSD's UFS (including the Mac OS X implementation) employs another unit of allocation besides a block: a fragment. A fragment is a fraction of a block that allows a block to be shared between files. When a volume contains a large number of small files, such sharing leads to more efficient use of space, but at the cost of more complicated logic in the file system.

12.2.3. Extents

An extent is a range of contiguous allocation blocks. It is represented in HFS+ by the extent descriptor data structure (struct HFSPlusExtentDescriptor [bsd/hfs/hfs_format.h]). An extent descriptor contains a pair of numbers: the allocation block number where the range starts and the number of allocation blocks in the range. For example, the extent descriptor { 100, 10 } represents a sequence of 10 consecutive allocation blocks, beginning at block number 100 on the volume.

struct HFSPlusExtentDescriptor {     u_int32_t startBlock; // first allocation block in the extent     u_int32_t blockCount; // number of allocation blocks in the extent }; typedef struct HFSPlusExtentDescriptor HFSPlusExtentDescriptor; typedef HFSPlusExtentDescriptor HFSPlusExtentRecord[8];

An eight-element array of HFS+ extent descriptors constitutes an extent record.^[7] HFS+ uses an extent record as an inline extent list for a file's contentsthat is, up to the first eight extents of a file (specifically, a file fork; see the next section) are stored as part of the file's basic metadata. For a file that has more than eight extents, HFS+ maintains one or more additional extent records, but they are not kept inline in the metadata.

^[7] More precisely, it should be called an extents record, since it contains multiple extents. However, we will call it an extent record based on the data structure's name.

12.2.4. File Forks

A file is traditionally equivalent to a single stream of bytes. HFS+ supports multiple byte-streams per file, with two special streams that are always present, although one or both may be empty (zero-size). These are the data fork and the resource fork. Each fork is a distinct part of the file and may be perceived as a file in itself. A fork is represented in HFS+ by the HFSPlusForkData structure [bsd/hfs/hfs_format.h].

struct HFSPlusForkData {     u_int64_t           logicalSize; // fork's logical size in bytes     u_int32_t           clumpSize;   // fork's clump size in bytes     u_int32_t           totalBlocks; // total blocks used by this fork     HFSPlusExtentRecord extents;     // initial set of extents }; typedef struct HFSPlusForkData HFSPlusForkData;

Both forks have their own HFSPlusForkData structures (and therefore, extent records) that are stored along with the file's standard metadata.

The traditional view of a file maps to its data fork. Most files on a typical Mac OS X installation have only the data forktheir resource forks are empty.

The Launch Services framework accesses a file's resource fork to retrieve the path to the application to use to open that file, provided such an application has been specified for that particular filesay, through the "Open with" section of the Finder's information window.

Another noteworthy aspect of the data and resource forks is that their names cannot be changed. Any additional byte-streams created have Unicode names. These named streams can have arbitrary contents, although an HFS+ implementation may limit the amount of data a named stream can hold.

Beginning with Mac OS X 10.4, named streams are used to provide native support for extended attributes, which in turn are used to provide native support for access control lists.

12.2.5. Clumps

Whereas an allocation block is a fixed-size group (for a given volume) of contiguous sectors, a clump is a fixed-size group of contiguous allocation blocks. Although every clump is an extent, not every extent is a clump. When allocating space to a fork, an HFS+ implementation may do so in terms of clumpsrather than individual allocation blocksto avoid external fragmentation.

HFS+ has a provision for default clump sizes for the volume as a whole, for each B-Tree, for all data forks, and for all file forks. Note in Section 12.2.4 that the HFSPlusForkData structure has a field called clumpSize. Although this field could be used to support per-fork clump size specifications, HFS+ implementations with support for Hot File Clustering use this field to record the number of allocation blocks read from that fork.

12.2.6. B-Trees

HFS+ uses B-Trees to implement its critical indexing data structures that make it possible to locate both file content and metadata residing on a volume.

The Popularity of B-Trees

B-Trees were discovered by Rudolf Bayer and Edward M. McCreight in 1970.^[8] They were subsequently described in a 1972 paper titled "Organization and Maintenance of Large Ordered Indexes."^[9] Since then, B-Trees have been immensely popular and successful as an efficient and scalable external index mechanism. B-Tree variants, especially B+ Trees, are widely used in relational databases, file systems, and other storage-based applications. Microsoft's NTFS also uses B+ Trees for its catalog.

^[8] The authors never disclosed what the "B" in B-Trees stands for. A few plausible explanations are often cited: "balanced," "broad," "Bayer," and "Boeing." The latter is in the picture because the authors worked at Boeing Scientific Research Labs at that time.

^[9] "Organization and Maintenance of Large Ordered Indexes," by Rudolf Bayer and Edward M. McCreight (Acta Informatica 1, 1972, pp. 173189).

A B-Tree is a generalization of a balanced binary search tree. Whereas a binary tree has a branching factor of two, a B-Tree can have an arbitrarily large branching factor. This is achieved by having very large tree nodes. A B-Tree node may be thought of as an encapsulation of many levels of a binary tree. Having a very large branching factor leads to very low tree height, which is the essence of B-Trees: They are exceptionally suited for cases where the tree structure resides on an expensive-to-access storage medium, such as a disk drive. The lower the height, the fewer the number of disk accesses required to perform a B-Tree search. B-Trees offer guaranteed worst-case performance for common tree operations such as insertion, retrieval, and deletion of records. The operations can be implemented using reasonably simple algorithms, which are extensively covered in computing literature.

We will not discuss the theory behind B-Trees in this chapter. Please refer to an algorithms textbook for further details on B-Trees.

12.2.6.1. B+ Trees in HFS+

HFS+ specifically uses a variant of B+ Trees, which themselves are B-Tree variants. In a B+ Tree, all data resides in leaf (external) nodes, with index (internal) nodes containing only keys and pointers to subtrees. Consequently, index and leaf nodes can have different formats and sizes. Moreover, the leaf nodes, which are all at the same (lowest) level in the balanced tree,^[10] are chained together from left to right in a linked list to form a sequence set. Whereas the index nodes allow random searching, the list of leaf nodes can be used for sequential access to the data. Note that since data corresponding to a key can be found only in a leaf node, a B+ Tree searchstarting from the root nodealways ends at a leaf node.

^[10] In other words, all paths to leaf nodes have the exact same length in a balanced B-Tree.

The HFS+ implementation of B+ Trees differs from the standard definition in one notable respect. In a B+ Tree, an index node I containing N keys has N + 1 pointersone to each of its N + 1 children. In particular, the first (leftmost) pointer points to the child (subtree) containing keys that are less than the first key of node I. This way, node I serves as an (N + 1)-way decision point while searching, with each pointer leading to the next level in the search based on which of the N + 1 ranges, if any, the search key falls in. The B+ Trees used in HFS+ do not have the leftmost pointer in their index nodesthat is, for an index node I, there is no leftmost subtree containing keys that are less than the first key of I. This means each index node with N keys has N pointers.

Hereafter, we will use the term B-Tree to refer to the HFS+ implementation of B+ Trees.

Although HFS+ uses several B-Trees, with differing key and data formats, all its trees share the same basic structure. In fact, all HFS+ B-Trees can be accessed and manipulated using mostly the same set of functionsonly certain operations (such as key comparison) require code specific to tree content. The following are common characteristics of HFS+ B-Trees.

Each B-Tree is implemented as a special file that is neither visible to the user nor accessible through any standard file system interface. Unlike regular files, which have two predefined forks, a special file has only one fork for holding its contents. The Hot File Clustering B-Tree is an exception, however. Although it is a system file, it is implemented as a regular file that is user-visible and has two predefined forks (the resource fork is empty).

An HFS+ system file is designated as such by the CD_ISMETA bit being set in the cd_flags field of the in-memory catalog node descriptor corresponding to that file.

The total space in a B-Tree file is conceptually divided into equal-size nodes. Each B-Tree has a fixed node size that must be a power of 2, ranging from 512 bytes to 32,768 bytes. The node size is determined when the volume is created and cannot be changedat least using standard utilitieswithout reformatting the volume. Moreover, each HFS+ B-Tree also has some initial size that is determined based on the size of the volume being created.
Nodes are numbered sequentially starting from zero. Therefore, the offset of a node numbered N in a given B-Tree is obtained by multiplying N with the tree's node size. A node number is represented as a 32-bit unsigned integer.
Each B-Tree has a single header node (of type kBTHeaderNode) that is the first node in the tree.
Each B-Tree has zero or more map nodes (of type kBTMapNode) that are essentially allocation bitmaps used for tracking which tree nodes are in use and which are free. The first part of the allocation bitmap resides in the header node, so one or more map nodes are required only if the entire bitmap doesn't fit in the header node.
Each B-Tree has zero or more index nodes (of type kBTIndexNode) that contain keyed pointer records leading to other index nodes or a leaf node. Index nodes are also called internal nodes.
Each B-Tree has one or more leaf nodes (of type kBTLeafNode) that contain keyed records holding the actual data associated with the keys.
All node types can hold variable-length records.

12.2.6.2. Nodes

Figure 122 shows the structure of a generic B-Tree node. Note that the nodes are only logically contiguouslike any other file, a B-Tree may or may not be physically contiguous on disk.

Figure 122. The structure of an HFS+ B-Tree node

The node structure shown in Figure 122 is shared by all node types. There is a node descriptor (struct BTNodeDescriptor [bsd/hfs/hfs_format.h]) at the beginning of each node. The bLink and fLink fields of this structure chain together nodes of a particular type (indicated by the kind field) in the tree. Immediately following the node descriptor is the records segment that contains the node's records. Since node records can be of varying lengths, the latter part of the node contains a list of 16-bit offsets, each being a record's offset from the beginning of the node. The last entry in the offset list is the offset to the unused space in the nodethat is, the space immediately following the records segment and before the offset list. Note that if there is no free space left, there still is a free space offset entryit points to its own offset in that case.

Figure 123 shows the structure of a B-Tree header node. The header node has exactly three records, namely, the following:

The header record contains general information about the B-Tree, such as the tree's node size, its depth,^[11] the number of the root node (if any), the number of leaf records in the tree, and the total number of nodes.
^[11] A tree's depth is the same as its height. We say depth because we visualize the B-Tree's structure as growing downward.
The user data record provides 128 bytes of space for storing arbitrary information associated with the tree. Of all the HFS+ B-Trees, only the Hot File Clustering B-Tree uses this area.
The map record contains a bitmap, each of whose bits indicates whether a node in the tree is in use or not.

Figure 123. The structure of an HFS+ B-Tree header node

As we noted earlier, a tree may have more nodes than can be represented by the header node's map record, whose size depends on the node size. The sum of the sizes of the node descriptor (14 bytes), the header record (106 bytes), the user data record (128 bytes), and the offset entries (4 x 2 bytes) is 256 bytes, leaving the remaining space for the map record. If additional space is required, the tree uses map nodes to house the extension of the bitmap. If a tree has one or more map nodes, the header node's fLink field will contain the number of the next map node. The first map node's fLink field will contain the next map node's number, if any, and so on, with the last map node's fLink field being set to zero. The bLink fields of all map nodes, and that of the header node, are always set to zero.

12.2.6.3. Records

Whereas the header and map nodes of a B-Tree contain administrative information for the tree itself, the index and leaf nodes contain file system information, where the type of information depends on the specific B-Tree in question. Nevertheless, the records in index and leaf nodes have the same general structure, which is shown in Figure 124.

Figure 124. The structure of an HFS+ B-Tree record

At the beginning of the record is a key length (keyLength), which is stored using either one or two bytes, depending on whether the attributes field in the B-Tree's header node has the kBTBigKeysMask bit clear or set, respectively. Immediately following the key length is the actual key. The key length may or may not represent the actual key length, which is determined as follows.

In the case of a leaf node, keyLength represents the actual key length.
In the case of an index node, keyLength represents the actual key length if the kBTVariableIndexKeysMask bit is set in the header node's attributes field.
If the kBTVariableIndexKeysMask bit is not set in the header node's attributes field, the actual key length is the constant value contained in the header node's maxKeyLength field.

As shown in Figure 124, a record's data may be preceded and succeeded by single pad bytes. Record data is required to be aligned on a two-byte boundary and to have a size that is an even number of bytes. If the combined size of the key length and the actual key is such that the data would start on an odd-numbered byte, a pad byte is inserted before the data. Similarly, if the data's size is an odd number of bytes, a pad byte is inserted after the data.

Index and leaf nodes contain only index and leaf records, respectively. Since these are B+ Trees, the actual data is stored only in leaf nodes. An index record's data is merely a node numbera pointer to another index node or a leaf node. In other words, index nodes together constitute an index for arbitrarily searching the data stored in leaf nodes.

12.2.6.4. Searching

A fundamental operation involved in B-Tree access and manipulation is key comparison. An important property of a B-Tree node is that all records within the node are stored such that their keys are in increasing order. For simple keys, say, integers, comparison could be as trivial as numerical comparison. Complex keyssuch as those used in HFS+ B-Treeshave several components and therefore require more complicated comparison operations. Typically, the various components of a complex key are assigned precedence values. When two keys of a given type are compared, the individual components are compared in decreasing order of precedence. If an individual comparison results in equality, the overall comparison operation moves to the next component. This process continues until there is an inequality or until all components are exhausted, in which case the keys are deemed equal.

Figure 125 shows a hypothetical B-Tree that uses fixed-size integral keys. The tree's height is 3. In general, leaf nodes, all of which are at the same level and therefore have the same height, are assigned 1 as their height. An index node immediately above the leaf node has 2 as its height, and so on. The index node with the highest height is the root node, a reference to which is maintained in the header node. Each node's height is contained in the height field of the node descriptor. The header node's height field is always set to zero, but its treeDepth field contains the tree's depth, which is the same as the root node's height.

Figure 125. The contents of a hypothetical HFS+ B-Tree

In an empty tree, there is no root node. Moreover, a root node does not have to be an index node. If all records are contained within a single node, that node is both the root node and the solitary leaf node.

The following are noteworthy observations about the tree shown in Figure 125.

All nodes with a given height are chained together in a noncircular, doubly linked list through the fLink and bLink fields of their respective node descriptors. In a given chain, the bLink field of the first node and the fLink field of the last node are both set to zero.
The header record in the header node contains node numbers of the root node, the first leaf node, and the last leaf node.
Within a node, records are stored in increasing order of their keys.
At any given height, all keys in a node are less than all keys in the node after it in the same-level chain. As a corollary, the first node in the chain contains the smallest keys and the last node contains the largest keys.

Let us see how data corresponding to a given keythe search keycan be searched for in this tree. The search always begins at the root node, which can always be found by examining the header record. The latter is at a fixed location within the tree, and the tree itself is assumed to be at a known location. Thereafter, the search proceeds downward, eventually ending at a leaf node that contains the search key, unless the key does not exist in the tree. In particular, the search algorithm used by HFS+ does not back upit accesses a node at most once during a given search operation.

Suppose the search key is 38. We begin by examining the root node's records, with the goal of finding the greatest key that is at most equal to but not greater than the search key. In this case, we would choose 32, which is the greatest key in the root node but is still less than 38. The record's data is a pointer that leads us to an index node that is one level down in the B-Tree. This node has three records. Again, we search for the record with the largest key that does not exceed the search key: we choose 35. The corresponding pointer leads us to a leaf node. The search key matches the key of the first record in the leaf node. Therefore, the search is successful.

A B-Tree search is analogous to a binary search, except that at each decision point, we decide between multiple paths instead of two paths. Searches within a node can be performed using any algorithm, with binary search and linear search (for small nodes) being common alternatives. The HFS+ implementation uses binary search.

For all leaf nodes to be at the same level, a B-Tree must be balanced. There exist several techniques to balance a B-Tree. HFS+ uses the left-rotate and left-split operations to maintain a balanced tree. Intuitively speaking, existing records are moved to the left and new records are inserted at the rightmost points in the tree.