12.2. Fundamental ConceptsBefore we look at details of HFS+, let us familiarize ourselves with some fundamental terminology and data structures. 12.2.1. VolumesIn Chapter 11, we defined a file system as a scheme for arranging data on a storage medium, with a volume being an instance of a file system. An HFS+ volume may span an entire disk or it may use only a portionthat is, a slice or a partitionof a disk. HFS+ volumes can also span multiple disks or partitions, although such spanning is at the device level and thus is not specific to HFS+, which will still see a single logical volume. Figure 121 shows a conceptual view of a disk containing two HFS+ volumes. Figure 121. A disk containing two HFS+ volumes12.2.2. Allocation BlocksSpace on an HFS+ volume is allocated to files in fundamental units called allocation blocks. For any given volume, its allocation block size is a multiple of the storage medium's sector size (i.e., the hardware-addressable block size). Common sector sizes for disk drives and optical drives are 512 bytes and 2KB, respectively. In contrast, the default (and optimal, for the Mac OS X implementation of HFS+) allocation block size is 4KB. As shown in Figure 121, the storage on an HFS+ volume is divided into some number of equal-size allocation blocks. The blocks are conceptually numbered sequentially. The file system implementation addresses volume contents using allocation block numbers, which are 32-bit quantities represented by the u_int32_t data type in the kernel.
Unlike HFS+, HFS supports only 16-bit allocation block numbers. Therefore, the total space on an HFS volume can be divided into at most 216 (65,536) allocation blocks. Note that the file system will allocate space in multiples of the allocation block size regardless of the space that is actually used. Consider a somewhat contrived example: If you use the HFS volume format on a 100GB disk, the allocation block size will be 1,638,400 bytes. In other words, the file system will allocate 1.6MB even for a 1-byte file. The allocation block size is a fundamental property of a given volume. You can choose an allocation block size other than the default when you construct a new HFS+ file systemsay, using the newfs_hfs command-line program. The following rules apply when choosing an alternate allocation block size.
It is possible for the capacity of a volume to not be a multiple of its allocation block size. In such a case, there will be trailing space on the volume that would not be covered by any allocation block.
12.2.3. ExtentsAn extent is a range of contiguous allocation blocks. It is represented in HFS+ by the extent descriptor data structure (struct HFSPlusExtentDescriptor [bsd/hfs/hfs_format.h]). An extent descriptor contains a pair of numbers: the allocation block number where the range starts and the number of allocation blocks in the range. For example, the extent descriptor { 100, 10 } represents a sequence of 10 consecutive allocation blocks, beginning at block number 100 on the volume. struct HFSPlusExtentDescriptor { u_int32_t startBlock; // first allocation block in the extent u_int32_t blockCount; // number of allocation blocks in the extent }; typedef struct HFSPlusExtentDescriptor HFSPlusExtentDescriptor; typedef HFSPlusExtentDescriptor HFSPlusExtentRecord[8]; An eight-element array of HFS+ extent descriptors constitutes an extent record.[7] HFS+ uses an extent record as an inline extent list for a file's contentsthat is, up to the first eight extents of a file (specifically, a file fork; see the next section) are stored as part of the file's basic metadata. For a file that has more than eight extents, HFS+ maintains one or more additional extent records, but they are not kept inline in the metadata.
12.2.4. File ForksA file is traditionally equivalent to a single stream of bytes. HFS+ supports multiple byte-streams per file, with two special streams that are always present, although one or both may be empty (zero-size). These are the data fork and the resource fork. Each fork is a distinct part of the file and may be perceived as a file in itself. A fork is represented in HFS+ by the HFSPlusForkData structure [bsd/hfs/hfs_format.h]. struct HFSPlusForkData { u_int64_t logicalSize; // fork's logical size in bytes u_int32_t clumpSize; // fork's clump size in bytes u_int32_t totalBlocks; // total blocks used by this fork HFSPlusExtentRecord extents; // initial set of extents }; typedef struct HFSPlusForkData HFSPlusForkData; Both forks have their own HFSPlusForkData structures (and therefore, extent records) that are stored along with the file's standard metadata. The traditional view of a file maps to its data fork. Most files on a typical Mac OS X installation have only the data forktheir resource forks are empty.
The Launch Services framework accesses a file's resource fork to retrieve the path to the application to use to open that file, provided such an application has been specified for that particular filesay, through the "Open with" section of the Finder's information window. Another noteworthy aspect of the data and resource forks is that their names cannot be changed. Any additional byte-streams created have Unicode names. These named streams can have arbitrary contents, although an HFS+ implementation may limit the amount of data a named stream can hold. Beginning with Mac OS X 10.4, named streams are used to provide native support for extended attributes, which in turn are used to provide native support for access control lists. 12.2.5. ClumpsWhereas an allocation block is a fixed-size group (for a given volume) of contiguous sectors, a clump is a fixed-size group of contiguous allocation blocks. Although every clump is an extent, not every extent is a clump. When allocating space to a fork, an HFS+ implementation may do so in terms of clumpsrather than individual allocation blocksto avoid external fragmentation. HFS+ has a provision for default clump sizes for the volume as a whole, for each B-Tree, for all data forks, and for all file forks. Note in Section 12.2.4 that the HFSPlusForkData structure has a field called clumpSize. Although this field could be used to support per-fork clump size specifications, HFS+ implementations with support for Hot File Clustering use this field to record the number of allocation blocks read from that fork. 12.2.6. B-TreesHFS+ uses B-Trees to implement its critical indexing data structures that make it possible to locate both file content and metadata residing on a volume.
A B-Tree is a generalization of a balanced binary search tree. Whereas a binary tree has a branching factor of two, a B-Tree can have an arbitrarily large branching factor. This is achieved by having very large tree nodes. A B-Tree node may be thought of as an encapsulation of many levels of a binary tree. Having a very large branching factor leads to very low tree height, which is the essence of B-Trees: They are exceptionally suited for cases where the tree structure resides on an expensive-to-access storage medium, such as a disk drive. The lower the height, the fewer the number of disk accesses required to perform a B-Tree search. B-Trees offer guaranteed worst-case performance for common tree operations such as insertion, retrieval, and deletion of records. The operations can be implemented using reasonably simple algorithms, which are extensively covered in computing literature.
We will not discuss the theory behind B-Trees in this chapter. Please refer to an algorithms textbook for further details on B-Trees. 12.2.6.1. B+ Trees in HFS+HFS+ specifically uses a variant of B+ Trees, which themselves are B-Tree variants. In a B+ Tree, all data resides in leaf (external) nodes, with index (internal) nodes containing only keys and pointers to subtrees. Consequently, index and leaf nodes can have different formats and sizes. Moreover, the leaf nodes, which are all at the same (lowest) level in the balanced tree,[10] are chained together from left to right in a linked list to form a sequence set. Whereas the index nodes allow random searching, the list of leaf nodes can be used for sequential access to the data. Note that since data corresponding to a key can be found only in a leaf node, a B+ Tree searchstarting from the root nodealways ends at a leaf node.
The HFS+ implementation of B+ Trees differs from the standard definition in one notable respect. In a B+ Tree, an index node I containing N keys has N + 1 pointersone to each of its N + 1 children. In particular, the first (leftmost) pointer points to the child (subtree) containing keys that are less than the first key of node I. This way, node I serves as an (N + 1)-way decision point while searching, with each pointer leading to the next level in the search based on which of the N + 1 ranges, if any, the search key falls in. The B+ Trees used in HFS+ do not have the leftmost pointer in their index nodesthat is, for an index node I, there is no leftmost subtree containing keys that are less than the first key of I. This means each index node with N keys has N pointers. Hereafter, we will use the term B-Tree to refer to the HFS+ implementation of B+ Trees. Although HFS+ uses several B-Trees, with differing key and data formats, all its trees share the same basic structure. In fact, all HFS+ B-Trees can be accessed and manipulated using mostly the same set of functionsonly certain operations (such as key comparison) require code specific to tree content. The following are common characteristics of HFS+ B-Trees.
An HFS+ system file is designated as such by the CD_ISMETA bit being set in the cd_flags field of the in-memory catalog node descriptor corresponding to that file.
12.2.6.2. NodesFigure 122 shows the structure of a generic B-Tree node. Note that the nodes are only logically contiguouslike any other file, a B-Tree may or may not be physically contiguous on disk. Figure 122. The structure of an HFS+ B-Tree nodeThe node structure shown in Figure 122 is shared by all node types. There is a node descriptor (struct BTNodeDescriptor [bsd/hfs/hfs_format.h]) at the beginning of each node. The bLink and fLink fields of this structure chain together nodes of a particular type (indicated by the kind field) in the tree. Immediately following the node descriptor is the records segment that contains the node's records. Since node records can be of varying lengths, the latter part of the node contains a list of 16-bit offsets, each being a record's offset from the beginning of the node. The last entry in the offset list is the offset to the unused space in the nodethat is, the space immediately following the records segment and before the offset list. Note that if there is no free space left, there still is a free space offset entryit points to its own offset in that case. Figure 123 shows the structure of a B-Tree header node. The header node has exactly three records, namely, the following:
Figure 123. The structure of an HFS+ B-Tree header nodeAs we noted earlier, a tree may have more nodes than can be represented by the header node's map record, whose size depends on the node size. The sum of the sizes of the node descriptor (14 bytes), the header record (106 bytes), the user data record (128 bytes), and the offset entries (4 x 2 bytes) is 256 bytes, leaving the remaining space for the map record. If additional space is required, the tree uses map nodes to house the extension of the bitmap. If a tree has one or more map nodes, the header node's fLink field will contain the number of the next map node. The first map node's fLink field will contain the next map node's number, if any, and so on, with the last map node's fLink field being set to zero. The bLink fields of all map nodes, and that of the header node, are always set to zero. 12.2.6.3. RecordsWhereas the header and map nodes of a B-Tree contain administrative information for the tree itself, the index and leaf nodes contain file system information, where the type of information depends on the specific B-Tree in question. Nevertheless, the records in index and leaf nodes have the same general structure, which is shown in Figure 124. Figure 124. The structure of an HFS+ B-Tree record
At the beginning of the record is a key length (keyLength), which is stored using either one or two bytes, depending on whether the attributes field in the B-Tree's header node has the kBTBigKeysMask bit clear or set, respectively. Immediately following the key length is the actual key. The key length may or may not represent the actual key length, which is determined as follows.
As shown in Figure 124, a record's data may be preceded and succeeded by single pad bytes. Record data is required to be aligned on a two-byte boundary and to have a size that is an even number of bytes. If the combined size of the key length and the actual key is such that the data would start on an odd-numbered byte, a pad byte is inserted before the data. Similarly, if the data's size is an odd number of bytes, a pad byte is inserted after the data. Index and leaf nodes contain only index and leaf records, respectively. Since these are B+ Trees, the actual data is stored only in leaf nodes. An index record's data is merely a node numbera pointer to another index node or a leaf node. In other words, index nodes together constitute an index for arbitrarily searching the data stored in leaf nodes. 12.2.6.4. SearchingA fundamental operation involved in B-Tree access and manipulation is key comparison. An important property of a B-Tree node is that all records within the node are stored such that their keys are in increasing order. For simple keys, say, integers, comparison could be as trivial as numerical comparison. Complex keyssuch as those used in HFS+ B-Treeshave several components and therefore require more complicated comparison operations. Typically, the various components of a complex key are assigned precedence values. When two keys of a given type are compared, the individual components are compared in decreasing order of precedence. If an individual comparison results in equality, the overall comparison operation moves to the next component. This process continues until there is an inequality or until all components are exhausted, in which case the keys are deemed equal. Figure 125 shows a hypothetical B-Tree that uses fixed-size integral keys. The tree's height is 3. In general, leaf nodes, all of which are at the same level and therefore have the same height, are assigned 1 as their height. An index node immediately above the leaf node has 2 as its height, and so on. The index node with the highest height is the root node, a reference to which is maintained in the header node. Each node's height is contained in the height field of the node descriptor. The header node's height field is always set to zero, but its treeDepth field contains the tree's depth, which is the same as the root node's height. Figure 125. The contents of a hypothetical HFS+ B-Tree
In an empty tree, there is no root node. Moreover, a root node does not have to be an index node. If all records are contained within a single node, that node is both the root node and the solitary leaf node. The following are noteworthy observations about the tree shown in Figure 125.
Let us see how data corresponding to a given keythe search keycan be searched for in this tree. The search always begins at the root node, which can always be found by examining the header record. The latter is at a fixed location within the tree, and the tree itself is assumed to be at a known location. Thereafter, the search proceeds downward, eventually ending at a leaf node that contains the search key, unless the key does not exist in the tree. In particular, the search algorithm used by HFS+ does not back upit accesses a node at most once during a given search operation. Suppose the search key is 38. We begin by examining the root node's records, with the goal of finding the greatest key that is at most equal to but not greater than the search key. In this case, we would choose 32, which is the greatest key in the root node but is still less than 38. The record's data is a pointer that leads us to an index node that is one level down in the B-Tree. This node has three records. Again, we search for the record with the largest key that does not exceed the search key: we choose 35. The corresponding pointer leads us to a leaf node. The search key matches the key of the first record in the leaf node. Therefore, the search is successful.
A B-Tree search is analogous to a binary search, except that at each decision point, we decide between multiple paths instead of two paths. Searches within a node can be performed using any algorithm, with binary search and linear search (for small nodes) being common alternatives. The HFS+ implementation uses binary search. For all leaf nodes to be at the same level, a B-Tree must be balanced. There exist several techniques to balance a B-Tree. HFS+ uses the left-rotate and left-split operations to maintain a balanced tree. Intuitively speaking, existing records are moved to the left and new records are inserted at the rightmost points in the tree. |