Section 12.9. Optimizations

12.9. Optimizations

The Mac OS X HFS+ implementation contains adaptive optimizations to improve performance and reduce fragmentation. We will look at these optimizations in this section.

12.9.1. On-the-Fly Defragmentation

When a user file is opened on an HFS+ volume, the kernel checks whether the file is qualified for on-the-fly defragmentation. All of the following conditions must be met for the file to be eligible.

The file system is not read-only.
The file system is journaled.
The file is a regular file.
The file is not already open.
The file fork being accessed is nonzero and no more than 20MB in size.
The fork is fragmented into eight or more extents.
The system has been up for more than three minutes (to ensure that bootstrapping has finished).

If all the preceding conditions are satisfied, the file is relocated by calling hfs_relocate() [bsd/hfs/hfs_readwrite.c], which attempts to find contiguous allocation blocks for the file. A successful relocation results in a defragmented file. Let us see this mechanism in action by creating a fragmented file and causing its relocation. We will use a somewhat unsavory method to create our fragmented file. Recall that we wrote a program (hfs_change_next_allocation, shown earlier in Figure 1211) to provide a hint to the kernel regarding the location of the next allocation block search. We will use that program to our advantage in the following algorithm to create the file we desire.

Start with a small file F.
Use hfsdebug to determine the location of F's last extent.
Use hfs_change_next_allocation to set the next allocation pointer immediately after F ends.
Create a nonempty dummy file d. This should consume the allocation block immediately after F's last allocation block.
Append an allocation block's worth of data to F. Since F cannot grow contiguously any more, it will require another extent to house the newly written data.
Delete the dummy file d.
If F has eight extents, we are done; otherwise, go back to step 2.

Figure 1229 shows a Perl program that implements the algorithm.

Figure 1229. A Perl program to create a file with eight fragments on an HFS+ volume

#! /usr/bin/perl -w my $FOUR_KB  = "4" x 4096; my $BINDIR   = "/usr/local/bin"; my $HFSDEBUG = "$BINDIR/hfsdebug"; my $HFS_CHANGE_NEXT_ALLOCATION = "$BINDIR/hfs_change_next_allocation"; sub usage() {     die "usage: $0 <volume>\n\twhere <volume> must not be the root volume\n"; } (-x $HFSDEBUG && -x $HFS_CHANGE_NEXT_ALLOCATION) or die "$0: missing tools\n"; ($#ARGV == 0) or usage(); my $volume = $ARGV[0]; my @sb = stat($volume); ((-d $volume) && @sb && ($sb[0] != (stat("/"))[0])) or usage(); my $file = "$volume/fragmented.$$"; (! -e $file) or die "$0: file $file already exists\n"; `echo -n $FOUR_KB > "$file"`; # create a file (-e "$file") or die "$0: failed to create file ($file)\n"; WHILE_LOOP: while (1) {     my @out = `$HFSDEBUG "$file" | grep -B 1 'allocation blocks'`;     $out[0] =~ /^\s+([^\s]+)\s+([^\s]+)..*$/;     my $lastStartBlock = $1; # starting block of the file's last extent     my $lastBlockCount = $2; # number of blocks in the last extent     $out[1] =~ /[\s*\d+] allocation blocks in (\d+) extents total.*/;     my $nExtents = $1;       # number of extents the file currently has     if ($nExtents >= 8) {    # do we already have 8 or more extents?         print "\ncreated $file with $nExtents extents\n";         last WHILE_LOOP;     }     # set volume's next allocation pointer to the block right after our file     my $conflict = sprintf("0x%x", hex($lastStartBlock) + hex($lastBlockCount));     `$HFS_CHANGE_NEXT_ALLOCATION $volume $conflict`;     print "start=$lastStartBlock count=$lastBlockCount extents=$nExtents ".           "conflict=$conflict\n";     `echo hello > "$volume/dummy.txt"`; # create dummy file to consume space     `echo -n $FOUR_KB >> "$file"`;      # extend our file to cause discontiguity     `rm "$volume/dummy.txt"`;           # remove the dummy file } # WHILE_LOOP exit(0);

Now that we have the means of creating a file that should be eligible for on-the-fly defragmentation, let us test the feature on a disk image.

$ hdiutil create -size 32m -fs HFSJ -volname HFSFrag /tmp/hfsfrag.dmg ... $ open /tmp/hfsfrag.dmg $ ./mkfrag.pl /Volumes/HFSFrag start=0xaf9 count=0x1 extents=1 conflict=0xafa start=0xafb count=0x1 extents=2 conflict=0xafc start=0xafd count=0x1 extents=3 conflict=0xafe start=0xaff count=0x1 extents=4 conflict=0xb00 start=0xb01 count=0x1 extents=5 conflict=0xb02 start=0xb03 count=0x1 extents=6 conflict=0xb04 start=0xb05 count=0x1 extents=7 conflict=0xb06 created /Volumes/HFSFrag/fragmented.2189 with 8 extents $ hfsdebug /Volumes/HFSFrag/fragmented.2189 ...   extents              =   startBlock   blockCount      % of file                                 0xaf9          0x1        12.50 %                                 0xafb          0x1        12.50 %                                 0xafd          0x1        12.50 %                                 0xaff          0x1        12.50 %                                 0xb01          0x1        12.50 %                                 0xb03          0x1        12.50 %                                 0xb05          0x1        12.50 %                                 0xb07          0x1        12.50 %                          8 allocation blocks in 8 extents total. ... $ cat /Volumes/HFSFrag/fragmented.2189 > /dev/null # open the file $ hfsdebug /Volumes/HFSFrag/fragmented.12219 ...   extents              =   startBlock   blockCount      % of file                                0x1b06          0x8       100.00 %                          8 allocation blocks in 1 extents total. ...

We see that opening a fragmented file that is eligible for on-the-fly defragmentation indeed caused relocation of the file to a single extent.

12.9.2. The Metadata Zone

The HFS+ implementation in Mac OS X 10.3 introduced an allocation policy that reserves space for several volume metadata structures, placing them next to each otherif possiblein an area near the beginning of the volume. This area is called the metadata allocation zone. Unless disk space is scarce, the HFS+ allocator will not consume space from the metadata zone for normal file allocations. Similarly, unless the metadata zone is exhausted, HFS+ will allocate space for metadata from within the zone. Thus, various types of metadata are likely to be physically adjacent and have higher contiguity in generalif so, the consequent reduction in seek times will improve file system performance. The policy is enabled for a volume at runtime when the volume is mounted. The volume must be journaled and at least 10GB in size. The hfsmount structure stores runtime details of the metadata zone. We can use hfsdebug to view these details.

$ sudo hfsdebug -V /Volumes/HFSFrag -m # metadata zone should not be enabled ... # Metadata Zone   metadata zone start block               = 0   metadata zone end block                 = 0   hotfile start block                     = 0   hotfile end block                       = 0   hotfile free blocks                     = 0   hotfile maximum blocks                  = 0   overflow maximum blocks                 = 0   catalog maximum blocks                  = 0 ... $ sudo hfsdebug -m # metadata zone should be enabled for the root volume ... # Metadata Zone   metadata zone start block               = 0x1   metadata zone end block                 = 0x67fff   hotfile start block                     = 0x45bed   hotfile end block                       = 0x67fff   hotfile free blocks                     = 0x1ebaa   hotfile maximum blocks                  = 0x22413   overflow maximum blocks                 = 0x800   catalog maximum blocks                  = 0x43f27 ...

Figure 1230 shows a representative layout of the metadata zone. Note that a given volume may not have all shown constituents in usefor example, quotas are typically not enabled on Mac OS X systems, so there will be no quota files. The last part of the metadata zone is used for an optimization called Hot File Clustering and is therefore called the Hot File area.

Figure 1230. Layout of the HFS+ metadata zone

12.9.3. Hot File Clustering

Hot File Clustering (HFC) is an adaptive, multistage clustering scheme based on the premise that frequently accessed files are small both in number and in size. Such files are termed hot files in the context of HFC. As part of HFC, Apple's HFS+ implementation performs the following operations to improve performance while accessing hot files.

It keeps track of blocks read from file forks to identify candidate hot files.
After a predetermined recording period, it analyzes the list of candidate hot files to determine those that should be moved to the Hot File area within the latter part of the metadata zone.
If necessary, it evicts existing files from the Hot File area to make space for newer and hotter files.
It moves selected hot files to the Hot File area, allocating contiguous space for them as they are moved.

HFC records each file's temperature, which is defined as the ratio of the number of bytes read from that file during the recording period to the file's size. Thus, the more frequently a file is accessed, the higher its temperature. HFS+ uses the clumpSize field of the HFSPlusForkData structure to record the amount of data read from a fork.^[23]

^[23] HFC stores the number of allocation blocks readrather than the number of bytes readin the clumpSize field, which is a 32-bit number.

12.9.3.1. Hot File Clustering Stages

At any given time, HFC on a volume can be in one of the following stages.

HFC_DISABLEDHFC is currently disabled, typically because the volume is not a root volume. HFC also enters this stage if the volume was unmounted while HFC was in its recording stage.
HFC_IDLEHFC is waiting to start recording. This stage is entered after HFC is initialized during mount. It can also be entered from the evaluation and adoption stages.
HFC_BUSYThis is a temporary stage that HFC remains in while performing work to transition from one stage to another.
HFC_RECORDINGHFC is recording file temperatures.
HFC_EVALUATIONHFC has stopped recording file temperatures and is now processing the list of newly recorded hot files to determine whether to adopt new files or to evict old files before adopting.
HFC_EVICTIONHFC is relocating colder and older files to reclaim space in the Hot File area.
HFC_ADOPTIONHFC is relocating hotter and newer files to the Hot File area.

Figure 1231 shows a state diagram showing the transitions between various HFC stages. If the current stage is adoption, eviction, idle, or recording, the transition to the next stage is triggered as a side effect of a sync operation on the volume.

Figure 1231. Transitions between Hot File Clustering stages

12.9.3.2. The Hot Files B-Tree

HFC uses a B-Tree filethe Hot Files B-Treefor tracking hot files or, specifically, file forks. Unlike the other HFS+ special files, this tree's extents are not recorded in the volume header. It is an on-disk file that the kernel accesses by its pathname (/.hotfiles.btree).

$ ls -l /.hotfiles.btree 640 -rw-------   1 root  wheel  327680 Oct  7 05:25 /.hotfiles.btree

The Hot Files B-Tree is similar to the Catalog B-Tree in that each fork being tracked has a thread record and a Hot File record. Figure 1232 shows the key format used by the Hot Files B-Tree. Given a file's CNID and fork type, the thread record for that fork can be looked up by setting the search key's temperature field to the special value HFC_LOOKUPTAG (0xFFFFFFFF). A Hot File thread record's data is a 32-bit unsigned integer that represents the fork's temperature. If no thread record can be found, HFC is not tracking that fork as a hot file. By including the fork's temperature in the search key, the corresponding Hot File record can be looked up. The data for this record is also a 32-bit unsigned integer, but it has no relevance to HFC. It contains one of two values for debugging purposes: either the first four bytes of the file's UTF-8-encoded Unicode name or the ASCII string "????".

Figure 1232. Searching in the Hot Files B-Tree

hfsdebug can display details and contents (leaf records) of the Hot Files B-Tree. Note that unlike other HFS+ B-Trees, this B-Tree contains a user data record in its header node. The record holds a HotFilesInfo structure [bsd/hfs/hfs_hotfiles.h].

$ sudo hfsdebug -b hotfile # HFS+ Hot File Clustering (HFC) B-Tree ... # User Data Record   magic                = 0XFF28FF26   version              = 1   duration             = 216000 seconds ...   timeleft             = 42710 seconds   threshold            = 24   maxfileblks          = 2560 blocks   maxfilecnt           = 1000   tag                  = CLUSTERED HOT FILES B-TREE

12.9.3.3. The Working of Hot File Clustering

Let us now look at details of some key HFC operations while referring to Figure 1231. A typical starting point is the HFC_DISABLED stage, which a volume is considered to be in after it is unmounted.

When a volume is mounted, hfs_recording_init() ensures that it is a root volume, disabling HFC otherwise. Next, if the Hot Files B-Tree already exists, HFC transitions to the idle stage; otherwise, hfs_recording_init() creates a B-Tree and initiates a scan of the Catalog B-Tree. During this scan, HFC examines each leaf record in the catalog, performing the following operations for each file record.

It ignores resource forks^[24] and empty data forks.
^[24] In Mac OS X 10.4, HFC works only on data forks.
It ignores files whose extents are all outside of the Hot File area.
It skips over the two journal files, /.journal_info_block and /.journal.
It adds a thread record and a Hot File record to the Hot Files B-Tree for the remaining files, all of which will have at least one block within the Hot File area. The initial data values for the thread and Hot File records are HFC_MINIMUM_TEMPERATURE and the number 0x3f3f3f3f, respectively.

A volume's current HFC stage is contained in the hfc_stage field of the hfsmount structure, which is initially zero-filled. The numerical value of HFC_DISABLED is zero as well. Therefore, HFC is implicitly disabled at the start for every volume.

After the Catalog B-Tree scan is complete, hfs_recording_init() places HFC in the idle stage. The transition to the next stagethe recording stageoccurs when hfs_hotfilesync() calls hfs_recording_start().

A sync operation results in a call to hfs_hotfilesync(), which is responsible for calling the appropriate functions when the current stage is one of HFC_ADOPTION, HFC_EVICTION, HFC_IDLE, or HFC_RECORDING. A sync operation normally occurs because of the update daemon invoking the sync() system call periodically.

For HFC to record file temperatures on a volume, several conditions must hold.

The volume must not be read-only.
The volume must be journaled.
The volume's hfsmount structure must indicate that the metadata zone has been established for the volume.
The number of free allocation blocks on the volume must be at least twice the total number of Hot File area blocks possible.

Table 123 shows various other constraints that HFC uses.

Table 123. Constraints Used by Hot File Clustering
Name	Value	Notes
`HFC_BLKSPERSYNC`	300	The maximum number of allocation blocks that can be movedwhether for eviction or adoptionduring a single sync-triggered HFC operation. Adoption does not move only parts of a file; therefore, this effectively limits the size of a hot file to 1.2MB for the default allocation block size of 4KB.
`HFC_FILESPERSYNC`	50	The maximum number of files that can be moved during adoption or eviction.
`HFC_DEFAULT_DURATION`	60 hours	The default temperature-recording duration.
`HFC_DEFAULT_FILE_COUNT`	1000	The default number of hot files to track.
`HFC_MAXIMUM_FILE_COUNT`	5000	The upper limit on the number of hot files to track.
`HFC_MAXIMUM_FILESIZE`	10MB	The upper limit on the size of files to track during recording. Files larger than this will not be tracked.
`HFC_MINIMUM_TEMPERATURE`	24	The threshold temperature for residency in the Hot File area.

hfs_recording_start() allocates memory for data structures used during recording. In particular, an instance of the hotfile_data_t structure [bsd/hfs/hfs_hotfiles.c] is used as an anchor for a runtime recording list of hotfile_entry_t structures [bsd/hfs/hfs_hotfiles.c]. The hfc_recdata field of the hfsmount structure refers to the hotfile_data_t structure.

typedef struct hotfile_entry {     struct hotfile_entry *left;     struct hotfile_entry *right;     u_int32_t  fileid;     u_int32_t  temperature;     u_int32_t  blocks; } hotfile_entry_t;

During the recording stage, read operations accumulate the number of bytes read for each file, whereas write operations reset such a count to zero (specifically, a file whose size is changing is not a desirable Hot File candidate). Moreover, even during read operations, the bytes-read count for a file will be initialized to the I/O count of the current read operationrather than being added cumulativelyif the file's access time is older than the start of the current recording period.

When an active vnode^[25] becomes inactive, the HFS+ reclaim operation, hfs_vnop_reclaim(), calls hfs_addhotfile() to add the fileif appropriateto the runtime recording list. A vnode must satisfy several criteria to be added to the list, such as the following.

^[25] Technically, a catalog node (cnode).

It must be either a regular file or a symbolic link.
It must not be a system file.
It must not be a resource fork.
It must have a nonzero size that is less than the maximum hot file size.
It must have an access time newer than the start of the current recording period.
It must have its temperature above the threshold.

If a file on the runtime recording list becomes active, it is removed from the list by hfs_getnewvnode() [bsd/hfs/hfs_cnode.c] through a call to hfs_removehotfile(). Active vnodes are examined separately once the recording period has ended.

After the recording period ends, the next sync-triggered call to hfs_hotfilesync() will invoke hfs_recording_stop(), which first calls hotfiles_collect() to add all active hot files to the recording list. hotfiles_collect() iterates over each active vnode associated with the volume's mount point and calls hfs_addhotfile_internal()the back-end of hfs_hotfile()to update the recording list.

Once hotfiles_collect() returns, hfs_recording_stop() moves to the evaluation stage, during which it performs the following operations.

It ages the existing records (only the coldest 50%) in the Hot Files B-Tree by halving their temperatures, while limiting the lowest possible temperature to 4.
It sorts the runtime recording list entries by temperature.
It identifies the list entries that are already in the Hot Files B-Tree. For each such entry, its B-Tree information is updated, after which the entry is invalidated in the list. This operation results in a refined list of hot files eligible for adoption.

At this point, the next HFC stage will be set as either adoption or eviction, depending on whether the free space available in the Hot File area is more or less, respectively, than the total space required by all hot files ready for adoption.

If the current stage is HFC_EVICTION, the next sync will trigger the invocation of hotfiles_evict(), which attempts to reclaim space by moving files out of the Hot File area by calling hfs_relocate(). It begins with the coldest files but may end up evicting all files depending on the space that must be reclaimed. However, as listed in Table 123, only up to HFC_BLKSPERSYNC allocation blockscorresponding to no more than HFC_FILESPERSYNC filescan be moved during a single HFC eviction or adoption. If hotfiles_evict() is unable to finish its work before hitting these constraints, HFC remains in the eviction stage and will continue when the next sync occurs. Once eviction finishes, HFC moves to the adoption stage, which is handled by hotfiles_adopt(). Adoption is similar to evictionit is also performed through hfs_relocate() and is subject to the same transfer constraints.

After adoption, HFC moves to the idle stage, from where it will enter the next recording period.