Section 8.6. Soft Updates

8.6. Soft Updates

In filesystems, metadata (e.g., directories, inodes, and free block maps) gives structure to raw storage capacity. Metadata provides pointers and descriptions for linking multiple disk sectors into files and identifying those files. To be useful for persistent storage, a filesystem must maintain the integrity of its metadata in the face of unpredictable system crashes, such as power interruptions and operating system failures. Because such crashes usually result in the loss of all information in volatile main memory, the information in nonvolatile storage (i.e., disk) must always be consistent enough to deterministically reconstruct a coherent filesystem state. Specifically, the on-disk image of the filesystem must have no dangling pointers to uninitialized space, no ambiguous resource ownership caused by multiple pointers, and no unreferenced live resources. Maintaining these invariants generally requires sequencing (or atomic grouping) of updates to small on-disk metadata objects.

Traditionally, the UFS filesystem used synchronous writes to properly sequence stable storage changes. For example, creating a file involves first allocating and initializing a new inode and then filling in a new directory entry to point to it. With the synchronous write approach, the filesystem forces an application that creates a file to wait for the disk write that initializes the on-disk inode. As a result, filesystem operations like file creation and deletion proceed at disk speeds rather than processor/memory speeds [McVoy & Kleiman, 1991; Ousterhout, 1990; Seltzer et al., 1993]. Since disk access times are long compared to the speeds of other computer components, synchronous writes reduce system performance.

The metadata update problem can also be addressed with other mechanisms. For example, one can eliminate the need to keep the on-disk state consistent by using NVRAM technologies, such as an uninterruptible power supply or Flash RAM [Moran et al., 1990; Wu & Zwaenepoel, 1994]. Filesystem operations can proceed as soon as the block to be written is copied into the stable store, and updates can propagate to disk in any order and whenever it is convenient. If the system fails, unfinished disk operations can be completed from the stable store when the system is rebooted.

Another approach is to group each set of dependent updates as an atomic operation with some form of write-ahead logging [Chutani et al., 1992; Hagmann, 1987] or shadow-paging [Chamberlin & Astrahan, 1981; Rosenblum & Ousterhout, 1992; Stonebraker, 1987]. These approaches augment the on-disk state with a log of filesystem updates on a separate disk or in stable store. Filesystem operations can then proceed as soon as the operation to be done is written into the log. If the system fails, unfinished filesystem operations can be completed from the log when the system is rebooted. Many modern filesystems successfully use write-ahead logging to improve performance compared to the synchronous write approach.

In Ganger & Patt [1994], an alternative approach called soft updates was proposed and evaluated in the context of a research prototype. Following a successful evaluation, a production version of soft updates was written for FreeBSD. With soft updates, the filesystem uses delayed writes (i.e., write-back caching) for metadata changes, tracks dependencies between updates, and enforces these dependencies at write-back time. Because most metadata blocks contain many pointers, cyclic dependencies occur frequently when dependencies are recorded only at the block level. Therefore, soft updates track dependencies on a per-pointer basis, which allows blocks to be written in any order. Any still-dependent updates in a metadata block are rolled back before the block is written and rolled forward afterwards. Thus, dependency cycles are eliminated as an issue. With soft updates, applications always see the most current copies of metadata blocks, and the disk always sees copies that are consistent with its other contents.

Update Dependencies in the Filesystem

Several important filesystem operations consist of a series of related modifications to separate metadata structures. To ensure recoverability in the presence of unpredictable failures, the modifications often must be propagated to stable storage in a specific order. For example, when creating a new file, the filesystem allocates an inode, initializes it, and constructs a directory entry that points to it. If the system goes down after the new directory entry has been written to disk but before the initialized inode is written, consistency may be compromised, since the contents of the on-disk inode are unknown. To ensure metadata consistency, the initialized inode must reach stable storage before the new directory entry. We refer to this requirement as an update dependency, because safely writing the directory entry depends on first writing the inode. The ordering constraints map onto three simple rules:

Never point to a structure before it has been initialized (e.g., an inode must be initialized before a directory entry references it).
Never reuse a resource before nullifying all previous pointers to it (e.g., an inode's pointer to a data block must be nullified before that disk block may be reallocated for a new inode).
Never reset the old pointer to a live resource before the new pointer has been set (e.g., when renaming a file, do not remove the old name for an inode until after the new name has been written).

There are eight filesystem activities that require update ordering to ensure postcrash recoverability: file creation, file removal, directory creation, directory removal, file/directory rename, block allocation, indirect block manipulation, and free map management.

The two main resources managed by the filesystem are inodes and data blocks. Two bitmaps are used to maintain allocation information about these resources. For each inode in the filesystem, the inode bitmap has a bit that is set if the inode is in use and cleared if it is free. For each block in the filesystem, the data block bitmap has a bit that is set if the block is free and cleared if it is in use. Each filesystem is broken down into fixed-size pieces called cylinder groups (described more fully in Section 8.9). Each cylinder group has a cylinder group block that contains the bitmaps for the inodes and data blocks residing within that cylinder group. For a large filesystem, this organization allows just those sub-pieces of the filesystem bitmap that are actively being used to be brought into the kernel memory. Each of these active cylinder group blocks is stored in a separate I/O buffer and can be written to disk independently of the other cylinder group blocks.

When a file is created, three metadata structures located in separate blocks are modified. The first is a new inode, which is initialized with its type field set to the new file type, its link count set to one to show that it is live (i.e., referenced by some directory entry), its permission fields set as specified, and all other fields set to default values. The second is the inode bitmap, which is modified to show that the inode has been allocated. The third is a new directory entry, which is filled in with the new name and a pointer to the new inode. To ensure that the bitmaps always reflect all allocated resources, the bitmap must be written to disk before the inode or directory entry. Because the inode is in an unknown state until after it has been initialized on the disk, rule #1 specifies that there is an update dependency requiring that the relevant inode be written before the relevant directory entry. Although not strictly necessary, most BSD fast filesystem implementations also immediately write the directory block before the system call creating the file returns. This second synchronous write ensures that the filename is on stable storage if the application later does an fsync system call. If it were not done, then the fsync call would have to be able to find all the unwritten directory blocks containing a name for the file and write them to disk. A similar update dependency between inode and directory entry exists when adding a second name for the same file (a.k.a. a hard link), since the addition of the second name requires the filesystem to increment the link count in the inode and write the inode to disk before the entry may appear in the directory.

When a file is deleted, a directory block, an inode block, and one or more cylinder group bitmaps are modified. In the directory block, the relevant directory entry is "removed" by reclaiming its space or by nullifying the inode pointer. In the inode block, the relevant inode's type field, link count, and block pointers are zeroed out. The deleted file's blocks and inode are then added to the appropriate free block/inode maps. Rule #2 specifies that there are update dependencies between the directory entry and the inode and between the inode and any modified free map bits. To keep the link count conservatively high (and reduce complexity in practice), the update dependency between a directory entry and inode also exists when removing one of multiple names (hard links) for a file.

Creation and removal of directories is largely as just described for regular files. However, the .. entry is a link from the child directory to the parent, which adds additional update dependencies. Specifically, during creation, the parent's link count must be incremented on disk before the new directory's .. pointer is written. Likewise, during removal, the parent's link count must be decremented after the removed directory's .. pointer is nullified. (Note that this nullification is implicit in deleting the child directory's pointer to the corresponding directory block.)

When a new block is allocated, its bitmap location is updated to reflect that it is in use and the block's contents are initialized with newly written data or zeros. In addition, a pointer to the new block is added to an inode or indirect block (see following). To ensure that the on-disk bitmap always reflects allocated resources, the bitmap must be written to disk before the pointer. Also, because the contents of the newly allocated disk location are unknown, rule #1 specifies an update dependency between the new block and the pointer to it. Because enforcing this update dependency with synchronous writes can reduce data creation throughput by a factor of two [Ganger & Patt, 1994], many implementations ignore it for regular data blocks. This implementation decision reduces integrity and security, since newly allocated blocks generally contain previously deleted file data. Soft updates allow all block allocations to be protected in this way with near-zero performance reduction.

Manipulation of indirect blocks does not introduce fundamentally different update dependencies, but they do merit separate discussion. Allocation, both of indirect blocks and of blocks pointed to by indirect blocks, is as just discussed. File deletion, and specifically deallocation, is more interesting for indirect blocks. Because the inode reference is the only way to identify indirect blocks and blocks connected to them (directly or indirectly), nullifying the inode's pointer to an indirect block is enough to eliminate all recoverable pointers to said blocks. Once the pointer is nullified on disk, all its blocks can be freed. The exception to this rule is when a file is partially truncated. Here, the pointer from the inode to the indirect block remains. Some of the indirect block pointers will be zeroed and and their corresponding blocks freed while the rest of the pointers are left intact.

When a file is being renamed, two directory entries are affected. A new entry (with the new name) is created and set to point to the relevant inode, and the old entry is removed. Rule #3 states that the new entry should be written to disk before the old entry is removed to avoid having the file unreferenced on reboot. If link counts are being kept conservatively, rename involves at least four disk updates in sequence: one to increment the inode's link count, one to add the new directory entry, one to remove the old directory entry, and one to decrement the link count. If the new name already existed, then the addition of the new directory entry also acts as the first step of file removal as discussed above. Interestingly, rename is the one POSIX file operation that should have an atomic update to multiple user-visible metadata structures to provide ideal semantics. POSIX does not require said semantics and most implementations, including FreeBSD, cannot provide it.

On an active filesystem, the bitmaps change constantly. Thus, the copy of the bitmaps in the kernel memory often differs from the copy that is stored on the disk. If the system halts without writing out the incore state of the bitmaps, some of the recently allocated inodes and data blocks may not be reflected in the out-of-date copies of the bitmaps on the disk. So the filesystem check program, fsck, must be run over all the inodes in the filesystem to ascertain which inodes and blocks are in use and bring the bitmaps up to date [McKusick & Kowalski, 1994]. An added benefit of soft updates is that it tracks the writing of the bitmaps to disk and uses this information to ensure that no newly allocated inodes or pointers to newly allocated data blocks will be written to disk until after the bitmap that references them has been written to disk. This guarantee ensures that there will never be an allocated inode or data block that is not marked in the on-disk bitmap. This guarantee, together with the other guarantees made by the soft update code, means that it is no longer necessary to run fsck after a system crash.

The next 12 subsections describe the soft updates data structures and their use in enforcing the update dependencies just described. The structures and algorithms described eliminate all synchronous write operations from the filesystem except for the partial truncation of a file and the fsync system call, which explicitly requires that all the state of a file be committed to disk before the system call returns.

The key attribute of soft updates is dependency tracking at the level of individual changes within cached blocks. Thus, for a block containing 64 inodes, the system can maintain up to 64 dependency structures with one for each inode in the buffer. Similarly for a buffer containing a directory block containing 50 names, the system can maintain up to 50 dependency structures with one for each name in the directory. With this level of detailed dependency information, circular dependencies between blocks are not problematic. For example, when the system wishes to write a buffer containing inodes, those inodes that can be safely written can go to the disk. Any inodes that cannot yet be safely written are temporarily rolled back to their safe values while the disk write proceeds. After the disk write completes, such inodes are rolled forward to their current values. Because the buffer is locked throughout the time that the contents are rolled back, the disk write is being done, and the contents are rolled forward, any processes wishing to use the buffer will be blocked from accessing it until it has been returned to its correct state.

Dependency Structures

The soft updates implementation uses a variety of data structures to track pending update dependencies among filesystem structures. Table 8.2 (on page 330) lists the dependency structures used in the BSD soft updates implementation, their main functions, and the types of blocks with which they can be associated. These dependency structures are allocated and associated with blocks as various file operations are completed. They are connected to the in-core blocks with which they are associated by a pointer in the corresponding buffer header. Two common aspects of all listed dependency structures are the worklist structure and the states used to track the progress of a dependency.

Table 8.2. Soft updates and dependency tracking.
Name	Function	Associated Structures
bmsafemap	track bitmap dependencies (points to lists of dependency structures for recently allocated blocks and inodes)	cylinder group block
inodedep	track inode dependencies (information and list head pointers for all inode-related dependencies, including changes to the link count, the block pointers, and the file size)	inode block
allocdirect	track inode-referenced blocks (linked into lists pointed to by an inodedep and a bmsafemap to track inode's dependence on the block and bitmap being written to disk)	data block or indirect block or directory block
indirdep	track indirect block dependencies (points to list of dependency structures for recently allocated blocks with pointers in the indirect block)	indirect block
allocindir	track indirect-block referenced block (linked into lists pointed to by an indirdep and a bmsafemap to track the indirect block's dependence on that block and bitmap being written to disk)	data block or indirect block or directory block
pagedep	track directory block dependencies (points to lists of diradd and dirrem structures)	directory block
diradd	track dependency between a new directory entry and the referenced inode	inodedep and directory block
mkdir	track new directory creation (used in addition to standard diradd structure when doing a mkdir)	inodedep and directory block
dirrem	track dependency between a deleted directory entry and the unlinked inode	first pagedep then tasklist
freefrag	tracks a single block or fragment to be freed as soon as the corresponding block (containing the inode with the now-replaced pointer to it) is written to disk	first inodedep then tasklist
freeblks	tracks all the block pointers to be freed as soon as the corresponding block (containing the inode with the now-zeroed pointers to them) is written to disk	first inodedep then tasklist
freefile	tracks the inode that should be freed as soon as the corresponding block (containing the inode block with the now-reset inode) is written to disk	first inodedep then tasklist

The worklist structure is really just a common header included as the first item in each dependency structure. It contains a set of linkage pointers and a type field to show the type of structure in which it is embedded. The worklist structure allows several different types of dependency structures to be linked together into a single list. The soft updates code can traverse one of these heterogenous lists, using the type field to determine which kind of dependency structure it has encountered, and take the appropriate action with each.

The typical use for the worklist structure is to link together a set of dependencies associated with a buffer. Each buffer in the system has a pointer to a worklist added to it. Any dependencies associated with that buffer are linked onto its worklist. After the buffer has been locked and just before the buffer is to be written, the I/O system passes the buffer to the soft updates code to let it know that a disk write is about to be started. The soft updates code then traverses the list of dependencies associated with the buffer and does any needed roll-back operations. After the disk write completes but before the buffer is unlocked, the I/O system calls the soft updates code to let it know that a write has completed. The soft updates code then traverses the list of dependencies associated with the buffer, does any needed roll-forward operations, and deallocates any dependencies that are fulfilled by the data in the buffer having been written to disk.

Another important list maintained by the soft updates code is the tasklist that contains background tasks for the work daemon. Dependency structures are generally added to the tasklist during the disk write completion routine, describing tasks that have become safe given the disk update but that may need to block for locks or I/O and therefore cannot be completed during the interrupt handler. Once per second, the syncer daemon (in its dual role as the soft updates work daemon) wakes up and calls into the soft updates code to process any items on the tasklist.

The work done for a dependency structure on this list is type-dependent. For example, for a freeblks structure, the listed blocks are marked free in the block bitmaps. For a dirrem structure, the associated inode's link count is decremented, possibly triggering file deletion.

Most dependency structures have a set of flags that describe the state of completion of the corresponding dependency. Dirty cache blocks can be written to the disk at any time. When the I/O system hands the buffer to the soft updates code (before and after a disk write), the states of the associated dependency structures determine what actions are taken. Although the specific meanings vary from structure to structure, the three main flags and their general meanings are:

ATTACHED	The ATTACHED flag shows that the buffer with which the dependency structure is associated is not currently being written. When a disk write is started for a buffer with a dependency that must be rolled back, the ATTACHED flag is cleared in the dependency structure to show that it has been rolled back in the buffer. When the disk write completes, updates described by dependency structures that have the ATTACHED flag cleared are rolled forward, and the ATTACHED flag is set. Thus, a dependency structure can never be deleted while its ATTACHED flag is cleared, since the information needed to do the roll-forward operation would then be lost.
DEPCOMPLETE	The DEPCOMPLETE flag shows that all associated dependencies have been completed. When a disk write is started, the update described by a dependency structure is rolled back if the DEPCOMPLETE flag is clear. For example, in a dependency structure that is associated with newly allocated inodes or data blocks, the DEPCOMPLETE flag is set when the corresponding bitmap has been written to disk.
COMPLETE	The COMPLETE flag shows that the update being tracked has been committed to the disk. For some dependencies, updates will be rolled back during disk writes when the COMPLETE flag is clear. For example, for a newly allocated data block, the COMPLETE flag is set when the contents of the block have been written to disk.

In general, the flags are set as disk writes complete, and a dependency structure can be deallocated only when its ATTACHED, DEPCOMPLETE, and COMPLETE flags are all set. Consider the example of a newly allocated data block that will be tracked by an allocdirect structure. The ATTACHED flag will initially be set when the allocation occurs. The DEPCOMPLETE flag will be set after the bitmap allocating that new block is written. The COMPLETE flag will be set after the contents of the new block are written. If the inode claiming the newly allocated block is written before both the DEPCOMPLETE and COMPLETE flags are set, the ATTACHED flag will be cleared while the block pointer in the inode is rolled back to zero, the inode is written, and the block pointer in the inode is rolled forward to the new block number. Where different, the specific meanings of these flags in the various dependency structures are described in the subsections that follow.

Bitmap Dependency Tracking

Bitmap updates are tracked by the bmsafemap structure shown in Figure 8.18. Each buffer containing a cylinder group block will have its own bmsafemap structure. As with every dependency structure, the first entry in the bmsafemap structure is a worklist structure. Each time an inode, direct block, or indirect block is allocated from the cylinder group, a dependency structure is created for that resource and linked onto the appropriate bmsafemap list. Each newly allocated inode will be represented by an inodedep structure linked to the bmsafemap inodedep head list. Each newly allocated block directly referenced by an inode will be represented by an allocdirect structure linked to the bmsafemap allocdirect head list. Each newly allocated block referenced by an indirect block will be represented by an allocindir structure linked to the bmsafemap allocindir head list. Because of the code's organization, there is a small window between the time a block is first allocated and the time at which its use is known. During this period of time, it is described by a newblk structure linked to the bmsafemap new blk head list. After the kernel chooses to write the cylinder group block, the soft updates code will be notified when the write has completed. At that time, the code traverses the inode, direct block, indirect block, and new block lists, setting the DEPCOMPLETE flag in each dependency structure and removing said dependency structure from its dependency list. Having cleared all its dependency lists, the bmsafemap structure can be deallocated. There are multiple lists as it is slightly faster and more type-safe to have lists of specific types.

Figure 8.18. Bitmap update dependencies.

Inode Dependency Tracking

Inode updates are tracked by the inodedep structure shown in Figure 8.19. The worklist and state fields are as described for dependency structures in general. The filesystem ptr and inode number fields identify the inode in question. When an inode is newly allocated, its inodedep is attached to the inodedep head list of a bmsafemap structure. Here, deps list chains additional new inodedep structures and dep bp points to the cylinder group block that contains the corresponding bitmap. Other inodedep fields are explained in later subsections.

Figure 8.19. Inode update dependencies.

Before detailing the rest of the dependencies associated with an inode, we need to discuss the steps involved in updating an inode on disk as pictured in Figure 8.20.

The kernel calls the vnode operation, VOP_UPDATE, which requests that the disk resident part of an inode (referred to as a dinode) be copied from its in-memory vnode structure to the appropriate disk buffer. This disk buffer holds the contents of an entire disk block, which is usually big enough to include 64 dinodes. Some dependencies are fulfilled only when the inode has been written to disk. These dependencies need dependency structures to track the progress of the writing of the inode. Therefore, during step 1, a soft update routine, softdep_update_inodeblock(), is called to move allocdirect structures from the incore update list to the buffer update list and to move freefile, freeblks, freefrag, diradd, and mkdir structures (described below) from the inode wait list to the buffer wait list.
The kernel calls the vnode operation, VOP_STRATEGY, that prepares to write the buffer containing the dinode, pointed to by bp in Figure 8.20. A soft updates routine, softdep_disk_io_initiation(), identifies inodedep dependencies and calls initiate_write_inodeblock() to do roll-backs as necessary.
Output completes on the buffer referred to by bp and the I/O system calls a routine, biodone(), to notify any waiting processes that the write has finished. The biodone() routine then calls a soft updates routine, softdep_disk_write_complete(), that identifies inodedep dependencies and calls handle_written_inodeblock() to revert roll-backs and clear any dependencies on the buffer wait and buffer update lists.

Figure 8.20. Inode update steps.

Direct Block Dependency Tracking

Figure 8.21 illustrates the dependency structures involved in allocation of direct blocks. Recall that the key dependencies are that, before the on-disk inode points to a newly allocated block, both the corresponding bitmap block and the new block itself must be written to the disk. The order in which the two dependencies complete is not important. The figure introduces the allocdirect structure that tracks blocks directly referenced by the inode. The three recently allocated logical blocks (1, 2, and 3) shown are each in a different state. For logical block 1, the bitmap block dependency is complete (as shown by the DEPCOMPLETE flag being set), but the block itself has not yet been written (as shown by the COMPLETE flag being cleared). For logical block 2, both dependencies are complete. For logical block 3, neither dependency is complete, so the corresponding allocdirect structure is attached to a bmsafemap allocdirect head list (recall that this list is traversed to set DEPCOMPLETE flags after bitmap blocks are written). The COMPLETE flag for logical blocks 1 and 3 will be set when their initialized data blocks are written to disk. The figure also shows that logical block 1 existed at a time that VOP_UPDATE was called, which is why its allocdirect structure resides on the inodedep buffer update list. Logical blocks 2 and 3 were created after the most recent call to VOP_UPDATE and thus their structures reside on the inodedep incore update list.

Figure 8.21. Direct block allocation dependencies.

For files that grow in small steps, a direct block pointer may first point to a fragment that is later promoted to a larger fragment and eventually to a full-sized block. When a fragment is replaced by a larger fragment or a full-sized block, it must be released back to the filesystem. However, it cannot be released until the new fragment or block has had its bitmap entry and contents written and the inode claiming the new fragment or block has been written to the disk. The fragment to be released is described by a freefrag structure (not shown). The freefrag structure is held on the freefrag list of the allocdirect for the block that will replace it until the new block has had its bitmap entry and contents written. The freefrag structure is then moved to the inode wait list of the inodedep associated with its allocdirect structure where it migrates to the buffer wait list when VOP_UPDATE is called. The freefrag structure eventually is added to the tasklist after the buffer holding the inode block has been written to disk. When the tasklist is serviced, the fragment listed in the freefrag structure is returned to the free-block bitmap.

Indirect Block Dependency Tracking

Figure 8.22 shows the dependency structures involved in allocation of indirect blocks that includes the same dependencies as with direct blocks. This figure introduces two new dependency structures. A separate allocindir structure tracks each individual block pointer in an indirect block. The indirdep structure manages all the allocindir dependencies associated with an indirect block. The figure shows a file that recently allocated logical blocks 14 and 15 (the third and fourth entries, at offsets 8 and 12, in the first indirect block). The allocation bitmaps have been written for logical block 14 (as shown by its DEPCOMPLETE flag being set), but not for block 15. Thus, the bmsafemap structure tracks the allocindir structure for logical block 15. The contents of logical block 15 have been written to disk (as shown by its COMPLETE flag being set), but not those of block 14. The COMPLETE flag will be set in 14's allocindir structure once the block is written.

Figure 8.22. Indirect block allocation dependencies.

The list of allocindir structures tracked by an indirdep structure can be long (e.g., up to 2048 entries for 8-Kbyte indirect blocks). To avoid traversing lengthy dependency structure lists in the I/O routines, an indirdep structure maintains a second version of the indirect block: the saved data ptr always points to the buffer's up-to-date copy and the safe copy ptr points to a version that includes only the subset of pointers that can be safely written to disk (and NULL for the others). The former is used for all filesystem operations and the latter is used for disk writes. When the allocindir head list becomes empty, the saved data ptr and safe copy ptr point to identical blocks and the indirdep structure (and the safe copy) can be deallocated.

Dependency Tracking for New Indirect Blocks

Figure 8.23 shows the structures associated with a file that recently expanded into its single-level indirect block. Specifically, this expansion involves inodedep and indirdep structures to manage dependency structures for the inode and indirect block, an allocdirect structure to track the dependencies associated with the indirect block's allocation, and an allocindir structure to track the dependencies associated with a newly allocated block pointed to by the indirect block. These structures are used as described in the previous three subsections. Neither the indirect block nor the data block that it references have had their bitmaps set, so they do not have their DEPCOMPLETE flag set and are tracked by a bmsafemap structure. The bitmap entry for the inode has been written, so the inodedep structure has its DEPCOMPLETE flag set. The use of the buffer update head list by the inodedep structure shows that the in-core inode has been copied into its buffer by a call to VOP_UPDATE. Neither of the dependent pointers (from the inode to the indirect block and from the indirect block to the data block) can be safely included in disk writes yet, since the corresponding COMPLETE and DEPCOMPLETE flags are not set. Only after the bitmaps and the contents have been written will all the flags be set and the dependencies complete.

Figure 8.23. Dependencies for a file expanding into an indirect block.

New Directory Entry Dependency Tracking

Figure 8.24 shows the dependency structures for a directory that has two new entries, foo and bar. This figure introduces two new dependency structures. A separate diradd structure tracks each individual directory entry in a directory block. The pagedep structure manages all the diradd dependencies associated with a directory block. For each new file, there is an inodedep structure and a diradd structure. Both files' inodes have had their bitmap's written to disk, as shown by the DEPCOMPLETE flags being set in their inodedep structures. The inode for foo has been updated with VOP_UPDATE but has not yet been written to disk, as shown by the COMPLETE flag on its inodedep structure not being set and by its diradd structure still being linked onto its buffer wait list. Until the inode is written to disk with its increased link count, the directory entry may not appear on disk. If the directory page is written, the soft updates code will roll back the creation of the new directory entry for foo by setting its inode number to zero. After the disk write completes, the roll-back is reversed by restoring the correct inode number for foo.

Figure 8.24. Dependencies associated with adding new directory entries.

The inode for bar has been written to disk, as shown by the COMPLETE flag being set in its inodedep and diradd structures. When the inode write completed, the diradd structure for bar was moved from the inodedep buffer wait list to the inodedep pending ops list. The diradd also moved from the pagedep diradd list to the pagedep pending ops list. Since the inode has been written, it is safe to allow the directory entry to be written to disk. The diradd entries remain on the inodedep and pagedep pending ops list until the new directory entry is written to disk. When the entry is written, the diradd structure is freed. One reason to maintain the pending ops list is so that when an fsync system call is done on a file, the kernel is able to ensure that both the file's contents and directory reference(s) are written to disk. The kernel ensures that the reference(s) are written by doing a lookup to see if there is an inodedep for the inode that is the target of the fsync. If it finds an inodedep, it checks to see if it has any diradd dependencies on either its pending ops or buffer wait lists. If it finds any diradd structures, it follows the pointers to their associated pagedep structures and flushes out the directory inode associated with that pagedep. This backtracking recurses on the directory inodedep.

New Directory Dependency Tracking

Figure 8.25 (on page 340) shows the two additional dependency structures involved with creating a new directory. For a regular file, the directory entry can be committed as soon as the newly referenced inode has been written to disk with its increased link count. When a new directory is created, there are two additional dependencies: writing the directory data block containing the . and .. entries (MKDIR_BODY) and writing the parent inode with the increased link count for .. (MKDIR_PARENT). These additional dependencies are tracked by two mkdir structures linked to the associated diradd structure. The soft-updates design dictates that any given dependency will correspond to a single buffer at any given point in time. Thus, two structures are used to track the action of the two different buffers. When each completes, it clears its associated flag in the diradd structure. The MKDIR_PARENT is linked to the inodedep structure for the parent directory. When that directory inode is written, the link count will be updated on disk. The MKDIR_BODY is linked to the buffer that contains the initial contents of the new directory. When that buffer is written, the entries for . and .. will be on disk. The last mkdir to complete sets the DEPCOMPLETE flag in the diradd structure so that the diradd structure knows that these extra dependencies have been completed. Once these extra dependencies have been completed, the handling of the directory diradd proceeds exactly as it would for a regular file.

Figure 8.25. Dependencies associated with adding a new directory.

All mkdir structures in the system are linked together on a list. This list is needed so that a diradd can find its associated mkdir structures and deallocate them if it is prematurely freed (e.g., if a mkdir system call is immediately followed by a rmdir system call of the same directory). Here, the deallocation of a diradd structure must traverse the list to find the associated mkdir structures that reference it. The deletion would be faster if the diradd structure were simply augmented to have two pointers that referenced the associated mkdir structures. However, these extra pointers would double the size of the diradd structure to speed an infrequent operation.

Directory Entry Removal Dependency Tracking

Figure 8.26 shows the dependency structures involved with the removal of a directory entry. This figure introduces one new dependency structure, the dirrem structure, and a new use for the pagedep structure. A separate dirrem structure tracks each individual directory entry to be removed in a directory block. In addition to previously described uses, pagedep structures associated with a directory block manage all dirrem structures associated with the block. After the directory block is written to disk, the dirrem request is added to the work daemon's tasklist list. For file deletions, the work daemon will decrement the inode's link count by one. For directory deletions, the work daemon will decrement the inode's link count by two, truncate its size to zero, and decrement the parent directory's link count by one. If the inode's link count drops to zero, the resource reclamation activities described in the "file and directory inode reclamation" section are started.

Figure 8.26. Dependencies associated with removing a directory entry.

File Truncation

When a file is truncated to zero length without soft updates enabled, the block pointers in its inode are saved in a temporary list, the pointers in the inode are zeroed, and the inode is synchronously written to disk. When the inode write completes, the list of its formerly claimed blocks are added to the free-block bitmap. With soft updates, the block pointers in the inode being truncated are copied into a freeblks structure, the pointers in the inode are zeroed, and the inode is marked dirty. The freeblks structure is added to the inode wait list, and it migrates to the buffer wait list when VOP_UPDATE is called. The freeblks structure is eventually added to the tasklist after the buffer holding the inode block has been written to disk. When the tasklist is serviced, the blocks listed in the freeblks structure are returned to the free-block bitmap.

File and Directory Inode Reclamation

When the link count on a file or directory drops to zero, its inode is zeroed to show that it is no longer in use. When running without soft updates, the zeroed inode is synchronously written to disk, and the inode is marked as free in the bitmap. With soft updates, information about the inode to be freed is saved in a freefile structure. The freefile structure is added to the inode wait list, and it migrates to the buffer wait list when VOP_UPDATE is called. The freefile structure eventually is added to the tasklist after the buffer holding the inode block has been written to disk. When the tasklist is serviced, the inode listed in the freefile structure is returned to the free inode map.

Directory Entry Renaming Dependency Tracking

Figure 8.27 shows the structures involved in renaming a file. The dependencies follow the same series of steps as those for adding a new file entry, with two variations. First, when a roll-back of an entry is needed because its inode has not yet been written to disk, the entry must be set back to the previous inode number rather than to zero. The previous inode number is stored in a dirrem structure. The DIRCHG flag is set in the diradd structure so that the roll-back code knows to use the old inode number stored in the dirrem structure. The second variation is that, after the modified directory entry is written to disk, the dirrem structure is added to the work daemon's tasklist list so that the link count of the old inode will be decremented as described in the section on "Directory Entry Removal Dependency Tracking."

Figure 8.27. Dependencies associated with renaming a directory entry.

Tracking File Removal Dependencies

This section will go through a short example describing the dependencies that track the removal of a file. These three ordering constraints must be maintained:

The filename in the on-disk copy of the directory must be deleted.
The inode describing the file must be deallocated by zeroing out its on-disk dinode.
The blocks formerly referenced by the inode for the file must be released to the free-space bitmap, and the inode needs to be released to the free-inode bitmap.

These constraints are maintained by soft updates as follows:

The block of the directory containing the name to be deleted is read into a kernel buffer. The entry is deleted by changing the entry that precedes it to point to the entry that follows it (see Section 8.3). Before releasing the buffer, a set of dependencies needs to be constructed, as shown in Figure 8.26. If this deletion is the first dependency for the directory block, it needs to have a pagedep structure allocated that is linked onto the dependency list for the buffer. Next a dirrem structure is allocated that records the inode number of the entry being deleted. The dirrem structure is linked onto the dirrem list of the pagedep structure for the directory block. The buffer is then marked dirty, unlocked, and released.
Some time in the next 30 seconds, the kernel will decide to write the dirty directory buffer. When the write completes, the pagedep associated with the buffer is passed to soft updates for processing. One processing step is to handle each of the dirrem entries. Each dirrem entry causes the inode formerly referenced by the directory to have its reference count decremented by one. If the reference count drops to zero (meaning that the last name for the file was removed), then the inode needs to be deallocated and freed. Before zeroing out the contents of the on-disk dinode, its list of allocated blocks must be saved in a freeblks structure and information needed to free the inode must be saved in a freefile structure. The block of the filesystem containing the dinode to be freed is read into a kernel buffer, as shown in Figure 8.20. The part of the buffer containing the dinode is zeroed out. If the deallocation is the first dependency for the inode, it needs to have an inodedep structure allocated that is linked onto the dependency list for the buffer. The freeblks and freefile structures are linked onto the buffer wait list of the inodedep structure. The buffer is then marked dirty, unlocked, and released. The dirrem structure is freed as is the pagedep structure if it is no longer tracking any dependencies.
Some time in the next 30 seconds, the kernel will decide to write the buffer containing the zeroed dinode. When the write completes, the inodedep associated with the buffer is passed to soft updates for processing. One processing step is to handle each of the buffer wait entries. The handling of the freeblks entry causes all its listed blocks to be marked free in the appropriate cylinder-group bitmaps. The handling of the freefile entry causes the deleted inode to be marked free in the appropriate cylinder-group bitmap. The freeblks and freefile structures are freed as is the inodedep structure if it is no longer tracking any dependencies.

The file has now been completely removed and ceases to be tracked by soft updates.

Fsync Requirements for Soft Updates

The fsync system call requests that the specified file be written to stable storage and that the system call not return until all its associated writes have completed. The task of completing an fsync requires more than simply writing all the file's dirty data blocks to disk. It also requires that any unwritten directory entries that reference the file also be written, as well as any unwritten directories between the file and the root of the filesystem. Simply getting the data blocks to disk can be a major task. First, the system must check to see if the bitmap for the inode has been written, finding the bitmap and writing it if necessary. It must then check for, find, and write the bitmaps for any new blocks in the file. Next, any unwritten data blocks must go to disk. Following the data blocks, any first-level indirect blocks that have newly allocated blocks in them are written, followed by any double indirect blocks, then triple-level indirect blocks. Finally, the inode can be written that will ensure that the contents of the file are on stable store. Ensuring that all names for the file are also on stable store requires data structures that can determine whether there are any uncommitted names and, if so, in which directories they occur. For each directory containing an uncommitted name, the soft updates code must go through the same set of flush operations that it has just done on the file itself.

The addition of extended attribute data to the inode required that the soft updates code be extended so that it could ensure the integrity of these new data blocks. As with the file data blocks, soft updates ensure that the extended data blocks and the bitmaps, which show that they are in use, are written to disk before they are claimed by the inode. Soft updates also ensure that any updated extended attribute data are committed to disk as part of an fsync of the file.

Although the fsync system call must ultimately be done synchronously, this requirement does not mean that the flushing operations must each be done synchronously. Instead, whole sets of bitmaps or data blocks are pushed into the disk queue, and the soft updates code then waits for all the writes to complete. This approach is more efficient because it allows the disk subsystem to sort all the write requests into the most efficient order for writing. Still, the fsync part of the soft updates code generates most of the remaining synchronous writes in the filesystem.

Another issue related to fsync is unmounting of filesystems. Doing an unmount requires finding and flushing all the dirty files that are associated with the filesystem. Flushing the files may lead to the generation of background activity, such as removing files whose reference count drops to zero as a result of their nullified directory entries being written. Thus, the system must be able to find all background activity requests and process them. Even on a quiescent filesystem, several iterations of file flushes followed by background activity may be required. FreeBSD allows for the forcible unmount of a filesystem, which may take place while the filesystem is actively in use. The ability to cleanly suspend operations on an active filesystem is described in Section 8.7.

File Removal Requirements for Soft Updates

For correct operation, a directory's .. entry should not be removed until after the directory is persistently unlinked. Correcting this dependency ordering in the soft updates code introduced a delay of up to two minutes between the time a directory is unlinked and the time that it is really deallocated (when the .. entry is removed). Until the directory's .. entry is really removed, the link count on its parent will not be decremented. Thus, when a user removes one or more directories, the link count of their former parent still reflects their being present for several minutes. This delayed link count decrement not only causes some questions from users, but also causes some applications to break. For example, the rmdir system call will not remove a directory that has a link count over two. This restriction means that a directory that recently had directories removed from it cannot be removed until its former directories have been fully deleted.

To fix these link-count problems, the soft updates implementation augments the inode nlink field with a new field called effnlink. The nlink field is still stored as part of the on-disk metadata and reflects the true link count of the inode. The effnlink field is maintained only in kernel memory and reflects the final value that the nlink field will reach once all its outstanding operations are completed. All interactions with user applications report the value of the effnlink field, which results in the illusion that everything has happened immediately.

When a file is removed on a filesystem running with soft updates, the removal appears to happen quickly, but the process of removing the file and returning its blocks to the free list may take up to several minutes. Before UFS2, the space held by the file did not show up in the filesystem statistics until the removal of the file had been completed. Thus, applications that clean up disk space such as the news expiration program would often vastly overshoot their goal. They work by removing files and then checking to see if enough free space has showed up. Because of the time lag in having the free space recorded, they would remove far too many files. To resolve problems of this sort, the soft updates code now maintains a counter that keeps track of the amount of space that is held by the files that it is in the process of removing. This counter of pending space is added to the actual amount of free space as reported by the kernel (and thus by utilities like df). The result of this change is that free space appears immediately after the unlink system call returns or the rm utility finishes.

The second and related change to soft updates has to do with avoiding false out-of-space errors. When running with soft updates on a nearly full filesystem with high turnover rate (for example, when installing a whole new set of binaries on a root partition), the filesystem can return a filesystem full error even though it reports that it has plenty of free space. The filesystem full message happens because soft updates have not managed to free the space from the old binaries in time for it to be available for the new binaries.

The initial attempt to correct this problem was to simply have the process that wished to allocate space wait for the free space to show up. The problem with this approach is that it often had to wait for up to a minute. In addition to making the application seem intolerably slow, it usually held a locked vnode that could cause other applications to get blocked waiting for it to become available (often referred to as a lock race to the root of the filesystem). Although the condition would clear in a minute or two, users often assumed that their system had hung and would reboot.

To remedy this problem, the solution devised for UFS2 is to co-opt the process that would otherwise be blocked and put it to work helping soft updates process the files to be freed. The more processes trying to allocate space, the more help that is available to soft updates and the faster free blocks begin to appear. Usually in under one second enough space shows up that the processes can return to their original task and complete. The effect of this change is that soft updates can now be used on small nearly full filesystems with high turnover.

Although the common case for deallocation is for all data in a file to be deleted, the truncate system call allows applications to delete only part of a file. This semantic creates slightly more complicated update dependencies, including the need to have deallocation dependencies for indirect blocks and the need to consider partially deleted data blocks. Because it is so uncommon, the soft-updates implementation does not optimize this case; the conventional synchronous write approach is used instead.

One concern with soft updates is the amount of memory consumed by the dependency structures. In daily operation, we have found that the additional dynamic memory load placed on the kernel memory allocation area is about equal to the amount of memory used by vnodes plus inodes. For each 1000 vnodes in the system, the additional peak memory load from soft updates is about 300 Kbytes. The one exception to this guideline occurs when large directory trees are removed. Here, the filesystem code can get arbitrarily far ahead of the on-disk state, causing the amount of memory dedicated to dependency structures to grow without bound. The soft-update code was modified to monitor the memory load for this case and not allow it to grow past a tunable upper bound. When the bound is reached, new dependency structures can only be created at the rate at which old ones are retired. The effect of this limit is to slow down the rate of removal to the rate at which the disk updates can be done. While this restriction slows the rate at which soft updates can normally remove files, it is still considerably faster than the traditional synchronous-write filesystem. In steady-state, the soft-update remove algorithm requires about one disk write for each 10 files removed, whereas the traditional filesystem requires at least two writes for every file removed.

Soft Updates Requirements for fsck

As with the dual tracking of the true and effective link count, the changes needed to fsck became evident through operational experience. In a non-soft-updates filesystem implementation, file removal happens within a few milliseconds. Thus, there is a short period of time between the directory entry being removed and the inode being deallocated. If the system crashes during a bulk tree removal operation, there are usually no inodes lacking references from directory entries, though in rare instances there may be one or two. By contrast, in a system running with soft updates, many seconds may elapse between the times when the directory entry is deleted and the inode is deallocated. If the system crashes during a bulk tree removal operation, there are usually tens to hundreds of inodes lacking references from directory entries. Historically, fsck placed any unreferenced inodes into the lost+found directory. This action is reasonable if the filesystem has been damaged because of disk failure that results in the loss of one or more directories. However, it results in the incorrect action of stuffing the lost+found directory full of partially deleted files when running with soft updates. Thus, the fsck program was modified to check that a filesystem is running with soft updates and clear out, rather than save, unreferenced inodes (unless it has determined that unexpected damage has occurred to the filesystem, in which case the files are saved in lost+found).

A peripheral benefit of soft updates is that fsck can trust the allocation information in the bitmaps. Thus, it only needs to check the subset of inodes in the filesystem that the bitmaps show are in use. Although some of the inodes marked "in use" may be free, none of those marked "free" will ever be in use.

Performance of Soft Updates

We give only a cursory look at soft updates performance. For a detailed analysis, including comparisons with other techniques, see Ganger, McKusick et al. [2000].

We place the performance of soft updates in context by comparing it to the filesystem running without soft updates enabled (referred to below as "normal") that uses synchronous writes for update ordering and a filesystem mounted with the O_ASYNC option (referred to below as "asynchronous") that ignores all update dependencies. In asynchronous mode, all metadata updates are converted into delayed writes (see Section 6.6). Thus, the O_ASYNC data provides an upper bound on the performance of an update ordering scheme. As expected, we have found that soft updates eliminate almost all synchronous writes and usually allow the filesystem to achieve performance within 5 percent of the upper bound. Compared to using synchronous writes, file creation and removal performance increase by factors of 2 and 20, respectively. Generally, a filesystem running with soft updates tends to require 40 percent fewer disk writes and complete tasks 25 percent more quickly than when running without soft updates.

To provide a feeling for how the system performs in normal operation, we present measurements from three different system tasks. The first task is our "filesystem torture test." This test consists of 1000 runs of the Andrew benchmark, 1000 copy and removes of /etc with randomly selected pauses of 0 to 60 seconds between each copy and remove, and 500 executions of the find application from / with randomly selected pauses of 100 seconds between each run. The run of the torture test is shown in Table 8.3. The overall result is that asynchronous and soft updates require 42 percent fewer writes (with almost no synchronous writes) and have a 28 percent shorter running time. This result is particularly impressive when one considers that the finds and the pauses involve no update dependencies, and the Andrew benchmark is largely CPU bound.

Table 8.3. Running-time comparison for filesystem tests.
Filesystem	Disk Writes		Running
Configuration	Sync	Async	Time
Normal	1,459,147	487,031	27hr, 15min
Asynchronous	0	1,109,711	19hr, 43min
Soft Updates	6	1,113,686	19hr, 50min

The second test consists of building and installing the FreeBSD system. This task is a real-world example of a program development environment. The results are shown in Table 8.4. The overall result is that soft updates require 75 percent fewer writes and have a 21 percent shorter running time. Although soft updates do 30 percent more writes than asynchronous, the two result in the same running time.

Table 8.4. Running-time comparison for building FreeBSD.
Filesystem	Disk Writes		Running
Configuration	Sync	Async	Time
Normal	162,410	39,924	2hr, 12min
Asynchronous	0	38,262	1hr, 44min
Soft Updates	1124	48,850	1hr, 44min

The third test compares the performance of the central mail server for Berkeley Software Design, Inc., run with and without soft updates. The administrator was obviously unwilling to run it in asynchronous mode, since it is a production machine and people will not abide by losing their mail. Unlike the preceding tests that involve a single disk, the mail spool on this system is striped across three disks. The statistics were gathered by averaging the results from 30 days of nonweekend operation in each mode. The results for a 24-hour period are shown in Table 8.5. The normal filesystem averaged over 40 writes per second with a ratio of synchronous to asynchronous writes of 1:1. With soft updates, the write rate dropped to 12 per second, and the ratio of synchronous to asynchronous writes dropped to 1:8. For this real-world application, soft updates require 70 percent fewer writes, which triples the mail-handling capacity of the machine.

Table 8.5. Running-time comparison for a mail server.
Filesystem	Disk Writes
Configuration	Sync	Async
Normal	1,877,794	1,613,465
Soft Updates	118,102	946,519