10.1 Filesystem Types

Before any disk partition can be used, a filesystem must be built on it. When a filesystem is made, certain data structures are written to disk that will be used to access and organize the physical disk space into files (see Section 10.3, later in this chapter).

Table 10-1 lists the most importantfilesystem types available on the various systems we are considering.

Table 10-1. Important filesystem types
Use	AIX	FreeBSD	HP-UX	Linux	Solaris	Tru64
Default local	jfs or jfs2	ufs	vxfs^[1]	ext3, reiserfs	ufs	ufs or advfs
NFS	nfs	nfs	nfs	nfs	nfs	nfs
CD-ROM	cdrfs	cd9660	cdfs	iso9660	hsfs	cdfs
Swap	not needed	swap	swap, swapfs	swap	swap	not needed
DOS	not supported	msdos	not supported	msdos	pcfs	pcfs
/proc	procfs	procfs	not supported	procfs	procfs	procfs
RAM-based	not supported	mfs^[2]	not supported	ramfs, tmpfs	tmpfs	mfs
Other		union	hfs	ext2	cachefs

^[1] HP-UX defines the default filesystem type in /etc/default/fs's LOCAL variable.

^[2] This feature is deprecated and will be replaced by the md facility in Version 5.

10.1.1 About Unix Filesystems: Moments from History

In the beginning was the System V filesystem. Well, not really, but that's where we'll start. This filesystem type once dominated System V-based operating systems.^[3]

^[3] The filesystem that came to be known as the System V filesystem (s5fs) actually predates System V.

The superblock of standard System V filesystems contained information about currently available free space in the filesystem in addition to information about how the space in the filesystem is allocated. It held the number of free inodes and data blocks, the first 50 free inode numbers, and the addresses of the first 100 free disk blocks. After the superblock came the inodes, followed by the data blocks.

The System V filesystem was designed for storage efficiency. It generally used a small filesystem block size: 2K bytes or less (minuscule, in fact, by modern standards). Traditionally, a block is the basic unit of disk storage;^[4] all files consume space in multiples of the block size, and any excess space in the last block cannot be used by other files and is therefore wasted. If a filesystem has a lot of small files, a small block size minimizes waste. However, small block sizes are much less efficient when transferring large files.

^[4] This block is not related to the blocks used in the default output from commands like df and du. Use -k with either command to avoid having to worry about units.

The System V filesystem type is obsolete at this point. It is still supported on some systems for backward compatibility purposes only.

The BSD Fast File System (FFS) was designed to remedy the performance limitations of the System V filesystem. It supports filesystem block sizes of up to 64 KB. Because merely increasing the block size to this level would have had a horrendous effect on the amount of wasted space, the designers introduced a subunit to the block: the fragment. While the block remains the I/O transfer unit, the fragment becomes the disk storage unit (although only the final chunk of a file can be a fragment). Each block may be divided into one, two, four, or eight fragments.

Whatever its absolute performance status, the BSD filesystem is an unequivocal improvement over System V. For this reason, it was included in the System V.4 standard as the UFS filesystem type. This is its name on Solaris and Tru64 systems (as well as under FreeBSD). For a while, this filesystem dominated in the Unix arena.

In addition to performance advantages, the BSD filesystem introduced reliability improvements. For example, it replicates the superblock at various points in the filesystem (which are all kept synchronized). If the primary superblock is damaged, an alternate one may be used to access the filesystem (instead of it becoming unreadable). The utilities that create new filesystems report where the spare superblocks are located. In addition, the FFS spreads the inodes throughout the filesystem rather than storing them all at the start of the partition.

The BSD filesystem format has a more complex organizational structure as well. It is organized around cylinder groups: logical subcylinders of the total partition space. Each cylinder group has a copy of the superblock, a cylinder group map recording block use in its domain, and a fraction of the inodes for that filesystem (as well as data blocks). The data structures are placed at a different offset into each cylinder group to ensure that they land on different platters. Thus, in the event of limited disk damage, a copy of the superblock will still exist somewhere on the disk, as well as a substantial portion of the inodes, enabling significant amounts of data to be potentially recoverable. In contrast, if all of the vital information is in a single location on the disk, damage at that location effectively destroys the entire disk.

The Berkeley Fast File System is an excellent filesystem, but it suffers from one significant drawback: fsck performance. Not only does the filesystem usually need to be checked at every boot, the fsck process is also very slow. In fact, on current large disks, it can take hours.

10.1.1.1 Journaled filesystems

As a result, a different filesystem strategy was developed: journaled filesystems. Many operating systems now use such filesystems by default. Indeed, the current Solaris UFS filesystem type is a journaled version of FFS. In these filesystems, filesystem structure integrity is maintained using techniques from real-time transaction processing. They use a transaction log which is stored either in a designated location within the filesystem or in a separate disk partition set aside for this purpose.

As the filesystem changes, all metadata changes are recorded to the log, and writing entries to the log always precedes writing the actual buffers to disk.^[5] In the case of a system crash, the entries in the log are replayed, which ensures that the filesystem is in a consistent state. This operation is very fast, and so the filesystem is available for essentially immediate use. Note that this mechanism is exactly equivalent to traditional fsck in terms of ensuring filesystem integrity. Like fsck, it has no effect on the integrity of the data.

^[5] Writes to the log itself can be synchronous (forced to disk immediately) or buffered (written to disk only when the buffer fills up).

Journaled filesystems can also be more efficient than traditional filesystems. For example, the actual disk writes for multiple changes to the same metadata can be combined into a single operation. For example, when several files are added to a directory, then each one causes an entry to be written to the log, but all four of them can be combined in a single write to disk of the block containing the directory.

10.1.1.2 BSD soft updates

In the BSD world, development of the FFS continues. The current version offers a feature called soft updates designed to make filesystems available immediately at boot time.^[6]

^[6] For technical details about soft updates, see the articles "Metadata Update Performance in File Systems" by GregoryGanger and Yale Patt, published in the USENIX Symposium on Operating Systems Design and Implementation (1994; available in an expanded version online at http://www.ece.cmu.edu/~ganger/papers/CSE-TR-243-95.pdf) and "Soft Updates: A Technique for Eliminating Most Synchronous Writes in the Fast Filesystem" by Marshall Kirk McKusick and Gregory R. Ganger, published in the Proceedings of 1999 USENIX Annual Technical Conference (available online at http://www.usenix.org/publications/library/proceedings/usenix1999/mckusick.html). For a comparison of FFS with soft updates to journaled filesystems, see the paper "Journaling versus Soft Updates: Asynchronous Meta-data Protection in File Systems" by Margo I. Seltzer, Gregory R. Ganger, M. Kirk McKusick, Keith A. Smith, Craig A. N. Soules, and Christopher A. Stein, published in the Proceedings of 2000 USENIX Annual Technical Conference (available online at http://www.usenix.org/publications/library/proceedings/usenix2000/general/seltzer.html).

The usual FFS writes blocks to disk in a synchronous manner: in order, and waiting for each write operation to complete before stating the next one. In contrast, the soft updates method uses a delayed, asynchronous approach by maintaining a write-back cache formetadata blocks (a technique referred to as delayed writes). This often produces significant performance improvements in that many modifications to metadata can take place in memory rather than each one having to be performed on disk. For example, consider a directory tree removal. With soft updates, the metadata changes for the entire delete operation might be made in only a single write, a great savings compared to the traditional approach.

Of course, overlapping changes to metadata can also occur. To account for these situations, the soft updates facility maintains dependency data specifying the other metadata changes that a given update assumes have already taken place.

Blocks are selected for writing to disk according to an algorithm designed for overall filesystem efficiency. When it is time to write a metadata block to disk, soft updates reviews the dependencies associated with the selected block. If there are any dependencies that assume that other pending blocks will have been written first, the changes creating the dependencies are temporarily undone (rolled back). This allows the block to be written to disk while ensuring that the filesystem remains consistent. After the write operation completes, the rolled back updates to the block are restored, ensuring that the in-memory version contains the current data state. The system also removes dependency list entries that have been fulfilled by writing out that block.^[7]

^[7] Occasionally, soft updates require more write operations than the traditional method. Specifically, block roll forwards immediately make the block dirty again. If the block doesn't change again before it gets flushed to disk, an extra write operation occurs that would not otherwise have been necessary. The block selection algorithm attempts to minimize the number of rollbacks in order to avoid these situations.

Soft updates have the advantage that the only filesystem inconsistencies that can be caused by a crash are inodes and data blocks marked as in use that are actually free (consult the papers listed in the earlier footnote to see why this is true). Because these errors are benign, the filesystem can be made available for immediate use after rebooting. A background process similar to fsck is used to locate and correct these errors.

10.1.2 Default Local Filesystems

Table 10-2 lists the characteristics of thedefault local filesystem types for the various Unix versions.

Table 10-2. Default local filesystem characteristics
Item	AIX	FreeBSD	HP-UX	Linux (Red Hat)	Linux (SuSE)	Solaris	Tru64	Tru64
Type	jfs	ufs	vxfs	ext3	reiserfs	ufs	ufs	advfs
Journaled	yes	soft updates	yes	yes	yes	yes	no	yes
64 bit (files>2 GB)	yes	yes	yes	yes	yes	yes	yes	yes
Dynamic resizing	yes	yes	yes	yes	yes	yes^[8]	no	yes^[9]
Sparse file support	yes	yes	yes	no	yes	yes	yes	yes
NFSv3 support	yes	yes	yes	yes	yes	yes	yes	yes
`dump` version provided	yes	yes	yes	yes	no	yes	yes	yes

^[8] Solaris 9 only

^[9] Requires the AdvFS utilities (additional cost option)

Table 10-1. Important filesystem types

10.1.1 About Unix Filesystems: Moments from History

10.1.1.1 Journaled filesystems

10.1.1.2 BSD soft updates

10.1.2 Default Local Filesystems

Table 10-2. Default local filesystem characteristics