Journaled File Systems | Performance Tuning for Linux Servers

Before the year 2000, Ext2 was the de facto file system for most Linux machines. Ext2 is robust, reliable, and suitable for most deployments. However, as Linux displaces UNIX and other operating systems in more and larger server and computing environments, Ext2 is being pushed to its limits. In fact, many now-common requirementslarge hard-disk partitions, quick recovery from crashes, high-performance I/O, and the need to store thousands and thousands of files representing terabytes of dataexceed the capabilities of Ext2.

Fortunately, a number of other Linux file systems take up where Ext2 leaves off. Indeed, Linux now offers four alternatives to Ext2: Ext3, ReiserFS, XFS, and JFS. In addition to meeting some or all of the requirements just listed, each of these alternative file systems supports journaling, a feature demanded by enterprises and beneficial to anyone running Linux. A journaling file system can simplify restarts, reduce fragmentation, and accelerate I/O. Journaling file systems minimize the need to run file system checkers.

System administrators who maintain complex systems or those who require high availability should consider deploying one or more journaling file systems.

When Good File Systems Go Bad

The following describes the methodology for increasing the size of a file from three blocks to five blocks when it is modified:

Two new blocks are allocated to hold the new data.
The file's inode is updated to record the two new block pointers and the file's new size.
The actual data is written into the blocks.

Although writing data to a file appears to be a single atomic operation, the actual process involves a number of steps (even more steps than shown here, considering all the accounting required to remove free blocks from a list of free space, among other possible metadata changes).

If all the steps to write a file are completed perfectly (and this happens most of the time), the file is saved successfully. However, if the process is interrupted at any time (perhaps due to power failure or other systemic failure), a non-journaled file system can end up in an inconsistent state. Corruption occurs because the logical operation of writing (or updating) a file is actually a sequence of I/O, and the entire operation might not be totally reflected on the media at any given point in time.

If the metadata or the file data is left in an inconsistent state, the file system no longer functions properly.

Non-journaled file systems rely on fsck to examine all the file system's metadata and detect and repair structural integrity problems before restarting. If Linux shuts down smoothly, fsck typically returns a clean bill of health. However, after a power failure or crash, fsck is likely to find some kind of error in metadata.

Because file systems contain significant amounts of metadata, running fsck can be very time-consuming. fsck has to scan a file system's entire repository of metadata to ensure consistency and error-free operation; therefore, the speed of fsck on a disk partition is proportional to the size of the partition, the number of directories, and the number of files in each directory.

For large file systems, journaling becomes crucial. Journaled file systems provide improved structural consistency, better recovery, and faster restart times than non-journaled file systems. In most cases, journaled file systems can restart in less than a second.

Transactions Are the Solution

The magic of journaling file systems lies in transactions. Like database transactions, journaling file system transactions treat a sequence of changes as a single, atomic operation. However, instead of tracking updates to tables, the journaling file system tracks changes to file system metadata or user data. The transaction guarantees that either all or none of the file system updates are done.

For example, the process of creating a new file modifies several metadata structures (inodes, free lists, and directory entries). Before the file system makes those changes, it creates a transaction that describes what it is about to do. After the transaction has been recorded (on disk), the file system goes ahead and modifies the metadata. The journal in a journaling file system is simply a list of transactions.

In the event of a system failure, the file system is restored to a consistent state by replaying the journal. Rather than examine all metadata (the fsck way), the file system inspects only those portions of the metadata that have recently changed. Recovery is much fasterusually only a matter of seconds. Better yet, recovery time is not dependent on the size of the partition.

In addition to faster restart times, most journaling file systems also address another significant problem: scalability. Combining even a few large-capacity disks, it is easy to assemble some massive (certainly by early-'90s standards) file systems. Features of modern file systems include the following:

Faster allocation of free blocks. Extents (as described previously) and B+ trees are used individually or together to find and allocate several free blocks quickly, either by size or location.
Large (or very large) numbers of files in a directory. A directory is a special file that contains a list of files. If a directory needs to contain thousands or tens of thousands of files, something better than a linked list of (name, inode) pairs is needed. Again, advanced file systems use B+ trees to store directory entries. In some cases, a single B+ tree is used for the entire system.
Large files. The old technique of storing direct, indirect, double-indirect, and even triple-indirect pointers to blocks does not scale well. For very large files, the number of disk accesses needed to retrieve a block in the data file would be prohibitively expensive.

More advanced file systems also manage sparse files, internal fragmentation, and the allocation of inodes better than Ext2.

A Wealth of Options

Although advanced file systems are tailored primarily for the high throughput and high uptime requirements of servers (from single-processor systems to clusters), these file systems can also benefit client machines where performance and reliability are wanted or needed.

Recent releases of Linux include not one, but four journaling file systems. JFS from IBM, XFS from SGI, and ReiserFS from Namesys have all been "open sourced" and subsequently included in the Linux kernel. In addition, Ext3 was developed as a journaling add-on to Ext2.

Figure 11-2 shows where file systems fit into Linux. Note that JFS, XFS, ReiserFS, and Ext3 are independent "peers." It is possible for a single Linux machine to use all these types of file systems at the same time. A system administrator can configure a system to use JFS on one partition and ReiserFS on another.

Figure 11-2. Where file systems fit in the operating system.

The following output from the mount command shows a system with all four of the journaling systems:

 # mount /dev/hdb6 on / type reiserfs (rw) proc on /proc type proc (rw) devpts on /dev/pts type devpts (rw,mode=0620,gid=5) shmfs on /dev/shm type shm (rw) usbdevfs on /proc/bus/usb type usbdevfs (rw) /dev/hda1 on /xfs type xfs (rw) /dev/hdb1 on /jfs type jfs (rw) /dev/hda4 on /ext3 type ext3 (rw) /dev/hda2 on /ext2 type ext2 (rw) /dev/hda3 on /reiserfs type reiserfs (rw)

The df command shows all these file systems and their available space:

 # df -k  Filesystem     1K-blocks      Used   Available  Use%    Mounted on /dev/hdb6        4441800   1770448     2671352   40%    / shmfs             192736        0       192736    0%    /dev/shm /dev/hda1         806448       144      806304     1%    /xfs /dev/hdb1        3999504    659320     3340184    17%   /jfs /dev/hda4        1739324     32828     1618140     2%   /ext3 /dev/hda2         798508        20      757924     1%    /ext2 /dev/hda3         811248     32840      778408     5%   /reiserfs