Section 13.3. The File System | Linux for Programmers and Users

[Page 541]

13.3. The File System

Linux uses files for long-term storage and RAM for short-term storage. Programs, data, and text are all stored in files. Files are usually stored on hard disks, but can also be stored on other media such as tape and floppy disks. Linux files are organized by a hierarchy of labels, commonly known as a directory structure. The files referenced by these labels may be of three kinds:

Regular files, which contain a sequence of bytes generally corresponding to code or data. They may be referenced via the standard I/O system calls.
Directory files, which are stored on disk in a special format and form the backbone of the file system. They may be referenced only via directory-specific system calls.
Special files, which correspond to peripherals such as printers and disks, and interprocess communication mechanisms such as pipes and sockets. They may be referenced via the standard I/O system calls.

Conceptually, a Linux file is a linear sequence of bytes. The Linux kernel does not support any higher order of file structure, such as records and/or fields, as some older operating systems did. This is evident if you consider the lseek () system call, which allows you to position the file pointer only in terms of a byte offset.

While most of the concepts in this section apply to any file system supported by Linux, we will specifically examine the Native Linux (ext2) file system. But first, let's examine the hardware architecture of the most common file system medium: the disk drive.

13.3.1. Disk Architecture

Figure 13-9 is a diagram of a typical disk architecture.

Figure 13-9. Disk architecture.

A disk is split up in two ways: it's sliced like a pizza into areas called sectors, and further subdivided into concentric rings called tracks. The individual areas bounded by the intersection of sectors and tracks are called blocks, and form the basic unit of disk storage. A typical disk block can hold 4K bytes. A single read/write head accesses information as the disk rotates and its surface passes underneath. A special chip called a disk controller moves the read/write head in response to instructions from the disk device driver, which is a special piece of software located in the Linux kernel.

[Page 542]

There are several variations of this simple disk architecture. Many disk drives actually contain several platters, stacked one upon the other. In these systems, the collection of tracks with the same index number is called a cylinder. In most multiplatter systems, the disk arms are connected to each other so that the read/write heads all move synchronously, rather like a comb moving through hair (Figure 13-10). The read/write heads of such disk systems therefore move through cylinders of media.

Figure 13-10. A multiplatter architecture.

Notice that the blocks on the outside track are larger than the blocks on the inside track, due to the way that a disk is partitioned. If a disk always rotates at the same speed, the density of data on outer blocks is less than it could be, thus wasting potential storage (Figure 13-11).

Figure 13-11. Disk storage techniques.

13.3.1.1. Interleaving

When a sequence of contiguously numbered blocks is read, there's a latency delay between each block due to the overhead of the communication between the disk controller and the device driver. Logically contiguous blocks are therefore spaced apart on the surface of the disk so that, by the time the latency delay is over, the head is positioned over the correct area. The spacing between blocks due to this delay effect is called the interleave factor. Figure 13-12 illustrates two different interleave factors.

[Page 543]

Figure 13-12. Disk interleaving.

13.3.1.2. Block I/O

Input and output to and from the disk is always performed in whole blocks. If you issue a read () system call to read the first byte of data from a file, the device driver issues an I/O request to the disk controller to read the first 4K block into a kernel buffer, and then copies the first byte from the buffer to your process. More information about I/O buffering is presented later in this chapter.

Most disk controllers handle one block I/O request at a time. When a disk controller completes the current block I/O request, it issues a hardware interrupt back to the device driver to signal completion. At this point, the device driver usually makes the next block I/O request. Figure 13-13 illustrates the sequence of events that might occur during a 9K read ().

Figure 13-13. Block I/O.

[Page 544]

13.3.1.3. Fragmentation

Block size can be specified at 1K, 2K, or 4K, when the file system is created with the mkfs utility (if not specified, mkfs will pick a size suitable to the size of the file system). Assuming a 4K block size, a single 9K file requires 3 blocks of storageone to hold the first 4K, one to hold the next 4K, and the last to hold the remaining 1K. As a file is modified, blocks are added to and removed from the file as data is added and deleted. Disk blocks are rarely contiguous on the disk device and, over time, the disk blocks making up the file tend to become scattered all over the disk (Figure 13-14). This situation is called fragmentation.

Figure 13-14. A file's blocks are scattered.

13.3.2. Virtual File System

The central connection between the Linux kernel and all file systems is the Virtual File System (VFS) abstraction layer. The Linux kernel knows nothing of the details of any particular file system, it simply talks to the VFS to deal with any file system. Every file system provides a standard interface that hooks into VFS. Figure 13-15 shows the relationship between a process and a file system.

Figure 13-15. The VFS layer between process and file system.

[Page 545]

This abstraction not only allows many different types of real file system to be plugged in underneath and run smoothly, it also allows a file system interface to be created for data structures other than file systems. The /proc file system (discussed in Chapter 14, "System Administration") is an example of such a use and is a good example of the power of VFS. A file system interface to kernel data provides an easy-to-navigate interface into a running kernel.

13.3.3. Inodes

Linux uses a structure called an inode (Index Node) to store information about each file. The inode of a regular or directory file contains pointers to the locations of its disk blocks, and the inode of a special file contains information that allows the peripheral to be identified. Each inode also holds other information about the file.

An inode is a fixed size structure containing pointers to disk blocks and additional indirect pointers (for large files). Every inode in a particular file system is allocated a unique inode number and every file has exactly one inode (Figure 13-16).

Figure 13-16. Every file has an inode.

Each inode contains information about the file such as:

the type of the file: regular, directory, block-oriented or character-oriented special, etc.
file permissions (mode)

[Page 546]

the owner and group ID of the file
a hard link count (described later in this chapter)
the last modification and last access times
if it's a regular or directory file, the location of the blocks
if it's a special file, the major and minor device numbers (described later in this chapter)
if it's a symbolic link, the value of the symbolic link (unless it's so long it must be in a data block)

In other words, an inode contains most of the information that you see when you perform an "ls -l," except the filename.

Only the locations of the first 12 blocks of a file are stored directly in the inode. Many Linux files are smaller than 48K in size, so in a 4K-block file system, this is sufficient for many files. An indirect accessing scheme is used for addressing larger files. In this scheme, a disk block is used to hold the locations of other data blocks. This block is called an indirect block. Its location is stored in the inode, and it is used to address the next group of blocks in the file. If this is still insufficient to hold references to all the required data blocks for a file, a double-indirect block is used. This block points to a group of indirect blocks, which in turn point to data blocks. A Linux inode contains a place to store references to one indirect, one double-indirect, and one triple-indirect block as shown in Figure 13-17.

Figure 13-17. A file's blocks are scattered.

Note that, as the file gets larger, the amount of indirection required to access a particular block increases. This overhead is minimized by buffering the contents of the inode and commonly referenced indirect blocks in RAM.

[Page 547]

13.3.4. File System Layout

The first logical block of a disk is called the boot block, and contains some executable code that is used when Linux is first activated. See Chapter 2, "Installing Your Linux System," and Chapter 14, "System Administration," for more information.

The remainder of the file system is a repeated series of block groups, each containing a group of inodes, disk blocks, and some housekeeping information about them.

Each block group contains the following data:

a copy of the file system's superblock
group descriptors
block bitmap
inode bitmap
the inode table
data blocks

Figure 13-18 is a simplified diagram of the layout of the ext2 file system.

Figure 13-18. Simplified view of the ext2 file system layout.

13.3.4.1. The Superblock

The superblock contains information pertaining to the entire file system, such as:

a "magic number" denoting type of file system
number of inodes in the file system
number of currently free inodes
number of blocks in the file system
number of currently free blocks
the size of a block (in bytes) used by the file system
number of data blocks per block group
number of inodes per block group

[Page 548]

Without the superblock, the file system could not be used. Therefore, multiple copies of the superblock are maintained throughout the file system area in case of a failure of the primary copy. A copy of the superblock is maintained at the started of each block group.

13.3.4.2. Group Descriptors

The group descriptor is a sort of superblock for the individual block group. As in superblock, a copy of all group descriptors is stored in each block group, so this area contains a copy of all group descriptors in the file system. Information stored here includes:

pointer to a disk block containing the block allocation bitmap
pointer to a disk block containing the inode allocation bitmap
pointer to the beginning of the inode table
free block count
free inode count
used directory count

13.3.4.3. Block and Inode Bitmaps

The block and inode bitmaps are each a string of bits corresponding to each block and inode, respectively, in the block group. If the bit is set to 1, then block or inode is allocated, otherwise it is available. These bitmaps are used during allocation and deallocation of block and inodes.

13.3.4.4. Inode Table

The inode table is simply an array of inode structures for each inode defined in the block group.

13.3.5. Bad Blocks

A disk always contains several blocks that, for one reason or another, are not fit for use. Many disk drives map out bad blocks internally, but those that are not can be mapped out by the file system itself. The utilities that create a new file system also create a single "worst nightmare" file composed of all the bad blocks on the disk and record the locations of all these blocks in inode number 1. This prevents the blocks from being allocated to other files. When fsck runs to check the file system, bad blocks can be added to this file and thus mapped out of the file system.

13.3.6. Directories

Inode number 2 contains the location of the block(s) containing the root directory of the file system. A directory contains a list of associations between filenames and inode numbers. When a directory is created, it is automatically allocated entries for "..", its parent directory, and ".", itself. Since a <filename, inode number> pair effectively links a name to a file, these associations are termed "hard links." Since filenames are stored in the directory blocks, they are not stored in a file's inode. In fact, it wouldn't make any sense to store the name in the inode, as a file may have more than one name. Because of this observation, it's more accurate to think of the directory hierarchy as being a hierarchy of file labels, rather than a hierarchy of files. Most file systems allow filenames up to 255 characters in length.

[Page 549]

Figure 13-19 is an illustration of the root inode corresponding to a simple root directory. The inode numbers associated with each filename are shown as subscripts.

Figure 13-19. The root directory is associated with inode two.

13.3.7. Translating Pathnames into Inode Numbers

System calls such as open () must obtain a file's inode from its pathname. They perform the translation as follows:

The inode from which to start the pathname search is located. If the pathname is absolute, the search starts from inode #2. If the pathname is relative, the search starts from the inode corresponding to the process's current working directory.
The components of the pathname are then processed from left to right. Every component except the last should correspond to either a directory or a symbolic link. Let's call the inode from which the pathname search is started the current inode.
If the current inode corresponds to a directory, the directory corresponding to the current inode is searched to see if the current pathname component is there. If it's not found, an error occursotherwise, the value of the current inode number becomes the inode number associated with the located pathname component.
If the current inode corresponds to a symbolic link, the pathname up to and including the current path component is replaced by the contents of the symbolic link, and the pathname is reprocessed.
The inode corresponding to the final pathname component is the inode of the file referenced by the entire pathname.

To illustrate this algorithm, I'll list the steps required to translate the pathname "/usr/test.c" into an inode number. Figure 13-20 contains the disk layout that I assume during the translation process. It indicates the translation path using bold lines, and the final destination using a circle.

[Page 550]

Figure 13-20. A sample directory layout.

Here's the logic that the kernel uses to translate the pathname "/usr/test.c" into an inode number:

The pathname is absolute, so the current inode number is 2.
The directory corresponding to inode 2 is searched for the pathname component "usr." The matching entry is found, and the current inode number is set to 4.

[Page 551]

The directory corresponding to inode 4 is searched for the pathname component "test.c". The matching entry is found, and the current inode number is set to 6.
"test.c" is the final pathname component, so the algorithm returns the inode number 6.

As you can see, the translation bounces between inodes and directory blocks until the pathname is fully processed.

13.3.8. Mounting File Systems

When Linux is booted, the directory hierarchy corresponds to the file system located on a single disk called the root device. Linux allows you to create file systems on other devices and attach them to the original directory hierarchy using a mechanism termed mounting. The mount utility allows a super-user to splice the root directory of a file system into the existing directory hierarchy. The hierarchy of a large Linux system may be spread across many devices, each containing a subtree of the total hierarchy. For example, the "/usr" subtree is commonly stored on a device other than the root device. Non-root file systems are usually mounted automatically at boot time. See Chapter 14, "System Administration," for more details. For example, assume that a file system is stored on a floppy disk in the "/dev/fd0" device. To attach it to the "/mnt" subdirectory of the main hierarchy, you'd execute the command:

$ mount /dev/fd0 /mnt

Figure 13-21 illustrates the effect of this command.

Figure 13-21. Mounting directories.

File systems may be detached from the main hierarchy by using the umount utility. The following command would detach the file system stored in "/dev/fd0":

$ umount /dev/fd0 or $ umount /mnt