Section 13.6. Input and Output

[Page 560 (continued)]

13.6. Input and Output

In this section, I'll describe the data structures and algorithms that the Linux kernel uses to support I/O-related system calls. Specifically, I'll look at the implementation of these calls in relation to three main types of file:

regular files
directory files
special files (i.e., peripherals, pipes, and sockets)

[Page 561]

13.6.1. I/O Objects

I like to think of files as being special kinds of objects that have I/O capabilities. Linux I/O objects may be arranged according to the hierarchy in Figure 13-28.

Figure 13-28. The I/O object hierarchy.

13.6.2. I/O System Calls

As I described in Chapter 12, "Systems Programming," the I/O system calls may be applied in a uniform way to all I/O objects. A few exceptions exist; for example, you can't use lseek () on a pipe or a socket. Here's a list of the system calls that are described in this section:

close
dup
ioctl
link
lseek
mkdir
mount
open
read
sync
umount
unlink
write

13.6.3. I/O Buffering

The kernel avoids unnecessary device I/O by buffering most I/O in a fixed-size, systemwide data structure call the buffer cache. The buffer cache is a collection of buffers that are used for caching file blocks in RAM.

When a process reads from a block for the very first time, the block is copied from the file or device into the buffer cache and then copied from there into the process's data space. Subsequent reads from the same block are serviced directly from RAM. Similarly, if a process writes to a block that isn't in the buffer cache, the block is copied from the file into the cache and then the buffered copy is modified. If the block is already in the cache, the buffered version is modified without any need for physical I/O. Several hash lists based on the block's device and block number are maintained for the buffers in the cache so that the kernel can quickly locate a buffered block.

[Page 562]

It's tempting to think that the kernel copies all of a file's modified buffered blocks back to disk when the file is closed, but it may not. It is not necessary to write the buffered blocks back to the device until the buffer needs to be reused for another stream of data. This scheme delays physical I/O until the last possible moment. If another process performs I/O on the same data block, it will receive data from the buffer cache or write its own data into the buffer as long as it is associated with the file or device. When the buffer is disassociated from the file or device, the entire contents of the buffer is written to the device or flushed.

Linux includes a kernel daemon called bdflush, the buffer-dirty flush daemon, that periodically flushes "dirty" buffers (buffers that have data that needs to be written to disk).

13.6.3.1. sync ()

sync () causes the kernel to flush all of the buffers to disk.

13.6.4. Regular File I/O

The next few sections describe the implementation of regular file I/O, including the implementations of open (), read (), write (), lseek (), close (), dup (), and unlink ().

13.6.4.1. open ()

Let's take a look at what happens when a process opens an existing regular file for read-only access. Later, we'll examine the way the kernel creates a new file. Assume that the process is the first process to open the file since the system was last rebooted, and that it executes the following code:

fd = open ("/home/glass/sample.txt", O_RDONLY);

The kernel begins by translating the filename into an inode number, using the algorithm described earlier in this chapter. If the inode of the file is not found, an error code is returned. Otherwise, the kernel takes an available file structure from the free list to hold information about the file and adds a pointer to it to the process's files_struct structure and links it into the kernel's open file list. Finally, the kernel returns the index of the file descriptor entry as the return value of open ().

If a process opens a nonexistent file and specifies the O_CREAT option, the kernel creates the named file. To do this, it allocates a free inode from the file system's inode list, sets the fields within it to indicate that the file is empty, and then adds a hard link to the appropriate directory file. Recall that a hard link is an entry consisting of a filename and its associated inode number.

Now that you've seen the way the kernel handles an open () system call, I'll describe the read (), write (), lseek (), and close () system calls. For now, assume that the example file is being accessed by just one process.

[Page 563]

13.6.4.2. read ()

Let's see what happens when the example process executes the following sequence of read () system calls:

read (fd, buf1, 100);  /* read 100 bytes into buffer buf1 */ read (fd, buf2, 200);  /* read 200 bytes into buffer buf2 */ read (fd, buf3, 5000); /* read 5000 bytes into buffer buf3 */

Here's the sequence of events that would occur:

The data requested by the first read () resides in the first block of the file. The kernel determines that the block is not in the buffer cache, and so copies it from disk into a free buffer. It then copies the first 100 bytes from the buffer into buf1. Finally, the file position stored in the file struct is updated to its new value of 100.
The data requested by the second read () also resides in the first block of the file. The kernel finds that the block is already in the buffer cache, and so copies the next 200 bytes from the buffer into buf2. It then updates the file position to 300.
The data requested by the third read resides partly in the first block of the file and partly in the second block. The kernel transfers the remainder of the first block (3796 bytes) from the buffer cache into buf3. It then copies the second block from disk into a free buffer in the cache and copies the remaining data (1204 bytes) from the buffer cache into buf3. Finally, it updates the file position to 5300.

Note that a single read may cause more than one block to be copied from disk into the buffer cache. If a process reads from a block that does not have an allocated user block (see Chapter 12, "Systems Programming," for a discussion of sparse files), then read () doesn't buffer anything, but instead treats the block as if it were filled with NULL (ASCII 0) characters.

13.6.4.3. write ()

The example process now executes the following series of write () system calls:

write (fd, buf4, 100);  /* write 100 bytes from buffer buf4 */ write (fd, buf5, 4000); /* write 4000 bytes from buffer buf5 *

Recall that the current value of the file position is 5300, which is situated near the start of the file's second block. Recall also that this block is currently buffered, courtesy of the last read (). Here's the sequence of events that would occur:

The data to be overwritten by the first write () resides entirely in the second block. This block is already in the buffer cache, and so 100 bytes of buf4 are copied into the appropriate bytes of the buffered second block.
The data to be overwritten by the second write () resides partly in the second block and partly in the third block. The kernel copies the first 3792 bytes of buf5 into the remaining 3792 bytes of the buffered second block. Then it copies the third block from the file into a free buffer. Finally, it copies the remaining 208 bytes of buf5 into the first 208 bytes of the buffered third block.
[Page 564]

13.6.4.4. lseek ()

The implementation of lseek () is trivial. The kernel simply changes the value of the descriptor's associated file position, located in the file struct. Note that no physical I/O is necessary. Changing the current location to 3000 is performed by the following code:

lseek (fd, 3000, SEEK_SET);

13.6.4.5. close ()

When a file descriptor is closed and it's the only one associated with a particular file, the kernel copies the file's inode back to disk and then marks the corresponding file structure as free. When a process terminates, the kernel automatically closes all of its open file descriptors.

As I mentioned earlier, the kernel has special mechanisms to support multiple file descriptors associated with the same file. To implement these mechanisms, the kernel keeps a reference count field in the structure representing each open file as well as the file's inode structure. When a file is opened for the first time, the reference count is set to one. Subsequent opens by other processes increment the reference count.

Multiple file structures can point to the same inode, in which case the inode's reference count will be greater than one. Multiple fds, either in a process or in multiple processes, can point to the same file structure, in which case its reference count will be greater than one.

The algorithm for close () handles the reference count fields as follows: when a file descriptor is closed, the kernel decrements the reference count field in its associated file struct. If the reference count remains greater than zero, nothing else occurs. If the reference count drops to zero, the file struct is marked as free and the reference count field in the file's inode structure is decremented. If the inode structure's reference count remains greater than zero, nothing else happens. If the inode's reference count drops to zero, the inode is copied back to disk and the inode structure is freed.

13.6.4.6. dup ()

The implementation of dup () is simple; it copies the specified file descriptor into the next free file descriptor array entry and increments the corresponding file structure reference count.

13.6.4.7. unlink ()

unlink () removes a hard link from a directory and decrements its associated inode's hard link count. If the hard link count drops to zero, the file's inode and data blocks are deallocated when the last process using it exits. Notice that this means that a process may unlink a file and continue to access it until the process exits (and all the while it will not be visible in the directory).

[Page 565]

13.6.5. Directory File I/O

Directory files are different from regular files in a few ways:

They may only be created using mknod () or mkdir ().
They may only be read using readdir ().
They may only be modified using link ().

This ensures the integrity of the directory hierarchy. Directory files may be opened just like regular files. Let's take a look at the implementation of mkdir () and link ().

13.6.5.1. mkdir ()

mkdir () creates a directory by allocating a new inode on disk, setting its type field accordingly, and adding it via a hard link into the directory hierarchy. A user block is associated with the inode and filled with the default "." and ".." entries. Here's an example of mkdir ():

mkdir ("/home/glass/tmpdir", 0666);

13.6.5.2. link ()

link () adds a hard link into a directory. Here's an example of link ():

link ("/home/glass/file1.c", "/home/glass/file2.c");

In this example, the kernel would find the inode number of the source filename "/home/glass/file1.c" and then associate it with the label "file2.c" in the destination directory "/home/glass." It would then increment the inode's hard link count. Only a super-user may link directories, to prevent unwary users from creating circular directory structures.

13.6.6. Mounting File Systems

The kernel maintains a data structure called the mount table that allows multiple file systems to be accessed via a single directory hierarchy. The mount () and umount () system calls modify this table, and are executable only by a super-user.

13.6.6.1. mount ()

When a file system is mounted using mount (), an entry is added to the mount table containing the following fields:

the number of the device that contains the newly mounted file system
a pointer to the root inode of the newly mounted file system
a pointer to the inode of the mount point
a pointer to the mount data structure specific to the newly mounted file system

The directory associated with the mount point becomes synonymous with the root node of the newly mounted file system, and its previous contents become inaccessible to processes until the file system is later unmounted. To enable the correct translation of pathnames that cross mount points, the inode of the mount directory is marked as a mount point, and is set to point to the associated mount-table entry. For example, Figure 13-29 shows the effect of the following system call, which mounts the file system contained on the "/dev/da0" device onto the "/mnt" directory:

[Page 566]

Figure 13-29. Mounting a directory.

mount ("/dev/da0", "/mnt", 0);

13.6.7. Translation of Filenames

The name-translation algorithm uses the contents of the mount table when translating pathnames that cross mount points. This can occur when moving up or down the directory hierarchy. For example, consider the following example:

$ cd /mnt/tmp1 $ cd ../../bin

The first cd command crosses from the root device to the "/dev/da0" device, and the second cd command crosses back across to the root device. Here's how the algorithm incorporates mounted file systems into the translation process:

When an inode is encountered during the translation process that is a mount point, a pointer to the root inode of the mounted file system is returned instead. For example, when the "/mnt" portion of the "/mnt/dir1" is translated, a pointer to the root node of the mounted file system is returned. This pointer is used as the starting point for the rest of the pathname translation.
When a ".." pathname component is encountered, the kernel checks to see whether a mount point is about to be crossed. If the current inode pointer of the translation process points to a root node and ".." also points to a root node, then a crossing point has been reached (Figure 13-30). It replaces the current inode pointer of the translation process with a pointer to the inode of the mount point in the parent file system, which it finds by scanning the mount table for the entry corresponding to the device number of the current inode.
[Page 567]

Figure 13-30. Crossing point.

13.6.7.1. umount ()

When unmounting a file system, there are several things that the kernel must do:

It checks that there are no open files in the file system about to be unmounted. If any open files are found, the system call fails.
It flushes the superblock and buffered inodes back to the file system.
It removes the mount-table entry and removes the mount-point mark from the mount-point directory.

13.6.8. Special File I/O

Most special files correspond to peripherals such as printers, terminals, and disk drives, so for the rest of this section I'll use the terms special file and peripheral synonymously.

Every peripheral in the system has an associated device driver, which is a custom-crafted piece of software that contains all of the peripheral-specific code. For example, a tape drive's device driver contains the code for rewinding and retensioning the tape. A single device driver may control all instances of a particular kind of peripheral. In other words, three tape drives of the same type can share a single device driver. The device drivers can be linked into the kernel during configuration of the kernel or can be in loadable kernel modules and loaded only when they need to run.

13.6.8.1. Device Interface

A peripheral's device driver supplies the peripheral's interface, which can come in two flavors:

block-oriented, which means that I/O is buffered, and that physical I/O is performed on a block-by-block basis. Disk drives and tape drives have a block-oriented interface.
character-oriented, which means that I/O is unbuffered, and that physical I/O occurs on a character-by-character basis. A character-oriented interface is sometimes known as a raw interface. All peripherals usually have a raw interface, including disk drives and tape drives.

A peripheral's device driver sometimes contains both kinds of interfaces. The kind of interface that you choose depends on how you're going to access the device. When performing random access and repeated access to a common set of blocks, it makes good sense to access the peripheral via its block-oriented interface. However, if you're going to access the blocks in a single linear sequence, as you would when making a backup tape, it makes more sense to access the peripheral via its character-oriented interface. This avoids the overhead of the kernel's internal buffering mechanism and sometimes allows the kernel to use the hardware's Direct Memory Access (DMA) capabilities.

[Page 568]

It's perfectly possible, although not advisable, to access a single device simultaneously via both interfaces. The trouble with this is that the character-oriented interface bypasses the buffering system, possibly leading to confusing I/O results. Here's an example:

Process A opens a floppy disk using its block-oriented interface, "/dev/fd0." It then writes 1000 bytes to the disk. This output is stored in the buffer cache, and marked for delayed writing.
Process B then opens the same floppy disk using its character-oriented interface, "/dev/rfd0." When it reads 1000 bytes from the disk, the data that was written by process A is ignored, since it's still in the buffer cache.

The solution to this problem is easy: Don't open a device via different interfaces simultaneously!

13.6.8.2. Major/Minor Numbers

The major and minor device numbers are used to locate the device driver associated with a particular device. The major device number specifies which device driver configured into the kernel will be used to access the device. The minor device number specifies which one of the (possibly many) devices will be used. For example, if you have three tape drives on a system, and the device driver for the tape drive corresponds to major number 15, if you use the ls command to list the block-oriented tape devices, you might see:

brw--w--w-  1    root 15,  0 Feb 13 14:27 /dev/mt0 brw--w--w-  1    root 15,  1 Feb 13 14:29 /dev/mt1 brw--w--w-  1    root 15,  2 Feb 13 14:27 /dev/mt2

From this, we see that all three tape drives are accessed by the same device driver (signified by the index of 15), and each minor number uniquely identifies a specific tape drive. The major and minor numbers are used to index into switch tables to locate the appropriate device driver.

13.6.8.3. Device Access

All Linux device drivers provide a set of standard operational functions that perform open, close, read, and write operations on the peripheral device. Pointers to these functions are maintained in the file structure of each open file. In this way, when a process requests an operation on a special file, the proper function in the device driver gets called to perform the I/O operation.

lseek (), chmod (), and stat () work the same way for special files as they do for regular files. For block-oriented devices, blocks of data are buffered in the buffer cache.

[Page 569]

13.6.8.4. Terminal I/O

Although terminals are a kind of peripheral, terminal device drivers are special even for special files. The main difference between terminal device drivers and other device drivers is that they must support several different kinds of pre- and post-processing on their input and output, respectively. Each variety of processing is called a line discipline. A terminal's line discipline can be set using ioctl (). Most terminal drivers support the following three common line disciplines:

raw mode, which performs no special processing at all. Characters entered at the keyboard are made available to the reading process based on the ioctl () parameters. Key sequences such as Control-C do not generate any kind of special action, and are passed as regular ASCII characters. For example, Control-C would be read as the character with ASCII value 3. Raw mode is used by applications such as editors that prefer to do all of their own character processing.
cbreak mode, which processes only some key sequences specially. For example, flow control via Control-S and Control-Q remains active. Similarly, Control-C generates an interrupt signal for every process in the foreground job. As with raw mode, all other characters are available to the reading process based on the ioctl () parameters.
cooked mode (sometimes known as canonical mode), which performs full pre- and postprocessing. In this mode, the delete and backspace keys take on their special meanings, together with the less common word-erase and line-erase characters. Input is made available to a reading process only when the Enter key is pressed. Similarly, tabs have a special meaning when output, and are expanded by the line discipline to the correct number of spaces. A newline character is expanded into a carriage return/newline pair.

Terminal devices are controlled through the use of the ioctl () system call and termios, a group of library functions provided by Linux especially for driving terminal I/O.