13.6. Input and OutputIn this section, I'll describe the data structures and algorithms that the Linux kernel uses to support I/O-related system calls. Specifically, I'll look at the implementation of these calls in relation to three main types of file:
13.6.1. I/O ObjectsI like to think of files as being special kinds of objects that have I/O capabilities. Linux I/O objects may be arranged according to the hierarchy in Figure 13-28. Figure 13-28. The I/O object hierarchy.
13.6.2. I/O System CallsAs I described in Chapter 12, "Systems Programming," the I/O system calls may be applied in a uniform way to all I/O objects. A few exceptions exist; for example, you can't use lseek () on a pipe or a socket. Here's a list of the system calls that are described in this section:
13.6.3. I/O BufferingThe kernel avoids unnecessary device I/O by buffering most I/O in a fixed-size, systemwide data structure call the buffer cache. The buffer cache is a collection of buffers that are used for caching file blocks in RAM. When a process reads from a block for the very first time, the block is copied from the file or device into the buffer cache and then copied from there into the process's data space. Subsequent reads from the same block are serviced directly from RAM. Similarly, if a process writes to a block that isn't in the buffer cache, the block is copied from the file into the cache and then the buffered copy is modified. If the block is already in the cache, the buffered version is modified without any need for physical I/O. Several hash lists based on the block's device and block number are maintained for the buffers in the cache so that the kernel can quickly locate a buffered block. It's tempting to think that the kernel copies all of a file's modified buffered blocks back to disk when the file is closed, but it may not. It is not necessary to write the buffered blocks back to the device until the buffer needs to be reused for another stream of data. This scheme delays physical I/O until the last possible moment. If another process performs I/O on the same data block, it will receive data from the buffer cache or write its own data into the buffer as long as it is associated with the file or device. When the buffer is disassociated from the file or device, the entire contents of the buffer is written to the device or flushed. Linux includes a kernel daemon called bdflush, the buffer-dirty flush daemon, that periodically flushes "dirty" buffers (buffers that have data that needs to be written to disk). 13.6.3.1. sync ()sync () causes the kernel to flush all of the buffers to disk. 13.6.4. Regular File I/OThe next few sections describe the implementation of regular file I/O, including the implementations of open (), read (), write (), lseek (), close (), dup (), and unlink (). 13.6.4.1. open ()Let's take a look at what happens when a process opens an existing regular file for read-only access. Later, we'll examine the way the kernel creates a new file. Assume that the process is the first process to open the file since the system was last rebooted, and that it executes the following code: fd = open ("/home/glass/sample.txt", O_RDONLY); The kernel begins by translating the filename into an inode number, using the algorithm described earlier in this chapter. If the inode of the file is not found, an error code is returned. Otherwise, the kernel takes an available file structure from the free list to hold information about the file and adds a pointer to it to the process's files_struct structure and links it into the kernel's open file list. Finally, the kernel returns the index of the file descriptor entry as the return value of open (). If a process opens a nonexistent file and specifies the O_CREAT option, the kernel creates the named file. To do this, it allocates a free inode from the file system's inode list, sets the fields within it to indicate that the file is empty, and then adds a hard link to the appropriate directory file. Recall that a hard link is an entry consisting of a filename and its associated inode number. Now that you've seen the way the kernel handles an open () system call, I'll describe the read (), write (), lseek (), and close () system calls. For now, assume that the example file is being accessed by just one process. 13.6.4.2. read ()Let's see what happens when the example process executes the following sequence of read () system calls: read (fd, buf1, 100); /* read 100 bytes into buffer buf1 */ read (fd, buf2, 200); /* read 200 bytes into buffer buf2 */ read (fd, buf3, 5000); /* read 5000 bytes into buffer buf3 */ Here's the sequence of events that would occur:
Note that a single read may cause more than one block to be copied from disk into the buffer cache. If a process reads from a block that does not have an allocated user block (see Chapter 12, "Systems Programming," for a discussion of sparse files), then read () doesn't buffer anything, but instead treats the block as if it were filled with NULL (ASCII 0) characters. 13.6.4.3. write ()The example process now executes the following series of write () system calls: write (fd, buf4, 100); /* write 100 bytes from buffer buf4 */ write (fd, buf5, 4000); /* write 4000 bytes from buffer buf5 * Recall that the current value of the file position is 5300, which is situated near the start of the file's second block. Recall also that this block is currently buffered, courtesy of the last read (). Here's the sequence of events that would occur:
13.6.4.4. lseek ()The implementation of lseek () is trivial. The kernel simply changes the value of the descriptor's associated file position, located in the file struct. Note that no physical I/O is necessary. Changing the current location to 3000 is performed by the following code: lseek (fd, 3000, SEEK_SET); 13.6.4.5. close ()When a file descriptor is closed and it's the only one associated with a particular file, the kernel copies the file's inode back to disk and then marks the corresponding file structure as free. When a process terminates, the kernel automatically closes all of its open file descriptors. As I mentioned earlier, the kernel has special mechanisms to support multiple file descriptors associated with the same file. To implement these mechanisms, the kernel keeps a reference count field in the structure representing each open file as well as the file's inode structure. When a file is opened for the first time, the reference count is set to one. Subsequent opens by other processes increment the reference count. Multiple file structures can point to the same inode, in which case the inode's reference count will be greater than one. Multiple fds, either in a process or in multiple processes, can point to the same file structure, in which case its reference count will be greater than one. The algorithm for close () handles the reference count fields as follows: when a file descriptor is closed, the kernel decrements the reference count field in its associated file struct. If the reference count remains greater than zero, nothing else occurs. If the reference count drops to zero, the file struct is marked as free and the reference count field in the file's inode structure is decremented. If the inode structure's reference count remains greater than zero, nothing else happens. If the inode's reference count drops to zero, the inode is copied back to disk and the inode structure is freed. 13.6.4.6. dup ()The implementation of dup () is simple; it copies the specified file descriptor into the next free file descriptor array entry and increments the corresponding file structure reference count. 13.6.4.7. unlink ()unlink () removes a hard link from a directory and decrements its associated inode's hard link count. If the hard link count drops to zero, the file's inode and data blocks are deallocated when the last process using it exits. Notice that this means that a process may unlink a file and continue to access it until the process exits (and all the while it will not be visible in the directory). 13.6.5. Directory File I/ODirectory files are different from regular files in a few ways:
This ensures the integrity of the directory hierarchy. Directory files may be opened just like regular files. Let's take a look at the implementation of mkdir () and link (). 13.6.5.1. mkdir ()mkdir () creates a directory by allocating a new inode on disk, setting its type field accordingly, and adding it via a hard link into the directory hierarchy. A user block is associated with the inode and filled with the default "." and ".." entries. Here's an example of mkdir (): mkdir ("/home/glass/tmpdir", 0666); 13.6.5.2. link ()link () adds a hard link into a directory. Here's an example of link (): link ("/home/glass/file1.c", "/home/glass/file2.c"); In this example, the kernel would find the inode number of the source filename "/home/glass/file1.c" and then associate it with the label "file2.c" in the destination directory "/home/glass." It would then increment the inode's hard link count. Only a super-user may link directories, to prevent unwary users from creating circular directory structures. 13.6.6. Mounting File SystemsThe kernel maintains a data structure called the mount table that allows multiple file systems to be accessed via a single directory hierarchy. The mount () and umount () system calls modify this table, and are executable only by a super-user. 13.6.6.1. mount ()When a file system is mounted using mount (), an entry is added to the mount table containing the following fields:
The directory associated with the mount point becomes synonymous with the root node of the newly mounted file system, and its previous contents become inaccessible to processes until the file system is later unmounted. To enable the correct translation of pathnames that cross mount points, the inode of the mount directory is marked as a mount point, and is set to point to the associated mount-table entry. For example, Figure 13-29 shows the effect of the following system call, which mounts the file system contained on the "/dev/da0" device onto the "/mnt" directory: Figure 13-29. Mounting a directory.
mount ("/dev/da0", "/mnt", 0); 13.6.7. Translation of FilenamesThe name-translation algorithm uses the contents of the mount table when translating pathnames that cross mount points. This can occur when moving up or down the directory hierarchy. For example, consider the following example: $ cd /mnt/tmp1 $ cd ../../bin The first cd command crosses from the root device to the "/dev/da0" device, and the second cd command crosses back across to the root device. Here's how the algorithm incorporates mounted file systems into the translation process:
13.6.7.1. umount ()When unmounting a file system, there are several things that the kernel must do:
13.6.8. Special File I/OMost special files correspond to peripherals such as printers, terminals, and disk drives, so for the rest of this section I'll use the terms special file and peripheral synonymously. Every peripheral in the system has an associated device driver, which is a custom-crafted piece of software that contains all of the peripheral-specific code. For example, a tape drive's device driver contains the code for rewinding and retensioning the tape. A single device driver may control all instances of a particular kind of peripheral. In other words, three tape drives of the same type can share a single device driver. The device drivers can be linked into the kernel during configuration of the kernel or can be in loadable kernel modules and loaded only when they need to run. 13.6.8.1. Device InterfaceA peripheral's device driver supplies the peripheral's interface, which can come in two flavors:
A peripheral's device driver sometimes contains both kinds of interfaces. The kind of interface that you choose depends on how you're going to access the device. When performing random access and repeated access to a common set of blocks, it makes good sense to access the peripheral via its block-oriented interface. However, if you're going to access the blocks in a single linear sequence, as you would when making a backup tape, it makes more sense to access the peripheral via its character-oriented interface. This avoids the overhead of the kernel's internal buffering mechanism and sometimes allows the kernel to use the hardware's Direct Memory Access (DMA) capabilities. It's perfectly possible, although not advisable, to access a single device simultaneously via both interfaces. The trouble with this is that the character-oriented interface bypasses the buffering system, possibly leading to confusing I/O results. Here's an example:
The solution to this problem is easy: Don't open a device via different interfaces simultaneously! 13.6.8.2. Major/Minor NumbersThe major and minor device numbers are used to locate the device driver associated with a particular device. The major device number specifies which device driver configured into the kernel will be used to access the device. The minor device number specifies which one of the (possibly many) devices will be used. For example, if you have three tape drives on a system, and the device driver for the tape drive corresponds to major number 15, if you use the ls command to list the block-oriented tape devices, you might see: brw--w--w- 1 root 15, 0 Feb 13 14:27 /dev/mt0 brw--w--w- 1 root 15, 1 Feb 13 14:29 /dev/mt1 brw--w--w- 1 root 15, 2 Feb 13 14:27 /dev/mt2 From this, we see that all three tape drives are accessed by the same device driver (signified by the index of 15), and each minor number uniquely identifies a specific tape drive. The major and minor numbers are used to index into switch tables to locate the appropriate device driver. 13.6.8.3. Device AccessAll Linux device drivers provide a set of standard operational functions that perform open, close, read, and write operations on the peripheral device. Pointers to these functions are maintained in the file structure of each open file. In this way, when a process requests an operation on a special file, the proper function in the device driver gets called to perform the I/O operation. lseek (), chmod (), and stat () work the same way for special files as they do for regular files. For block-oriented devices, blocks of data are buffered in the buffer cache. 13.6.8.4. Terminal I/OAlthough terminals are a kind of peripheral, terminal device drivers are special even for special files. The main difference between terminal device drivers and other device drivers is that they must support several different kinds of pre- and post-processing on their input and output, respectively. Each variety of processing is called a line discipline. A terminal's line discipline can be set using ioctl (). Most terminal drivers support the following three common line disciplines:
Terminal devices are controlled through the use of the ioctl () system call and termios, a group of library functions provided by Linux especially for driving terminal I/O. |