Section 6.5. VFS System Calls and the Filesystem Layer

6.5. VFS System Calls and the Filesystem Layer

Until this point, we covered all the structures that are associated with the VFS and the page cache. Now, we focus on two of the system calls used in file manipulation and trace their execution down to the kernel level. We see how the open(), close(), read(), and write() system calls make use of the structures previously described.

We mentioned that in the VFS, files are treated as complete abstractions. You can open, read, write, or close a file, but the specifics of what physically happens are unimportant to the VFS layer. Chapter 5 covers these specifics.

Hooked into the VFS is the filesystem-specific layer that translates the VFS' file I/O to pages and blocks. Because you can have many specific filesystem types on a computer system, like an ext2 formatted hard disk and an iso9660 cdrom, the filesystem layer can be divided into two main sections: the generic filesystem operations and the specific filesystem operations (refer to Figure 6.3).

Following our top-down approach, this section traces a read and write request from the VFS call of read(), or write(), tHRough the filesystem layer until a specific block I/O request is handed off to the block device driver. In our travels, we move between the generic filesystem and specific filesystem layer. We use the ext2 filesystem driver as the example of the specific filesystem layer, but keep in mind that different filesystem drivers could be accessed depending on what file is being acted upon. As we progress, we will also encounter the page cache, which is a construct within Linux that is positioned in the generic filesystem layer. In older versions of Linux, a buffer cache and page cache exist, but in the 2.6 kernel, the page cache has consumed any buffer cache functionality.

6.5.1. open ()

When a process wants to manipulate the contents of a file, it issues the open()system call:

 ----------------------------------------------------------------------- synopsis #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int open(const char *pathname, int flags); int open(const char *pathname, int flags, mode_t mode); int creat(const char *pathname, mode_t mode); -----------------------------------------------------------------------

The open syscall takes as its arguments the pathname of the file, the flags to identify access mode of the file being opened, and the permission bit mask (if the file is being created). open() returns the file descriptor of the opened file (if successful) or an error code (if it fails).

The flags parameter is formed by bitwise ORing one or more of the constants defined in include/linux/fcntl.h. Table 6.9 lists the flags for open() and the corresponding value of the constant. Only one of O_RDONLY, O_WRONLY, or O_RDWR flags has to be specified. The additional flags are optional.

Table 6.9. open() Flags
Flag Name	Value	Description
`O_RDONLY`	0	Opens file for reading.
`O_WRONLY`	1	Opens file for writing.
`O_RDWR`	2	Opens file for reading and writing.
`O_CREAT`	100	Indicates that, if the file does not exist, it should be created. The `creat()` function is equivalent to the `open()` function with this flag set.
`O_EXCL`	200	Used in conjunction with `O_CREAT`, this indicates the `open()` should fail if the file does exist.
`O_NOCTTY`	400	In the case that pathname refers to a terminal device, the process should not consider it a controlling terminal.
`O_TRUNC`	0x1000	If the file exists, truncate it to 0 bytes.
`O_APPEND`	0x2000	Writes at the end of the file.
`O_NONBLOCK`	0x4000	Opens the file in non-blocking mode.
`O_NDELAY`	0x4000	Same value as `O_NONBLOCK`.
`O_SYNC`	0x10000	Writes to the file have to wait for the completion of physical I/O. Applied to files on block devices.
`O_DIRECT`	0x20000	Minimizes cache buffering on I/O to the file.
`O_LARGEFILE`	0x100000	The large filesystem allows files of sizes greater than can be represented in 31 bits. This ensures they can be opened.
`O_DIRECTORY`	0x200000	If the pathname does not indicate a directory, the open is to fail.
`O_NOFOLLOW`	0x400000	If the pathname is a symbolic link, the open is to fail.

Let's look at the system call:

 ----------------------------------------------------------------------- fs/open.c 927  asmlinkage long sys_open (const char __user * filename, int flags, int mode) 928  { 929   char * tmp; 930   int fd, error; 931   932 #if BITS_PER_LONG != 32 933   flags |= O_LARGEFILE; 934  #endif 935   tmp = getname(filename); 936   fd = PTR_ERR(tmp); 937   if (!IS_ERR(tmp)) { 938    fd = get_unused_fd(); 939    if (fd >= 0) { 940     struct file *f = filp_open(tmp, flags, mode); 941     error = PTR_ERR(f); 942     if (IS_ERR(f)) 943       goto out_error; 944     fd_install(fd, f); 945    } 946  out: 947    putname(tmp); 948   } 949   return fd; 950   951  out_error: 952   put_unused_fd(fd); 953   fd = error; 954   goto out; 955  } -----------------------------------------------------------------------

Lines 932934

Verify if our system is non-32-bit. If so, enable the large filesystem support flag O_LARGEFILE. This allows the function to open files with sizes greater than those represented by 31 bits.

Line 935

The getname() routine copies the filename from user space to kernel space by invoking strncpy_from_user().

Line 938

The get_unused_fd() routine returns the first available file descriptor (or index into fd array: current->files->fd) and marks it busy. The local variable fd is set to this value.

Line 940

The filp_open() function performs the bulk of the open syscall work and returns the file structure that will associate the process with the file. Let's take a closer look at the filp_open() routine:

 ----------------------------------------------------------------------- fs/open.c 740  struct file *filp_open(const char * filename, int flags, int mode) 741  { 742   int namei_flags, error; 743   struct nameidata nd; 744 745   namei_flags = flags; 746   if ((namei_flags+1) & O_ACCMODE) 747    namei_flags++; 748   if (namei_flags & O_TRUNC) 749    namei_flags |= 2; 750   751   error = open_namei(filename, namei_flags, mode, &nd); 752   if (!error) 753    return dentry_open(nd.dentry, nd.mnt, flags); 754   755   return ERR_PTR(error); -----------------------------------------------------------------------

Lines 745749

The pathname lookup functions, such as open_namei(), expect the access mode flags encoded in a specific format that is different from the format used by the open system call. These lines copy the access mode flags into the namei_flags variable and format the access mode flags for interpretation by open_namei().

The main difference is that, for pathname lookup, it can be the case that the access mode might not require read or write permission. This "no permission" access mode does not make sense when trying to open a file and is thus not included under the open system call flags. "No permission" is indicated by the value of 00. Read permission is then indicated by setting the value of the low-order bit to 1 whereas write permission is indicated by setting the value of the high-order bit to 1. The open system call flags for O_RDONLY, O_WRONLY, and O_RDWR evaluate to 00, 01, and 02, respectively as seen in include/asm/fcntl.h.

The namei_flags variable can extract the access mode by logically bit ANDing it with the O_ACCMODE variable. This variable holds the value of 3 and evaluates to true if the variable to be ANDed with it holds a value of 1, 2, or 3. If the open system call flag was set to O_RDONLY, O_WRONLY, and O_RDWR, adding a 1 to this value translates it into the pathname lookup format and evaluates to true when ANDed with O_ACCMODE. The second check just assures that if the open system call flag is set to allow for file truncation, the high-order bit is set in the access mode specifying write access.

Line 751

The open_namei() routine performs the pathname lookup, generates the associated nameidata structure, and derives the corresponding inode.

Line 753

The dentry_open() is a wrapper routine around dentry_open_it(), which creates and initializes the file structure. It creates the file structure via a call to the kernel routine get_empty_filp(). This routine returns ENFILE if the files_stat.nr_files is greater than or equal to files_stat.max_files. This case indicates that the system's limit on the total number of open files has been reached.

Let's look at the dentry_open_it() routine:

[View full width]
 ----------------------------------------------------------------------- fs/open.c 844  struct file *dentry_open_it(struct dentry *dentry, struct 845   vfsmount *mnt, int  flags, struct lookup_intent *it) 846  { 847   struct file * f; 848   struct inode *inode; 849   int error; 850 851   error = -ENFILE; 852   f = get_empty_filp(); ... 855   f->f_flags = flags; 856   f->f_mode = (flags+1) & O_ACCMODE; 857   f->f_it = it; 858   inode = dentry->d_inode; 859   if (f->f_mode & FMODE_WRITE) { 860    error = get_write_access(inode); 861    if (error) 862      goto cleanup_file; 863   } ... 866   f->f_dentry = dentry; 867   f->f_vfsmnt = mnt; 868   f->f_pos = 0; 869   f->f_op = fops_get(inode->i_fop); 870   file_move(f, &inode->i_sb->s_files); 871 872   if (f->f_op && f->f_op->open) { 873     error = f->f_op->open(inode,f); 874     if (error) 875      goto cleanup_all; 876     intent_release(it); 877   } ... 891  return f; ... 907  } -----------------------------------------------------------------------

Line 852

The file struct is assigned by way of the call to get_empty_filp().

Lines 855856

The f_flags field of the file struct is set to the flags passed in to the open system call. The f_mode field is set to the access modes passed to the open system call, but in the format expected by the pathname lookup functions.

Lines 866869

The files struct's f_dentry field is set to point to the dentry struct that is associated with the file's pathname. The f_vfsmnt field is set to point to the vmfsmount struct for the filesystem. f_pos is set to 0, which indicates that the starting position of the file_offset is at the beginning of the file. The f_op field is set to point to the table of operations pointed to by the file's inode.

Line 870

The file_move() routine is called to insert the file structure into the filesystem's superblock list of file structures representing open files.

Lines 872877

This is where the next level of the open function occurs. It is called here if the file has more file-specific functionality to perform to open the file. It is also called if the file operations table for the file contains an open routing.

This concludes the dentry_open_it() routine.

By the end of filp_open(), we will have a file structure allocated, inserted at the head of the superblock's s_files field, with f_dentry pointing to the dentry object, f_vfsmount pointing to the vfsmount object, f_op pointing to the inode's i_fop file operations table, f_flags set to the access flags, and f_mode set to the permission mode passed to the open() call.

Line 944

The fd_install() routine sets the fd array pointer to the address of the file object returned by filp_open(). That is, it sets current->files->fd[fd].

Line 947

The putname() routine frees the kernel space allocated to store the filename.

Line 949

The file descriptor fd is returned.

Line 952

The put_unused_fd() routine clears the file descriptor that has been allocated. This is called when a file object failed to be created.

To summarize, the hierarchical call of the open() syscall process looks like this:

sys_open:

getname(). Moves filename to kernel space
get_unused_fd(). Gets next available file descriptor
filp_open(). Creates the nameidata struct
open_namei(). Initializes the nameidata struct
dentry_open(). Creates and initializes the file object
fd_install(). Sets current->files->fd[fd] to the file object
putname(). Deallocates kernel space for filename

Figure 6.16 illustrates the structures that are initialized and set and identifies the routines where this was done.

Figure 6.16. Filesystem Structures

Table 6.10 shows some of the sys_open() return errors and the kernel routines that find them.

Table 6.10. sys_open() Errors
Error Code	Description	Function Returning Error
`ENAMETOOLONG`	Pathname too long.	`getname()`
`ENOENT`	File does not exist (and flag `O_CREAT` not set).	`getname()`
`EMFILE`	Process has maximum number of files open.	`get_unused_fd()`
`ENFILE`	System has maximum number of files open.	`get_unused_filp()`

6.5.2. close ()

After a process finishes with a file, it issues the close() system call:

 synopsis #include <unistd.h> int close(int fd); -----------------------------------------------------------------------

The close system call takes as parameter the file descriptor of the file to be closed. In standard C programs, this call is made implicitly upon program termination. Let's delve into the code for sys_close():

 ----------------------------------------------------------------------- fs/open.c 1020  asmlinkage long sys_close(unsigned int fd) 1021  { 1022   struct file * filp; 1023   struct files_struct *files = current->files; 1024   1025   spin_lock(&files->file_lock); 1026   if (fd >= files->max_fds) 1027    goto out_unlock; 1028   filp = files->fd[fd]; 1029   if (!filp) 1030    goto out_unlock; 1031   files->fd[fd] = NULL; 1032   FD_CLR(fd, files->close_on_exec); 1033   __put_unused_fd(files, fd); 1034   spin_unlock(&files->file_lock); 1035   return filp_close(filp, files); 1036   1037  out_unlock: 1038   spin_unlock(&files->file_lock); 1039   return -EBADF; 1040  } -----------------------------------------------------------------------

Line 1023

The current task_struct's files field point at the files_struct that corresponds to our file.

Lines 10251030

These lines begin by locking the file so as to not run into synchronization problems. We then check that the file descriptor is valid. If the file descriptor number is greater than the highest allowable file number for that file, we remove the lock and return the error EBADF. Otherwise, we acquire the file structure address. If the file descriptor index does not yield a file structure, we also remove the lock and return the error as there would be nothing to close.

Lines 10311032

Here, we set the current->files->fd[fd] to NULL, removing the pointer to the file object. We also clear the file descriptor's bit in the file descriptor set referred to by files->close_on_exec. Because the file descriptor is closed, the process need not worry about keeping track of it in the case of a call to exec().

Line 1033

The kernel routine __put_unused_fd() clears the file descriptor's bit in the file descriptor set files->open_fds because it is no longer open. It also does something that assures us of the "lowest available index" assignment of file descriptors:

 ----------------------------------------------------------------------- fs/open.c 897  static inline void __put_unused_fd(struct files_struct *files,  unsigned int fd) 898  { 899  __FD_CLR(fd, files->open_fds); 890  if (fd < files->next_fd) 891   files->next_fd = fd; 892  } -----------------------------------------------------------------------

Lines 890891

The next_fd field holds the value of the next file descriptor to be assigned. If the current file descriptor's value is less than that held by files->next_fd, this field will be set to the value of the current file descriptor instead. This assures that file descriptors are assigned on the basis of the lowest available value.

Lines 10341035

The lock on the file is now released and the control is passed to the filp_close() function that will be in charge of returning the appropriate value to the close system call. The filp_close() function performs the bulk of the close syscall work. Let's take a closer look at the filp_close() routine:

 ----------------------------------------------------------------------- fs/open.c 987  int filp_close(struct file *filp, fl_owner_t id) 988  { 989  int retval; 990  /* Report and clear outstanding errors */ 991  retval = filp->f_error; 992  if (retval) 993   filp->f_error = 0; 994   995  if (!file_count(filp)) { 996   printk(KERN_ERR "VFS: Close: file count is 0\n"); 997   return retval; 998  } 999   1000  if (filp->f_op && filp->f_op->flush) { 1001   int err = filp->f_op->flush(filp); 1002   if (!retval) 1003     retval = err; 1004  } 1005   1006  dnotify_flush(filp, id); 1007  locks_remove_posix(filp, id); 1008  fput(filp); 1009  return retval; 1010  } -----------------------------------------------------------------------

Lines 991993

These lines clear any outstanding errors.

Lines 995997

This is a sanity check on the conditions necessary to close a file. A file with a file_count of 0 should already be closed. Hence, in this case, filp_close returns an error.

Lines 10001001

Invokes the file operation flush() (if it is defined). What this does is determined by the particular filesystem.

Line 1008

fput() is called to release the file structure. The actions performed by this routine include calling file operation release(), removing the pointer to the dentry and vfsmount objects, and finally, releasing the file object.

The hierarchical call of the close() syscall process looks like this:

sys_close():

__put_unused_fd(). Returns file descriptor to the available pool
filp_close(). Prepares file object for clearing
fput(). Clears file object

Table 6.11 shows some of the sys_close() return errors and the kernel routines that find them.

Table 6.11. sys_close() Errors
Error	Function	Description
`EBADF`	`sys_close()`	Invalid file descriptor

6.5.3. read()

When a user level program calls read(), Linux translates this to a system call, sys_read():

 ----------------------------------------------------------------------- fs/read_write.c 272 asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count) 273 { 274   struct file *file; 275   ssize_t ret = -EBADF; 276   int fput_needed; 277  278   file = fget_light(fd, &fput_needed); 279   if (file) { 280     ret = vfs_read(file, buf, count, &file->f_pos); 281     fput_light(file, fput_needed); 282   } 283  284   return ret; 285 } -----------------------------------------------------------------------

Line 272

sys_read() takes a file descriptor, a user-space buffer pointer, and a number of bytes to read from the file into the buffer.

Lines 273282

A file lookup is done to translate the file descriptor to a file pointer with fget_light(). We then call vfs_read(), which does all the main work. Each fget_light() needs to be paired with fput_light(,) so we do that after our vfs_read() finishes.

The system call, sys_read(), has passed control to vfs_read(), so let's continue our trace:

 ----------------------------------------------------------------------- fs/read_write.c 200 ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) 201 { 202   struct inode *inode = file->f_dentry->d_inode; 203   ssize_t ret; 204  205   if (!(file->f_mode & FMODE_READ)) 206     return -EBADF; 207   if (!file->f_op || (!file->f_op->read && \            !file->f_op->aio_read)) 208     return -EINVAL; 209  210   ret = locks_verify_area(FLOCK_VERIFY_READ, inode, file, *pos, count); 211   if (!ret) { 212     ret = security_file_permission (file, MAY_READ); 213     if (!ret) { 214       if (file->f_op->read) 215         ret = file->f_op->read(file, buf, count, pos); 216       else 217         ret = do_sync_read(file, buf, count, pos); 218       if (ret > 0) 219         dnotify_parent(file->f_dentry, DN_ACCESS); 220     } 221   } 222  223   return ret; 224 } -----------------------------------------------------------------------

Line 200

The first three parameters are all passed via, or are translations from, the original sys_read() parameters. The fourth parameter is the offset within file, where the read should start. This could be non-zero if vfs_read() is called explicitly because it could be called from within the kernel.

Line 202

We store a pointer to the file's inode.

Lines 205208

Basic checking is done on the file operations structure to ensure that read or asynchronous read operations have been defined. If no read operation is defined, or if the operations table is missing, the function returns the EINVAL error at this point. This error indicates that the file descriptor is attached to a structure that cannot be used for reading.

Lines 210214

We verify that the area to be read is not locked and that the file is authorized to be read. If it is not, we notify the parent of the file (on lines 218219).

Lines 215217

These are the guts of vfs_read(). If the read file operation has been defined, we call it; otherwise, we call do_sync_read().

In our tracing, we follow the standard file operation read and not the do_sync_read() function. Later, it becomes clear that both calls eventually reach the same underlying point.

6.5.3.1. Moving from the Generic to the Specific

This is our first encounter with one of the many abstractions where we move between the generic filesystem layer and the specific filesystem layer. Figure 6.17 illustrates how the file structure points to the specific filesystem table or operations. Recall that when read_inode() is called, the inode information is filled in, including having the fop field point to the appropriate table of operations defined by the specific filesystem implementation (for example, ext2).

Figure 6.17. File Operations

When a file is created, or mounted, the specific filesystem layer initializes its file operations structure. Because we are operating on a file on an ext2 filesystem, the file operations structure is as follows:

 ----------------------------------------------------------------------- fs/ext2/file.c 42 struct file_operations ext2_file_operations = { 43   .llseek   = generic_file_llseek, 44   .read   = generic_file_read, 45   .write   = generic_file_write, 46   .aio_read  = generic_file_aio_read, 47   .aio_write  = generic_file_aio_write, 48   .ioctl   = ext2_ioctl, 49   .mmap   = generic_file_mmap, 50   .open   = generic_file_open, 51   .release  = ext2_release_file, 52   .fsync   = ext2_sync_file, 53   .readv   = generic_file_readv, 54   .writev   = generic_file_writev, 55   .sendfile  = generic_file_sendfile, 56 }; -----------------------------------------------------------------------

You can see that for nearly every file operation, the ext2 filesystem has decided that the Linux defaults are acceptable. This leads us to ask when a filesystem would want to implement its own file operations. When a filesystem is sufficiently unlike a UNIX filesystem, extra steps might be necessary to allow Linux to interface with it. For example, MSDOS- or FAT-based filesystems need to implement their own write but can use the generic read.^[10]

^[10] See fs/fat/file.c for more information.

Discovering that the specific filesystem layer for ext2 passes control to the generic filesystem layer, we now examine generic_file_read():

 ----------------------------------------------------------------------- mm/filemap.c 924 ssize_t 925 generic_file_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos) 926 { 927   struct iovec local_iov = { .iov_base = buf, .iov_len = count }; 928   struct kiocb kiocb; 929   ssize_t ret; 930  931   init_sync_kiocb(&kiocb, filp); 932   ret = __generic_file_aio_read(&kiocb, &local_iov, 1, ppos); 933   if (-EIOCBQUEUED == ret) 934     ret = wait_on_sync_kiocb(&kiocb); 935   return ret; 936 } 937  938 EXPORT_SYMBOL(generic_file_read); -----------------------------------------------------------------------

Lines 924925

Notice that the same parameters are simply being passed along from the upper-level reads. We have filp, the file pointer; buf, the pointer to the memory buffer where the file will be read into; count, the number of characters to read; and ppos, the position within the file to begin reading from.

Line 927

An iovec structure is created that contains the address and length of the user space buffer that the results of the read are to be stored in.

Lines 928 and 931

A kiocb structure is initialized using the file pointer. (kiocb stands for kernel I/O control block.)

Line 932

The bulk of the read is done in the generic asynchronous file read function.

Asynchronous I/O Operations

kiocb and iovec are two datatypes that facilitate asynchronous I/O operations within the Linux kernel.

Asynchronous I/O is desirable when a process wishes to perform an input or output operation without immediately waiting for the result of the operation. It is extremely desirable for high I/O environments, as you can allow the device the opportunity to order and schedule the I/O requests instead of the process.

In Linux, an I/O vector (iovec) represents an address range of memory and is defined as

[View full width]
 ------------------------------------------------------------------------- include/linux/uio.h 20 struct iovec 21 { 22   void __user *iov_base; /* BSD uses caddr_t (1003.1g requires  void *) */ 23   __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ 24 }; -------------------------------------------------------------------------

This is simply a pointer to a section of memory and the length of the memory.

The kernel I/O control block (kiocb) is a structure that is required to help manage how and when the I/O vector gets operated upon asynchronously.

__generic_file_aio_read() function uses the kiocb and iovec structures to read the page_cache directly.

Lines 933935

After we send off the read, we wait until the read finishes and then return the result of the read operation.

Recall the do_sync_read() path in vfs_read(); it would have eventually called this same function via another path. Let's continue the trace of file I/O by examining __generic_file_aio_read():

 ----------------------------------------------------------------------- mm/filemap.c 835 ssize_t 836 __generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, 837     unsigned long nr_segs, loff_t *ppos) 838 { 839   struct file *filp = iocb->ki_filp; 840   ssize_t retval; 841   unsigned long seg; 842   size_t count; 843  844   count = 0; 845   for (seg = 0; seg < nr_segs; seg++) { 846     const struct iovec *iv = &iov[seg]; ... 852     count += iv->iov_len; 853     if (unlikely((ssize_t)(count|iv->iov_len) <  0)) 854       return -EINVAL; 855     if (access_ok(VERIFY_WRITE, iv->iov_base, iv->iov_len)) 856       continue; 857     if (seg == 0) 858       return -EFAULT; 859     nr_segs = seg; 860     count -= iv->iov_len 861     break; 862   } ... -----------------------------------------------------------------------

Lines 835842

Recall that nr_segs was set to 1 by our caller and that iocb and iov contain the file pointer and buffer information. We immediately extract the file pointer from iocb.

Lines 845862

This for loop verifies that the iovec struct passed is composed of valid segments. Recall that it contains the user space buffer information.

 ----------------------------------------------------------------------- mm/filemap.c ... 863 864   /* coalesce the iovecs and go direct-to-BIO for O_DIRECT */ 865   if (filp->f_flags & O_DIRECT) { 866     loff_t pos = *ppos, size; 867     struct address_space *mapping; 868     struct inode *inode; 869  870     mapping = filp->f_mapping; 871     inode = mapping->host; 872     retval = 0; 873     if (!count) 874       goto out; /* skip atime */ 875     size = i_size_read(inode); 876     if (pos < size) { 877       retval = generic_file_direct_IO(READ, iocb, 878             iov, pos, nr_segs); 879       if (retval >= 0 && !is_sync_kiocb(iocb)) 880         retval = -EIOCBQUEUED; 881       if (retval > 0) 882         *ppos = pos + retval; 883     } 884     file_accessed(filp); 885     goto out; 886   } ... -----------------------------------------------------------------------

Lines 863886

This section of code is only entered if the read is direct I/O. Direct I/O bypasses the page cache and is a useful property of certain block devices. For our purposes, however, we do not enter this section of code at all. Most file I/O takes our path as the page cache, which we describe soon, which is much faster than the underlying block device.

 ----------------------------------------------------------------------- mm/filemap.c ... 887  888   retval = 0; 889   if (count) { 890     for (seg = 0; seg < nr_segs; seg++) { 891       read_descriptor_t desc; 892  893       desc.written = 0; 894       desc.buf = iov[seg].iov_base; 895       desc.count = iov[seg].iov_len; 896       if (desc.count == 0) 897         continue; 898       desc.error = 0; 899       do_generic_file_read(filp,ppos,&desc,file_read_actor); 900       retval += desc.written; 901       if (!retval) { 902         retval = desc.error; 903         break; 904       } 905     } 906   } 907 out: 08   return retval; 909 } -----------------------------------------------------------------------

Lines 889890

Because our iovec is valid and we have only one segment, we execute this for loop once only.

Lines 891898

We translate the iovec structure into a read_descriptor_t structure. The read_descriptor_t structure keeps track of the status of the read. Here is the description of the read_descriptor_t structure:

 ----------------------------------------------------------------------- include/linux/fs.h 837  typedef struct { 838   size_t written; 839   size_t count; 840   char __user * buf; 841   int error; 842  } read_descriptor_t; -----------------------------------------------------------------------

Line 838

The field written keeps a running count of the number of bytes transferred.

Line 839

The field count keeps a running count of the number of bytes left to be transferred.

Line 840

The field buf holds the current position into the buffer.

Line 841

The field error holds any error code encountered during the read operation.

Lines 899

We pass our new read_descriptor_t structure desc to do_generic_file_read(), along with our file pointer filp and our position ppos. file_read_actor() is a function that copies a page to the user space buffer located in desc.^[11]

^[11] file_read_actor() can be found on line 794 of mm/filemap.c.

Lines 900909

The amount read is calculated and returned to the caller.

At this point in the read() internals, we are about to access the page cache^[12] and determine if the sections of the file we want to read already exist in RAM, so we don't have to directly access the block device.

^[12] The page cache is described in Section 6.4, "Page Cache."

6.5.3.2. Tracing the Page Cache

Recall that the last function we encountered passed a file pointer filp, an offset ppos, a read_descriptor_t desc, and a function file_read_actor into do_generic_file_read().

 ----------------------------------------------------------------------- include/linux/fs.h 1420 static inline void do_generic_file_read(struct file * filp, loff_t *ppos, 1421           read_descriptor_t * desc, 1422           read_actor_t actor) 1423 { 1424   do_generic_mapping_read(filp->f_mapping, 1425         &filp->f_ra, 1426         filp, 1427         ppos, 1428         desc, 1429         actor); 1430 } -----------------------------------------------------------------------

Lines 14201430

do_generic_file_read() is simply a wrapper to do_generic_mapping_read(). filp->f_mapping is a pointer to an address_space object and filp->f_ra is a structure that holds the address of the file's read-ahead state.^[13]

^[13] See the "file Structure" section for more information about this field and read-ahead optimization.

So, we've transformed our read of a file into a read of the page cache via the address_space object in our file pointer. Because do_generic_mapping_read() is an extremely long function with a number of separate cases, we try to make the analysis of the code as painless as possible.

 ----------------------------------------------------------------------- mm/filemap.c 645 void do_generic_mapping_read(struct address_space *mapping, 646        struct file_ra_state *_ra, 647        struct file * filp, 648        loff_t *ppos, 649        read_descriptor_t * desc, 650        read_actor_t actor) 651 {       652   struct inode *inode = mapping->host; 653   unsigned long index, offset; 654   struct page *cached_page; 655   int error; 656   struct file_ra_state ra = *_ra; 657    658   cached_page = NULL;   659   index = *ppos >> PAGE_CACHE_SHIFT; 660   offset = *ppos & ~PAGE_CACHE_MASK; -----------------------------------------------------------------------

Line 652

We extract the inode of the file we're reading from address_space.

Lines 658660

We initialize cached_page to NULL until we can determine if it exists within the page cache. We also calculate index and offset based on page cache constraints. The index corresponds to the page number within the page cache, and the offset corresponds to the displacement within that page. When the page size is 4,096 bytes, a right bit shift of 12 on the file pointer yields the index of the page.

"The page cache can [be] done in larger chunks than one page, because it allows for more efficient throughput" (linux/pagemap.h). PAGE_CACHE_SHIFT and PAGE_CACHE_MASK are settings that control the structure and size of the page cache:

 ----------------------------------------------------------------------- mm/filemap.c 661  662   for (;;) { 663     struct page *page; 664     unsigned long end_index, nr, ret; 665     loff_t isize = i_size_read(inode); 666  667     end_index = isize >> PAGE_CACHE_SHIFT; 668  669     if (index > end_index) 670       break; 671     nr = PAGE_CACHE_SIZE; 672     if (index == end_index) { 673       nr = isize & ~PAGE_CACHE_MASK; 674       if (nr <= offset) 675         break; 676     } 677  678     cond_resched(); 679     page_cache_readahead(mapping, &ra, filp, index); 680  681     nr = nr - offset; -----------------------------------------------------------------------

Lines 662681

This section of code iterates through the page cache and retrieves enough pages to fulfill the bytes requested by the read command.

 ----------------------------------------------------------------------- mm/filemap.c 682 find_page: 683     page = find_get_page(mapping, index); 684     if (unlikely(page == NULL)) { 685       handle_ra_miss(mapping, &ra, index); 686       goto no_cached_page; 687     } 688     if (!PageUptodate(page)) 689       goto page_not_up_to_date; -----------------------------------------------------------------------

Lines 682689

We attempt to find the first page required. If the page is not in the page cache, we jump to the no_cached_page label. If the page is not up to date, we jump to the page_not_up_to_date label. find_get_page() uses the address space's radix tree to find the page at index, which is the specified offset.

 ----------------------------------------------------------------------- mm/filemap.c 690 page_ok: 691     /* If users can be writing to this page using arbitrary 692     * virtual addresses, take care about potential aliasing 693     * before reading the page on the kernel side. 694     */ 695     if (mapping_writably_mapped(mapping)) 696       flush_dcache_page(page); 697  698     /* 699     * Mark the page accessed if we read the beginning. 700     */ 701     if (!offset) 702       mark_page_accessed(page); ... 714     ret = actor(desc, page, offset, nr); 715     offset += ret; 716     index += offset >> PAGE_CACHE_SHIFT; 717     offset &= ~PAGE_CACHE_MASK; 718  719     page_cache_release(page); 720     if (ret == nr && desc->count) 721       continue; 722     break; 723  -----------------------------------------------------------------------

Lines 690723

The inline comments are descriptive so there's no point repeating them. Notice that on lines 656658, if more pages are to be retrieved, we immediately return to the top of the loop where the index and offset manipulations in lines 714716 help choose the next page to retrieve. If no more pages are to be read, we break out of the for loop.

 ----------------------------------------------------------------------- mm/filemap.c 724 page_not_up_to_date: 725     /* Get exclusive access to the page ... */ 726     lock_page(page); 727  728     /* Did it get unhashed before we got the lock? */ 729     if (!page->mapping) { 730       unlock_page(page); 731       page_cache_release(page); 732       continue; 734  735     /* Did somebody else fill it already? */ 736     if (PageUptodate(page)) { 737       unlock_page(page); 738       goto page_ok; 739     } 740 -----------------------------------------------------------------------

Lines 724740

If the page is not up to date, we check it again and return to the page_ok label if it is, now, up to date. Otherwise, we try to get exclusive access; this causes us to sleep until we get it. Once we have exclusive access, we see if the page attempts to remove itself from the page cache; if it is, we hasten it along before returning to the top of the for loop. If it is still present and is now up to date, we unlock the page and jump to the page_ok label.

 ----------------------------------------------------------------------- mm/filemap.c 741 readpage: 742  /* ... and start the actual read. The read will unlock the page. */ 743     error = mapping->a_ops->readpage(filp, page); 744  745     if (!error) { 746       if (PageUptodate(page)) 747         goto page_ok; 748       wait_on_page_locked(page); 749       if (PageUptodate(page)) 750         goto page_ok; 751       error = -EIO; 752     } 753  754    /* UHHUH! A synchronous read error occurred. Report it */ 755     desc->error = error; 756     page_cache_release(page); 757     break; 758 -----------------------------------------------------------------------

Lines 741743

If the page was not up to date, we can fall through the previous label with the page lock held. The actual read, mapping->a_ops->readpage(filp, page), unlocks the page. (We trace readpage() further in a bit, but let's first finish the current explanation.)

Lines 746750

If we read a page successfully, we check that it's up to date and jump to page_ok when it is.

Lines 751758

If a synchronous read error occurred, we log the error in desc, release the page from the page cache, and break out of the for loop.

 ----------------------------------------------------------------------- mm/filemap.c 759 no_cached_page: 760     /* 761     * Ok, it wasn't cached, so we need to create a new 762     * page.. 763     */ 764     if (!cached_page) { 765       cached_page = page_cache_alloc_cold(mapping); 766       if (!cached_page) { 767         desc->error = -ENOMEM; 768         break; 769       } 770     } 771     error = add_to_page_cache_lru(cached_page, mapping, 772             index, GFP_KERNEL); 773     if (error) { 774       if (error == -EEXIST) 775         goto find_page; 776       desc->error = error; 777       break; 778     } 779     page = cached_page; 780     cached_page = NULL; 781     goto readpage; 782   } -----------------------------------------------------------------------

Lines 698772

If the page to be read wasn't cached, we allocate a new page in the address space and add it to both the least recently used (LRU) cache and the page cache.

Lines 773775

If we have an error adding the page to the cache because it already exists, we jump to the find_page label and try again. This could occur if multiple processes attempt to read the same uncached page; one would attempt allocation and succeed, the other would attempt allocation and find it already existing.

Lines 776777

If there is an error in adding the page to the cache other than it already existing, we log the error and break out of the for loop.

Lines 779781

When we successfully allocate and add the page to the page cache and LRU cache, we set our page pointer to the new page and attempt to read it by jumping to the readpage label.

 ----------------------------------------------------------------------- mm/filemap.c 784   *_ra = ra; 785  786   *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; 787   if (cached_page) 788     page_cache_release(cached_page); 789   file_accessed(filp); 790 } -----------------------------------------------------------------------

Line 786

We calculate the actual offset based on our page cache index and offset.

Lines 787788

If we allocated a new page and could add it correctly to the page cache, we remove it.

Line 789

We update the file's last accessed time via the inode.

The logic described in this function is the core of the page cache. Notice how the page cache does not touch any specific filesystem data. This allows the Linux kernel to have a page cache that can cache pages regardless of the underlying filesystem structure. Thus, the page cache can hold pages from MINIX, ext2, and MSDOS all at the same time.

The way the page cache maintains its specific filesystem layer agnosticism is by using the readpage() function of the address space. Each specific filesystem implements its own readpage(). So, when the generic filesystem layer calls mapping->a_ops->readpage(), it calls the specific readpage() function from the filesystem driver's address_space_operations structure. For the ext2 filesystem, readpage() is defined as follows:

 ----------------------------------------------------------------------- fs/ext2/inode.c 676 struct address_space_operations ext2_aops = { 677   .readpage    = ext2_readpage, 678   .readpages    = ext2_readpages, 679   .writepage    = ext2_writepage, 680   .sync_page    = block_sync_page, 681   .prepare_write   = ext2_prepare_write, 682   .commit_write   = generic_commit_write, 683   .bmap     = ext2_bmap, 684   .direct_IO    = ext2_direct_IO, 685   .writepages    = ext2_writepages, 686 }; -----------------------------------------------------------------------

Thus, readpage()actually calls ext2_readpage():

 ----------------------------------------------------------------------- fs/ext2/inode.c 616 static int ext2_readpage(struct file *file, struct page *page) 617 { 618   return mpage_readpage(page, ext2_get_block); 619 } -----------------------------------------------------------------------

ext2_readpage() calls mpage_readpage(),which is a generic filesystem layer call, but passes it the specific filesystem layer function ext2_get_block().

The generic filesystem function mpage_readpage() expects a get_block() function as its second argument. Each filesystem implements certain I/O functions that are specific to the format of the filesystem; get_block() is one of these. Filesystem get_block() functions map logical blocks in the address_space pages to actual device blocks in the specific filesystem layout. Let's look at the specifics of mpage_readpage():

 ----------------------------------------------------------------------- fs/mpage.c 358 int mpage_readpage(struct page *page, get_block_t get_block) 359 { 360   struct bio *bio = NULL; 361   sector_t last_block_in_bio = 0; 362  363   bio = do_mpage_readpage(bio, page, 1, 364       &last_block_in_bio, get_block); 365   if (bio) 366     mpage_bio_submit(READ, bio); 367   return 0; 368 } -----------------------------------------------------------------------

Lines 360361

We allocate space for managing the bio structure the address space uses to manage the page we are trying to read from the device.

Lines 363364

do_mpage_readpage() is called, which translates the logical page to a bio structure composed of actual pages and blocks. The bio structure keeps track of information associated with block I/O.

Lines 365367

We send the newly created bio structure to mpage_bio_submit() and return.

Let's take a moment and recap (at a high level) the flow of the read function so far:

Using the file descriptor from a call to read(), we locate the file pointer from which we obtain an inode.
The filesystem layer checks the in-memory page cache for a page, or pages, that correspond to the given inode.
If no page is found, the filesystem layer uses the specific filesystem driver to translate the requested sections of the file to I/O blocks on a given device.
We allocate space for pages in the page cache address_space and create a bio structure that ties the new pages with the sectors on the block device.

mpage_readpage() is the function that creates the bio structure and ties together the newly allocated pages from the page cache to the bio structure. However, no data exists in the pages yet. For that, the filesystem layer needs the block device driver to do the actual interfacing to the device. This is done by the submit_bio() function in mpage_bio_submit():

 ----------------------------------------------------------------------- fs/mpage.c 90 struct bio *mpage_bio_submit(int rw, struct bio *bio) 91 {   92   bio->bi_end_io = mpage_end_io_read; 93   if (rw == WRITE) 94     bio->bi_end_io = mpage_end_io_write; 95   submit_bio(rw, bio); 96   return NULL; 97 } -----------------------------------------------------------------------

Line 90

The first thing to notice is that mpage_bio_submit() works for both read and write calls via the rw parameter. It submits a bio structure that, in the read case, is empty and needs to be filled in. In the write case, the bio structure is filled and the block device driver copies the contents to its device.

Lines 9294

If we are reading or writing, we set the appropriate function that will be called when I/O ends.

Lines 9596

We call submit_bio() and return NULL. Recall that mpage_readpage() doesn't do anything with the return value of mpage_bio_submit().

submit_bio() is part of the generic block device driver layer of the Linux kernel.

 ----------------------------------------------------------------------- drivers/block/ll_rw_blk.c 2433 void submit_bio(int rw, struct bio *bio) 2434 {     2435   int count = bio_sectors(bio); 2436  2437   BIO_BUG_ON(!bio->bi_size); 2438   BIO_BUG_ON(!bio->bi_io_vec); 2439   bio->bi_rw = rw; 2440   if (rw & WRITE) 2441     mod_page_state(pgpgout, count); 2442   else 2443     mod_page_state(pgpgin, count); 2444  2445   if (unlikely(block_dump)) { 2446     char b[BDEVNAME_SIZE]; 2447     printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n", 2448       current->comm, current->pid, 2449       (rw & WRITE) ? "WRITE" : "READ", 2450       (unsigned long long)bio->bi_sector, 2451       bdevname(bio->bi_bdev,b)); 2452   } 2453  2454   generic_make_request(bio); 2455 } -----------------------------------------------------------------------

Lines 24332443

These calls enable some debugging: Set the read/write attribute of the bio structure, and perform some page state housekeeping.

Lines 24452452

These lines handle the rare case that a block dump occurs. A debug message is thrown.

Line 2454

generic_make_request() contains the main functionality and uses the specific block device driver's request queue to handle the block I/O operation.

Part of the inline comments for generic_make_request() are enlightening:

 ----------------------------------------------------------------------- drivers/block/ll_rw_blk.c 2336 * The caller of generic_make_request must make sure that bi_io_vec 2337 * are set to describe the memory buffer, and that bi_dev and bi_sector   are 2338 * set to describe the device address, and the 2339 * bi_end_io and optionally bi_private are set to describe how 2340 * completion notification should be signaled. -----------------------------------------------------------------------

In these stages, we constructed the bio structure, and thus, the bio_vec structures are mapped to the memory buffer mentioned on line 2337, and the bio struct is initialized with the device address parameters as well. If you want to follow the read even further into the block device driver, refer to the "Block Device Overview"section in Chapter 5, which describes how the block device driver handles request queues and the specific hardware constraints of its device. Figure 6.18 illustrates how the read() system call traverses through the layers of kernel functionality.

Figure 6.18. read() Top-Down Traversal

After the block device driver reads the actual data and places it in the bio structure, the code we have traced unwinds. The newly allocated pages in the page cache are filled, and their references are passed back to the VFS layer and copied to the section of user space specified so long ago by the original read() call.

However, we hear you ask, "Isn't this only half of the story? What if we wanted to write instead of read?"

We hope that these descriptions made it somewhat clear that the path a read() call takes through the Linux kernel is similar to the path a write() call takes. However, we now outline some differences.

6.5.4. write()

A write() call gets mapped to sys_write() and then to vfs_write() in the same manner as a read() call:

 ----------------------------------------------------------------------- fs/read_write.c 244 ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) 245 { ... 259       ret = file->f_op->write(file, buf, count, pos); ... 268 } -----------------------------------------------------------------------

vfs_write() uses the generic file_operations write function to determine what specific filesystem layer write to use. This is translated, in our example ext2 case, via the ext2_file_operations structure:

 ----------------------------------------------------------------------- fs/ext2/file.c 42 struct file_operations ext2_file_operations = { 43   .llseek   = generic_file_llseek, 44   .read   = generic_file_read, 45   .write   = generic_file_write, ... 56 }; -----------------------------------------------------------------------

Lines 4445

Instead of calling generic_file_read(), we call generic_file_write().

generic_file_write() obtains a lock on the file to prevent two writers from simultaneously writing to the same file, and calls generic_file_write_nolock(). generic_file_write_nolock() converts the file pointers and buffers to the kiocb and iovec parameters and calls the page cache write function generic_file_aio_write_nolock().

Here is where a write diverges from a read. If the page to be written isn't in the page cache, the write does not fall through to the device itself. Instead, it reads the page into the page cache and then performs the write. Pages in the page cache are not immediately written to disk; instead, they are marked as "dirty" and, periodically, all dirty pages are written to disk.

There are analogous functions to the read() functions' readpage(). Within generic_file_aio_write_nolock(), the address_space_operations pointer accesses prepare_write() and commit_write(), which are both specific to the filesystem type the file resides upon. Recall ext2_aops, and we see that the ext2 driver uses its own function, ext2_prepare_write(), and a generic function generic_commit_write().

 ----------------------------------------------------------------------- fs/ext2/inode.c 628 static int 629 ext2_prepare_write(struct file *file, struct page *page, 630       unsigned from, unsigned to) 631 { 632   return block_prepare_write(page,from,to,ext2_get_block); 633 } -----------------------------------------------------------------------

Line 632

ext2_prepare_write is simply a wrapper for the generic filesystem function block_prepare_write(), which passes in the ext2 filesystem-specific get_block() function.

block_prepare_write() allocates any new buffers that are required for the write. For example, if data is being appended to a file enough buffers are created, and linked with pages, to store the new data.

generic_commit_write() takes the given page and iterates over the buffers within it, marking each dirty. The prepare and the commit sections of a write are separated to prevent a partial write being flushed from the page cache to the block device.

6.5.4.1. Flushing Dirty Pages

The write() call returns after it has insertedand marked dirtyall the pages it has written to. Linux has a daemon, pdflush, which writes the dirty pages from the page cache to the block device in two cases:

The system's free memory falls below a threshold. Pages from the page cache are flushed to free up memory.
Dirty pages reach a certain age. Pages that haven't been written to disk after a certain amount of time are written to their block device.

The pdflush daemon calls the filesystem-specific function writepages() when it is ready to write pages to disk. So, for our example, recall the ext2_file_operation structure, which equates writepages() with ext2_writepages().^[14]

^[14] The pdflush daemon is fairly involved, and for our purposes of tracing a write, we can ignore the complexity. However, if you are interested in the details, mm/pdflush.c, mm/fs-writeback.c, and mm/page-writeback.c contain the relevant code.

 ----------------------------------------------------------------------- 670 static int 671 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc) 672 { 673   return mpage_writepages(mapping, wbc, ext2_get_block); 674 } -----------------------------------------------------------------------

Like other specific implementations of generic filesystem functions, ext2_ writepages() simply calls the generic filesystem function mpage_writepages() with the filesystem-specific ext2_get_block() function.

mpage_writepages() loops over the dirty pages and calls mpage_writepage() on each dirty page. Similar to mpage_readpage(), mpage_writepage() returns a bio structure that maps the physical device layout of the page to its physical memory layout. mpage_writepages() then calls submit_bio() to send the new bio structure to the block device driver to transfer the data to the device itself.