Section 6.4. Page Cache | The Linux Kernel Primer. A Top-Down Approach for x86 and PowerPC Architectures

6.4. Page Cache

In the introductory sections, we mentioned that the page cache is an in-memory collection of pages. When data is frequently accessed, it is important to be able to quickly access the data. When data is duplicated and synchronized across two devices, one of which typically is smaller in storage size but allows much faster access than the other, we call it a cache. A page cache is how an operating system stores parts of the hard drive in memory for faster access. We now look at how it works and is implemented.

When you perform a write to a file on your hard drive, that file is broken into chunks called pages, that are swapped into memory (RAM). The operating system updates the page in memory and, at a later date, the page is written to disk.

If a page is copied from the hard drive to RAM (which is called swapping into memory), it can become either clean or dirty. A dirty page has been modified in memory but the modifications have not yet been written to disk. A clean page exists in memory in the same state that it exists on disk.

In Linux, the memory is divided into zones.^[8] Each zone has a list of active and inactive pages. When a page is inactive for a certain amount of time, it gets swapped out (written back to disk) to free memory. Each page in the zones list has a pointer to an address_space. Each address_space has a pointer to an address_space_operations structure. Pages are marked dirty by calling the set_dirty_page() function of the address_space_operation structure. Figure 6.12 illustrates this dependency.

^[8] See Chapter 4 for more on memory zones.

Figure 6.12. Page Cache and Zones

6.4.1. address_space Structure

The core of the page cache is the address_space object. Let's take a close look at it.

 ----------------------------------------------------------------------- include/linux/fs.h 326 struct address_space { 327   struct inode   *host;  /* owner: inode, block_device */ 328   struct radix_tree_root page_tree; /* radix tree of all pages */ 329   spinlock_t    tree_lock; /* and spinlock protecting it */ 330   unsigned long   nrpages;  /* number of total pages */ 331   pgoff_t     writeback_index;/* writeback starts here */ 332   struct address_space_operations *a_ops; /* methods */ 333   struct prio_tree_root i_mmap;  /* tree of private mappings */ 334   unsigned int   i_mmap_writable;/* count VM_SHARED mappings */ 335   struct list_head  i_mmap_nonlinear;/*list VM_NONLINEAR mappings */ 336   spinlock_t    i_mmap_lock; /* protect tree, count, list */ 337   atomic_t    truncate_count; /* Cover race condition with truncate */ 338   unsigned long   flags;   /* error bits/gfp mask */ 339   struct backing_dev_info *backing_dev_info; /* device readahead, etc */ 340   spinlock_t    private_lock; /* for use by the address_space */ 341   struct list_head  private_list; /* ditto */ 342   struct address_space *assoc_mapping; /* ditto */ 343 }; -----------------------------------------------------------------------

The inline comments of the structure are fairly descriptive. Some additional explanation might help in understanding how the page cache operates.

Usually, an address_space is associated with an inode and the host field points to this inode. However, the generic intent of the page cache and address space structure need not require this field. It could be NULL if the address_space is associated with a kernel object that is not an inode.

The address_space structure has a field that should be intuitively familiar to you by now: address_space_operations. Like the file structure file_operations, address_space_operations contains information about what operations are valid for this address_space.

 ----------------------------------------------------------------------- include/linux/fs.h 297 struct address_space_operations { 298   int (*writepage)(struct page *page, struct writeback_control *wbc); 299   int (*readpage)(struct file *, struct page *); 300   int (*sync_page)(struct page *); 301  302   /* Write back some dirty pages from this mapping. */ 303   int (*writepages)(struct address_space *, struct writeback_control *); 304  305   /* Set a page dirty */ 306   int (*set_page_dirty)(struct page *page); 307  308   int (*readpages)(struct file *filp, struct address_space *mapping, 309       struct list_head *pages, unsigned nr_pages); 310  311   /* 312   * ext3 requires that a successful prepare_write() call be followed 313   * by a commit_write() call - they must be balanced 314   */ 315   int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); 316   int (*commit_write)(struct file *, struct page *, unsigned, unsigned); 317   /* Unfortunately this kludge is needed for FIBMAP. Don't use it */ 318   sector_t (*bmap)(struct address_space *, sector_t); 319   int (*invalidatepage) (struct page *, unsigned long); 320   int (*releasepage) (struct page *, int); 321   ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, 322       loff_t offset, unsigned long nr_segs); 323 }; -----------------------------------------------------------------------

These functions are reasonably straightforward. readpage() and writepage() read and write pages associated with an address space, respectively. Multiple pages can be written and read via readpages() and writepages(). Journaling file systems, such as ext3, can provide functions for prepare_write() and commit_write().

When the kernel checks the page cache for a page, it must be blazingly fast. As such, each address space has a radix_tree, which performs a quick search to determine if the page is in the page cache or not.

Figure 6.13 illustrates how files, inodes, address spaces, and pages relate to each other; this figure is useful for the upcoming analysis of the page cache code.

Figure 6.13. Files, Inodes, Address Spaces, and Pages

6.4.2. buffer_head Structure

Each sector on a block device is represented by the Linux kernel as a buffer_head structure. A buffer_head contains all the information necessary to map a physical sector to a buffer in physical memory. The buffer_head structure is illustrated in Figure 6.14.

 ----------------------------------------------------------------------- include/linux/buffer_head.h 47 struct buffer_head { 48   /* First cache line: */ 49   unsigned long b_state;  /* buffer state bitmap (see above) */ 50   atomic_t b_count;    /* users using this block */ 51   struct buffer_head *b_this_page;/* circular list of page's buffers */ 52   struct page *b_page;   /* the page this bh is mapped to */ 53  54   sector_t b_blocknr;    /* block number */ 55   u32 b_size;      /* block size */ 56   char *b_data;     /* pointer to data block */ 57  58   struct block_device *b_bdev; 59   bh_end_io_t *b_end_io;   /* I/O completion */ 60   void *b_private;    /* reserved for b_end_io */ 61   struct list_head b_assoc_buffers; /* associated with another mapping */ 62 }; -----------------------------------------------------------------------

Figure 6.14. buffer_head Structure

The physical sector that a buffer_head structure refers to is logical block b_blocknr on device b_dev.

The physical memory that a buffer_head structure refers to is a block of memory starting at b_data of b_size bytes. This memory block is within the physical page of b_page.

The other definitions within the buffer_head structure are used for managing housekeeping tasks for how the physical sector is mapped to the physical memory. (Because this is a digression on bio structures and not buffer_head structures, refer to mpage.c for more detailed information on struct buffer_head.)

As mentioned in Chapter 4, each physical memory page in the Linux kernel is represented by a struct page. A page is composed of a number of I/O blocks. As each I/O block can be no larger than a page (although it can be smaller), a page is composed of one or more I/O blocks.

In older versions of Linux, block I/O was only done via buffers, but in 2.6, a new way was developed, using bio structures. The new way allows the Linux kernel to group block I/O together in a more manageable way.

Suppose we write a portion of the top of a text file and the bottom of a text file. This update would likely need two buffer_head structures for the data transfer: one that points to the top and one that points to the bottom. A bio structure allows file operations to bundle discrete chunks together in a single structure. This alternate way of looking at buffers and pages occurs by looking at the contiguous memory segments of a buffer. The bio_vec structure represents a contiguous memory segment in a buffer. The bio_vec structure is illustrated in Figure 6.15.

 ----------------------------------------------------------------------- include/linux/bio.h 47 struct bio_vec { 48   struct page  *bv_page; 49   unsigned int bv_len; 50   unsigned int bv_offset; 51 }; -----------------------------------------------------------------------

Figure 6.15. Bio Structure

The bio_vec structure holds a pointer to a page, the length of the segment, and the offset of the segment within the page.

A bio structure is composed of an array of bio_vec structures (along with other housekeeping fields). Thus, a bio structure represents a number of contiguous memory segments of one or more buffers on one or more pages.^[9]

^[9] See include/linux/bio.h for detailed information on struct bio.