The Kernel View of File Systems

To isolate the rest of the kernel from the type-specific structures used by various file system types, a Virtual File System (VFS) abstraction layer is employed. Figure 8-11 gives us an overview of the VFS components.

Figure 8-11. Kernel File System Tables

graphics/08fig11.gif

The VFS abstraction layer was first introduced to UNIX by Sun Microsystems engineers while they were developing the Network File System (NFS). They wanted local users and applications to be oblivious to the actual location of the files they were using. Current UNIX operating systems frequently build on this approach to masquerade the specifics of local and remote file system types mounted within their kernels.

The HP-UX kernel uses a number of VFS structures, virtual inodes (vnode), caches, and hash headers to facilitate file access. To get started, we examine two key components of the virtual file system: the vfs mount table and the vnode.

The `vfs` Mount Table

The starting point for the HP-UX file management subsystem is the pointer rootvfs. This kernel symbol directs us to the "/" (or root) file system's vfs data structure (see Figure 8-12).

Figure 8-12. Mounting a File System

graphics/08fig12.gif

Each new file system mount creates an additional entry in the linked list of vfs structures. Once a mount has been made, its vfs entry is maintained until the kernel is rebooted, even if the file system is unmounted (if it is later remounted, the original vfs structure is reused). Mount points configured by the automounter also add vfs structures to the list.

There are several notable components within the vfs structure. Among them are a pointer to the next vfs, a type value, and a pointer to an operational array (vector jump table) that maps type-specific kernel routines needed to manipulate the file system, and a pointer to its kernel-resident data structures. The kernel-resident data structures vary greatly in content between the various file system implementations. The type-specific routines mapped by the operations array know what to expect and how to work with the specifics of their design.

Listings 8.3 and 8.4 are annotated listings of the vfs structure and the vfsops vector array it points to.

Listing 8.3. `q4> fields struct vfs`

 A forward-pointing link to the next mounted file system (in  chronological order)    0 0    4 0 *          vfs_next A hash chain for rapid lookup (used to see if a vfs structure  already exists before we create a new one)    4 0    4 0 *          vfs_hash A pointer to the file system type-specific operations array    8 0    4 0 *          vfs_op A pointer to the in-core copy of the vnode this file system is mounted on   12 0    4 0 *          vfs_vnodecovered The vfs state flags   16 0    4 0 int        vfs_flag

0x01VFS_RDONLY	read-only
0x02VFS_MLOCK	lock this vfs (keep subtree stable)
0x04VFS_MWAIT	there is a waiter for the lock
0x08VFS_NOSUID	disable suid for file system
0x10VFS_EXPORTED	this file system is exported (NFS)
0x20VFS_REMOTE	this is an imported file system (NFS)
0x40VFS_HASQUOTA	this file system has quotas enabled
0x200VFS_MOUNTING	used in conjunction with VFS_MLOCK
0x800VFS_LARGEFILES	enable use of large files

 This file system's native block size   20 0    4 0 int        vfs_bsize For exported file systems (NFS) this is the UID remote "root"  users are remapped to and the export state flag   24 0    2 0 u_short    vfs_exroot   26 0    2 0 short      vfs_exflags This points to the file system's type-specific in-core data  structures   28 0    4 0 *          vfs_data A reference count of processes currently sleeping on the file system's mount point   32 0    4 0 int        vfs_icount The file system's specific type (HFS, VxFS, NFS, CDFS, ...)and  its unique id   36 0    2 0 short      vfs_mtype   40 0    4 0 int        vfs_fsid[0]   44 0    4 0 int        vfs_fsid[1] A log pointer, the most recent mount timestamp, file system name, device number (Major+minor device number), and a chain pointer to DNLC entries referencing this file system   48 0    4 0 *          vfs_logp   52 0    4 0 int        vfs_mnttime   56 0 1024 0 char[1024] vfs_name 1080 0    4 0 int        vfs_dev 1084 0    4 0 *          vfs_ncachehd

Listing 8.4. `q4> fields struct vfsops`

 The following pointers form a vector jump table to type- specific operational routines in the kernel  0 0 4 0 * vfs_mount  4 0 4 0 * vfs_unmount  8 0 4 0 * vfs_root 12 0 4 0 * vfs_statvfs 16 0 4 0 * vfs_sync 20 0 4 0 * vfs_vget 24 0 4 0 * vfs_getmount 28 0 4 0 * vfs_freeze 32 0 4 0 * vfs_thaw 36 0 4 0 * vfs_quota 40 0 4 0 * vfs_mountroot 44 0 4 0 * vfs_size

The Type-Specific In-Core Data Structures

Each file system type requires its own specific in-core data. Figure 8-13 illustrates the point.

Figure 8-13. Type-Specific Data Structures (HFS Example)

graphics/08fig13.gif

In general, the in-core data has a copy of the file system's superblock, a pointer to the kernel's copy of the inode/vnode to which the file system is mounted, and copies of the file system's allocation tables.

In the case of HFS, the kernel-resident structures include a mount structure (with pointers to the mount point's inode/vnode), a pointer back to the file system's vfs structure, and a pointer to a copy of the superblock maintained in the buffer cache (more on the buffer cache later in this chapter). Let's examine the HFS mount and fs (superblock) structures in Listings 8.5 and 8.6.

Listing 8.5. `q4> fields struct mount`

 First are forward and backward linkage pointers to a hash list    0 0 4 0 *       m_hforw    4 0 4 0 *       m_hback Next is a pointer to the file system's vfs structure, the  block device number (major and \minor number) on which the  file system resides, and a pointer to the buffer header which maps the in-core copy of the file system's superblock    8 0 4 0 *       m_vfsp   12 0 4 0 int     m_dev   16 0 4 0 *       m_bufp   20 0 4 0 int     m_maxbufs The raw device number is also recorded here   24 0 4 0 int     m_rdev   28 0 4 0 *       m_nodes.m_inodp   28 0 4 0 *       m_nodes.m_cdnodp   32 0 4 0 *       m_qinod Mount flags are stored here   36 0 4 0 int     m_flag   40 0 2 0 u_short m_qflags   44 0 4 0 u_int   m_btimelimit   48 0 4 0 u_int   m_ftimelimit A pointer to the file system's root inode/vnode in-core, a  reference count, and a pointer to the mount point's in-core inode/vnode   52 0 4 0 *       m_rootvp   56 0 4 0 u_int   m_ref   60 0 4 0 *       m_iused

Listing 8.6. `q4> fields struct fs`

    0 0   4 0 int       fs_unused[0]    4 0   4 0 int       fs_unused[1] First we have the file system's super block address, the offset of the cylinder group data block, the inode blocks, and  the first data block following the cylinder groups    8 0   4 0 int       fs_sblkno   12 0   4 0 int       fs_cblkno   16 0   4 0 int       fs_iblkno   20 0   4 0 int       fs_dblkno   24 0   4 0 int       fs_cgoffset   28 0   4 0 int       fs_cgmask Next is the last modification timestamp, the number of blocks (total space), and number of data blocks (minus the metadata overhead) in the file system.   32 0   4 0 int       fs_time   36 0   4 0 int       fs_size   40 0   4 0 int       fs_dsize Additional static and tunable parameters include the number of cylinders per group, block and fragment sizes, number of  fragments in a block (1,2,4,8), minimum reserved free space (applies only to non-superuser allocation requests), rotational delay for writing sequential blocks (also known as interleave), and drive speed (in revolutions per second)   44 0   4 0 int       fs_ncg   48 0   4 0 int       fs_bsize   52 0   4 0 int       fs_fsize   56 0   4 0 int       fs_frag   60 0   4 0 int       fs_minfree   64 0   4 0 int       fs_rotdelay   68 0   4 0 int       fs_rps   72 0   4 0 int       fs_bmask   76 0   4 0 int       fs_fmask   80 0   4 0 int       fs_bshift   84 0   4 0 int       fs_fshift Maximum contiguous blocks and max number of blocks allocated  to a single file within a cylinder group   88 0   4 0 int       fs_maxcontig   92 0   4 0 int       fs_maxbpg   96 0   4 0 int       fs_fragshift  100 0   4 0 int       fs_fsbtodb  104 0   4 0 int       fs_sbsize  108 0   4 0 int       fs_csmask  112 0   4 0 int       fs_csshift  116 0   4 0 int       fs_nindir  120 0   4 0 int       fs_inopb  124 0   4 0 int       fs_nspf This file system's unique identifier number  128 0   4 0 int       fs_id[0]  132 0   4 0 int       fs_id[1] Mirror state information for primary swap and "/" partitions  136 0   0 4 u_int     fs_mirror.state.root  136 4   0 1 u_int     fs_mirror.state.rflag  136 5   0 4 u_int     fs_mirror.state.swap  137 1   0 1 u_int     fs_mirror.state.sflag  137 2   2 6 u_int     fs_mirror.state.spare  140 0   4 0 int       fs_mirror.mirtime  144 0   4 0 int       fs_featurebits  148 0   4 0 int       fs_optim Cylinder group summary array address, summary array size,  cylinder group size  152 0   4 0 int       fs_csaddr  156 0   4 0 int       fs_cssize  160 0   4 0 int       fs_cgsize Number of tracks per cylinder group, sectors per track,  sectors per cylinder, total number of cylinders in the file  system, cylinders per group, and inodes per group  164 0   4 0 int       fs_ntrak  168 0   4 0 int       fs_nsect  172 0   4 0 int       fs_spc  176 0   4 0 int       fs_ncyl  180 0   4 0 int       fs_cpg  184 0   4 0 int       fs_ipg  188 0   4 0 int       fs_fpg Calculated cylinder group totals (this must be audited  following a power failure)  192 0   4 0 int       fs_cstotal.cs_ndir  196 0   4 0 int       fs_cstotal.cs_nbfree  200 0   4 0 int       fs_cstotal.cs_nifree  204 0   4 0 int       fs_cstotal.cs_nffree Next are the file system modification flag, the clean/dirty  flag, the read-only flag, the spare flag (currently not used), and a string holding the current mount point of the file  system  208 0   1 0 char      fs_fmod  209 0   1 0 char      fs_clean  210 0   1 0 char      fs_ronly  211 0   1 0 char      fs_flags  212 0 512 0 char[512] fs_fsmnt  724 0   4 0 int       fs_cgrotor  728 0   4 0 *         fs_csp  732 0   4 0 int       fs_csp_pad  736 0   4 0 int       fs_unused2[30]  856 0   4 0 int       fs_cpc  860 0   2 0 short     fs_postbl[32][8] This file system's "magic number" (used to verify its type),  its type name and pack name 1372 0   4 0 int       fs_magic 1376 0   6 0 char[6]   fs_fname 1382 0   6 0 char[6]   fs_fpack 1388 0   1 0 u_char    fs_rotbl[0]

The `vnode`

Next, we address the specifics of the vnode structure. We have learned that an inode contains all of a file's credentials and attributes, and is therefore the key to managing a file within a file system. The inode number is unique within the context of a single file system, and it is all that is needed to define a file. Within a kernel, there may be several file systems mounted together to create what appears to be a seamless, larger file system. Within this larger context, inode numbers are no longer guaranteed to be unique. Each mounted file system may have an inode 1, 2, and so on. To enable the kernel to refer to opened files with a unique handle, the virtual inode or vnode was created.

In a similar manner to the vfs structure, a vnode contains an operational array pointer to the kernel's type-specific routines for file manipulation tasks (see Figure 8-14).

Figure 8-14. The `vnode`

graphics/08fig14.gif

The key fields of the vnode include a type number, a pointer to a type-specific operations array, and a reference to the in-core copy of the file's inode data. Listing 8.7 is an annotated listing of a virtual file system vnode:

Listing 8.7. `q4> fields struct vnode`

  0 0 2 0 u_short v_flag

0x01	VROOT	root of a file system
0x02	VTEXT	vnode is a text prototype
0x04	VSWAP	vnode maps a swap space
0x08	VXOFTHOLD	vnode allows soft holds
0x010	VEXLOCK	exclusive lock
0x020	VSHLOCK	shared lock
0x040	VLWAIT	there is a waiter on a lock
0x080	VXDEVOK	cross device rename and link is allowed
0x0100	VMMF	vnode is memory mapped
0x04000	VNOBCACHE	data caching disabled
0x08000	VN_MPSAFE	semaphore not required

 Shared lock count, exclusive lock count, private data count,  and reference count  2 0 2 0 u_short v_shlockc  4 0 2 0 u_short v_exlockc  6 0 2 0 u_short v_tcount  8 0 4 0 int     v_count Pointer to vfs structure if this is a mount point 12 0 4 0 *       v_vfsmountedhere Pointers to the type specific vnode operations array, unix ipc socket, stream head, and the vfs the file is in 16 0 4 0 *       v_op 20 0 4 0 *       v_socket 24 0 4 0 *       v_stream 28 0 4 0 *       v_vfsp vnode type, device number, type-specific data (hfs inode, vxfs  inode, rnode, ...), and parent file system type 32 0 4 0 enum4   v_type 36 0 4 0 int     v_rdev 40 0 4 0 *       v_data 44 0 4 0 enum4   v_fstype Pointer to parent pseudo vas (if this maps to an executable  image file) 48 0 4 0 *       v_vas Beta semaphore locking structure 52 0 1 0 u_char  v_lock.b_lock 54 0 2 0 u_short v_lock.order 56 0 4 0 *       v_lock.owner Mapped memory cache buffers for this file 60 0 4 0 *       v_cleanblkhd 64 0 4 0 *       v_dirtyblkhd vnode write count, FS independent locks, soft hold count, and  a copy of the va_nodeid 68 0 4 0 int     v_writecount 72 0 4 0 *       v_locklist 76 0 4 0 int     v_scount 80 0 4 0 int     v_nodeid DNLC entry pointers if this vnode is currently described in  the directory name lookup cache 84 0 4 0 *       v_ncachedhd 88 0 4 0 *       v_ncachevhd 92 0 4 0 *       v_pfdathd 96 0 4 0 u_int   v_last_fsync

The HFS In-Core `inode` Contains Its `vnode`

For an HFS file, the in-core inode structure includes the vnode structure and a complete copy of its disk-resident inode (the icommon structure we examined earlier) plus additional fields of information (Listing 8.8):

Listing 8.8. `q4> fields struct inode`

   0 0  4 0 *         i_chain[0]   4 0  4 0 *         i_chain[1] The device number this inode came from   8 0  4 0 int       i_dev The inode number  12 0  4 0 u_int     i_number The state flags, lock word, and tid of last thread to lock this inode  16 0  4 0 u_int     i_flag  20 0  2 0 u_short   i_lockword  24 0  4 0 int       i_tid

The virtual file system's vnode for the file is stored between offset 28 and 127

 inode device pointer 128 0  4 0 *         i_devvp 132 0  4 0 int       i_diroff Continuation inode pointer 136 0  4 0 *         i_contip File system pointer 140 0  4 0 *         i_fs Quota pointer 144 0  4 0 *         i_dquot 148 0  4 0 int       i_rdev 152 0  4 0 int       i_un.if_lastr 152 0  4 0 *         i_un.is_socket 156 0  4 0 *         i_fr.if_freef 160 0  4 0 *         i_fr.if_freeb 164 0  4 0 *         i_fselr.i_selp 168 0  2 0 short     i_fselr.i_selflag 172 0  4 0 *         i_fselw.i_selp 176 0  2 0 short     i_fselw.i_selflag 180 0  4 0 *         i_mount

Acopy of the disk-based inode (structure icommon) is stored between offset 184 and 311

 Number of reader calls 312 0  2 0 u_short   i_rcount Beta semaphore locking structures 316 0  1 0 u_char    i_b_sema.b_lock 318 0  2 0 u_short   i_b_sema.order 320 0  4 0 *         i_b_sema.owner 324 0  4 0 int       filler[0] 328 0  4 0 int       filler[1]

The relationship between the in-core inode data and the vnode varies between file system types, but the vnode structure itself is identical for all supported file systems. As the vnode uses operation's array pointers, it is up to the programmers of the referenced routines to have intimate knowledge of inode type-specific content.

When a file is opened by the kernel, it is identified by the device number of the volume on which the file system is located and its inode number. By using these two values, each file is guaranteed to have a unique identity within the scope of the operating system.

Caching `inode`/`vnode` Data in the Kernel

To improve system performance and reduce the amount of time spent accessing metadata on the disk, inode/vnode data is maintained in kernel-resident caches. Each type of file system is responsible for the allocation and maintenance of its inode/vnode cache. In most implementations caches are sized by kernel-tunable parameters or by an imbedded formula. A hash is formed using the device number and the inode number. Before a copy of an inode is read in from the physical disk, the inode cache is always checked to see if a copy of the data is present.

Smoke and Mirrors: Tying Together the File Subsystem Abstractions

Next, we need to tie the abstractions together into a seamless VFS implementation (see Figure 8-15).

Figure 8-15. Building a Seamless File System

graphics/08fig15.gif

Let's revisit our earlier scenario except that now /home is a separate file system mounted to the system's root file system. Again, our challenge is to follow the pathname /home/chris/stuff through the kernel's virtual structures.

We start our search by locating the kernel's copy of the / inode (2 on the first volume). The inode/vnode and root directory data of all mount points and file system roots are held in the caches and the buffer cache as long as a file system is mounted. To speed up this initial search, the kernel maintains a permanent shortcut pointer to the cached copy of the / inode/vnode in the symbol rootdir.

From this vnode, we determine the file system device number, and from the associated inode we get the data block pointer. These two numbers are used to hash our way into the buffer cache, where we find a copy of the root file system's root directory data. We take the inode number associated with the home directory and the device number to hash our way back into the inode cache for the next step of our search.

When we find the cached vnode for the /home directory (5 on the first volume), the v_vfsmountedhere pointer is valid and directs us to the kernel mount table and the vfs structure for the second mounted volume (did you ever play the children's game Chutes and Ladders?). We then follow the v_data pointer to find the root inode/vnode (2 on the second volume) and cached data block for the mounted /home file system.

We continue our search within the /home file system, locating the inode/vnode (10 on the second volume) and cached data for the /home/chris directory and finally the inode/vnode (14 on the second volume) and cached data for the /home/chris/stuff file. Accessing these last two stops on our search may or may not find the requested data in the appropriate caches. If they aren't loaded, then the system uses the routines referenced by the operational arrays and kernel functions to load copies from the disk into the caches.

This brings us to the end of our quest. Next we consider the attachment of an open file to a process.

Connecting an Open File to a Process

We have seen that the proc structure is the base of all process-related kernel structures. The definition of a process's resources includes mapping opened files to per-process file descriptors (see Figure 8-16).

Figure 8-16. System File Table

graphics/08fig16.gif

From the kernel's point of view, each distinct file open() call must be tracked, which is accomplished by making entries in the system file table. From the process's point of view, it is only concerned with its own file descriptors, but the kernel must manage all open() calls.

The number of file descriptors a single process may have is controlled by the kernel-tunable parameters max_files (for a regular user process) and max_files_lim (for a superuser process), and may be set up to 60,000. The default is 60 and is usually sufficient, considering that most processes only need to map stdin, stdout, and stderr.

To avoid wasting kernel memory space, process file descriptors are stored in blocks of eight for HP-UX 11.11 (HP-UX 11.0 used 32 descriptors per block), and the blocks are referenced through a partition table pointed to by the proc structure. Additional blocks are allocated as needed by the kernel.

On HP-UX 11.11 the file_descriptor_t structure contained:

 fd_lock       a spinlock structure *fd_filep     a pointer to a system file table file entry fd_locker_tid a reference to a locking thread fd_state      the current descriptor state fd_pofile     reference to the ofile partition entry

The kernel file system table consists of nfile entries (kernel-tunable) of type file. See Listings 8.9 and 8.10.

Listing 8.9. `q4> fields struct file`

 The file descriptor table flags are:   0 0 4 0 int       f_flag

0x01FREAD	descriptor is readable
0x02FWRITE	descriptor is writable
0x04FNDELAY	no delay
0x08FAPPEND	append on each write

 The kernel object represented by this entry   4 0 2 0 short     f_type

0x01DTYPE_VNODE	regular file
0x02DTYPE_SOCKET	Berkeley IPC Socket
0x03DTYPE_STREAMS	Streams file type
0x05DTYPE_UNSP	user nsp control
0x06DTYPE_LLA	link-level LAN access

 link count from the message queue    6 0 2 0 short     f_msgcount reference count    8 0 4 0 int       f_count type-specific file system operations array pointer   12 0 4 0 *         f_ops pointer to this file's vnode/socket kernel data   16 0 4 0 *         f_data   20 0 4 0 int       f_freeindex   24 0 8 0 long long f_offset   32 0 4 0 *         f_buf credentials of user who opened the file   36 0 4 0 *         f_cred kernel beta-semaphore   40 0 1 0 u_char    f_b_sema.b_lock   42 0 2 0 u_short   f_b_sema.order   44 0 4 0 *         f_b_sema.owner data intercept audit flag   48 0 2 0 short     f_tap

Listing 8.10. `q4> fields struct fileops`

  0 0 4 0 * fo_rw  4 0 4 0 * fo_ioctl  8 0 4 0 * fo_select 12 0 4 0 * fo_close

The Buffer Cache

Just as the kernel keeps copies of open inode data in a cache, it also maintains a much larger cache of recently accessed file data blocks (or extents, depending on the type of file system). This buffer cache is a key component of the HP-UX operating system. Figure 8-17 provides an overview of the buffer cache design.

Figure 8-17. The Buffer Cache

graphics/08fig17.gif

A basic premise of the UNIX operating system is that every thing is a file. This model serves many purposes and greatly reduces the overall complexity of the operating system. With this in mind, it is understandable that everything that may be done to improve file performance provides a large payback.

Being Flexible: The Dynamic Buffer Cache

The HP-UX implementation of a buffer cache follows a dynamic model. The buffer cache is mapped into the kernel's address space, but the physical pages allocated to it come from the same available memory pools as those used for user processes. In this manner, the buffer cache competes with user code for system memory resources.

The kernel allows the administrator to set a maximum and minimum percentage of the available memory that may be allocated for buffer cache use. Remember that available memory means the amount of physical memory remaining after the kernel allocates pages for its exclusive use during system initialization. The kernel-tunable parameters to enable a dynamic buffer cache are:

dbc_min_pct: Minimum percentage of available memory that may be allocated for buffer cache use
dbc_max_pct: Maximum percentage of available memory pages that may be allocated for buffer cache use

If the dbc_min_pct is set equal to or larger than the dbc_max_pct, then the dynamic buffer cache is capped at the size dictated by dbc_min_pct. This presents what at first seems to be a conundrum. How can it be a dynamic buffer cache if the size is fixed? While the size is fixed, it is simply a targeted value and not an absolute. As long as the buffer cache is considered to be dynamic, its pages may be aged and stolen by the vhand daemon during times of heavy memory pressure. This means that while the overall size appears to be static, the specific pages allocated to the cache may change from time to time, so they are considered dynamic.

If these two dynamic tunables are not set, then the following older kernel parameters take effect and the cache is flagged as static; that is, pages once allocated to the cache remain there and vhand cannot scan them.

nbuf: Number of buffer headers
bufpages: Number of buffer pages

Once a buffer has been populated with a copy of file data, the kernel needs an efficient method to search for it in the cache. This is accomplished through the use of a hash (by now you expected this, didn't you!).

The Buffer Cache Hash

Figure 8-18 illustrates the basic design of the buffer cache. Individual buffers are pointed to by buffer headers' which are arranged in hash chains. In the case of a read(), the hash offset is calculated using the device number and block offset (obtained from the vnode) for the requested disk data.

Figure 8-18. The Buffer Cache Hash

graphics/08fig18.gif

Once the hash has been calculated, the appropriate hash chain is searched to see if the requested data is present in the cache. In the case of a miss, the kernel works with the appropriate hardware driver, requests a physical data transfer from the disk, and allocates a free cache buffer to store the data in. Requests made and satisfied by the buffer cache are said to be logical reads and writes. Reads and writes between the buffer cache and a disk are called physical.

A write() request is handled in a similar manner. If a buffer has already been started, a simple logical write is performed. When a buffer is filled or when a dirty buffer (one whose content differs from its counterpart on the disk) is found by the sync daemon during one of its scheduled scans, a physical write is scheduled. Physical writes may also be specifically requested by a programmer or when a file is closed.

Listings 8.11 and 8.12 show selected fields from q4 listings of the buffer header and buffer structure.

Listing 8.11. `q4> fields struct bufhd`

 The buffer header state flag    0 0 4 0 int b_flags

0x00800000	B_PAGEOUT	this is a buffer header, not a buffer

 Forward/backward linkage pointers for the hash chain    4 0 4 0 *   b_forw    8 0 4 0 *   b_back Modification timestamp   12 0 4 0 int b_checkdup Lock structure for this buffer hash chain   16 0 4 0 *   bh_lock

Listing 8.12. `q4> fields struct buf (a partial listing)`

 The buffer state flags   0 0 4 0 int                b_flags

0x00000000	B_WRITE
0x00000001	B_READ
0x00000002	B_DONE	transaction completed
0x00000004	B_ERROR	transaction failed
0x00000008	B_BUSY	not currently on a free list
0x00000010	B_PHYS	physical I/O in progress
0x00000040	B_WANTED	send a wakeup when no longer BUSY
0x00000100	B_ASYNC	don't wait for I/O completion
0x00001000	B_PRIVATE	used by LVM
0x00004000	B_PFTIMEOUT	LVM power fail timeout
0x00008000	B_CACHE
0x00010000	B_INVAL	buffer not valid
0x00020000	B_FSYSIO	buffer resulted from bread or bwrite
0x00040000	B_CALL	call b_iodone() from iodone()
0x00080000	B_NOCACHE	don't hold block in buffer cache
0x00100000	B_RAW	raw access I/O
0x00200000	B_BCACHE	buffer is from the buffer cache
0x00400000	B_SYNC	buffer write is synchronous

 Linkage pointers connecting a buffer to a hash chain     4 0 4 0 *                  b_forw     8 0 4 0 *                  b_back Linkage pointer connecting a buffer to a free list    12 0 4 0 *                  av_forw    16 0 4 0 *                  av_back Linkage pointers connecting a buffer to a vnodes clean/dirty list    20 0 4 0 *                  b_blockf    24 0 4 0 *                  b_blockb    28 0 4 0 *                  b_strategy    32 0 4 0 *                  b_un.b_addr    32 0 4 0 *                  b_un.b_words    32 0 4 0 *                  b_un.b_filsys    32 0 4 0 *                  b_un.b_cs    32 0 4 0 *                  b_un.b_cg    32 0 4 0 *                  b_un.b_dino    32 0 4 0 *                  b_un.b_daddr    36 0 4 0 int                b_un2.b_sectno    36 0 4 0 int                b_un2.b_byteno    40 0 4 0 int                b_dev    44 0 4 0 int                b_blkno    48 0 4 0 *                  b_sc The count of how many times this buffer has been returned to a  free list since it was allocated    52 0 4 0 int                b_bcount    56 0 2 0 u_short            b2_flags Error number (if B_ERROR is set in b_flags), buffer size, and offset    58 0 2 0 short              b_error    60 0 4 0 int                b_bufsize    64 0 4 0 int                b_offset    68 0 4 0 *                  b_merge The pointer contains the address of the routine to call when a  pending I/O is finished    72 0 4 0 *                  b_iodone ---------------------------------------

Free Lists: Allocating a New Buffer

When a buffer is in use, its b_flags is set to B_BUSY. As soon as any pending data transfer is complete, it is returned to a free list. There are several free lists.

When a page is first allocated to a dynamic buffer cache, its buffers are placed on the BQ_EMPTY free list. Requests to grow the buffer cache depend on its level of use.

Once a buffer is allocated, used, and freed, it is placed on the BQ_AGE list, and its b_count is incremented. Once the b_count exceeds 4, the buffer is placed on the BQ_LRQ list when it is freed. In this manner, a buffer that has been loaded to satisfy a truly random access read request is more likely to be reallocated than one for which sequential requests are being processed.

Each processor has its own set of free list headers. The rules for finding an available buffer with a fixed buffer cache are as follows:

Search the processor's BQ_AGE list, then its BQ_LRU.
Next, search the BQ_AGE lists of all the other processors.
Finally, search their BQ_LRQ lists.

In the case of a dynamic buffer cache:

Check the current processor's BQ_AGE and BQ_LRU.
Next we check the BQ_EMPTY (we wait to check it, as the allocation of new pages to the cache is performed in background).
Assuming that we can't find a buffer on one of our processor's free lists, we next check the neighboring processors' BQ_AGE and then their BQ_LRU lists.

If we are still unsuccessful, the thread blocks until a buffer is available.

An interesting observation is that items cached in the buffer cache don't behave like items in most other caches. When a buffer is involved in an I/O operation, it is B_BUSY. The rest of the time, it is placed on a free list. In this manner the more a buffer is referenced, the less its chances are of being reallocated. If the cache is not busy, a buffer may persist for quite some time following its use, if another process comes along requesting the data, it may be pleasantly surprised to find that it is already in the cache and may be accessed logically.

A fourth queue, BQ_SENDFILE, is used to list transitive buffers in use by kernel routine sendfile().

Bigger Isn't Always Better

The time required to search the hash chain is offset by the fact that a logical read or write is several orders of magnitude faster than a physical transaction. The greater the "hit rate" for data in the cache, the greater the payback is for this scheme. For files accessed in a sequential manner, the hit rate may exceed 90 percent! Adjusting the size of the buffer cache may help to increase the hit rate, but caution should be used to make sure that the system has plenty of available memory and that increasing the buffer cache doesn't cause an increase in the page-out rate, or the gains in the hit rate could be lost in the added overhead of vhand.

Another concern is that if the buffer cache gets too large, then the added length of the hash chain requires more search time. At some point the payback of caching may be diminished.

Prior to HP-UX 11.11, the conventional wisdom was that the buffer cache should not exceed 0.5 GB. Recent modifications (HP-UX 11.11) to the hashtable mechanics suggest that sizes in the 1-GB to 2-GB range may be considered

The Directory Name Lookup Cache, DNLC

The final component of the file system we examine in this chapter is the directory name lookup cache, or DNLC (Figure 8-19). Just as the kernel maintains caches for data blocks and inodes for improved performance, it also maintains a cache of recently referenced pathnames.

Figure 8-19. Directory Name Lookup Cache

graphics/08fig19.gif

We have seen how a user-supplied pathname is processed by the kernel through the VFS. If we could skip past a majority of the smoke-and-mirror translation tasks, we could improve the time required to open a file descriptor.

Consider a thread that has changed to a working directory and is accessing a number of the files in that directory. Each time an open() call is made, the kernel must convert the pathname to an appropriate vnode and data block. The DNLC caches individual directory names and accesses them via a hash based on their name strings.

Entries in the cache remain on a free list in a least recently used fashion. Another limit is that individual names must not be longer than 39 characters, the maximum space allocated for the directory name. Longer names may be used but they will not be cached (prior to HP-UX 11.0 the maximum length for cached names was 15 characters).

Hash chain headers are stored in the kernel's nc_hash[] array. The hash function uses the vnode address and the file/directory name to calculate the chain header to which it should be linked.

Consider the Listings 8.13 and 8.14of the DNLC structures.

Listing 8.13. `q4> fields struct nc_hash`

 Each nc_hash[] array element contains two linkage pointers    0 0 4 0 * hash_next    4 0 4 0 * hash_prev

Listing 8.14. `q4> fields struct ncache`

 First the entry is linked to its hash chain    0 0  4 0 *        hash_next    4 0  4 0 *        hash_prev All entries are linked into a least recently used list    8 0  4 0 *        lru_next   12 0  4 0 *        lru_prev This points to the directory's vnode   16 0  4 0 *        vp And a pointer to the parent directory's vnode   20 0  4 0 *        dp The entry's length, name, and credentials are stored next   24 0  1 0 char     namlen   25 0 39 0 char[39] name   64 0  4 0 *        cred Mount point directories are linked to the virtual file system table   68 0  4 0 *        vfs_next   72 0  4 0 *        vfs_prev All entries sharing a parent directory are dual linked   76 0  4 0 *        dp_next   80 0  4 0 *        dp_prev All entries sharing a vnode are dual linked   84 0  4 0 *        vp_next   88 0  4 0 *        vp_prev