Flylib.com

Books Software

 
 
 

The Kernel View of File Systems

   

The Kernel View of File Systems

To isolate the rest of the kernel from the type-specific structures used by various file system types, a Virtual File System (VFS) abstraction layer is employed. Figure 8-11 gives us an overview of the VFS components .

Figure 8-11. Kernel File System Tables

graphics/08fig11.gif


The VFS abstraction layer was first introduced to UNIX by Sun Microsystems engineers while they were developing the Network File System (NFS). They wanted local users and applications to be oblivious to the actual location of the files they were using. Current UNIX operating systems frequently build on this approach to masquerade the specifics of local and remote file system types mounted within their kernels .

The HP-UX kernel uses a number of VFS structures, virtual inode s ( vnode ), caches, and hash headers to facilitate file access. To get started, we examine two key components of the virtual file system: the vfs mount table and the vnode .

The vfs Mount Table

The starting point for the HP-UX file management subsystem is the pointer rootvfs . This kernel symbol directs us to the " / " (or root) file system's vfs data structure (see Figure 8-12).

Figure 8-12. Mounting a File System

graphics/08fig12.gif


Each new file system mount creates an additional entry in the linked list of vfs structures. Once a mount has been made, its vfs entry is maintained until the kernel is rebooted, even if the file system is unmounted (if it is later remounted, the original vfs structure is reused). Mount points configured by the automounter also add vfs structures to the list.

There are several notable components within the vfs structure. Among them are a pointer to the next vfs , a type value, and a pointer to an operational array (vector jump table) that maps type-specific kernel routines needed to manipulate the file system, and a pointer to its kernel-resident data structures. The kernel-resident data structures vary greatly in content between the various file system implementations . The type-specific routines mapped by the operations array know what to expect and how to work with the specifics of their design.

Listings 8.3 and 8.4 are annotated listings of the vfs structure and the vfsops vector array it points to.

Listing 8.3. q4> fields struct vfs
A forward-pointing link to the next mounted file system (in 

chronological order)

   0 0    4 0 *          vfs_next

A hash chain for rapid lookup (used to see if a vfs structure 

already exists before we create a new one)

   4 0    4 0 *          vfs_hash

A pointer to the file system type-specific operations array

   8 0    4 0 *          vfs_op

A pointer to the in-core copy of the vnode this file system is

mounted on

  12 0    4 0 *          vfs_vnodecovered

The vfs state flags

  16 0    4 0 int        vfs_flag

0x01VFS_RDONLY

read-only

0x02VFS_MLOCK

lock this vfs (keep subtree stable)

0x04VFS_MWAIT

there is a waiter for the lock

0x08VFS_NOSUID

disable suid for file system

0x10VFS_EXPORTED

this file system is exported (NFS)

0x20VFS_REMOTE

this is an imported file system (NFS)

0x40VFS_HASQUOTA

this file system has quotas enabled

0x200VFS_MOUNTING

used in conjunction with VFS_MLOCK

0x800VFS_LARGEFILES

enable use of large files


This file system's native block size

  20 0    4 0 int        vfs_bsize

For exported file systems (NFS) this is the UID remote "root" 

users are remapped to and the export state flag

  24 0    2 0 u_short    vfs_exroot

  26 0    2 0 short      vfs_exflags

This points to the file system's type-specific in-core data 

structures

  28 0    4 0 *          vfs_data

A reference count of processes currently sleeping on the file

system's mount point

  32 0    4 0 int        vfs_icount

The file system's specific type (HFS, VxFS, NFS, CDFS, ...)and 

its unique id

  36 0    2 0 short      vfs_mtype

  40 0    4 0 int        vfs_fsid[0]

  44 0    4 0 int        vfs_fsid[1]

A log pointer, the most recent mount timestamp, file system

name, device number (Major+minor device number), and a chain

pointer to DNLC entries referencing this file system

  48 0    4 0 *          vfs_logp

  52 0    4 0 int        vfs_mnttime

  56 0 1024 0 char[1024] vfs_name

1080 0    4 0 int        vfs_dev

1084 0    4 0 *          vfs_ncachehd

Listing 8.4. q4> fields struct vfsops
The following pointers form a vector jump table to type-

specific operational routines in the kernel

 0 0 4 0 * vfs_mount

 4 0 4 0 * vfs_unmount

 8 0 4 0 * vfs_root

12 0 4 0 * vfs_statvfs

16 0 4 0 * vfs_sync

20 0 4 0 * vfs_vget

24 0 4 0 * vfs_getmount

28 0 4 0 * vfs_freeze

32 0 4 0 * vfs_thaw

36 0 4 0 * vfs_quota

40 0 4 0 * vfs_mountroot

44 0 4 0 * vfs_size

The Type-Specific In-Core Data Structures

Each file system type requires its own specific in-core data. Figure 8-13 illustrates the point.

Figure 8-13. Type-Specific Data Structures (HFS Example)

graphics/08fig13.gif


In general, the in-core data has a copy of the file system's superblock, a pointer to the kernel's copy of the inode / vnode to which the file system is mounted, and copies of the file system's allocation tables.

In the case of HFS, the kernel-resident structures include a mount structure (with pointers to the mount point's inode / vnode ), a pointer back to the file system's vfs structure, and a pointer to a copy of the superblock maintained in the buffer cache (more on the buffer cache later in this chapter). Let's examine the HFS mount and fs (superblock) structures in Listings 8.5 and 8.6.

Listing 8.5. q4> fields struct mount
First are forward and backward linkage pointers to a hash list

   0 0 4 0 *       m_hforw

   4 0 4 0 *       m_hback

Next is a pointer to the file system's vfs structure, the 

block device number (major and \minor number) on which the 

file system resides, and a pointer to the buffer header which

maps the in-core copy of the file system's superblock

   8 0 4 0 *       m_vfsp

  12 0 4 0 int     m_dev

  16 0 4 0 *       m_bufp

  20 0 4 0 int     m_maxbufs

The raw device number is also recorded here

  24 0 4 0 int     m_rdev

  28 0 4 0 *       m_nodes.m_inodp

  28 0 4 0 *       m_nodes.m_cdnodp

  32 0 4 0 *       m_qinod

Mount flags are stored here

  36 0 4 0 int     m_flag

  40 0 2 0 u_short m_qflags

  44 0 4 0 u_int   m_btimelimit

  48 0 4 0 u_int   m_ftimelimit

A pointer to the file system's root inode/vnode in-core, a 

reference count, and a pointer to the mount point's in-core

inode/vnode

  52 0 4 0 *       m_rootvp

  56 0 4 0 u_int   m_ref

  60 0 4 0 *       m_iused

Listing 8.6. q4> fields struct fs
0 0   4 0 int       fs_unused[0]

   4 0   4 0 int       fs_unused[1]

First we have the file system's super block address, the

offset of the cylinder group data block, the inode blocks, and 

the first data block following the cylinder groups

   8 0   4 0 int       fs_sblkno

  12 0   4 0 int       fs_cblkno

  16 0   4 0 int       fs_iblkno

  20 0   4 0 int       fs_dblkno

  24 0   4 0 int       fs_cgoffset

  28 0   4 0 int       fs_cgmask

Next is the last modification timestamp, the number of blocks

(total space), and number of data blocks (minus the metadata

overhead) in the file system.

  32 0   4 0 int       fs_time

  36 0   4 0 int       fs_size

  40 0   4 0 int       fs_dsize

Additional static and tunable parameters include the number of

cylinders per group, block and fragment sizes, number of 

fragments in a block (1,2,4,8), minimum reserved free space

(applies only to non-superuser allocation requests),

rotational delay for writing sequential blocks (also known as

interleave), and drive speed (in revolutions per second)

  44 0   4 0 int       fs_ncg

  48 0   4 0 int       fs_bsize

  52 0   4 0 int       fs_fsize

  56 0   4 0 int       fs_frag

  60 0   4 0 int       fs_minfree

  64 0   4 0 int       fs_rotdelay

  68 0   4 0 int       fs_rps

  72 0   4 0 int       fs_bmask

  76 0   4 0 int       fs_fmask

  80 0   4 0 int       fs_bshift

  84 0   4 0 int       fs_fshift

Maximum contiguous blocks and max number of blocks allocated 

to a single file within a cylinder group

  88 0   4 0 int       fs_maxcontig

  92 0   4 0 int       fs_maxbpg

  96 0   4 0 int       fs_fragshift

 100 0   4 0 int       fs_fsbtodb

 104 0   4 0 int       fs_sbsize

 108 0   4 0 int       fs_csmask

 112 0   4 0 int       fs_csshift

 116 0   4 0 int       fs_nindir

 120 0   4 0 int       fs_inopb

 124 0   4 0 int       fs_nspf

This file system's unique identifier number

 128 0   4 0 int       fs_id[0]

 132 0   4 0 int       fs_id[1]

Mirror state information for primary swap and "/" partitions

 136 0   0 4 u_int     fs_mirror.state.root

 136 4   0 1 u_int     fs_mirror.state.rflag

 136 5   0 4 u_int     fs_mirror.state.swap

 137 1   0 1 u_int     fs_mirror.state.sflag

 137 2   2 6 u_int     fs_mirror.state.spare

 140 0   4 0 int       fs_mirror.mirtime

 144 0   4 0 int       fs_featurebits

 148 0   4 0 int       fs_optim

Cylinder group summary array address, summary array size, 

cylinder group size

 152 0   4 0 int       fs_csaddr

 156 0   4 0 int       fs_cssize

 160 0   4 0 int       fs_cgsize

Number of tracks per cylinder group, sectors per track, 

sectors per cylinder, total number of cylinders in the file 

system, cylinders per group, and inodes per group

 164 0   4 0 int       fs_ntrak

 168 0   4 0 int       fs_nsect

 172 0   4 0 int       fs_spc

 176 0   4 0 int       fs_ncyl

 180 0   4 0 int       fs_cpg

 184 0   4 0 int       fs_ipg

 188 0   4 0 int       fs_fpg

Calculated cylinder group totals (this must be audited 

following a power failure)

 192 0   4 0 int       fs_cstotal.cs_ndir

 196 0   4 0 int       fs_cstotal.cs_nbfree

 200 0   4 0 int       fs_cstotal.cs_nifree

 204 0   4 0 int       fs_cstotal.cs_nffree

Next are the file system modification flag, the clean/dirty 

flag, the read-only flag, the spare flag (currently not used),

and a string holding the current mount point of the file 

system

 208 0   1 0 char      fs_fmod

 209 0   1 0 char      fs_clean

 210 0   1 0 char      fs_ronly

 211 0   1 0 char      fs_flags

 212 0 512 0 char[512] fs_fsmnt

 724 0   4 0 int       fs_cgrotor

 728 0   4 0 *         fs_csp

 732 0   4 0 int       fs_csp_pad

 736 0   4 0 int       fs_unused2[30]

 856 0   4 0 int       fs_cpc

 860 0   2 0 short     fs_postbl[32][8]

This file system's "magic number" (used to verify its type), 

its type name and pack name

1372 0   4 0 int       fs_magic

1376 0   6 0 char[6]   fs_fname

1382 0   6 0 char[6]   fs_fpack

1388 0   1 0 u_char    fs_rotbl[0]

The vnode

Next, we address the specifics of the vnode structure. We have learned that an inode contains all of a file's credentials and attributes, and is therefore the key to managing a file within a file system. The inode number is unique within the context of a single file system, and it is all that is needed to define a file. Within a kernel, there may be several file systems mounted together to create what appears to be a seamless, larger file system. Within this larger context, inode numbers are no longer guaranteed to be unique. Each mounted file system may have an inode 1, 2, and so on. To enable the kernel to refer to opened files with a unique handle, the virtual inode or vnode was created.

In a similar manner to the vfs structure, a vnode contains an operational array pointer to the kernel's type-specific routines for file manipulation tasks (see Figure 8-14).

Figure 8-14. The vnode

graphics/08fig14.gif


The key fields of the vnode include a type number, a pointer to a type-specific operations array, and a reference to the in-core copy of the file's inode data. Listing 8.7 is an annotated listing of a virtual file system vnode:

Listing 8.7. q4> fields struct vnode
0 0 2 0 u_short v_flag

0x01

VROOT

root of a file system

0x02

VTEXT

vnode is a text prototype

0x04

VSWAP

vnode maps a swap space

0x08

VXOFTHOLD

vnode allows soft holds

0x010

VEXLOCK

exclusive lock

0x020

VSHLOCK

shared lock

0x040

VLWAIT

there is a waiter on a lock

0x080

VXDEVOK

cross device rename and link is allowed

0x0100

VMMF

vnode is memory mapped

0x04000

VNOBCACHE

data caching disabled

0x08000

VN_MPSAFE

semaphore not required


Shared lock count, exclusive lock count, private data count, 

and reference count

 2 0 2 0 u_short v_shlockc

 4 0 2 0 u_short v_exlockc

 6 0 2 0 u_short v_tcount

 8 0 4 0 int     v_count

Pointer to vfs structure if this is a mount point

12 0 4 0 *       v_vfsmountedhere

Pointers to the type specific vnode operations array, unix ipc

socket, stream head, and the vfs the file is in

16 0 4 0 *       v_op

20 0 4 0 *       v_socket

24 0 4 0 *       v_stream

28 0 4 0 *       v_vfsp

vnode type, device number, type-specific data (hfs inode, vxfs 

inode, rnode, ...), and parent file system type

32 0 4 0 enum4   v_type

36 0 4 0 int     v_rdev

40 0 4 0 *       v_data

44 0 4 0 enum4   v_fstype

Pointer to parent pseudo vas (if this maps to an executable 

image file)

48 0 4 0 *       v_vas

Beta semaphore locking structure

52 0 1 0 u_char  v_lock.b_lock

54 0 2 0 u_short v_lock.order

56 0 4 0 *       v_lock.owner

Mapped memory cache buffers for this file

60 0 4 0 *       v_cleanblkhd

64 0 4 0 *       v_dirtyblkhd

vnode write count, FS independent locks, soft hold count, and 

a copy of the va_nodeid

68 0 4 0 int     v_writecount

72 0 4 0 *       v_locklist

76 0 4 0 int     v_scount

80 0 4 0 int     v_nodeid

DNLC entry pointers if this vnode is currently described in 

the directory name lookup cache

84 0 4 0 *       v_ncachedhd

88 0 4 0 *       v_ncachevhd

92 0 4 0 *       v_pfdathd

96 0 4 0 u_int   v_last_fsync

The HFS In-Core inode Contains Its vnode

For an HFS file, the in-core inode structure includes the vnode structure and a complete copy of its disk-resident inode (the icommon structure we examined earlier) plus additional fields of information (Listing 8.8):

Listing 8.8. q4> fields struct inode
0 0  4 0 *         i_chain[0]

  4 0  4 0 *         i_chain[1]

The device number this inode came from

  8 0  4 0 int       i_dev

The inode number

 12 0  4 0 u_int     i_number

The state flags, lock word, and tid of last thread to lock this inode

 16 0  4 0 u_int     i_flag

 20 0  2 0 u_short   i_lockword

 24 0  4 0 int       i_tid

The virtual file system's vnode for the file is stored between offset 28 and 127

inode device pointer

128 0  4 0 *         i_devvp

132 0  4 0 int       i_diroff

Continuation inode pointer

136 0  4 0 *         i_contip

File system pointer

140 0  4 0 *         i_fs

Quota pointer

144 0  4 0 *         i_dquot

148 0  4 0 int       i_rdev

152 0  4 0 int       i_un.if_lastr

152 0  4 0 *         i_un.is_socket

156 0  4 0 *         i_fr.if_freef

160 0  4 0 *         i_fr.if_freeb

164 0  4 0 *         i_fselr.i_selp

168 0  2 0 short     i_fselr.i_selflag

172 0  4 0 *         i_fselw.i_selp

176 0  2 0 short     i_fselw.i_selflag

180 0  4 0 *         i_mount

Acopy of the disk-based inode (structure icommon) is stored between offset 184 and 311

Number of reader calls

312 0  2 0 u_short   i_rcount

Beta semaphore locking structures

316 0  1 0 u_char    i_b_sema.b_lock

318 0  2 0 u_short   i_b_sema.order

320 0  4 0 *         i_b_sema.owner

324 0  4 0 int       filler[0]

328 0  4 0 int       filler[1]

The relationship between the in-core inode data and the vnode varies between file system types, but the vnode structure itself is identical for all supported file systems. As the vnode uses operation's array pointers, it is up to the programmers of the referenced routines to have intimate knowledge of inode type-specific content.

When a file is opened by the kernel, it is identified by the device number of the volume on which the file system is located and its inode number. By using these two values, each file is guaranteed to have a unique identity within the scope of the operating system.

Caching inode / vnode Data in the Kernel

To improve system performance and reduce the amount of time spent accessing metadata on the disk, inode / vnode data is maintained in kernel-resident caches. Each type of file system is responsible for the allocation and maintenance of its inode / vnode cache. In most implementations caches are sized by kernel-tunable parameters or by an imbedded formula. A hash is formed using the device number and the inode number. Before a copy of an inode is read in from the physical disk, the inode cache is always checked to see if a copy of the data is present.

Smoke and Mirrors: Tying Together the File Subsystem Abstractions

Next, we need to tie the abstractions together into a seamless VFS implementation (see Figure 8-15).

Figure 8-15. Building a Seamless File System

graphics/08fig15.gif


Let's revisit our earlier scenario except that now /home is a separate file system mounted to the system's root file system. Again, our challenge is to follow the pathname /home/chris/stuff through the kernel's virtual structures.

We start our search by locating the kernel's copy of the / inode (2 on the first volume). The inode / vnode and root directory data of all mount points and file system roots are held in the caches and the buffer cache as long as a file system is mounted. To speed up this initial search, the kernel maintains a permanent shortcut pointer to the cached copy of the / inode / vnode in the symbol rootdir .

From this vnode , we determine the file system device number, and from the associated inode we get the data block pointer. These two numbers are used to hash our way into the buffer cache, where we find a copy of the root file system's root directory data. We take the inode number associated with the home directory and the device number to hash our way back into the inode cache for the next step of our search.

When we find the cached vnode for the /home directory (5 on the first volume), the v_vfsmountedhere pointer is valid and directs us to the kernel mount table and the vfs structure for the second mounted volume (did you ever play the children's game Chutes and Ladders?). We then follow the v_data pointer to find the root inode / vnode (2 on the second volume) and cached data block for the mounted /home file system.

We continue our search within the /home file system, locating the inode / vnode (10 on the second volume) and cached data for the /home/chris directory and finally the inode / vnode (14 on the second volume) and cached data for the /home/chris/stuff file. Accessing these last two stops on our search may or may not find the requested data in the appropriate caches. If they aren't loaded, then the system uses the routines referenced by the operational arrays and kernel functions to load copies from the disk into the caches.

This brings us to the end of our quest. Next we consider the attachment of an open file to a process.

Connecting an Open File to a Process

We have seen that the proc structure is the base of all process- related kernel structures. The definition of a process's resources includes mapping opened files to per-process file descriptors (see Figure 8-16).

Figure 8-16. System File Table

graphics/08fig16.gif


From the kernel's point of view, each distinct file open() call must be tracked, which is accomplished by making entries in the system file table. From the process's point of view, it is only concerned with its own file descriptors, but the kernel must manage all open() calls.

The number of file descriptors a single process may have is controlled by the kernel-tunable parameters max_files (for a regular user process) and max_files_lim (for a superuser process), and may be set up to 60,000. The default is 60 and is usually sufficient, considering that most processes only need to map stdin , stdout , and stderr .

To avoid wasting kernel memory space, process file descriptors are stored in blocks of eight for HP-UX 11.11 (HP-UX 11.0 used 32 descriptors per block), and the blocks are referenced through a partition table pointed to by the proc structure. Additional blocks are allocated as needed by the kernel.

On HP-UX 11.11 the file_descriptor_t structure contained:

fd_lock       a spinlock structure

*fd_filep     a pointer to a system file table file entry

fd_locker_tid a reference to a locking thread

fd_state      the current descriptor state

fd_pofile     reference to the ofile partition entry

The kernel file system table consists of nfile entries (kernel-tunable) of type file . See Listings 8.9 and 8.10.

Listing 8.9. q4> fields struct file
The file descriptor table flags are:

  0 0 4 0 int       f_flag

0x01FREAD

descriptor is readable

0x02FWRITE

descriptor is writable

0x04FNDELAY

no delay

0x08FAPPEND

append on each write


The kernel object represented by this entry

  4 0 2 0 short     f_type

0x01DTYPE_VNODE

regular file

0x02DTYPE_SOCKET

Berkeley IPC Socket

0x03DTYPE_STREAMS

Streams file type

0x05DTYPE_UNSP

user nsp control

0x06DTYPE_LLA

link-level LAN access


{% if main.adsdop %}{% include 'adsenceinline.tpl' %}{% endif %}
link count from the message queue

   6 0 2 0 short     f_msgcount

reference count

   8 0 4 0 int       f_count

type-specific file system operations array pointer

  12 0 4 0 *         f_ops

pointer to this file's vnode/socket kernel data

  16 0 4 0 *         f_data

  20 0 4 0 int       f_freeindex

  24 0 8 0 long long f_offset

  32 0 4 0 *         f_buf

credentials of user who opened the file

  36 0 4 0 *         f_cred

kernel beta-semaphore

  40 0 1 0 u_char    f_b_sema.b_lock

  42 0 2 0 u_short   f_b_sema.order

  44 0 4 0 *         f_b_sema.owner

data intercept audit flag

  48 0 2 0 short     f_tap

Listing 8.10. q4> fields struct fileops
0 0 4 0 * fo_rw

 4 0 4 0 * fo_ioctl

 8 0 4 0 * fo_select

12 0 4 0 * fo_close

The Buffer Cache

Just as the kernel keeps copies of open inode data in a cache, it also maintains a much larger cache of recently accessed file data blocks (or extents, depending on the type of file system). This buffer cache is a key component of the HP-UX operating system. Figure 8-17 provides an overview of the buffer cache design.

Figure 8-17. The Buffer Cache

graphics/08fig17.gif


A basic premise of the UNIX operating system is that every thing is a file. This model serves many purposes and greatly reduces the overall complexity of the operating system. With this in mind, it is understandable that everything that may be done to improve file performance provides a large payback.

Being Flexible: The Dynamic Buffer Cache

The HP-UX implementation of a buffer cache follows a dynamic model. The buffer cache is mapped into the kernel's address space, but the physical pages allocated to it come from the same available memory pools as those used for user processes. In this manner, the buffer cache competes with user code for system memory resources.

The kernel allows the administrator to set a maximum and minimum percentage of the available memory that may be allocated for buffer cache use. Remember that available memory means the amount of physical memory remaining after the kernel allocates pages for its exclusive use during system initialization. The kernel-tunable parameters to enable a dynamic buffer cache are:

  • dbc_min_pct : Minimum percentage of available memory that may be allocated for buffer cache use

  • dbc_max_pct : Maximum percentage of available memory pages that may be allocated for buffer cache use

If the dbc_min_pct is set equal to or larger than the dbc_max_pct , then the dynamic buffer cache is capped at the size dictated by dbc_min_pct . This presents what at first seems to be a conundrum . How can it be a dynamic buffer cache if the size is fixed? While the size is fixed, it is simply a targeted value and not an absolute. As long as the buffer cache is considered to be dynamic, its pages may be aged and stolen by the vhand daemon during times of heavy memory pressure. This means that while the overall size appears to be static, the specific pages allocated to the cache may change from time to time, so they are considered dynamic.

If these two dynamic tunables are not set, then the following older kernel parameters take effect and the cache is flagged as static; that is, pages once allocated to the cache remain there and vhand cannot scan them.

  • nbuf : Number of buffer headers

  • bufpages : Number of buffer pages

Once a buffer has been populated with a copy of file data, the kernel needs an efficient method to search for it in the cache. This is accomplished through the use of a hash (by now you expected this, didn't you!).

The Buffer Cache Hash

Figure 8-18 illustrates the basic design of the buffer cache. Individual buffers are pointed to by buffer headers' which are arranged in hash chains. In the case of a read() , the hash offset is calculated using the device number and block offset (obtained from the vnode ) for the requested disk data.

Figure 8-18. The Buffer Cache Hash

graphics/08fig18.gif


Once the hash has been calculated, the appropriate hash chain is searched to see if the requested data is present in the cache. In the case of a miss , the kernel works with the appropriate hardware driver, requests a physical data transfer from the disk, and allocates a free cache buffer to store the data in. Requests made and satisfied by the buffer cache are said to be logical reads and writes. Reads and writes between the buffer cache and a disk are called physical .

A write() request is handled in a similar manner. If a buffer has already been started, a simple logical write is performed. When a buffer is filled or when a dirty buffer (one whose content differs from its counterpart on the disk) is found by the sync daemon during one of its scheduled scans , a physical write is scheduled. Physical writes may also be specifically requested by a programmer or when a file is closed.

Listings 8.11 and 8.12 show selected fields from q4 listings of the buffer header and buffer structure.

Listing 8.11. q4> fields struct bufhd
The buffer header state flag

   0 0 4 0 int b_flags

0x00800000

B_PAGEOUT

this is a buffer header, not a buffer


Forward/backward linkage pointers for the hash chain

   4 0 4 0 *   b_forw

   8 0 4 0 *   b_back

Modification timestamp

  12 0 4 0 int b_checkdup

Lock structure for this buffer hash chain

  16 0 4 0 *   bh_lock

Listing 8.12. q4> fields struct buf (a partial listing)
The buffer state flags

  0 0 4 0 int                b_flags

0x00000000

B_WRITE

 

0x00000001

B_READ

 

0x00000002

B_DONE

transaction completed

0x00000004

B_ERROR

transaction failed

0x00000008

B_BUSY

not currently on a free list

0x00000010

B_PHYS

physical I/O in progress

0x00000040

B_WANTED

send a wakeup when no longer BUSY

0x00000100

B_ASYNC

don't wait for I/O completion

0x00001000

B_PRIVATE

used by LVM

0x00004000

B_PFTIMEOUT

LVM power fail timeout

0x00008000

B_CACHE

 

0x00010000

B_INVAL

buffer not valid

0x00020000

B_FSYSIO

buffer resulted from bread or bwrite

0x00040000

B_CALL

call b_iodone() from iodone()

0x00080000

B_NOCACHE

don't hold block in buffer cache

0x00100000

B_RAW

raw access I/O

0x00200000

B_BCACHE

buffer is from the buffer cache

0x00400000

B_SYNC

buffer write is synchronous


Linkage pointers connecting a buffer to a hash chain

    4 0 4 0 *                  b_forw

    8 0 4 0 *                  b_back

Linkage pointer connecting a buffer to a free list

   12 0 4 0 *                  av_forw

   16 0 4 0 *                  av_back

Linkage pointers connecting a buffer to a vnodes clean/dirty list

   20 0 4 0 *                  b_blockf

   24 0 4 0 *                  b_blockb

   28 0 4 0 *                  b_strategy

   32 0 4 0 *                  b_un.b_addr

   32 0 4 0 *                  b_un.b_words

   32 0 4 0 *                  b_un.b_filsys

   32 0 4 0 *                  b_un.b_cs

   32 0 4 0 *                  b_un.b_cg

   32 0 4 0 *                  b_un.b_dino

   32 0 4 0 *                  b_un.b_daddr

   36 0 4 0 int                b_un2.b_sectno

   36 0 4 0 int                b_un2.b_byteno

   40 0 4 0 int                b_dev

   44 0 4 0 int                b_blkno

   48 0 4 0 *                  b_sc

The count of how many times this buffer has been returned to a 

free list since it was allocated

   52 0 4 0 int                b_bcount

   56 0 2 0 u_short            b2_flags

Error number (if B_ERROR is set in b_flags), buffer size, and offset

   58 0 2 0 short              b_error

   60 0 4 0 int                b_bufsize

   64 0 4 0 int                b_offset

   68 0 4 0 *                  b_merge

The pointer contains the address of the routine to call when a 

pending I/O is finished

   72 0 4 0 *                  b_iodone

---------------------------------------

Free Lists: Allocating a New Buffer

When a buffer is in use, its b_flags is set to B_BUSY . As soon as any pending data transfer is complete, it is returned to a free list. There are several free lists.

When a page is first allocated to a dynamic buffer cache, its buffers are placed on the BQ_EMPTY free list. Requests to grow the buffer cache depend on its level of use.

Once a buffer is allocated, used, and freed, it is placed on the BQ_AGE list, and its b_count is incremented. Once the b_count exceeds 4, the buffer is placed on the BQ_LRQ list when it is freed. In this manner, a buffer that has been loaded to satisfy a truly random access read request is more likely to be reallocated than one for which sequential requests are being processed .

Each processor has its own set of free list headers. The rules for finding an available buffer with a fixed buffer cache are as follows:

  • Search the processor's BQ_AGE list, then its BQ_LRU .

  • Next, search the BQ_AGE lists of all the other processors.

  • Finally, search their BQ_LRQ lists.

In the case of a dynamic buffer cache:

  • Check the current processor's BQ_AGE and BQ_LRU .

  • Next we check the BQ_EMPTY (we wait to check it, as the allocation of new pages to the cache is performed in background).

  • Assuming that we can't find a buffer on one of our processor's free lists, we next check the neighboring processors' BQ_AGE and then their BQ_LRU lists.

If we are still unsuccessful , the thread blocks until a buffer is available.

An interesting observation is that items cached in the buffer cache don't behave like items in most other caches. When a buffer is involved in an I/O operation, it is B_BUSY . The rest of the time, it is placed on a free list. In this manner the more a buffer is referenced, the less its chances are of being reallocated. If the cache is not busy, a buffer may persist for quite some time following its use, if another process comes along requesting the data, it may be pleasantly surprised to find that it is already in the cache and may be accessed logically.

A fourth queue, BQ_SENDFILE , is used to list transitive buffers in use by kernel routine sendfile() .

Bigger Isn't Always Better

The time required to search the hash chain is offset by the fact that a logical read or write is several orders of magnitude faster than a physical transaction. The greater the "hit rate" for data in the cache, the greater the payback is for this scheme. For files accessed in a sequential manner, the hit rate may exceed 90 percent! Adjusting the size of the buffer cache may help to increase the hit rate, but caution should be used to make sure that the system has plenty of available memory and that increasing the buffer cache doesn't cause an increase in the page-out rate, or the gains in the hit rate could be lost in the added overhead of vhand .

Another concern is that if the buffer cache gets too large, then the added length of the hash chain requires more search time. At some point the payback of caching may be diminished.

Prior to HP-UX 11.11, the conventional wisdom was that the buffer cache should not exceed 0.5 GB. Recent modifications (HP-UX 11.11) to the hashtable mechanics suggest that sizes in the 1-GB to 2-GB range may be considered

The Directory Name Lookup Cache, DNLC

The final component of the file system we examine in this chapter is the directory name lookup cache, or DNLC (Figure 8-19). Just as the kernel maintains caches for data blocks and inode s for improved performance, it also maintains a cache of recently referenced pathnames.

Figure 8-19. Directory Name Lookup Cache

graphics/08fig19.gif


We have seen how a user-supplied pathname is processed by the kernel through the VFS. If we could skip past a majority of the smoke-and-mirror translation tasks, we could improve the time required to open a file descriptor.

Consider a thread that has changed to a working directory and is accessing a number of the files in that directory. Each time an open() call is made, the kernel must convert the pathname to an appropriate vnode and data block. The DNLC caches individual directory names and accesses them via a hash based on their name strings.

Entries in the cache remain on a free list in a least recently used fashion. Another limit is that individual names must not be longer than 39 characters, the maximum space allocated for the directory name. Longer names may be used but they will not be cached (prior to HP-UX 11.0 the maximum length for cached names was 15 characters ).

Hash chain headers are stored in the kernel's nc_hash[] array. The hash function uses the vnode address and the file/directory name to calculate the chain header to which it should be linked.

Consider the Listings 8.13 and 8.14of the DNLC structures.

Listing 8.13. q4> fields struct nc_hash
Each nc_hash[] array element contains two linkage pointers

   0 0 4 0 * hash_next

   4 0 4 0 * hash_prev

Listing 8.14. q4> fields struct ncache
First the entry is linked to its hash chain

   0 0  4 0 *        hash_next

   4 0  4 0 *        hash_prev

All entries are linked into a least recently used list

   8 0  4 0 *        lru_next

  12 0  4 0 *        lru_prev

This points to the directory's vnode

  16 0  4 0 *        vp

And a pointer to the parent directory's vnode

  20 0  4 0 *        dp

The entry's length, name, and credentials are stored next

  24 0  1 0 char     namlen

  25 0 39 0 char[39] name

  64 0  4 0 *        cred

Mount point directories are linked to the virtual file system table

  68 0  4 0 *        vfs_next

  72 0  4 0 *        vfs_prev

All entries sharing a parent directory are dual linked

  76 0  4 0 *        dp_next

  80 0  4 0 *        dp_prev

All entries sharing a vnode are dual linked

  84 0  4 0 *        vp_next

  88 0  4 0 *        vp_prev