To isolate the rest of the kernel from the type-specific structures used by various file system types, a Virtual File System (VFS) abstraction layer is employed. Figure 8-11 gives us an overview of the VFS components. Figure 8-11. Kernel File System Tables The VFS abstraction layer was first introduced to UNIX by Sun Microsystems engineers while they were developing the Network File System (NFS). They wanted local users and applications to be oblivious to the actual location of the files they were using. Current UNIX operating systems frequently build on this approach to masquerade the specifics of local and remote file system types mounted within their kernels. The HP-UX kernel uses a number of VFS structures, virtual inodes (vnode), caches, and hash headers to facilitate file access. To get started, we examine two key components of the virtual file system: the vfs mount table and the vnode. The vfs Mount Table The starting point for the HP-UX file management subsystem is the pointer rootvfs. This kernel symbol directs us to the "/" (or root) file system's vfs data structure (see Figure 8-12). Figure 8-12. Mounting a File System Each new file system mount creates an additional entry in the linked list of vfs structures. Once a mount has been made, its vfs entry is maintained until the kernel is rebooted, even if the file system is unmounted (if it is later remounted, the original vfs structure is reused). Mount points configured by the automounter also add vfs structures to the list. There are several notable components within the vfs structure. Among them are a pointer to the next vfs, a type value, and a pointer to an operational array (vector jump table) that maps type-specific kernel routines needed to manipulate the file system, and a pointer to its kernel-resident data structures. The kernel-resident data structures vary greatly in content between the various file system implementations. The type-specific routines mapped by the operations array know what to expect and how to work with the specifics of their design. Listings 8.3 and 8.4 are annotated listings of the vfs structure and the vfsops vector array it points to. Listing 8.3. q4> fields struct vfs A forward-pointing link to the next mounted file system (in chronological order) 0 0 4 0 * vfs_next A hash chain for rapid lookup (used to see if a vfs structure already exists before we create a new one) 4 0 4 0 * vfs_hash A pointer to the file system type-specific operations array 8 0 4 0 * vfs_op A pointer to the in-core copy of the vnode this file system is mounted on 12 0 4 0 * vfs_vnodecovered The vfs state flags 16 0 4 0 int vfs_flag 0x01VFS_RDONLY | read-only | 0x02VFS_MLOCK | lock this vfs (keep subtree stable) | 0x04VFS_MWAIT | there is a waiter for the lock | 0x08VFS_NOSUID | disable suid for file system | 0x10VFS_EXPORTED | this file system is exported (NFS) | 0x20VFS_REMOTE | this is an imported file system (NFS) | 0x40VFS_HASQUOTA | this file system has quotas enabled | 0x200VFS_MOUNTING | used in conjunction with VFS_MLOCK | 0x800VFS_LARGEFILES | enable use of large files |
This file system's native block size 20 0 4 0 int vfs_bsize For exported file systems (NFS) this is the UID remote "root" users are remapped to and the export state flag 24 0 2 0 u_short vfs_exroot 26 0 2 0 short vfs_exflags This points to the file system's type-specific in-core data structures 28 0 4 0 * vfs_data A reference count of processes currently sleeping on the file system's mount point 32 0 4 0 int vfs_icount The file system's specific type (HFS, VxFS, NFS, CDFS, ...)and its unique id 36 0 2 0 short vfs_mtype 40 0 4 0 int vfs_fsid[0] 44 0 4 0 int vfs_fsid[1] A log pointer, the most recent mount timestamp, file system name, device number (Major+minor device number), and a chain pointer to DNLC entries referencing this file system 48 0 4 0 * vfs_logp 52 0 4 0 int vfs_mnttime 56 0 1024 0 char[1024] vfs_name 1080 0 4 0 int vfs_dev 1084 0 4 0 * vfs_ncachehd Listing 8.4. q4> fields struct vfsops The following pointers form a vector jump table to type- specific operational routines in the kernel 0 0 4 0 * vfs_mount 4 0 4 0 * vfs_unmount 8 0 4 0 * vfs_root 12 0 4 0 * vfs_statvfs 16 0 4 0 * vfs_sync 20 0 4 0 * vfs_vget 24 0 4 0 * vfs_getmount 28 0 4 0 * vfs_freeze 32 0 4 0 * vfs_thaw 36 0 4 0 * vfs_quota 40 0 4 0 * vfs_mountroot 44 0 4 0 * vfs_size The Type-Specific In-Core Data Structures Each file system type requires its own specific in-core data. Figure 8-13 illustrates the point. Figure 8-13. Type-Specific Data Structures (HFS Example) In general, the in-core data has a copy of the file system's superblock, a pointer to the kernel's copy of the inode/vnode to which the file system is mounted, and copies of the file system's allocation tables. In the case of HFS, the kernel-resident structures include a mount structure (with pointers to the mount point's inode/vnode), a pointer back to the file system's vfs structure, and a pointer to a copy of the superblock maintained in the buffer cache (more on the buffer cache later in this chapter). Let's examine the HFS mount and fs (superblock) structures in Listings 8.5 and 8.6. Listing 8.5. q4> fields struct mount First are forward and backward linkage pointers to a hash list 0 0 4 0 * m_hforw 4 0 4 0 * m_hback Next is a pointer to the file system's vfs structure, the block device number (major and \minor number) on which the file system resides, and a pointer to the buffer header which maps the in-core copy of the file system's superblock 8 0 4 0 * m_vfsp 12 0 4 0 int m_dev 16 0 4 0 * m_bufp 20 0 4 0 int m_maxbufs The raw device number is also recorded here 24 0 4 0 int m_rdev 28 0 4 0 * m_nodes.m_inodp 28 0 4 0 * m_nodes.m_cdnodp 32 0 4 0 * m_qinod Mount flags are stored here 36 0 4 0 int m_flag 40 0 2 0 u_short m_qflags 44 0 4 0 u_int m_btimelimit 48 0 4 0 u_int m_ftimelimit A pointer to the file system's root inode/vnode in-core, a reference count, and a pointer to the mount point's in-core inode/vnode 52 0 4 0 * m_rootvp 56 0 4 0 u_int m_ref 60 0 4 0 * m_iused Listing 8.6. q4> fields struct fs 0 0 4 0 int fs_unused[0] 4 0 4 0 int fs_unused[1] First we have the file system's super block address, the offset of the cylinder group data block, the inode blocks, and the first data block following the cylinder groups 8 0 4 0 int fs_sblkno 12 0 4 0 int fs_cblkno 16 0 4 0 int fs_iblkno 20 0 4 0 int fs_dblkno 24 0 4 0 int fs_cgoffset 28 0 4 0 int fs_cgmask Next is the last modification timestamp, the number of blocks (total space), and number of data blocks (minus the metadata overhead) in the file system. 32 0 4 0 int fs_time 36 0 4 0 int fs_size 40 0 4 0 int fs_dsize Additional static and tunable parameters include the number of cylinders per group, block and fragment sizes, number of fragments in a block (1,2,4,8), minimum reserved free space (applies only to non-superuser allocation requests), rotational delay for writing sequential blocks (also known as interleave), and drive speed (in revolutions per second) 44 0 4 0 int fs_ncg 48 0 4 0 int fs_bsize 52 0 4 0 int fs_fsize 56 0 4 0 int fs_frag 60 0 4 0 int fs_minfree 64 0 4 0 int fs_rotdelay 68 0 4 0 int fs_rps 72 0 4 0 int fs_bmask 76 0 4 0 int fs_fmask 80 0 4 0 int fs_bshift 84 0 4 0 int fs_fshift Maximum contiguous blocks and max number of blocks allocated to a single file within a cylinder group 88 0 4 0 int fs_maxcontig 92 0 4 0 int fs_maxbpg 96 0 4 0 int fs_fragshift 100 0 4 0 int fs_fsbtodb 104 0 4 0 int fs_sbsize 108 0 4 0 int fs_csmask 112 0 4 0 int fs_csshift 116 0 4 0 int fs_nindir 120 0 4 0 int fs_inopb 124 0 4 0 int fs_nspf This file system's unique identifier number 128 0 4 0 int fs_id[0] 132 0 4 0 int fs_id[1] Mirror state information for primary swap and "/" partitions 136 0 0 4 u_int fs_mirror.state.root 136 4 0 1 u_int fs_mirror.state.rflag 136 5 0 4 u_int fs_mirror.state.swap 137 1 0 1 u_int fs_mirror.state.sflag 137 2 2 6 u_int fs_mirror.state.spare 140 0 4 0 int fs_mirror.mirtime 144 0 4 0 int fs_featurebits 148 0 4 0 int fs_optim Cylinder group summary array address, summary array size, cylinder group size 152 0 4 0 int fs_csaddr 156 0 4 0 int fs_cssize 160 0 4 0 int fs_cgsize Number of tracks per cylinder group, sectors per track, sectors per cylinder, total number of cylinders in the file system, cylinders per group, and inodes per group 164 0 4 0 int fs_ntrak 168 0 4 0 int fs_nsect 172 0 4 0 int fs_spc 176 0 4 0 int fs_ncyl 180 0 4 0 int fs_cpg 184 0 4 0 int fs_ipg 188 0 4 0 int fs_fpg Calculated cylinder group totals (this must be audited following a power failure) 192 0 4 0 int fs_cstotal.cs_ndir 196 0 4 0 int fs_cstotal.cs_nbfree 200 0 4 0 int fs_cstotal.cs_nifree 204 0 4 0 int fs_cstotal.cs_nffree Next are the file system modification flag, the clean/dirty flag, the read-only flag, the spare flag (currently not used), and a string holding the current mount point of the file system 208 0 1 0 char fs_fmod 209 0 1 0 char fs_clean 210 0 1 0 char fs_ronly 211 0 1 0 char fs_flags 212 0 512 0 char[512] fs_fsmnt 724 0 4 0 int fs_cgrotor 728 0 4 0 * fs_csp 732 0 4 0 int fs_csp_pad 736 0 4 0 int fs_unused2[30] 856 0 4 0 int fs_cpc 860 0 2 0 short fs_postbl[32][8] This file system's "magic number" (used to verify its type), its type name and pack name 1372 0 4 0 int fs_magic 1376 0 6 0 char[6] fs_fname 1382 0 6 0 char[6] fs_fpack 1388 0 1 0 u_char fs_rotbl[0] The vnode Next, we address the specifics of the vnode structure. We have learned that an inode contains all of a file's credentials and attributes, and is therefore the key to managing a file within a file system. The inode number is unique within the context of a single file system, and it is all that is needed to define a file. Within a kernel, there may be several file systems mounted together to create what appears to be a seamless, larger file system. Within this larger context, inode numbers are no longer guaranteed to be unique. Each mounted file system may have an inode 1, 2, and so on. To enable the kernel to refer to opened files with a unique handle, the virtual inode or vnode was created. In a similar manner to the vfs structure, a vnode contains an operational array pointer to the kernel's type-specific routines for file manipulation tasks (see Figure 8-14). Figure 8-14. The vnode The key fields of the vnode include a type number, a pointer to a type-specific operations array, and a reference to the in-core copy of the file's inode data. Listing 8.7 is an annotated listing of a virtual file system vnode: Listing 8.7. q4> fields struct vnode 0 0 2 0 u_short v_flag 0x01 | VROOT | root of a file system | 0x02 | VTEXT | vnode is a text prototype | 0x04 | VSWAP | vnode maps a swap space | 0x08 | VXOFTHOLD | vnode allows soft holds | 0x010 | VEXLOCK | exclusive lock | 0x020 | VSHLOCK | shared lock | 0x040 | VLWAIT | there is a waiter on a lock | 0x080 | VXDEVOK | cross device rename and link is allowed | 0x0100 | VMMF | vnode is memory mapped | 0x04000 | VNOBCACHE | data caching disabled | 0x08000 | VN_MPSAFE | semaphore not required |
Shared lock count, exclusive lock count, private data count, and reference count 2 0 2 0 u_short v_shlockc 4 0 2 0 u_short v_exlockc 6 0 2 0 u_short v_tcount 8 0 4 0 int v_count Pointer to vfs structure if this is a mount point 12 0 4 0 * v_vfsmountedhere Pointers to the type specific vnode operations array, unix ipc socket, stream head, and the vfs the file is in 16 0 4 0 * v_op 20 0 4 0 * v_socket 24 0 4 0 * v_stream 28 0 4 0 * v_vfsp vnode type, device number, type-specific data (hfs inode, vxfs inode, rnode, ...), and parent file system type 32 0 4 0 enum4 v_type 36 0 4 0 int v_rdev 40 0 4 0 * v_data 44 0 4 0 enum4 v_fstype Pointer to parent pseudo vas (if this maps to an executable image file) 48 0 4 0 * v_vas Beta semaphore locking structure 52 0 1 0 u_char v_lock.b_lock 54 0 2 0 u_short v_lock.order 56 0 4 0 * v_lock.owner Mapped memory cache buffers for this file 60 0 4 0 * v_cleanblkhd 64 0 4 0 * v_dirtyblkhd vnode write count, FS independent locks, soft hold count, and a copy of the va_nodeid 68 0 4 0 int v_writecount 72 0 4 0 * v_locklist 76 0 4 0 int v_scount 80 0 4 0 int v_nodeid DNLC entry pointers if this vnode is currently described in the directory name lookup cache 84 0 4 0 * v_ncachedhd 88 0 4 0 * v_ncachevhd 92 0 4 0 * v_pfdathd 96 0 4 0 u_int v_last_fsync The HFS In-Core inode Contains Its vnode For an HFS file, the in-core inode structure includes the vnode structure and a complete copy of its disk-resident inode (the icommon structure we examined earlier) plus additional fields of information (Listing 8.8): Listing 8.8. q4> fields struct inode 0 0 4 0 * i_chain[0] 4 0 4 0 * i_chain[1] The device number this inode came from 8 0 4 0 int i_dev The inode number 12 0 4 0 u_int i_number The state flags, lock word, and tid of last thread to lock this inode 16 0 4 0 u_int i_flag 20 0 2 0 u_short i_lockword 24 0 4 0 int i_tid The virtual file system's vnode for the file is stored between offset 28 and 127 inode device pointer 128 0 4 0 * i_devvp 132 0 4 0 int i_diroff Continuation inode pointer 136 0 4 0 * i_contip File system pointer 140 0 4 0 * i_fs Quota pointer 144 0 4 0 * i_dquot 148 0 4 0 int i_rdev 152 0 4 0 int i_un.if_lastr 152 0 4 0 * i_un.is_socket 156 0 4 0 * i_fr.if_freef 160 0 4 0 * i_fr.if_freeb 164 0 4 0 * i_fselr.i_selp 168 0 2 0 short i_fselr.i_selflag 172 0 4 0 * i_fselw.i_selp 176 0 2 0 short i_fselw.i_selflag 180 0 4 0 * i_mount Acopy of the disk-based inode (structure icommon) is stored between offset 184 and 311 Number of reader calls 312 0 2 0 u_short i_rcount Beta semaphore locking structures 316 0 1 0 u_char i_b_sema.b_lock 318 0 2 0 u_short i_b_sema.order 320 0 4 0 * i_b_sema.owner 324 0 4 0 int filler[0] 328 0 4 0 int filler[1] The relationship between the in-core inode data and the vnode varies between file system types, but the vnode structure itself is identical for all supported file systems. As the vnode uses operation's array pointers, it is up to the programmers of the referenced routines to have intimate knowledge of inode type-specific content. When a file is opened by the kernel, it is identified by the device number of the volume on which the file system is located and its inode number. By using these two values, each file is guaranteed to have a unique identity within the scope of the operating system. Caching inode/vnode Data in the Kernel To improve system performance and reduce the amount of time spent accessing metadata on the disk, inode/vnode data is maintained in kernel-resident caches. Each type of file system is responsible for the allocation and maintenance of its inode/vnode cache. In most implementations caches are sized by kernel-tunable parameters or by an imbedded formula. A hash is formed using the device number and the inode number. Before a copy of an inode is read in from the physical disk, the inode cache is always checked to see if a copy of the data is present. Smoke and Mirrors: Tying Together the File Subsystem Abstractions Next, we need to tie the abstractions together into a seamless VFS implementation (see Figure 8-15). Figure 8-15. Building a Seamless File System Let's revisit our earlier scenario except that now /home is a separate file system mounted to the system's root file system. Again, our challenge is to follow the pathname /home/chris/stuff through the kernel's virtual structures. We start our search by locating the kernel's copy of the / inode (2 on the first volume). The inode/vnode and root directory data of all mount points and file system roots are held in the caches and the buffer cache as long as a file system is mounted. To speed up this initial search, the kernel maintains a permanent shortcut pointer to the cached copy of the / inode/vnode in the symbol rootdir. From this vnode, we determine the file system device number, and from the associated inode we get the data block pointer. These two numbers are used to hash our way into the buffer cache, where we find a copy of the root file system's root directory data. We take the inode number associated with the home directory and the device number to hash our way back into the inode cache for the next step of our search. When we find the cached vnode for the /home directory (5 on the first volume), the v_vfsmountedhere pointer is valid and directs us to the kernel mount table and the vfs structure for the second mounted volume (did you ever play the children's game Chutes and Ladders?). We then follow the v_data pointer to find the root inode/vnode (2 on the second volume) and cached data block for the mounted /home file system. We continue our search within the /home file system, locating the inode/vnode (10 on the second volume) and cached data for the /home/chris directory and finally the inode/vnode (14 on the second volume) and cached data for the /home/chris/stuff file. Accessing these last two stops on our search may or may not find the requested data in the appropriate caches. If they aren't loaded, then the system uses the routines referenced by the operational arrays and kernel functions to load copies from the disk into the caches. This brings us to the end of our quest. Next we consider the attachment of an open file to a process. Connecting an Open File to a Process We have seen that the proc structure is the base of all process-related kernel structures. The definition of a process's resources includes mapping opened files to per-process file descriptors (see Figure 8-16). Figure 8-16. System File Table From the kernel's point of view, each distinct file open() call must be tracked, which is accomplished by making entries in the system file table. From the process's point of view, it is only concerned with its own file descriptors, but the kernel must manage all open() calls. The number of file descriptors a single process may have is controlled by the kernel-tunable parameters max_files (for a regular user process) and max_files_lim (for a superuser process), and may be set up to 60,000. The default is 60 and is usually sufficient, considering that most processes only need to map stdin, stdout, and stderr. To avoid wasting kernel memory space, process file descriptors are stored in blocks of eight for HP-UX 11.11 (HP-UX 11.0 used 32 descriptors per block), and the blocks are referenced through a partition table pointed to by the proc structure. Additional blocks are allocated as needed by the kernel. On HP-UX 11.11 the file_descriptor_t structure contained: fd_lock a spinlock structure *fd_filep a pointer to a system file table file entry fd_locker_tid a reference to a locking thread fd_state the current descriptor state fd_pofile reference to the ofile partition entry The kernel file system table consists of nfile entries (kernel-tunable) of type file. See Listings 8.9 and 8.10. Listing 8.9. q4> fields struct file The file descriptor table flags are: 0 0 4 0 int f_flag 0x01FREAD | descriptor is readable | 0x02FWRITE | descriptor is writable | 0x04FNDELAY | no delay | 0x08FAPPEND | append on each write |
The kernel object represented by this entry 4 0 2 0 short f_type 0x01DTYPE_VNODE | regular file | 0x02DTYPE_SOCKET | Berkeley IPC Socket | 0x03DTYPE_STREAMS | Streams file type | 0x05DTYPE_UNSP | user nsp control | 0x06DTYPE_LLA | link-level LAN access |
link count from the message queue 6 0 2 0 short f_msgcount reference count 8 0 4 0 int f_count type-specific file system operations array pointer 12 0 4 0 * f_ops pointer to this file's vnode/socket kernel data 16 0 4 0 * f_data 20 0 4 0 int f_freeindex 24 0 8 0 long long f_offset 32 0 4 0 * f_buf credentials of user who opened the file 36 0 4 0 * f_cred kernel beta-semaphore 40 0 1 0 u_char f_b_sema.b_lock 42 0 2 0 u_short f_b_sema.order 44 0 4 0 * f_b_sema.owner data intercept audit flag 48 0 2 0 short f_tap Listing 8.10. q4> fields struct fileops 0 0 4 0 * fo_rw 4 0 4 0 * fo_ioctl 8 0 4 0 * fo_select 12 0 4 0 * fo_close The Buffer Cache Just as the kernel keeps copies of open inode data in a cache, it also maintains a much larger cache of recently accessed file data blocks (or extents, depending on the type of file system). This buffer cache is a key component of the HP-UX operating system. Figure 8-17 provides an overview of the buffer cache design. Figure 8-17. The Buffer Cache A basic premise of the UNIX operating system is that every thing is a file. This model serves many purposes and greatly reduces the overall complexity of the operating system. With this in mind, it is understandable that everything that may be done to improve file performance provides a large payback. Being Flexible: The Dynamic Buffer Cache The HP-UX implementation of a buffer cache follows a dynamic model. The buffer cache is mapped into the kernel's address space, but the physical pages allocated to it come from the same available memory pools as those used for user processes. In this manner, the buffer cache competes with user code for system memory resources. The kernel allows the administrator to set a maximum and minimum percentage of the available memory that may be allocated for buffer cache use. Remember that available memory means the amount of physical memory remaining after the kernel allocates pages for its exclusive use during system initialization. The kernel-tunable parameters to enable a dynamic buffer cache are: If the dbc_min_pct is set equal to or larger than the dbc_max_pct, then the dynamic buffer cache is capped at the size dictated by dbc_min_pct. This presents what at first seems to be a conundrum. How can it be a dynamic buffer cache if the size is fixed? While the size is fixed, it is simply a targeted value and not an absolute. As long as the buffer cache is considered to be dynamic, its pages may be aged and stolen by the vhand daemon during times of heavy memory pressure. This means that while the overall size appears to be static, the specific pages allocated to the cache may change from time to time, so they are considered dynamic. If these two dynamic tunables are not set, then the following older kernel parameters take effect and the cache is flagged as static; that is, pages once allocated to the cache remain there and vhand cannot scan them. Once a buffer has been populated with a copy of file data, the kernel needs an efficient method to search for it in the cache. This is accomplished through the use of a hash (by now you expected this, didn't you!). The Buffer Cache Hash Figure 8-18 illustrates the basic design of the buffer cache. Individual buffers are pointed to by buffer headers' which are arranged in hash chains. In the case of a read(), the hash offset is calculated using the device number and block offset (obtained from the vnode) for the requested disk data. Figure 8-18. The Buffer Cache Hash Once the hash has been calculated, the appropriate hash chain is searched to see if the requested data is present in the cache. In the case of a miss, the kernel works with the appropriate hardware driver, requests a physical data transfer from the disk, and allocates a free cache buffer to store the data in. Requests made and satisfied by the buffer cache are said to be logical reads and writes. Reads and writes between the buffer cache and a disk are called physical. A write() request is handled in a similar manner. If a buffer has already been started, a simple logical write is performed. When a buffer is filled or when a dirty buffer (one whose content differs from its counterpart on the disk) is found by the sync daemon during one of its scheduled scans, a physical write is scheduled. Physical writes may also be specifically requested by a programmer or when a file is closed. Listings 8.11 and 8.12 show selected fields from q4 listings of the buffer header and buffer structure. Listing 8.11. q4> fields struct bufhd The buffer header state flag 0 0 4 0 int b_flags 0x00800000 | B_PAGEOUT | this is a buffer header, not a buffer |
Forward/backward linkage pointers for the hash chain 4 0 4 0 * b_forw 8 0 4 0 * b_back Modification timestamp 12 0 4 0 int b_checkdup Lock structure for this buffer hash chain 16 0 4 0 * bh_lock Listing 8.12. q4> fields struct buf (a partial listing) The buffer state flags 0 0 4 0 int b_flags 0x00000000 | B_WRITE | | 0x00000001 | B_READ | | 0x00000002 | B_DONE | transaction completed | 0x00000004 | B_ERROR | transaction failed | 0x00000008 | B_BUSY | not currently on a free list | 0x00000010 | B_PHYS | physical I/O in progress | 0x00000040 | B_WANTED | send a wakeup when no longer BUSY | 0x00000100 | B_ASYNC | don't wait for I/O completion | 0x00001000 | B_PRIVATE | used by LVM | 0x00004000 | B_PFTIMEOUT | LVM power fail timeout | 0x00008000 | B_CACHE | | 0x00010000 | B_INVAL | buffer not valid | 0x00020000 | B_FSYSIO | buffer resulted from bread or bwrite | 0x00040000 | B_CALL | call b_iodone() from iodone() | 0x00080000 | B_NOCACHE | don't hold block in buffer cache | 0x00100000 | B_RAW | raw access I/O | 0x00200000 | B_BCACHE | buffer is from the buffer cache | 0x00400000 | B_SYNC | buffer write is synchronous |
Linkage pointers connecting a buffer to a hash chain 4 0 4 0 * b_forw 8 0 4 0 * b_back Linkage pointer connecting a buffer to a free list 12 0 4 0 * av_forw 16 0 4 0 * av_back Linkage pointers connecting a buffer to a vnodes clean/dirty list 20 0 4 0 * b_blockf 24 0 4 0 * b_blockb 28 0 4 0 * b_strategy 32 0 4 0 * b_un.b_addr 32 0 4 0 * b_un.b_words 32 0 4 0 * b_un.b_filsys 32 0 4 0 * b_un.b_cs 32 0 4 0 * b_un.b_cg 32 0 4 0 * b_un.b_dino 32 0 4 0 * b_un.b_daddr 36 0 4 0 int b_un2.b_sectno 36 0 4 0 int b_un2.b_byteno 40 0 4 0 int b_dev 44 0 4 0 int b_blkno 48 0 4 0 * b_sc The count of how many times this buffer has been returned to a free list since it was allocated 52 0 4 0 int b_bcount 56 0 2 0 u_short b2_flags Error number (if B_ERROR is set in b_flags), buffer size, and offset 58 0 2 0 short b_error 60 0 4 0 int b_bufsize 64 0 4 0 int b_offset 68 0 4 0 * b_merge The pointer contains the address of the routine to call when a pending I/O is finished 72 0 4 0 * b_iodone --------------------------------------- Free Lists: Allocating a New Buffer When a buffer is in use, its b_flags is set to B_BUSY. As soon as any pending data transfer is complete, it is returned to a free list. There are several free lists. When a page is first allocated to a dynamic buffer cache, its buffers are placed on the BQ_EMPTY free list. Requests to grow the buffer cache depend on its level of use. Once a buffer is allocated, used, and freed, it is placed on the BQ_AGE list, and its b_count is incremented. Once the b_count exceeds 4, the buffer is placed on the BQ_LRQ list when it is freed. In this manner, a buffer that has been loaded to satisfy a truly random access read request is more likely to be reallocated than one for which sequential requests are being processed. Each processor has its own set of free list headers. The rules for finding an available buffer with a fixed buffer cache are as follows: Search the processor's BQ_AGE list, then its BQ_LRU. Next, search the BQ_AGE lists of all the other processors. Finally, search their BQ_LRQ lists. In the case of a dynamic buffer cache: Check the current processor's BQ_AGE and BQ_LRU. Next we check the BQ_EMPTY (we wait to check it, as the allocation of new pages to the cache is performed in background). Assuming that we can't find a buffer on one of our processor's free lists, we next check the neighboring processors' BQ_AGE and then their BQ_LRU lists. If we are still unsuccessful, the thread blocks until a buffer is available. An interesting observation is that items cached in the buffer cache don't behave like items in most other caches. When a buffer is involved in an I/O operation, it is B_BUSY. The rest of the time, it is placed on a free list. In this manner the more a buffer is referenced, the less its chances are of being reallocated. If the cache is not busy, a buffer may persist for quite some time following its use, if another process comes along requesting the data, it may be pleasantly surprised to find that it is already in the cache and may be accessed logically. A fourth queue, BQ_SENDFILE, is used to list transitive buffers in use by kernel routine sendfile(). Bigger Isn't Always Better The time required to search the hash chain is offset by the fact that a logical read or write is several orders of magnitude faster than a physical transaction. The greater the "hit rate" for data in the cache, the greater the payback is for this scheme. For files accessed in a sequential manner, the hit rate may exceed 90 percent! Adjusting the size of the buffer cache may help to increase the hit rate, but caution should be used to make sure that the system has plenty of available memory and that increasing the buffer cache doesn't cause an increase in the page-out rate, or the gains in the hit rate could be lost in the added overhead of vhand. Another concern is that if the buffer cache gets too large, then the added length of the hash chain requires more search time. At some point the payback of caching may be diminished. Prior to HP-UX 11.11, the conventional wisdom was that the buffer cache should not exceed 0.5 GB. Recent modifications (HP-UX 11.11) to the hashtable mechanics suggest that sizes in the 1-GB to 2-GB range may be considered The Directory Name Lookup Cache, DNLC The final component of the file system we examine in this chapter is the directory name lookup cache, or DNLC (Figure 8-19). Just as the kernel maintains caches for data blocks and inodes for improved performance, it also maintains a cache of recently referenced pathnames. Figure 8-19. Directory Name Lookup Cache We have seen how a user-supplied pathname is processed by the kernel through the VFS. If we could skip past a majority of the smoke-and-mirror translation tasks, we could improve the time required to open a file descriptor. Consider a thread that has changed to a working directory and is accessing a number of the files in that directory. Each time an open() call is made, the kernel must convert the pathname to an appropriate vnode and data block. The DNLC caches individual directory names and accesses them via a hash based on their name strings. Entries in the cache remain on a free list in a least recently used fashion. Another limit is that individual names must not be longer than 39 characters, the maximum space allocated for the directory name. Longer names may be used but they will not be cached (prior to HP-UX 11.0 the maximum length for cached names was 15 characters). Hash chain headers are stored in the kernel's nc_hash[] array. The hash function uses the vnode address and the file/directory name to calculate the chain header to which it should be linked. Consider the Listings 8.13 and 8.14of the DNLC structures. Listing 8.13. q4> fields struct nc_hash Each nc_hash[] array element contains two linkage pointers 0 0 4 0 * hash_next 4 0 4 0 * hash_prev Listing 8.14. q4> fields struct ncache First the entry is linked to its hash chain 0 0 4 0 * hash_next 4 0 4 0 * hash_prev All entries are linked into a least recently used list 8 0 4 0 * lru_next 12 0 4 0 * lru_prev This points to the directory's vnode 16 0 4 0 * vp And a pointer to the parent directory's vnode 20 0 4 0 * dp The entry's length, name, and credentials are stored next 24 0 1 0 char namlen 25 0 39 0 char[39] name 64 0 4 0 * cred Mount point directories are linked to the virtual file system table 68 0 4 0 * vfs_next 72 0 4 0 * vfs_prev All entries sharing a parent directory are dual linked 76 0 4 0 * dp_next 80 0 4 0 * dp_prev All entries sharing a vnode are dual linked 84 0 4 0 * vp_next 88 0 4 0 * vp_prev |