This file system's native block size
20 0 4 0 int vfs_bsize
For exported file systems (NFS) this is the UID remote "root"
users are remapped to and the export state flag
24 0 2 0 u_short vfs_exroot
26 0 2 0 short vfs_exflags
This points to the file system's type-specific in-core data
structures
28 0 4 0 * vfs_data
A reference count of processes currently sleeping on the file
system's mount point
32 0 4 0 int vfs_icount
The file system's specific type (HFS, VxFS, NFS, CDFS, ...)and
its unique id
36 0 2 0 short vfs_mtype
40 0 4 0 int vfs_fsid[0]
44 0 4 0 int vfs_fsid[1]
A log pointer, the most recent mount timestamp, file system
name, device number (Major+minor device number), and a chain
pointer to DNLC entries referencing this file system
48 0 4 0 * vfs_logp
52 0 4 0 int vfs_mnttime
56 0 1024 0 char[1024] vfs_name
1080 0 4 0 int vfs_dev
1084 0 4 0 * vfs_ncachehd
Listing 8.4.
q4> fields struct vfsops
The following pointers form a vector jump table to type-
specific operational routines in the kernel
0 0 4 0 * vfs_mount
4 0 4 0 * vfs_unmount
8 0 4 0 * vfs_root
12 0 4 0 * vfs_statvfs
16 0 4 0 * vfs_sync
20 0 4 0 * vfs_vget
24 0 4 0 * vfs_getmount
28 0 4 0 * vfs_freeze
32 0 4 0 * vfs_thaw
36 0 4 0 * vfs_quota
40 0 4 0 * vfs_mountroot
44 0 4 0 * vfs_size
The Type-Specific In-Core Data Structures
Each file system type requires its own specific in-core data. Figure 8-13 illustrates the point.
In general, the in-core data has a copy of the file system's superblock, a pointer to the kernel's copy of the
inode
/
vnode
to which the file system is mounted, and copies of the file system's allocation tables.
In the case of HFS, the kernel-resident structures include a
mount
structure (with pointers to the mount point's
inode
/
vnode
), a pointer back to the file system's
vfs
structure, and a pointer to a copy of the superblock maintained in the buffer cache (more on the buffer cache later in this chapter). Let's examine the HFS
mount
and
fs
(superblock) structures in Listings 8.5 and 8.6.
Listing 8.5.
q4> fields struct mount
First are forward and backward linkage pointers to a hash list
0 0 4 0 * m_hforw
4 0 4 0 * m_hback
Next is a pointer to the file system's vfs structure, the
block device number (major and \minor number) on which the
file system resides, and a pointer to the buffer header which
maps the in-core copy of the file system's superblock
8 0 4 0 * m_vfsp
12 0 4 0 int m_dev
16 0 4 0 * m_bufp
20 0 4 0 int m_maxbufs
The raw device number is also recorded here
24 0 4 0 int m_rdev
28 0 4 0 * m_nodes.m_inodp
28 0 4 0 * m_nodes.m_cdnodp
32 0 4 0 * m_qinod
Mount flags are stored here
36 0 4 0 int m_flag
40 0 2 0 u_short m_qflags
44 0 4 0 u_int m_btimelimit
48 0 4 0 u_int m_ftimelimit
A pointer to the file system's root inode/vnode in-core, a
reference count, and a pointer to the mount point's in-core
inode/vnode
52 0 4 0 * m_rootvp
56 0 4 0 u_int m_ref
60 0 4 0 * m_iused
Listing 8.6.
q4> fields struct fs
0 0 4 0 int fs_unused[0]
4 0 4 0 int fs_unused[1]
First we have the file system's super block address, the
offset of the cylinder group data block, the inode blocks, and
the first data block following the cylinder groups
8 0 4 0 int fs_sblkno
12 0 4 0 int fs_cblkno
16 0 4 0 int fs_iblkno
20 0 4 0 int fs_dblkno
24 0 4 0 int fs_cgoffset
28 0 4 0 int fs_cgmask
Next is the last modification timestamp, the number of blocks
(total space), and number of data blocks (minus the metadata
overhead) in the file system.
32 0 4 0 int fs_time
36 0 4 0 int fs_size
40 0 4 0 int fs_dsize
Additional static and tunable parameters include the number of
cylinders per group, block and fragment sizes, number of
fragments in a block (1,2,4,8), minimum reserved free space
(applies only to non-superuser allocation requests),
rotational delay for writing sequential blocks (also known as
interleave), and drive speed (in revolutions per second)
44 0 4 0 int fs_ncg
48 0 4 0 int fs_bsize
52 0 4 0 int fs_fsize
56 0 4 0 int fs_frag
60 0 4 0 int fs_minfree
64 0 4 0 int fs_rotdelay
68 0 4 0 int fs_rps
72 0 4 0 int fs_bmask
76 0 4 0 int fs_fmask
80 0 4 0 int fs_bshift
84 0 4 0 int fs_fshift
Maximum contiguous blocks and max number of blocks allocated
to a single file within a cylinder group
88 0 4 0 int fs_maxcontig
92 0 4 0 int fs_maxbpg
96 0 4 0 int fs_fragshift
100 0 4 0 int fs_fsbtodb
104 0 4 0 int fs_sbsize
108 0 4 0 int fs_csmask
112 0 4 0 int fs_csshift
116 0 4 0 int fs_nindir
120 0 4 0 int fs_inopb
124 0 4 0 int fs_nspf
This file system's unique identifier number
128 0 4 0 int fs_id[0]
132 0 4 0 int fs_id[1]
Mirror state information for primary swap and "/" partitions
136 0 0 4 u_int fs_mirror.state.root
136 4 0 1 u_int fs_mirror.state.rflag
136 5 0 4 u_int fs_mirror.state.swap
137 1 0 1 u_int fs_mirror.state.sflag
137 2 2 6 u_int fs_mirror.state.spare
140 0 4 0 int fs_mirror.mirtime
144 0 4 0 int fs_featurebits
148 0 4 0 int fs_optim
Cylinder group summary array address, summary array size,
cylinder group size
152 0 4 0 int fs_csaddr
156 0 4 0 int fs_cssize
160 0 4 0 int fs_cgsize
Number of tracks per cylinder group, sectors per track,
sectors per cylinder, total number of cylinders in the file
system, cylinders per group, and inodes per group
164 0 4 0 int fs_ntrak
168 0 4 0 int fs_nsect
172 0 4 0 int fs_spc
176 0 4 0 int fs_ncyl
180 0 4 0 int fs_cpg
184 0 4 0 int fs_ipg
188 0 4 0 int fs_fpg
Calculated cylinder group totals (this must be audited
following a power failure)
192 0 4 0 int fs_cstotal.cs_ndir
196 0 4 0 int fs_cstotal.cs_nbfree
200 0 4 0 int fs_cstotal.cs_nifree
204 0 4 0 int fs_cstotal.cs_nffree
Next are the file system modification flag, the clean/dirty
flag, the read-only flag, the spare flag (currently not used),
and a string holding the current mount point of the file
system
208 0 1 0 char fs_fmod
209 0 1 0 char fs_clean
210 0 1 0 char fs_ronly
211 0 1 0 char fs_flags
212 0 512 0 char[512] fs_fsmnt
724 0 4 0 int fs_cgrotor
728 0 4 0 * fs_csp
732 0 4 0 int fs_csp_pad
736 0 4 0 int fs_unused2[30]
856 0 4 0 int fs_cpc
860 0 2 0 short fs_postbl[32][8]
This file system's "magic number" (used to verify its type),
its type name and pack name
1372 0 4 0 int fs_magic
1376 0 6 0 char[6] fs_fname
1382 0 6 0 char[6] fs_fpack
1388 0 1 0 u_char fs_rotbl[0]
The
vnode
Next, we address the specifics of the
vnode
structure. We have learned that an
inode
contains all of a file's credentials and attributes, and is therefore the key to managing a file within a file system. The
inode
number is unique within the context of a single file system, and it is all that is needed to define a file. Within a kernel, there may be several file systems mounted together to create what appears to be a seamless, larger file system. Within this larger context,
inode
numbers
are no longer
guaranteed
to be unique. Each mounted file system may have an
inode
1, 2, and so on. To enable the kernel to refer to opened files with a unique handle, the virtual
inode
or
vnode
was created.
In a similar manner to the
vfs
structure, a
vnode
contains an operational array pointer to the kernel's type-specific routines for file manipulation
tasks
(see Figure 8-14).
The key fields of the
vnode
include a type number, a pointer to a type-specific operations array, and a reference to the in-core copy of the file's
inode
data. Listing 8.7 is an annotated listing of a virtual file system vnode:
Listing 8.7.
q4> fields struct vnode
0 0 2 0 u_short v_flag
|
0x01
|
VROOT
|
root of a file system
|
|
0x02
|
VTEXT
|
vnode is a text prototype
|
|
0x04
|
VSWAP
|
vnode maps a swap space
|
|
0x08
|
VXOFTHOLD
|
vnode allows soft holds
|
|
0x010
|
VEXLOCK
|
exclusive lock
|
|
0x020
|
VSHLOCK
|
shared lock
|
|
0x040
|
VLWAIT
|
there is a waiter on a lock
|
|
0x080
|
VXDEVOK
|
cross device rename and link is allowed
|
|
0x0100
|
VMMF
|
vnode is memory mapped
|
|
0x04000
|
VNOBCACHE
|
data caching disabled
|
|
0x08000
|
VN_MPSAFE
|
semaphore not required
|
Shared lock count, exclusive lock count, private data count,
and reference count
2 0 2 0 u_short v_shlockc
4 0 2 0 u_short v_exlockc
6 0 2 0 u_short v_tcount
8 0 4 0 int v_count
Pointer to vfs structure if this is a mount point
12 0 4 0 * v_vfsmountedhere
Pointers to the type specific vnode operations array, unix ipc
socket, stream head, and the vfs the file is in
16 0 4 0 * v_op
20 0 4 0 * v_socket
24 0 4 0 * v_stream
28 0 4 0 * v_vfsp
vnode type, device number, type-specific data (hfs inode, vxfs
inode, rnode, ...), and parent file system type
32 0 4 0 enum4 v_type
36 0 4 0 int v_rdev
40 0 4 0 * v_data
44 0 4 0 enum4 v_fstype
Pointer to parent pseudo vas (if this maps to an executable
image file)
48 0 4 0 * v_vas
Beta semaphore locking structure
52 0 1 0 u_char v_lock.b_lock
54 0 2 0 u_short v_lock.order
56 0 4 0 * v_lock.owner
Mapped memory cache buffers for this file
60 0 4 0 * v_cleanblkhd
64 0 4 0 * v_dirtyblkhd
vnode write count, FS independent locks, soft hold count, and
a copy of the va_nodeid
68 0 4 0 int v_writecount
72 0 4 0 * v_locklist
76 0 4 0 int v_scount
80 0 4 0 int v_nodeid
DNLC entry pointers if this vnode is currently described in
the directory name lookup cache
84 0 4 0 * v_ncachedhd
88 0 4 0 * v_ncachevhd
92 0 4 0 * v_pfdathd
96 0 4 0 u_int v_last_fsync
The HFS In-Core
inode
Contains Its
vnode
For an HFS file, the in-core
inode
structure includes the
vnode
structure and a complete copy of its
disk-resident
inode
(the
icommon
structure we examined earlier) plus additional fields of information (Listing 8.8):
Listing 8.8.
q4> fields struct inode
0 0 4 0 * i_chain[0]
4 0 4 0 * i_chain[1]
The device number this inode came from
8 0 4 0 int i_dev
The inode number
12 0 4 0 u_int i_number
The state flags, lock word, and tid of last thread to lock this inode
16 0 4 0 u_int i_flag
20 0 2 0 u_short i_lockword
24 0 4 0 int i_tid
The virtual file system's
vnode
for the file is stored between offset 28 and 127
inode device pointer
128 0 4 0 * i_devvp
132 0 4 0 int i_diroff
Continuation inode pointer
136 0 4 0 * i_contip
File system pointer
140 0 4 0 * i_fs
Quota pointer
144 0 4 0 * i_dquot
148 0 4 0 int i_rdev
152 0 4 0 int i_un.if_lastr
152 0 4 0 * i_un.is_socket
156 0 4 0 * i_fr.if_freef
160 0 4 0 * i_fr.if_freeb
164 0 4 0 * i_fselr.i_selp
168 0 2 0 short i_fselr.i_selflag
172 0 4 0 * i_fselw.i_selp
176 0 2 0 short i_fselw.i_selflag
180 0 4 0 * i_mount
Acopy of the disk-based inode (structure icommon) is stored between offset 184 and 311
Number of reader calls
312 0 2 0 u_short i_rcount
Beta semaphore locking structures
316 0 1 0 u_char i_b_sema.b_lock
318 0 2 0 u_short i_b_sema.order
320 0 4 0 * i_b_sema.owner
324 0 4 0 int filler[0]
328 0 4 0 int filler[1]
The relationship between the in-core
inode
data and the
vnode
varies between file system types, but the
vnode
structure itself is identical for all supported file systems. As the
vnode
uses operation's array pointers, it is up to the programmers of the referenced routines to have intimate knowledge of
inode
type-specific content.
When a file is opened by the kernel, it is identified by the device number of the volume on which the file system is located and its
inode
number. By using these two values, each file is guaranteed to have a unique identity within the scope of the operating system.
Caching
inode
/
vnode
Data in the Kernel
To improve system performance and reduce the amount of time spent accessing metadata on the disk,
inode
/
vnode
data is maintained in kernel-resident caches. Each type of file system is responsible for the allocation and maintenance of its
inode
/
vnode
cache. In most implementations caches are
sized
by kernel-tunable parameters or by an
imbedded
formula. A hash is
formed
using the device number and the
inode
number. Before a copy of an
inode
is read in from the physical disk, the
inode
cache is always checked to see if a copy of the data is present.
Smoke and Mirrors: Tying Together the File Subsystem Abstractions
Next, we need to tie the abstractions together into a seamless VFS implementation (see Figure 8-15).
Let's
revisit
our earlier scenario except that now
/home
is a separate file system mounted to the system's root file system. Again, our challenge is to follow the pathname
/home/chris/stuff
through the kernel's virtual structures.
We start our search by locating the kernel's copy of the
/
inode
(2 on the first volume). The
inode
/
vnode
and root directory data of all mount points and file system roots are held in the caches and the buffer cache as long as a file system is mounted. To speed up this initial search, the kernel maintains a permanent shortcut pointer to the cached copy of the
/ inode
/
vnode
in the symbol
rootdir
.
From this
vnode
, we determine the file system device number, and from the associated
inode
we get the data block pointer. These two numbers are used to hash our way into the buffer cache, where we find a copy of the root file system's root directory data. We take the
inode
number associated with the
home
directory and the device number to hash our way back into the
inode
cache for the next step of our search.
When we find the cached
vnode
for the
/home
directory (5 on the first volume), the
v_vfsmountedhere
pointer is valid and directs us to the kernel mount table and the
vfs
structure for the second mounted volume (did you ever play the children's game Chutes and Ladders?). We then follow the
v_data
pointer to find the root
inode
/
vnode
(2 on the second volume) and cached data block for the mounted
/home
file system.
We continue our search within the
/home
file system, locating the
inode
/
vnode
(10 on the second volume) and cached data for the
/home/chris
directory and finally the
inode
/
vnode
(14 on the second volume) and cached data for the
/home/chris/stuff
file. Accessing these last two stops on our search may or may not find the
requested
data in the appropriate caches. If they aren't loaded, then the system uses the routines referenced by the operational arrays and kernel functions to load copies from the disk into the caches.
This
brings
us to the end of our quest. Next we consider the attachment of an
open
file to a process.
Connecting an Open File to a Process
We have seen that the
proc
structure is the base of all process-
related
kernel structures. The definition of a process's resources includes mapping opened files to
per-process
file descriptors (see Figure 8-16).
From the kernel's point of view, each distinct file
open()
call must be tracked, which is accomplished by making entries in the system file table. From the process's point of view, it is only
concerned
with its own file descriptors, but the kernel must manage all
open()
calls.
The number of file descriptors a single process may have is controlled by the kernel-tunable parameters
max_files
(for a regular
user
process) and
max_files_lim
(for a superuser process), and may be set up to 60,000. The default is 60 and is usually sufficient, considering that most processes only need to map
stdin
,
stdout
, and
stderr
.
To avoid wasting kernel memory space, process file descriptors are stored in blocks of eight for HP-UX 11.11 (HP-UX 11.0 used 32 descriptors per block), and the blocks are referenced through a partition table pointed to by the
proc
structure. Additional blocks are allocated as needed by the kernel.
On HP-UX 11.11 the
file_descriptor_t
structure contained:
fd_lock a spinlock structure
*fd_filep a pointer to a system file table file entry
fd_locker_tid a reference to a locking thread
fd_state the current descriptor state
fd_pofile reference to the ofile partition entry
The kernel file system table consists of
nfile
entries (kernel-tunable) of type
file
. See Listings 8.9 and 8.10.
Listing 8.9.
q4> fields struct file
The file descriptor table flags are:
0 0 4 0 int f_flag
|
0x01FREAD
|
descriptor is readable
|
|
0x02FWRITE
|
descriptor is writable
|
|
0x04FNDELAY
|
no delay
|
|
0x08FAPPEND
|
append on each write
|
The kernel object represented by this entry
4 0 2 0 short f_type
|
0x01DTYPE_VNODE
|
regular file
|
|
0x02DTYPE_SOCKET
|
Berkeley IPC Socket
|
|
0x03DTYPE_STREAMS
|
Streams file type
|
|
0x05DTYPE_UNSP
|
user nsp control
|
|
0x06DTYPE_LLA
|
link-level LAN access
|
{% if main.adsdop %}{% include 'adsenceinline.tpl' %}{% endif %}
link count from the message queue
6 0 2 0 short f_msgcount
reference count
8 0 4 0 int f_count
type-specific file system operations array pointer
12 0 4 0 * f_ops
pointer to this file's vnode/socket kernel data
16 0 4 0 * f_data
20 0 4 0 int f_freeindex
24 0 8 0 long long f_offset
32 0 4 0 * f_buf
credentials of user who opened the file
36 0 4 0 * f_cred
kernel beta-semaphore
40 0 1 0 u_char f_b_sema.b_lock
42 0 2 0 u_short f_b_sema.order
44 0 4 0 * f_b_sema.owner
data intercept audit flag
48 0 2 0 short f_tap
Listing 8.10.
q4> fields struct fileops
0 0 4 0 * fo_rw
4 0 4 0 * fo_ioctl
8 0 4 0 * fo_select
12 0 4 0 * fo_close
The Buffer Cache
Just as the kernel keeps copies of open
inode
data in a cache, it also maintains a much larger cache of recently accessed file data blocks (or extents, depending on the type of file system). This buffer cache is a key component of the HP-UX operating system. Figure 8-17 provides an overview of the buffer cache design.
A basic
premise
of the UNIX operating system is that every thing is a file. This model serves many purposes and greatly
reduces
the overall complexity of the operating system. With this in mind, it is understandable that everything that may be done to improve file performance provides a large payback.
Being Flexible: The Dynamic Buffer Cache
The HP-UX implementation of a buffer cache
follows
a dynamic model. The buffer cache is mapped into the kernel's address space, but the physical pages allocated to it come from the same available memory pools as those used for user processes. In this manner, the buffer cache competes with user code for system memory resources.
The kernel allows the administrator to set a maximum and minimum percentage of the available memory that may be allocated for buffer cache use. Remember that
available memory
means the amount of physical memory remaining after the kernel
allocates
pages for its exclusive use during system initialization. The kernel-tunable parameters to enable a dynamic buffer cache are:
If the
dbc_min_pct
is set equal to or larger than the
dbc_max_pct
, then the dynamic buffer cache is capped at the size dictated by
dbc_min_pct
. This
presents
what at first seems to be a
conundrum
. How can it be a dynamic buffer cache if the size is fixed? While the size is fixed, it is simply a
targeted
value and not an absolute. As long as the buffer cache is
considered
to be dynamic, its pages may be aged and stolen by the
vhand
daemon during times of heavy memory pressure. This means that while the overall size appears to be static, the specific pages allocated to the cache may change from time to time, so they are considered dynamic.
If these two dynamic tunables are not set, then the following older kernel parameters take effect and the cache is flagged as static; that is, pages once allocated to the cache
remain
there and
vhand
cannot scan them.
Once a buffer has been
populated
with a copy of file data, the kernel needs an efficient method to search for it in the cache. This is accomplished through the use of a hash (by now you expected this, didn't you!).
The Buffer Cache Hash
Figure 8-18 illustrates the basic design of the buffer cache. Individual buffers are pointed to by buffer headers' which are arranged in hash chains. In the case of a
read()
, the hash offset is calculated using the device number and block offset (obtained from the
vnode
) for the requested disk data.
Once the hash has been calculated, the appropriate hash chain is searched to see if the requested data is present in the cache. In the case of a
miss
, the kernel works with the appropriate hardware driver, requests a physical data transfer from the disk, and allocates a free cache buffer to store the data in. Requests made and satisfied by the buffer cache are said to be
logical
reads and writes. Reads and
writes
between the buffer cache and a disk are called
physical
.
A
write()
request is handled in a similar manner. If a buffer has already been started, a simple logical write is performed. When a buffer is filled or when a dirty buffer (one whose content
differs
from its
counterpart
on the disk) is found by the
sync
daemon during one of its scheduled
scans
, a physical write is scheduled. Physical writes may also be
specifically
requested by a programmer or when a file is closed.
Listings 8.11 and 8.12 show selected fields from q4 listings of the buffer header and buffer structure.
Listing 8.11.
q4> fields struct bufhd
The buffer header state flag
0 0 4 0 int b_flags
|
0x00800000
|
B_PAGEOUT
|
this is a buffer header, not a buffer
|
Forward/backward linkage pointers for the hash chain
4 0 4 0 * b_forw
8 0 4 0 * b_back
Modification timestamp
12 0 4 0 int b_checkdup
Lock structure for this buffer hash chain
16 0 4 0 * bh_lock
Listing 8.12.
q4> fields struct buf (a partial listing)
The buffer state flags
0 0 4 0 int b_flags
|
0x00000000
|
B_WRITE
|
|
|
0x00000001
|
B_READ
|
|
|
0x00000002
|
B_DONE
|
transaction completed
|
|
0x00000004
|
B_ERROR
|
transaction failed
|
|
0x00000008
|
B_BUSY
|
not currently on a free list
|
|
0x00000010
|
B_PHYS
|
physical I/O in progress
|
|
0x00000040
|
B_WANTED
|
send a
wakeup
when no longer BUSY
|
|
0x00000100
|
B_ASYNC
|
don't wait for I/O completion
|
|
0x00001000
|
B_PRIVATE
|
used by LVM
|
|
0x00004000
|
B_PFTIMEOUT
|
LVM power fail timeout
|
|
0x00008000
|
B_CACHE
|
|
|
0x00010000
|
B_INVAL
|
buffer not valid
|
|
0x00020000
|
B_FSYSIO
|
buffer resulted from bread or bwrite
|
|
0x00040000
|
B_CALL
|
call b_iodone() from iodone()
|
|
0x00080000
|
B_NOCACHE
|
don't hold block in buffer cache
|
|
0x00100000
|
B_RAW
|
raw access I/O
|
|
0x00200000
|
B_BCACHE
|
buffer is from the buffer cache
|
|
0x00400000
|
B_SYNC
|
buffer write is synchronous
|
Linkage pointers connecting a buffer to a hash chain
4 0 4 0 * b_forw
8 0 4 0 * b_back
Linkage pointer connecting a buffer to a free list
12 0 4 0 * av_forw
16 0 4 0 * av_back
Linkage pointers connecting a buffer to a vnodes clean/dirty list
20 0 4 0 * b_blockf
24 0 4 0 * b_blockb
28 0 4 0 * b_strategy
32 0 4 0 * b_un.b_addr
32 0 4 0 * b_un.b_words
32 0 4 0 * b_un.b_filsys
32 0 4 0 * b_un.b_cs
32 0 4 0 * b_un.b_cg
32 0 4 0 * b_un.b_dino
32 0 4 0 * b_un.b_daddr
36 0 4 0 int b_un2.b_sectno
36 0 4 0 int b_un2.b_byteno
40 0 4 0 int b_dev
44 0 4 0 int b_blkno
48 0 4 0 * b_sc
The count of how many times this buffer has been returned to a
free list since it was allocated
52 0 4 0 int b_bcount
56 0 2 0 u_short b2_flags
Error number (if B_ERROR is set in b_flags), buffer size, and offset
58 0 2 0 short b_error
60 0 4 0 int b_bufsize
64 0 4 0 int b_offset
68 0 4 0 * b_merge
The pointer contains the address of the routine to call when a
pending I/O is finished
72 0 4 0 * b_iodone
---------------------------------------
Free Lists: Allocating a New Buffer
When a buffer is in use, its
b_flags
is set to
B_BUSY
. As soon as any pending data transfer is complete, it is returned to a free list. There are several free lists.
When a page is first allocated to a dynamic buffer cache, its buffers are placed on the
BQ_EMPTY
free list. Requests to grow the buffer cache depend on its level of use.
Once a buffer is allocated, used, and freed, it is placed on the
BQ_AGE
list, and its
b_count
is incremented. Once the
b_count
exceeds 4, the buffer is placed on the
BQ_LRQ
list when it is freed. In this manner, a buffer that has been loaded to
satisfy
a truly random access read request is more likely to be reallocated than one for which sequential requests are being
processed
.
Each processor has its own set of free list headers. The rules for finding an available buffer with a fixed buffer cache are as follows:
-
Search the processor's
BQ_AGE
list, then its
BQ_LRU
.
-
Next, search the
BQ_AGE
lists of all the other processors.
-
Finally, search their
BQ_LRQ
lists.
In the case of a dynamic buffer cache:
-
Check the current processor's
BQ_AGE
and
BQ_LRU
.
-
Next we check the
BQ_EMPTY
(we wait to check it, as the allocation of new pages to the cache is performed in background).
-
Assuming that we can't find a buffer on one of our processor's free lists, we next check the neighboring processors'
BQ_AGE
and then their
BQ_LRU
lists.
If we are still
unsuccessful
, the thread blocks until a buffer is available.
An interesting observation is that items cached in the buffer cache don't behave like items in most other caches. When a buffer is involved in an I/O operation, it is
B_BUSY
. The rest of the time, it is placed on a free list. In this manner the more a buffer is referenced, the less its
chances
are of being reallocated. If the cache is not busy, a buffer may persist for quite some time following its use, if another process comes along requesting the data, it may be pleasantly surprised to find that it is already in the cache and may be accessed logically.
A fourth queue,
BQ_SENDFILE
, is used to list transitive buffers in use by kernel routine
sendfile()
.
Bigger Isn't Always Better
The time required to search the hash chain is offset by the fact that a logical read or write is several orders of magnitude faster than a physical transaction. The greater the "hit rate" for data in the cache, the greater the payback is for this scheme. For files accessed in a sequential manner, the hit rate may exceed 90 percent! Adjusting the size of the buffer cache may help to increase the hit rate, but caution should be used to make sure that the system has plenty of available memory and that increasing the buffer cache doesn't cause an increase in the page-out rate, or the gains in the hit rate could be lost in the added overhead of
vhand
.
Another concern is that if the buffer cache gets too large, then the added length of the hash chain requires more search time. At some point the payback of caching may be diminished.
Prior to HP-UX 11.11, the conventional wisdom was that the buffer cache should not exceed 0.5 GB. Recent modifications (HP-UX 11.11) to the hashtable mechanics suggest that sizes in the 1-GB to 2-GB range may be considered
The Directory Name Lookup Cache, DNLC
The final component of the file system we examine in this chapter is the directory name lookup cache, or DNLC (Figure 8-19). Just as the kernel maintains caches for data blocks and
inode
s for improved performance, it also maintains a cache of recently referenced pathnames.
We have seen how a
user-supplied
pathname is processed by the kernel through the VFS. If we could skip past a majority of the smoke-and-mirror translation tasks, we could improve the time required to open a file descriptor.
Consider a thread that has changed to a working directory and is accessing a number of the files in that directory. Each time an
open()
call is made, the kernel must convert the pathname to an appropriate
vnode
and data block. The DNLC caches individual directory
names
and
accesses
them via a hash based on their name strings.
Entries in the cache remain on a free list in a least recently used fashion. Another limit is that individual names must not be longer than 39 characters, the maximum space allocated for the directory name. Longer names may be used but they will not be cached (prior to HP-UX 11.0 the maximum length for cached names was 15
characters
).
Hash chain headers are stored in the kernel's
nc_hash[]
array. The hash function uses the
vnode
address and the file/directory name to calculate the chain header to which it should be linked.
Consider the Listings 8.13 and 8.14of the DNLC structures.
Listing 8.13.
q4> fields struct nc_hash
Each nc_hash[] array element contains two linkage pointers
0 0 4 0 * hash_next
4 0 4 0 * hash_prev
Listing 8.14.
q4> fields struct ncache
First the entry is linked to its hash chain
0 0 4 0 * hash_next
4 0 4 0 * hash_prev
All entries are linked into a least recently used list
8 0 4 0 * lru_next
12 0 4 0 * lru_prev
This points to the directory's vnode
16 0 4 0 * vp
And a pointer to the parent directory's vnode
20 0 4 0 * dp
The entry's length, name, and credentials are stored next
24 0 1 0 char namlen
25 0 39 0 char[39] name
64 0 4 0 * cred
Mount point directories are linked to the virtual file system table
68 0 4 0 * vfs_next
72 0 4 0 * vfs_prev
All entries sharing a parent directory are dual linked
76 0 4 0 * dp_next
80 0 4 0 * dp_prev
All entries sharing a vnode are dual linked
84 0 4 0 * vp_next
88 0 4 0 * vp_prev