Section 11.6. The VFS Layer | Mac OS X Internals: A Systems Approach

11.6. The VFS Layer

Mac OS X provides a virtual file system interfacethe vnode/vfs layeroften referred to simply as the VFS layer. First implemented by Sun Microsystems, the vnode/vfs concept is widely used by modern operating systems to allow multiple file systems to coexist in a clean and maintainable manner. A vnode (virtual node) is an in-kernel representation of a file, whereas a vfs (virtual file system) represents a file system. The VFS layer sits between the file-system-independent and file-system-dependent code in the kernel, thereby abstracting file system differences from the rest of the kernel, which uses VFS-layer functions to perform I/Oregardless of the underlying file systems. Beginning with Mac OS X 10.4, a VFS kernel programming interface (KPI) is implemented in bsd/vfs/kpi_vfs.c.

The Mac OS X VFS is derived from FreeBSD's VFS, although there are numeroususually minor in conceptdifferences. An area of major difference is the file system layer's integration with virtual memory. The unified buffer cache (UBC) on Mac OS X is integrated with Mach's virtual memory layer. As we saw in Chapter 8, the ubc_info structure associates Mac OS X vnodes with the corresponding virtual memory objects.

Figure 1112 shows a simplistic visualization of the vnode/vfs layer. In object-oriented parlance, the vfs is akin to an abstract base class from which specific file system instances such as HFS Plus and UFS are derived. Continuing with the analogy, the vfs "class" contains several pure virtual functions that are defined by the derived classes. The vfsops structure [bsd/sys/mount.h] acts as a function-pointer table for these functions, which include the following (listed in the order they appear in the structure):

vfs_mount()implements the mount() system call
vfs_start()called by the mount() system call to perform any operations that the file system wishes to perform after a successful mount operation
vfs_unmount()implements the unmount() system call
vfs_root()retrieves the root vnode of the file system
vfs_quotactl()implements the quotactl() system call (handles quota operations on the file system)
vfs_getattr()populates a vfs_attr structure with file system attributes
vfs_statfs()implements the statfs() system call (retrieves file system statistics by populating a statfs structure)
vfs_sync()synchronizes in-memory dirty data with the on-disk data
vfs_vget()retrieves an existing file system object given its IDfor example, the catalog node ID in the case of HFS Plus (see Chapter 12)
vfs_fhtovp()translates a file handle to a vnode; used by the NFS server
vfs_init()performs one-time initialization of the file system
vfs_sysctl()handles file-system-level sysctl operations specific to this file system, for example, enabling or disabling the journal on an HFS Plus volume
vfs_setattr()sets file system attributes, if any can be set, for example, the volume name in the case of HFS Plus

Figure 1112. An overview of the vnode/vfs layer's role in the operating system

Similarly, a vnode is an abstract base class from which files residing on various file systems are conceptually derived. A vnode contains all the information that the file-system-independent layer of the kernel needs. Just as the vfs has a set of virtual functions, a vnode too has a (larger) set of functions representing vnode operations. Normally, all vnodes representing files on a given file system type share the same function-pointer table.

As Figure 1112 shows, a mount structure represents an instance of a mounted file system. Besides a pointer to the vfs operations table, the mount structure also contains a pointer (mnt_data) to instance-specific private datawhich is private in that it is opaque to the file-system-independent code. For example, in the case of HFS Plus, mnt_data points to an hfsmount structure, which we will discuss in Chapter 12. Similarly, a vnode contains a private data pointer (v_data) that points to a file-system-specific per-file structurefor example, the cnode and inode structures in the case of HFS Plus and UFS, respectively.

Because of the arrangement shown in Figure 1112, the code outside of the VFS layer usually need not worry about file system differences. Incoming file and file system operations are routed through the vnode and mount structures, respectively, to the appropriate file systems.

Technically, code outside the VFS layer should see the vnode and mount structures as opaque handles. The kernel uses vnode_t and mount_t, respectively, as the corresponding opaque types.

Figure 1113 shows a more detailed view of key vnode/vfs data structures. The mountlist global variable is the head of a list of mount structuresone per mounted file system. Each mount structure has a list of associated vnodesmultiple lists, actually (the mnt_workerqueue and mnt_newvnodes lists are used when iterating over all vnodes in the file system). Note that the details shown correspond to a mounted HFS Plus file system.

Figure 1113. A mounted file system and its vnodes

The kernel maintains an in-memory vfstable structure ([bsd/sys/mount_internal.h]) for each file system type supported. The global variable vfsconf points to a list of these structures. When there is a mount request, the kernel searches this list to identify the appropriate file system. Figure 1114 shows an overview of the vfsconf list, which is declared in bsd/vfs/vfs_conf.c.

Figure 1114. Configuration information for file system types supported by the kernel

There also exists a user-visible vfsconf structure (not a list), which contains a subset of the information contained in the corresponding vfstable structure. The CTL_VFSVFS_CONF sysctl operation can be used to retrieve the vfsconf structure for a given file system type. The program in Figure 1115 retrieves and displays information about all file system types supported by the running kernel.

Figure 1115. Displaying information about all available file system types

// lsvfsconf.c #include <stdio.h> #include <stdlib.h> #include <sys/mount.h> #include <sys/sysctl.h> #include <sys/errno.h> void print_flags(int f) {     if (f & MNT_LOCAL)    // file system is stored locally         printf("local ");     if (f & MNT_DOVOLFS)  // supports volfs         printf("volfs ");     printf("\n"); } int main(void) {     int    i, ret, val;     size_t len;     int    mib[4];     struct vfsconf vfsconf;     mib[0] = CTL_VFS;     mib[1] = VFS_NUMMNTOPS; // retrieve number of mount/unmount operations     len = sizeof(int);     if ((ret = sysctl(mib, 2, &val, &len, NULL, 0)) < 0)         goto out;     printf("%d mount/unmount operations across all VFSs\n\n", val);     mib[1] = VFS_GENERIC;     mib[2] = VFS_MAXTYPENUM; // retrieve highest defined file system type     len = sizeof(int);     if ((ret = sysctl(mib, 3, &val, &len, NULL, 0)) < 0)         goto out;     mib[2] = VFS_CONF; // retrieve vfsconf for each type     len = sizeof(vfsconf);     printf("name        typenum refcount mountroot next     flags\n");     printf("----        ------- -------- --------- ----     -----\n");     for (i = 0; i < val; i++) {         mib[3] = i;         if ((ret = sysctl(mib, 4, &vfsconf, &len, NULL, 0)) != 0) {             if (errno != ENOTSUP) // if error is ENOTSUP, let us ignore it                 goto out;         } else {             printf("%-11s %-7d %-8d %#09lx %#08lx ",                    vfsconf.vfc_name, vfsconf.vfc_typenum, vfsconf.vfc_refcount,                    (unsigned long)vfsconf.vfc_mountroot,                    (unsigned long)vfsconf.vfc_next);             print_flags(vfsconf.vfc_flags);         }     } out:     if (ret)         perror("sysctl");     exit(ret); } $ gcc -Wall -o lsvfsconf lsvfsconf.c $ ./lsvfsconf 14 mount/unmount operations across all VFSs name        typenum refcount mountroot next     flags ----        ------- -------- --------- ----     ----- ufs         1       0        0x020d5e8 0x367158 local nfs         2       4        0x01efcfc 0x3671e8 fdesc       7       1        000000000 0x367278 cd9660      14      0        0x0112d90 0x3671a0 local union       15      0        000000000 0x367230 hfs         17      2        0x022bcac 0x367110 local, volfs volfs       18      1        000000000 0x3672c0 devfs       19      1        000000000 00000000

Note that the program output in Figure 1115 would contain additional file system types if new file systems (such as MS-DOS and NTFS) were dynamically loaded into the kernel.

The vnode structure is declared in bsd/vfs/vnode_internal.hits internals are private to the VFS layer, although the VFS KPI provides several functions to access and manipulate vnode structures. vnode_internal.h also declares the vnodeop_desc structure, an instance of which describes a single vnode operation such as "lookup," "create," and "open." The file bsd/vfs/vnode_if.c contains the declaration of a vnodeop_desc structure for each vnode operation known to the VFS layer, as shown in this example.

struct vnodeop_desc vnop_mknod_desc = {     0, // offset in the operations vector (initialized by vfs_op_init())     "vnop_mknod", // a human-readable name -- for debugging     0 | VDESC_VP0_WILLRELE | VDESC_VPP_WILLRELE, // flags     // various offsets used by the nullfs bypass routine (unused in Mac OS X)     ... };

The shell script bsd/vfs/vnode_if.sh parses an input file (bsd/vfs/vnode_if.src) to automatically generate bsd/vfs/vnode_if.c and bsd/sys/vnode_if.h. The input file contains a specification of each vnode operation descriptor.

A vnodeop_desc structure is referred to by a vnodeopv_entry_desc [bsd/sys/vnode.h] structure, which represents a single entry in a vector of vnode operations.

// bsd/sys/vnode.h struct vnodeopv_entry_desc {     struct vnodeop_desc *opve_op; // which operation this is     int (*opve_impl)(void *);     // code implementing this operation };

The vnodeopv_desc structure [bsd/sys/vnode.h] describes a vector of vnode operationsit contains a pointer to a null-terminated list of vnodeopv_entry_desc structures.

// bsd/sys/vnode.h struct vnodeopv_desc {     int (***opv_desc_vector_p)(void *);     struct vnodeopv_entry_desc *opv_desc_ops; };

Figure 1116 shows how vnode operation data structures are maintained in the VFS layer. There is a vnodeopv_desc for each supported file system. The file bsd/vfs/vfs_conf.c declares a list of vnodeopv_desc structures for built-in file systems.

Figure 1116. Vnode operations vectors in the VFS layer

// bsd/vfs/vfs_conf.c extern struct vnodeopv_desc ffs_vnodeop_opv_desc; ... extern struct vnodeopv_desc hfs_vnodeop_opv_desc; extern struct vnodeopv_desc hfs_specop_opv_desc; extern struct vnodeopv_desc hfs_fifoop_opv_desc; ... struct vnodeopv_desc *vfs_opv_descs[] = {     &ffs_vnodeop_opv_desc,     ...     &hfs_vnodeop_opv_desc,     &hfs_specop_opv_desc,     &hfs_fifoop_opv_desc,     ...     NULL };

Typically, each vnodeopv_desc is declared in a file-system-specific file. For example, bsd/hfs/hfs_vnops.c declares hfs_vnodeop_opv_desc.

// bsd/hfs/hfs_vnops.c struct vnodeopv_desc hfs_vnodeop_opv_desc =      { &hfs_vnodeop_p, hfs_vnodeop_entries };

hfs_vnodeop_entriesa null-terminated list of vnodeopv_entry_desc structuresis declared in bsd/hfs/hfs_vnops.c as well.

// bsd/hfs/hfs_vnops.c #define VOPFUNC int (*)(void *) struct vnodeopv_entry_desc hfs_vnodeop_entries[] = {     { &vnop_default_desc, (VOPFUNC)vn_default_error },    // default     { &vnop_lookup_desc,  (VOPFUNC)hfs_vnop_lookup  },    // lookup     { &vnop_create_desc,  (VOPFUNC)hfs_vnop_create  },    // create     { &vnop_mknod_desc,   (VOPFUNC)hfs_vnop_mknod   },    // mknod     ...     { NULL, (VOPFUNC)NULL } };

During bootstrapping, bsd_init() [bsd/kern/bsd_init.c] calls vfsinit() [bsd/vfs/vfs_init.c] to initialize the VFS layer. Section 5.7.2 enumerates the important operations performed by vfsinit(). It calls vfs_op_init() [bsd/vfs/vfs_init.c] to set known vnode operation vectors to an initial state.

// bsd/vfs/vfs_init.c void vfs_op_init() {     int i;     // Initialize each vnode operation vector to NULL     // struct vnodeopv_desc *vfs_opv_descs[]     for (i = 0; vfs_opv_descs[i]; i++)         *(vfs_opv_descs[i]->opv_desc_vector_p) = NULL;     // Initialize the offset value in each vnode operation descriptor     // struct vnodeop_desc *vfs_op_descs[]     for (vfs_opv_numops = 0, i = 0, vfs_op_descs[i]; i++) {         vfs_op_descs[i]->vdesc_offset = vfs_opv_numops;         vfs_opv_numops++;     } }

Next, vfsinit() calls vfs_opv_init() [bsd/vfs/vfs_init.c] to populate the operations vectors. vfs_opv_init() iterates over each element of vfs_opv_descs, checking whether the opv_desc_vector_p field of each entry points to a NULLif so, it allocates the vector before populating it. Figure 1117 shows the operation of vfs_opv_init().

Figure 1117. Initialization of vnode operations vectors during bootstrap

// bsd/vfs/vfs_init.c void vfs_opv_init() {     int i, j, k;     int (***opv_desc_vector_p)(void *);     int (**opv_desc_vector)(void *);     struct vnodeopv_entry_desc *opve_descp;     for (i = 0; vfs_opv_descs[i]; i++) {         opv_desc_vector_p = vfs_opv_descs[i]->opv_desc_vector_p;         if (*opv_desc_vector_p == NULL) {             // allocate and zero out *opv_desc_vector_p             ...         }         opv_desc_vector = *opv_desc_vector_p;         for (j = 0; vfs_opv_descs[i]->opv_desc_ops[j].opve_op; j++) {             opve_descp = &(vfs_opv_descs[i]->opv_desc_ops[j]);             // sanity-check operation offset (panic if it is 0 for an             // operation other than the default operation)             // populate the entry             opv_desc_vector[opve_descp->opve_op->vdesc_offset] =                     opve_descp->opve_impl;         }     }     // replace unpopulated routines with defaults     ... }

Figure 1116 shows an interesting feature of the FreeBSD-derived VFS layer: There can be multiple vnode operations vectors for a given vnodeopv_desc.