Section 6.2. Linux Virtual Filesystem

6.2. Linux Virtual Filesystem

The implementation of filesystems varies from system to system. For example, in Windows, the implementation of how a file relates to a disk block differs from how a file in a UNIX filesystem relates to a disk block. In fact, Microsoft has various implementations of filesystems that correspond to its various operating systems: MS-DOS for DOS and Win 3.x, VFAT for Windows 9x, and NTFS for Windows NT. UNIX operating systems also have various implementations, such as SYSV and MINIX. Linux specifically uses filesystems such as ext2, ext3, and ResierFS.

One of the best attributes of Linux is the many filesystems it supports. Not only can you view files from its own filesystems (ext2, ext3, and ReiserFS), but you can also view files from filesystems pertaining to other operating systems. On a single Linux system, you are capable of accessing files from numerous different formats. Table 6.1 lists the currently supported filesystems. To a user, there is no difference between one filesystem and another; he can indiscriminately mount any of the supported filesystems to his original tree namespace.

Table 6.1. Some of the Linux Supported Filesystems
Filesystem Name	Description
ext2	Second extended filesystem
ext3	ext3 journaling filesystem
Reiserfs	Journaling filesystem
JFS	IBM's journaled filesystem
XFS	SGI Irix's high-performance journaling filesystem
MINIX	Original Linux filesystem, minix OS filesystem
ISO9660	CD-ROM filesystem
JOLIET	Microsoft CRDOM filesystem extensions
UDF	Alternative CROM, DVD filesystem
MSDOS	Microsoft Disk Operating System
VFAT	Windows 95 Virtual File Allocation Table
NTFS	Windows NT, 2000, XP, 2003 filesystem
ADFS	Acorn Disk filesystem
HFS	Apple Macintosh filesystem
BEFS	BeOs filesystem
FreeVxfs	Veritas Vxfs support
HPFS	OS/2 support
SysVfs	System V filesystem support
NFS	Networking filesystem support
AFS	Andrew filesystem (also networking)
UFS	BSD filesystem support
NCP	NetWare filesystem
SMB	Samba

Linux supports more than on-disk filesystems. It also supports network-mounted filesystems and special filesystems that are used for things other than managing disk space. For example, procfs is a pseudo filesystem. This virtual filesystem provides information about different aspects of your system. A procfs filesystem does not take up hard disk space and files are created on the fly upon access. Another such filesystem is devfs,^[4] which provides an interface to device drivers.

^[4] In Linux 2.6, devfs is obsolete by udev, although minimal support is still available. For more information on udev, go to http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev-FAQ.

Linux achieves this "masquerading" of the physical filesystem specifics by introducing an intermediate layer of abstraction between user space and the physical filesystem. This layer is known as the virtual filesystem (VFS). It separates the filesystem-specific structures and functions from the rest of the kernel. The VFS manages the filesystem-related system calls and translates them to the appropriate filesystem type functions. Figure 6.3 overviews the filesystem-management structure.

Figure 6.3. Linux VFS

The user application accesses the generic VFS through system calls. Each supported filesystem must have an implementation of a set of functions that perform the VFS-supported operations (for example, open, read, write, and close). The VFS keeps track of the filesystems it supports and the functions that perform each of the operations. You know from Chapter 5 that a generic block device layer exists between the filesystem and the actual device driver. This provides a layer of abstraction that allows the implementation of the filesystem-specific code to be independent of the specific device it eventually accesses.

6.2.1. VFS Data Structures

The VFS relies on data structures to hold its generic representation of a filesystem.

The data structures are as follows:

superblock structure. Stores information relating to mounted filesystems
inode structure. Stores information relating to files
file structure. Stores information related to files opened by a process
dentry structure. Stores information related to a pathname and the file pointed to

In addition to these structures, the VFS also uses additional structures such as vfsmount, and nameidata, which hold mounting information and pathname lookup information, respectively. We see how these two structures relate to the main ones just described, although we do not independently cover them.

The structures that compose the VFS are associated with actions that can be applied on the object represented by the structure. These actions are defined in a table of operations for each object. The tables of operations are lists of function pointers. We define the operations table for each object as we describe them. We now closely look at each of these structures. (Note that we do not focus on any locking mechanisms for the purposes of clarity and brevity.)

6.2.1.1. superblock Structure

When a filesystem is mounted, all information concerning it is stored is the super_block struct. One superblock structure exists for every mounted filesystem. We show the structure definition followed by explanations of some of the more important fields:

 ----------------------------------------------------------------------- include/linux/fs.h 666  struct super_block { 667   struct list_head   s_list; 668   dev_t     s_dev;   669   unsigned long    s_blocksize; 670   unsigned long    s_old_blocksize; 671   unsigned char    s_blocksize_bits; 672   unsigned char    s_dirt; 673   unsigned long long   s_maxbytes;   674   struct file_system_type  *s_type; 675   struct super_operations  *s_op; 676   struct dquot_operations  *dq_op; 677   struct quotactl_ops   *s_qcop; 678   struct export_operations  *s_export_op; 679   unsigned long    s_flags; 680   unsigned long    s_magic; 681   struct dentry    *s_root; 682   struct rw_semaphore   s_umount; 683   struct semaphore   s_lock; 684   int     s_count; 685   int     s_syncing; 686   int     s_need_sync_fs; 687   atomic_t    s_active; 688   void      *s_security; 689   690   struct list_head   s_dirty;   691   struct list_head   s_io;    692   struct hlist_head   s_anon;   693   struct list_head   s_files; 694   695   struct block_device   *s_bdev; 696   struct list_head   s_instances; 697   struct quota_info   s_dquot;   698   699   char      s_id[32];     700   701   struct kobject    kobj;    702   void      *s_fs_info;   ... 708   struct semaphore   s_vfs_rename_sem;   709  }; -----------------------------------------------------------------------

Line 667

The s_list field is of type list_head,^[5] which is a pointer to the next and previous elements in the circular doubly linked list in which this super_block is embedded. Like many other structures in the Linux kernel, the super_block structs are maintained in a circular doubly linked list. The list_head datatype contains pointers to two other list_heads: the list_head of the next superblock object and the list_head of the previous superblock objects. (The global variable super_blocks (fs/super.c) points to the first element in the list.)

^[5] Chapter 2, "Exploration Toolkit," describes the list_head datatype in detail.

Line 672

On disk-based filesystems, the superblock structure is filled with information originally maintained in a special disk sector that is loaded into the superblock structure. Because the VFS allows editing of fields in the superblock structure, the information in the superblock structure can find itself out of sync with the on-disk data. This field identifies that the superblock structure has been edited and needs to sync up with the disk.

Line 673

This field of type unsigned long defines the maximum file size allowed in the filesystem.

Line 674

The superblock structure contains general filesystem information. However, it needs to be associated with the specific filesystem information (for example, MSDOS, ext2, MINIX, and NFS). The file_system_type structure holds filesystem-specific information, one for each type of filesystem configured into the kernel. This field points to the appropriate filesystem-specific struct and is how the VFS manages the interaction from general request to specific filesystem operation.

Figure 6.4 shows the relation between the superblock and the file_system_type structures. We show how the superblock->s_type field points to the appropriate file_system_type struct in the file_systems list. (In the "Global and Local List References" section later in this chapter, we show what the file_systems list is.)

Figure 6.4. Relation Between superblock and file_system_type

Line 675

The field is a pointer of type super_operations struct. This datatype holds the table of superblock operations. The super_operations struct itself holds function pointers that are initialized with the particular filesystem's superblock operations. The next section explains super_operations in more detail.

Line 681

This field is a pointer to a dentry struct. The dentry struct holds the pathname of a file. This particular dentry object is the one associated with the mount directory whose superblock this belongs to.

Line 690

The s_dirty field (not to be confused with s_dirt) is a list_head struct that points to the first and last elements in the list of dirty inodes belonging to this filesystem.

Line 693

The s_files field is a list_head struct that points to the first element of a list of file structs that are both in use and assigned to the superblock. In the "file Structure" section, you see that this is one of the three lists in which a file structure can find itself.

Line 696

The field of s_instances is a list_head structure that points to the adjacent superblock elements in the list of superblocks with the same filesystem type. The head of this list is referenced by the fs_supers field of the file_system_type structure.

Line 702

This void * data type points to additional superblock information that is specific to a particular filesystem (for example, ext3_sb_info). This acts as a sort of catch-all for any superblock data on disk for that specific filesystem that was not abstracted out into the virtual filesystem superblock concept.

6.2.1.2. superblock Operations

The s_op field of the superblock points to a table of operations that the filesystem's superblock can perform. This list is specific to each filesystem because it operates directly on the filesystem's implementation. The table of operations is stored in a structure of type super_operations:

 ----------------------------------------------------------------------- include/linux/fs.h struct super_operations {   struct inode *(*alloc_inode)(struct super_block *sb);   void (*destroy_inode)(struct inode *);   void (*read_inode) (struct inode *);   void (*dirty_inode) (struct inode *);   void (*write_inode) (struct inode *, int);   void (*put_inode) (struct inode *);   void (*drop_inode) (struct inode *);   void (*delete_inode) (struct inode *);   void (*put_super) (struct super_block *);   void (*write_super) (struct super_block *);   int (*sync_fs)(struct super_block *sb, int wait);   void (*write_super_lockfs) (struct super_block *);   void (*unlockfs) (struct super_block *);   int (*statfs) (struct super_block *, struct kstatfs *);   int (*remount_fs) (struct super_block *, int *, char *);   void (*clear_inode) (struct inode *);   void (*umount_begin) (struct super_block *);   int (*show_options)(struct seq_file *, struct vfsmount *); }; -----------------------------------------------------------------------

When the superblock of a filesystem is initialized, the s_op field is set to point at the appropriate table of operations. In the "Moving from the Generic to the Specific" section later in this chapter, we show how this table of operations is implemented in the ext2 filesystem. Table 6.2 shows the list of superblock operations. Some of these functions are optional and are only filled in by a subset of the supported filesystems. Those that do not support a particular optional function set the field to NULL in the operations struct.

Table 6.2. Superblock Operations
Superblock Operations Name	Description
`alloc_inode`	New in 2.6. It allocates and initializes a `vfs` inode under the superblock. The specifics of initialization are left up to the particular filesystem. The allocation is done with a call to `kmem_cache_create()` or `kemem_cache_alloc()` (see Chapter 4) on the inode's cache.
`destroy_inode`	New in 2.6. It deallocates the specified inode pertaining to the superblock. The deallocation is done with a call to `kmem_cache_free()`.
`read_inode`	Reads the inode specified by the `inode->i_ino` field. The inode's fields are updated from the on-disk data. Particularly important is `inode->i_op`.
`dirty_inode`	Places an inode in the superblock's dirty inode list. The head and tail of the circular, doubly linked list is referenced by way of the `superblock->s_dirty` field. Figure 6.5 illustrates a superblock's dirty inode list.
`write_inode`	Writes the inode information to disk.
`put_inode`	Releases the inode from the inode cache. It's called by `iput()`.
`drop_inode`	Called when the last access to an inode is dropped.
`delete_inode`	Deletes an inode from disk. Used on inodes that are no longer needed. It's called from `generic_delete_inode()`.
`put_super`	Frees the superblock (for example, when unmounting a filesystem).
`write_super`	Writes the superblock information to disk.
`sync_fs`	Currently used only by ext3, Resiserfs, XFS, and JFS, this function writes out dirty `superblock` struct data to the disk.
`write_super_lockfs`	In use by ext3, JFS, Resierfs, and XFS, this function blocks changes to the filesystem. It then updates the disk superblock.
`unlockfs`	Reverses the block set by the `write_super_lockfs()` function.
`stat_fs`	Called to get filesystem statistics.
`remount_fs`	Called when the filesystem is remounted to update any mount options.
`clear_inode`	Releases the inode and all pages associated with it.
`umount_begin`	Called when a mount operation must be interrupted.
`show_options`	Used to get filesystem information from a mounted filesystem.

Figure 6.5. Relation Between Superblock and Inode

This completes our introduction of the superblock structure and its operations. Now, we explore the inode structure in detail.

6.2.1.3. inode Structure

We mentioned that inodes are structures that keep track of file information, such as pointers, to the blocks that contain all the file data. Recall that directories, devices, and pipes (for example) are also represented as files in the kernel, so an inode can represent one of them as well. Inode objects exist for the full lifetime of the file and contain data that is maintained on disk.

Inodes are kept in lists to facilitate referencing. One list is a hash table that reduces the time it takes to find a particular inode. An inode also finds itself in one of three types of doubly linked list. Table 6.3 shows the three list types. Figure 6.5 shows the relationship between a superblock structure and its list of dirty inodes.

Table 6.3. Inode Lists
List	i_count	Dirty	Reference Pointer
Valid, unused	i_count = 0	Not dirty	`inode_unused` (global)
Valid, in use	i_count > 0	Not dirty	`inode_in_use` (global)
Dirty inodes	i_count > 0	Dirty	`superblock's s_dirty` field

The inode struct is large and has many fields. The following is a description of a small subset of the inode struct fields:

 ----------------------------------------------------------------------- include/linux/fs.h 368  struct inode { 369   struct hlist_node   i_hash; 370   struct list_head   i_list; 371   struct list_head   i_dentry; 372   unsigned long    i_ino; 373   atomic_t    i_count; ... 390   struct inode_operations  *i_op; ... 392   struct super_block   *i_sb; ... 407   unsigned long    i_state; ... 421  }; -----------------------------------------------------------------------

Line 369

The i_hash field is of type hlist_node.^[6] This contains a pointer to the hash list, which is used for speedy inode lookup. The inode hash list is referenced by the global variable inode_hashtable.

^[6] hlist_node is a type of list pointer for double-linked lists, much like list_head. The difference is that the list head (type hlist_head) contains a single pointer that points at the first element rather than two (where the second one points at the tail of the list). This reduces overhead for hash tables.

Line 370

This field links to the adjacent structures in the inode lists. Inodes can find themselves in one of the three linked lists.

Line 371

This field points to a list of dentry structs that corresponds to the file. The dentry struct contains the pathname pertaining to the file being represented by the inode. A file can have multiple dentry structs if it has multiple aliases.

Line 372

This field holds the unique inode number. When an inode gets allocated within a particular superblock, this number is an automatically incremented value from a previously assigned inode ID. When the superblock operation read_inode() is called, the inode indicated in this field is read from disk.

Line 373

The i_count field is a counter that gets incremented with every inode use. A value of 0 indicates that the inode is unused and a positive value indicates that it is in use.

Line 392

This field holds the pointer to the superblock of the filesystem in which the file resides. Figure 6.5 shows how all the inodes in a superblocks' dirty inode list will have their i_sb field pointing to a common superblock.

Line 407

This field corresponds to inode state flags. Table 6.4 lists the possible values.

Table 6.4. Inode States
Inode State Flags	Description
`I_DIRTY_SYNC`	See `I_DIRTY` description.
`I_DIRTY_DATASYNC`	See `I_DIRTY` description.
`I_DIRTY_PAGES`	See `I_DIRTY` description.
`I_DIRTY`	This macro correlates to any of the three `I_DIRTY_` flags. It enables a quick check for any of those flags. The `I_DIRTY` flags indicate that the contents of the inode have been written to and need to be synchronized.
`I_LOCK`	Set when the inode is locked and cleared when the inode is unlocked. An inode is locked when it is first created and when it is involved in I/O transfers.
`I_FREEING`	Gets set when an inode is being removed. This flag serves the purpose of tagging the inode as unusable as it is being deleted so no one takes a new reference to it.
`I_CLEAR`	Indicates that the inode is no longer useful.
`I_NEW`	Gets set upon inode creation. The flag gets removed the first time the new inode is unlocked.

An inode with the I_LOCK or I_DIRTY flags set finds itself in the inode_in_use list. Without either of these flags, it is added to the inode_unused list.

6.2.1.4. dentry Structure

The dentry structure represents a directory entry and the VFS uses it to keep track of relations based on directory naming, organization, and logical layout of files. Each dentry object corresponds to a component in a pathname and associates other structures and information that relates to it. For example, in the path /home/lkp/Chapter06.txt, there is a dentry created for /, home, lkp, and Chapter06.txt. Each dentry has a reference to that component's inode, superblock, and related information. Figure 6.6 illustrates the relationship between the superblock, the inode, and the dentry structs.

Figure 6.6. Relations Between superblock, dentry, and inode

We now look at some of the fields of the dentry struct:

 ----------------------------------------------------------------------- include/linux/dcache.h 81  struct dentry { ...   85   struct inode  * d_inode;   86   struct list_head  d_lru;    87   struct list_head  d_child;  /* child of parent list */ 88   struct list_head  d_subdirs;  /* our children */ 89   struct list_head  d_alias;   90   unsigned long  d_time;  /* used by d_revalidate */ 91   struct dentry_operations *d_op; 92   struct super_block  * d_sb;   ... 100   struct dentry  * d_parent;   ... 105  } ____cacheline_aligned;  -----------------------------------------------------------------------

Line 85

The d_inode field points to the inode corresponding with the file associated with the dentry. In the case that the pathname component corresponding with the dentry does not have an associated inode, the value is NULL.

Lines 8588

These are the pointers to the adjacent elements in the dentry lists. A dentry object can find itself in one of the kinds of lists shown in Table 6.5.

Table 6.5. Dentry Lists
Listname	List Pointer	Description
Used dentrys	`d_alias`	The inode with which these dentrys are associated points to the head of the list via the `i_dentry` field.
Unused dentrys	`d_lru`	These dentrys are no longer in use but are kept around in case the same components are accessed in a pathname.

Line 91

The d_op field points to the table of dentry operations.

Line 92

This is a pointer to the superblock associated with the component represented by the dentry. Refer to Figure 6.6 to see how a dentry is associated with a superblock struct.

Line 100

This field holds a pointer to the parent dentry, or the dentry corresponding to the parent component in the pathname. For example, in the pathname /home/paul, the d_parent field of the dentry for paul points to the dentry for home, and the d_parent field of this dentry in turn points to the dentry for /.

6.2.1.5. file Structure

Another structure that the VFS uses is the file structure. When a process manipulates a file, the file structure is the datatype the VFS uses to hold information regarding the process/file association. Unlike other structures, no original on-disk data is held by a file structure; file structures are created on-the-fly upon the issue of the open() syscall and are destroyed upon issue of the close() syscall. Recall from Chapter 3 that throughout the lifetime of a process, the file structures representing files opened by the process are referenced through the process descriptor (the task_struct). Figure 6.7 illustrates how the file structure associates with the other VFS structures. The task_struct points to the file descriptor table, which holds a list of pointers to all the file descriptors that process has opened. Recall that the first three entries in the descriptor table correspond to the file descriptors for stdin, stdout, and stderr, respectively.

Figure 6.7. File Objects

The kernel keeps file structures in circular doubly linked lists. There are three lists in which a file structure can find itself embedded depending on its usage and assignment. Table 6.6 describes the three lists.

Table 6.6. File Lists
Name	Reference Pointer to Head of List	Description
The free file object list	Global variable `free_list`	A doubly linked list composed of all file objects that are available. The size of this list is always at least `NR_RESERVED_FILES` large.
The in-use but unassigned file object list	Global variable `anon_list`	A doubly linked list composed of all file objects that are being used but have not been assigned to a superblock.
Superblock file object list	Superblock field `s_files`	A doubly linked list composed of all file objects that have a file associated with a superblock.

The kernel creates the file structure by way of get_empty_filp(). This routine returns a pointer to the file structure or returns NULL if there are no more free structures or if the system has run out of memory.

We now look at some of the more important fields in the file structure:

 ----------------------------------------------------------------------- include/linux/fs.h 506  struct file { 507   struct list_head   f_list; 508   struct dentry    *f_dentry; 509   struct vfsmount   *f_vfsmnt; 510   struct file_operations  *f_op; 511   atomic_t    f_count; 512   unsigned int    f_flags; 513   mode_t     f_mode; 514   loff_t     f_pos; 515   struct fown_struct   f_owner; 516   unsigned int    f_uid, f_gid; 517   struct  file_ra_state   f_ra; ... 527   struct address_space  *f_mapping; ... 529  }; -----------------------------------------------------------------------

Line 507

The f_list field of type list_head holds the pointers to the adjacent file structures in the list.

Line 508

This is a pointer to the dentry structure associated with the file.

Line 509

This is a pointer to the vfsmount structure that is associated with the mounted filesystem that the file is in. All filesystems that are mounted have a vfsmount structure that holds the related information. Figure 6.8 illustrates the data structures associated with vfsmount structures.

Figure 6.8. vfsmount Objects

Line 510

This is a pointer to the file_operations structure, which holds the table of file operations that can be applied to a file. (The inodes field i_fop points to the same structure.) Figure 6.7 illustrates this relationship.

Line 511

Numerous processes can concurrently access a file. The f_count field is set to 0 when the file structure is unused (and, therefore, available for use). The f_count field is set to 1 when it's associated with a file and incremented by one thereafter with each process that handles the file. Thus, if a file object that is in use represents a file accessed by four different processes, the f_count field holds a value of 5.

Line 512

The f_flags field contains the flags that are passed in via the open() syscall. We cover this in more detail in the "open()" section.

Line 514

The f_pos field holds the file offset. This is essentially the read/write pointer that some of the methods in the file operations table use to refer to the current position in the file.

Line 516

We need to know who the owner of the process is to determine file access permissions when the file is manipulated. These fields correspond to the uid and the gid of the user who started the process and opened the file structure.

Line 517

A file can read pages from the page cache, which is the in-memory collection of pages, in advance. The read-ahead optimization involves reading adjacent pages of a file prior to any of them being requested to reduce the number of costly disk accesses. The f_ra field holds a structure of type file_ra_state, which contains all the information related to the file's read-ahead state.

Line 527

This field points to the address_space struct, which corresponds to the page-caching mechanism for this file. This is discussed in detail in the "Page Cache" section.

6.2.2. Global and Local List References

The Linux kernel uses global variables that hold pointers to linked lists of the structures previously mentioned. All structures are kept in a doubly linked list. The kernel keeps a pointer to the head of the list using this as an access point to the list. The structures all have fields of type list_head,^[7] which they use to point to the previous and next elements in the list. Table 6.7 summarizes the global variables that the kernel holds and the type of list it keeps a reference to.

^[7] The inode struct has a variation of this called hlist_node, as we saw in Section 6.2.1.3, "inode Structure."

Table 6.7. VFS-Related Global Variables
Global Variable	Structure Type
`super_blocks`	`super_block`
`file_systems`	`file_system_type`
`dentry_unused`	`dentry`
`vfsmntlist`	`vfsmount`
`inode_in_use`	`inode`
`inode_unused`	`inode`

The super_block, file_system_type, dentry, and vfsmount structures are all kept in their own list. Inodes can find themselves in either global inode_in_use or inode_unused, or in the local list of the superblock under which they correspond. Figure 6.9 shows how some of these structures interrelate.

Figure 6.9. VFS-Related Global Variables

The super_blocks variable points to the head of the superblock list with the elements pointing to the previous and next elements in the list by means of the s_list field. The s_dirty field of the superblock structure in turn points to the inodes it owns, which need to be synchronized with the disk. Inodes not in a local superblock list are in the inode_in_use or inode_unused lists. All inodes point to the next and previous elements in the list by way of the i_list field.

The superblock also points to the head of the list containing the file structs that have been assigned to that superblock by way of the s_files list. The file structs that have not been assigned are placed in one of the free_list lists of the anon_list list. Both lists have a dummy file struct as the head of the list. All file structs point to the next and previous elements in their list by using the f_list field.

Refer to Figure 6.6 to see how the inode points to the list of dentry structures by using the i_dentry field.