Section 15.3. The UFS Inode

15.3. The UFS Inode

The inode (Index Node) is UFS's internal descriptor for a file. Each file system has two forms of an inode: the on-disk inode and the in-core (in-memory) inode. The on-disk inode resides on the physical medium and represents the on-disk format and layout of the file.

15.3.1. In-Core UFS Inodes

The in-core inode, as you may have guessed, resides in memory and contains the file-system-dependent information, free-list pointers, hash anchors, kernel locks (covered in UFS locking below), and inode state.

typedef struct inode {         struct  inode *i_chain[2];      /* must be first */         struct inode *i_freef;  /* free list forward - must be before i_ic */         struct inode *i_freeb;  /* free list back - must be before i_ic */         struct  icommon i_ic;   /* Must be here */         struct  vnode *i_vnode; /* vnode associated with this inode */         struct  vnode *i_devvp; /* vnode for block I/O */         dev_t   i_dev;          /* device where inode resides */         ino_t   i_number;       /* i number, 1-to-1 with device address */         off_t   i_diroff;       /* offset in dir, where we found last entry */                                 /* just a hint - no locking needed */         struct ufsvfs *i_ufsvfs; /* incore fs associated with inode */         struct  dquot *i_dquot; /* quota structure controlling this file */         krwlock_t i_rwlock;     /* serializes write/setattr requests */         krwlock_t i_contents;   /* protects (most of) inode contents */         kmutex_t i_tlock;       /* protects time fields, i_flag */         offset_t i_nextr;       /*                                       */                                 /* next byte read offset (read-ahead)    */                                 /*   No lock required                    */                                 /*                                       */         uint_t  i_flag;         /* inode flags */         uint_t  i_seq;          /* modification sequence number */         boolean_t i_cachedir;   /* Cache this directory on next lookup */                                 /* - no locking needed  */         long    i_mapcnt;       /* mappings to file pages */         int     *i_map;         /* block list for the corresponding file */         dev_t   i_rdev;         /* INCORE rdev from i_oldrdev by ufs_iget */         size_t  i_delaylen;     /* delayed writes, units=bytes */         offset_t i_delayoff;    /* where we started delaying */         offset_t i_nextrio;     /* where to start the next clust */         long    i_writes;       /* number of outstanding bytes in write q */         kcondvar_t i_wrcv;      /* sleep/wakeup for write throttle */         offset_t i_doff;        /* dinode byte offset in file system */         si_t *i_ufs_acl;        /* pointer to acl entry */         dcanchor_t i_danchor;   /* directory cache anchor */         kthread_t *i_writer;    /* thread which is in window in wrip() */ } inode_t;                                               See usr/src/uts/common/sys/fs/ufs_inode.h

New with Solaris 10, an inode sequence number was added to the in-core inode structure to support NFSv3 and NFSv4 detection of atomic changes to the inode. Two caveats with this new value: i_seq must be updated if i_ctime and i_mtime are changed; the value of i_seq is only guaranteed to be persistent while the inode is active.

15.3.2. Inode Cache

When the last reference to a vnode is released, the vop_inactive() routine for the file system is called. (See vnode reference counts in Section 14.6.8.) UFS uses vop_inactive() to free the inode when it is no longer required. If we were to destroy each vnode when the last reference to a vnode is relinquished, we would throw away all the data relating to that vnode, including all the file pages cached in the page cache. This practice could mean that if a file is closed and then reopened, none of the file data that was cached would be available after the second open and would need to be reread from disk. To remedy the situation, UFS caches all unused inodes in its global cache.

The UFS inode cache contains an entry for every open inode in the system. It also attempts to keep as many closed inodes as possible so that inactive inodes/ vnodes and associated pages are around in memory for possible reuse. This is a global cache and not a per-file system cache, and that unfortunately leads to several performance issues.

The inode cache consists of several disconnected queues or chains, and each queue is linked with the inode's i_forw and i_backw pointers (see Figure 15.10). Starting with Solaris 10, hashing of inode entries is done with the inode number (because of recent devfs changes) rather than with the inode number and the device number (Solaris 9 and earlier). These queues are managed according to least recently used (LRU) scheme.

Figure 15.10. UFS Inode Hash Queues

An inode free list is also maintained within the cache which is built upon the i_freef and i_freeb pointers. These enable the free list to span several hash chains. If an inode is not on the free list, then the i_freef and i_freeb values point back to the inode itself.

Inodes on the free list can be part of two separate queues:

Idle queue. Holds the idle or unreferenced inodes (where the v_count equals 1, t and the i_nlink is greater than 0). This queue is managed by the global file system idle thread, which frees entries, starting at the head. When new entries are added, ufs_inactive() adds an inode to the head if the inode has no associated pages; otherwise, the inode is added to the tail. This ensures that pages are retained longer in memory for possible reusethe frees are done starting at the head.
Starting with Solaris 10, the idle queue architecture was reorganized into two separate hash queues: ufs_useful_iq and ufs_junk_iq. If an inode has pages associated with it (vn_has_cached_data(vnode)) or is a fast symbolic link (i_flag and IFASTSYMLNK), then it is attached to the useful idle queue. All other inodes are attached to the junk idle queue instead. These queues are not used for searching but only for grouping geographically local inodes for faster updates and fewer disk seeks upon reuse. Entries from the junk idle queue are destroyed first when ufs_idle_free() is invoked by the UFS idle thread so that cached pages pertaining to entries in the ufs_useful_iq idle queue stay in memory longer.
The idle thread is adjusted to run when there are 25% of ufs_ninode enTRies on the idle queue. When it runs, it gives back half of the idle queue until the queue falls below the low water mark of ufs_q->uq_lowat. Inodes on the junk queue get destroyed first. Figure 15.11 illustrates the process.

Figure 15.11. UFS Idle Queue
Delete queue. Is active if UFS logging is enabled and consists of inodes that are unlinked or deleted (v_count equals 1 and i_nlink is less than or equal to 0). This queue is a performance enhancer for file systems with logging turned on and observing heavy deletion activity. The delete queue is handled by the per-file system delete thread, which queues the inodes to be deleted by the ufs_delete() thread. This significantly boosts response times for removal of large amounts of data. If logging is not enabled, ufs_delete() is called immediately. ufs_delete() calls VN_RELE() after it has finished processing, which causes the inode to once again be processed by ufs_inactive, which this time puts it on the idle queue. While on the delete queue, the inode's i_freef and i_freeb point to the inode itself since the inodes are not free yet.

15.3.3. Block Allocation

The UFS file system is block based where each file is represented by a set of fixed sized units of disk space, indexed by a tree of physical-meta-data blocks.

15.3.3.1. Layout Policy

UFS uses block sizes of 4 and 8 Kbytes, which provides significantly higher performance than the 512-byte blocks used in the System V file system. The downside of larger blocks was that when partially allocated blocks were created, several kilo-bytes of disk space for each partly filled file system block was wasted. To overcome this disadvantage, UFS uses the notion of file system fragments. Fragments allow a single block to be broken up into 2, 4, or 8 fragments when necessary (4 Kbytes, 2 Kbytes or 1 Kbyte, respectively).

UFS block allocation tries to prevent excessive disk seeking by attempting to co-locate inodes within a directory and by attempting to co-locate a file's inode and its data blocks. When possible, all the inodes in a directory are allocated in the same cylinder group. This scheme helps reduce disk seeking when directories are traversed; for example, executing a simple ls -l of a directory will access all the inodes in that directory. If all the inodes reside in the same cylinder group, most of the data are cached after the first few files are accessed. A directory is placed in a cylinder group different from that of its parent.

Blocks are allocated to a file sequentially, starting with the first 96 Kbytes (the first 12 direct blocks), skipping to the next cylinder group and allocating blocks up to the limit set by the file system parameter maxbpg (maximum-blocks-per-cylinder-group). After that, blocks are allocated from the next available cylinder group.

By default, on a file system greater than 1 Gbyte, the algorithm allocates 96 Kbytes in the first cylinder group, 16 Mbytes in the next available cylinder group, 16 Mbytes from the next, and so on. The maximum cylinder group size is 54 Mbytes, and the allocation algorithm allows only one-third of that space to be allocated to each section of a single file when it is extended. The maxbpg parameter is set to 2,048 8-Kbyte blocks by default at the time the file system is created. It is also tunable but can only be tuned downward since the maximum cylinder group size is 16-Mybte allocation per cylinder group.

Selection of a new cylinder group for the next segment of a file is governed by a rotor and free-space algorithm. A per-file-system allocation rotor points to one of the cylinder groups; each time new disk space is allocated, it starts with the cylinder group pointed to by the rotor. If the cylinder group has less than average free space, then it is skipped and the next cylinder group is tried. This algorithm makes the file system attempt to balance the allocation across the cylinder groups.

Figure 15.12 shows the default allocation that is used if a file is created on a large UFS. The first 96 Kbytes of file 1 are allocated from the first cylinder group. Then, allocation skips to the second cylinder group and another 16 Mbytes of file 1 are allocated, and so on. When another file is created, we can see that it consumes the holes in the allocated blocks alongside file 1. There is room for a third file to do the same.

Figure 15.12. Default File Allocation in 16-Mbyte Groups

The actual on-disk layout will not be quite as simple as the example shown but does reflect the allocation policies discussed. We can use an add-on tool, filestat, to view the on-disk layout of a file, as shown below.

sol8# /usr/local/bin/filestat testfile Inodes per cyl group:   128 Inodes per block:       64 Cylinder Group no:      0 Cylinder Group blk:     64 File System Block Size: 8192 Block Size:             512 Number of 512b Blocks:  262288 Start Block    End Block    Length (512 byte Blocks) -----------    -----------  ------------------------         144 -> 335          192         400 -> 33167        32768      110800 -> 143567       32768      221264 -> 221343       80      221216 -> 221263       48      221456 -> 254095       32640      331856 -> 331999       144      331808 -> 331855       48      332112 -> 364687       32576      442448 -> 442655       208      442400 -> 442447       48      442768 -> 475279       32512

The filestat output shows that the first segment of the file occupies 192 (512-byte) blocks, followed by the next 16 Mbytes, which start in a different cylinder group. This particular file system was not empty when the file was created, which is why the next cylinder group chosen is a long way from the first.

We can observe the file system parameters of an existing file system with the fstyp command. The fstyp command simply dumps the superblock information for the file, revealing all the cylinder group and allocation information. The following example shows the output for a 4-Gbyte file system with default parameters. We can see that the file system has 8,247,421 blocks and has 167 cylinder groups spaced evenly at 6,272 (51-Mbyte) intervals. The maximum blocks to allocate for each group is set to the default of 2,048 8-Kbyte, 16 Mbytes.

sol8# fstyp -v /dev/vx/dsk/homevol |more ufs magic   11954   format  dynamic time    Sat Mar  6 18:19:59 1999 sblkno  16      cblkno  24      iblkno  32      dblkno  800 sbsize  2048    cgsize  8192    cgoffset 32     cgmask  0xffffffe0 ncg     167     size    8378368 blocks  8247421 bsize   8192    shift   13      mask     0xffffe000 fsize   1024    shift   10      mask     0xfffffc00 frag    8       shift   3        fsbtodb 1 minfree 1%      maxbpg  2048    optim   time maxcontig 32    rotdelay 0ms    rps     120 csaddr  800     cssize  3072    shift   9       mask     0xfffffe00 ntrak   32      nsect   64      spc     2048    ncyl     8182 cpg     49      bpg     6272    fpg     50176   ipg      6144 nindir  2048    inopb   64      nspf    2 nbfree  176719  ndir    10241   nifree  956753  nffree  21495 cgrotor 152     fmod    0       ronly   0       logbno  0

The UFS-specific version of the fstyp command dumps the superblock of a UFS file system, as shown below.

sol8# fstyp -v /dev/vx/dsk/homevol |more ufs magic   11954   format  dynamic time    Sat Mar  6 18:19:59 1999 sblkno  16      cblkno  24      iblkno  32      dblkno  800 sbsize  2048    cgsize  8192    cgoffset 32     cgmask  0xffffffe0 ncg     167     size    8378368 blocks  8247421 bsize   8192    shift   13      mask     0xffffe000 fsize   1024    shift   10      mask     0xfffffc00 frag    8       shift   3       fsbtodb  1 minfree 1%      maxbpg  2048    optim   time maxcontig 32    rotdelay 0ms    rps     120 csaddr  800     cssize  3072    shift   9       mask     0xfffffe00 ntrak   32      nsect   64      spc     2048    ncyl     8182 cpg     49      bpg     6272    fpg     50176   ipg      6144 nindir  2048    inopb   64      nspf    2 nbfree  176719  ndir    10241   nifree  956753  nffree  21495 cgrotor 152     fmod    0       ronly   0       logbno  0 fs_reclaim is not set file system state is valid, fsclean is 0 blocks available in each rotational position cylinder number 0:    position 0:      0    4    8   12   16   20   24   28   32   36   40    44                    48   52   56   60   64   68   72   76   80   84   88    92                    96  100  104  108  112  116  120  124    position 2:      1    5    9   13   17   21   25   29   33   37   41    45                    49   53   57   61   65   69   73   77   81   85   89    93                    97  101  105  109  113  117  121  125    position 4:      2    6   10   14   18   22   26   30   34   38   42    46                    50   54   58   62   66   70   74   78   82   86   90    94                    98  102  106  110  114  118  122  126    position 6:      3    7   11   15   19   23   27   31   35   39   43    47                    51   55   59   63   67   71   75   79   83   87   91    95                    99  103  107  111  115  119  123  127 cs[].cs_(nbfree,ndir,nifree,nffree):         (23,26,5708,102) (142,26,5724,244) (87,20,5725,132) (390,69,5737,80)         (72,87,5815,148) (3,87,5761,110) (267,87,5784,4) (0,66,5434,4)         (217,46,5606,94) (537,87,5789,70) (0,87,5901,68) (0,87,5752,20) . . cylinders in last group 48 blocks in last group 6144 cg 0: magic   90255   tell    6000    time    Sat Feb 27 22:53:11 1999 cgx     0       ncyl    49      niblk   6144    ndblk   50176 nbfree  23      ndir    26      nifree  5708    nffree  102 rotor   1224    irotor  144     frotor  1224 frsum   7       7       3       1       1       0       9 sum of frsum: 102 iused:  0-143, 145-436 free:   1224-1295, 1304-1311, 1328-1343, 4054-4055, 4126-4127, 4446-4447, 4455, 4637- 4638,

15.3.3.2. Mapping Files to Disk Blocks

At the heart of a disk-based file system are the block map algorithms, which implement the on-disk file system format. These algorithms map UFS file and offsets pairs into disk addresses on the underlying storage. For UFS, two main functionsbmap_read() and bmap_write()implement the on-disk format. Calling these functions has the following results:

bmap_read()queries the file system as to which physical disk sector a file block resides on; that is, requests a lookup of the direct/indirect blocks that contain the disk address(es) of the required blocks.
bmap_write() allocates, with the aid of helper functions, new disk blocks when extending or allocating blocks for a file.
The bmap_read() function reads file system block addresses. It accepts an inode and offset as input arguments, and a pointer to a disk address and contiguity length as output arguments.

int bmap_read(struct inode *ip, u_offset_t off, daddr_t *dap, int *lenp)                                                See usr/src/uts/common/fs/ufs/ufs_bmap.c

The file system uses the bmap_read() algorithm to locate the physical blocks for the file being read. The bmap_read() function searches through the direct, indirect, and double-indirect blocks of the inode to locate the disk address of the disk blocks that map to the supplied offset. The function also searches forward from the offset, looking for disk blocks that continue to map contiguous portions of the inode, and returns the length of the contiguous segment (in blocks) in the length pointer argument. The length and the file system block clustering parameters are used within the file system as bounds for clustering contiguous blocks to provide better performance by reading larger parts of a file from disk at a time. See ufs_getpage_ra(), defined in usr/src/uts/common/fs/ufs_vnops.c, for more information on read-aheads.

int bmap_write(struct inode *ip, u_offset_t off, int size,            int alloc_only, struct cred *cr);                                                See usr/src/uts/common/fs/ufs/ufs_bmap.c

The bmap_write() function allocates file space in the file system when a file is extended or a file with holes has blocks written for the first time and is responsible for storing the allocated block information in the inode. bmap_write() traverses the block free lists, using the rotor algorithm (discussed in Section 15.3.3), and updates the local, direct, and indirect blocks in the inode for the file being extended. bmap_write calls several helper functions to facilitate the allocation of blocks.

daddr_t blkpref(struct inode *ip, daddr_t lbn, int indx, daddr32_t *bap) Guides bmap_write in selecting the next desired block in the file. Sets the policy as described in Section 15.3.3.1. int realloccg(struct inode *ip, daddr_t bprev, daddr_t bpref, int osize, int nsize,     daddr_t *bnp, cred_t *cr) Re-allocates a fragment to a bigger size. The number and size of the old block size is specified and the allocator attempts to extend the original block. Failing that, the regular block allocator is called to obtain an appropriate block. int alloc(struct inode *ip, daddr_t bpref, int size, daddr_t *bnp, cred_t *cr) Allocates a block in the file system. The size of the block is specified which is a multiple of (fs_fsize <= fs_bsize). If a preference (usually obtained from blkpref()) is specified, the allocator will try to allocate the requested block. If that fails, a rotationally optimal block in the same cylinder is found. Failing that a block in the same cylinder group is searched for. And in case that fails, the allocator quadratically rehashes into other cylinder groups (see hashalloc() in uts/common/fs/ufs/ufs_alloc.c) to locate an available block. If no preference is given, a block in the same cylinder is found, and failing that the allocator quadratically searches other cylinder groups for one.                                                       See uts/common/fs/ufs/ufs_alloc.c static void ufs_undo_allocation(inode_t *ip,                                 int block_count,                                 struct ufs_allocated_block table[],                                 int inode_sector_adjust) In the case of an error, bmap_write() will call ufs_undo_allocation to free any blocks which were used during the allocation process.                                                        See uts/common/fs/ufs/ufs_bmap.c

15.3.3.3. Reading and Writing UFS Blocks

A file system read calls bmap_read() to find the location of the underlying physical blocks for the file being read. UFS then calls the device driver's strategy routine for the device containing the file system to initiate the read operation by calling bdev_strategy().

A file system write operation that extends a file first calls bmap_write() to allocate the new blocks and then calls bmap_read() to obtain the block location for the write. UFS then calls the device driver's strategy routine, by means of bdev_strategy(), to initiate the file write.

15.3.3.4. Buffering Block Metadata

The block map functions access metadata (single, double and triple indirect blocks) on the device media through the buffer cache, using the bread_common() and bwrite_common() buffered block I/O kernel functions. The block I/O functions read and write device blocks in 512-byte chunks, and they cache physical disk blocks in the block buffer cache (note: this cache is different from the page cache, used for file data). The UFS file system requires 1 Mbyte of metadata for every 2 Gbytes of file space. This relationship can be used as a rule to calculate the size of the block buffer cache, set by the bufhwm kernel parameter.

15.3.4. Methods to Read and Write UFS Files

Files can be read or written in two ways: by the read() or write() system calls or by mapped file I/O. The read() and write() system calls call the file system's ufs_read() and ufs_write() method. These methods map files into the kernel's address space and then use the file system's ufs_getpage() and ufs_putpage() methods to transfer data to and from the physical media.

15.3.4.1. `ufs_read()`

An example of the steps taken by a UFS read system call is shown in Figure 15.13. A read system call invokes the file-system-dependent read function, which turns the read request into a series of vop_getpage() calls by mapping the file into the kernel's address space with the seg_kpm driver (through the seg_map driver), as described in Section 14.7.

Figure 15.13. `ufs_read()`

The ufs_read method calls into the seg_map driver to locate a virtual address in the kernel address space for the file and offset requested with the segmap_getmapflt() function. The seg_map driver determines whether it already has a mapping for the requested offset by looking into its hashed list of mapping slots. Once a slot is located or created, an address for the page is located. segmap then calls back into the file system with ufs_getpage() to soft-initiate a page fault to read in the page at the virtual address of the seg_map slot. The page fault is initiated while we are still in the segmap_getmap() routine, by a call to segmap_fault(). That function in turn calls back into the file system with ufs_getpage(), which calls out file system's _getpage(). If not, then a slot is created and ufs_getpage() is called to read in the pages.

The ufs_getpage() routine brings the requested range of the file (vnode, offset, and length) from disk into the virtual address, and the length is passed into the ufs_getpage() function. The ufs_getpage() function locates the file's blocks (through the block map functions discussed in Section 15.3.3.2) and reads them by calling the underlying device's strategy routine.

Once the page is read by the file system, the requested range is copied back to the user by the uiomove() function. The file system then releases the slot associated with that block of the file by using the segmap_release() function. At this point, the slot is not removed from the segment, because we may need the same file and offset later (effectively caching the virtual address location); instead, it is added to a seg_map free list so that it can be reclaimed or reused later.

15.3.4.2. `ufs_write()`

Writing to the file system is performed similarly, although it is more complex because of some of the file system write performance enhancements, such as delayed writes and write clustering. Writing to the file system follows the steps shown in Figure 15.14.

Figure 15.14. ufs_write()

The write system call calls the file-system-independent write, which in our example calls ufs_write(). UFS breaks the write into 8-Kbyte chunks and then processes each chunk. For each 8-Kbyte chunk, the following steps are performed.

1.	UFS asks the `segmap` driver for an 8-Kbyte mapping of the file in the kernel's virtual address space. The page for the file and offset is mapped here so that the data can be copied in and then written out with paged I/O.
2.	If the file is being extended or a new page is being created within a hole of a file, then a call is made to the `segmap_pagecreate` function to create and lock the new pages. Next, a call is made `segmap_pageunlock()` to unlock the pages that were locked during the `page_create`.
3.	If the write is to a whole file system block, then a new zeroed page is created with `segmap_pagecreate()`. In the case of a partial block write, the block must first be read in so that the partial block contents can be replaced.
4.	The new page is returned, locked, to UFS. The buffer that is passed into the write system call is copied from user address space into kernel address space.

5.	The `ufs_write` tHRottle first checks to see if too many bytes are outstanding for this file as a result of previous delayed writes. If more than the kernel parameter `ufs_HW` bytes are outstanding, the write is put to sleep until the amount of outstanding bytes drops below the kernel parameter `ufs_LW`.

The file system calls the seg_map driver to map in the portion of the file we are going to write. The data is copied from the process's user address space into the kernel address space allocated by seg_map, and seg_map is then called to release the address space containing the dirty pages to be written. This is when the real work of write starts, because seg_map calls ufs_putpage() when it realizes there are dirty pages in the address space it is releasing.