The New and Improved UNIX File System

   

The early history of UNIX development consisted of two competing flavors of UNIX: the original AT&T/Bell Labs release and the Berkeley System Definition. The original code from AT&T could not be sold, as antitrust laws didn't allow AT&T to compete in this arena, but the company was allowed to provide copies of its operating system code for free, and many universities took them up on the offer. The University of California, Berkeley, became a hotbed of UNIX development efforts, and one of the first areas it addressed was the file system.

The original UNIX file system was unique in its approach to file data management but had several weak points in its design. Among them was its dependency on the all-important superblock. If this key structure was corrupted, then all access to data on the file system was lost. In addition, the design allowed the blocks of a single file to be spread across the entire disk space with no attempt to keep related blocks in any type of order. This played well when the kernel was trying to figure out where to put the next block of data, but the price had to be paid in performance when a file had to be accessed in its entirety (file copies, backups, etc.).

As disks got bigger, the tendency was to increase the file system block size to keep the management overhead to an acceptable limit. This was fine as long as you didn't have lots of very small files. A file with only a small number of bytes would require an entire block for its storage. This could result in an inefficient utilization of disk space.

The Berkeley File System (BFS) attempted to address these and other performance issues. For the record, the following names all refer to the same file system implementation:

  • High-performance File System or Hierarchical File System (HFS in HP-UX)

  • Berkeley File System

  • Fast File System (FFS in SYS V.4)

  • McKusick File System

  • UNIX File System (UFS)

HP-UX's HFS

The HFS has long been the main file system for the HP-UX kernel, and indeed for PA-RISC-based systems, the kernel image may be loaded only from HFS. Although recently there are contenders to HFS's position as the HP-UX default file system, we study it here, as it sets the baseline for HP-UX file system features and performance comparisons. Figure 8-8 illustrates a high-level view of this model.

Figure 8-8. The HFS File System Layout

graphics/08fig08.jpg


File access time depends on a couple of basic factors: first, the average seek time. Assume that a disk has 1,000 distinct tracks. The amount of time it takes to move the disk head from one track to another is the seek time. Sometimes the data you are seeking is one track away and sometimes it might be 1,000 tracks away. The average seek time is the time it takes to reposition the head across ½ of its total track count (500 in our example).

The other factor affecting disk access time is rotational latency. Consider a rotating disk platter with a number of sectors located around each track. The sector you require might be the next to come under the head, or it might have just passed by, in which case you will have to wait for a complete revolution of the disk. As such, the average rotational latency is the time it takes for half a revolution of the disk media.

From a designer's point of view, the only way to reduce the rotational latency is to get a faster disk (you may have noticed the trend from 5400 rpm disks to 10,000 rpm or higher in the PC market recently). The disadvantages of speeding up a drive is that the tolerances of the heads must be increased and that speed equals heat the faster the drive spins, the more heat dissipation the hardware designer has to deal with. These two issues add up to increased cost. As far as the seek time is concerned, the trend is to increase the number of tracks, not decrease them, as hardware designers are constantly trying to increase the capacity of their drives.

This is where the file system designer chose to implement a "soft" solution to the challenge. By dividing the disk into multiple cylinder groups and allocating inodes and their related data blocks within a cylinder group whenever possible, the average seek time may be greatly improved. Each cylinder group contains a portion of the inode table, a redundant copy of the superblock (the fsck utility can use one of these copies to recover a corrupted super block), and localized metadata to manage and monitor utilization of the local cylinder group space.

Reconsider our example, assume that the whole disk was divided into 100 cylinder groups. Each cylinder group (CG) would be mapped to 10 adjacent tracks. While the rotational latency would be a constant the all the individual blocks containing a file's data would be located within a 10 track area. For access to this single file the average seek time would be reduced from the time to move 500 tracks to the time to move 5 tracks! Simply by following these conventions for the placement of a files data we could see a vast improvement in access time. We realize that this is an over-simplified explanation but it points our thinking in the right direction.

Direct and Indirect Block Pointers

You may be wondering how a fixed-size inode can point to all the data blocks of files of varying sizes. If the block size is fixed and the inode size is fixed, it seems that there would be an implied maximum file size. To work around this issue, the HFS inode employs four types of block pointers:

  • Direct pointers (12 per inode)

  • Single indirect pointers (one per inode)

  • Double indirect pointers (one per inode)

  • Triple indirect pointers (one per inode)

The original AT&T file system had a 512-byte block, and the inode was 128 bytes and had room for 15 block pointers (each pointer was 32 bits). Let's examine the limits of this model (Figure 8-9).

Figure 8-9. The Inode

graphics/08fig09.gif


There are 12 direct pointers that locate the first 12 blocks containing file data, allowing for 12 * 512, or 6 KB of storage. If the file grew beyond this size, then the 13th pointer (the single indirect pointer) was used to point to a data block that could hold additional direct pointers (a block size of 512 meant that an indirect block would hold an additional 128 pointers). The 14th pointer (the double indirect) could point to a block containing additional single indirect pointers, and the 15th pointer (the triple indirect) could point to a block containing additional double indirect pointers. In this manner, a fixed-size inode could map a very large number of file data blocks.

With the original 512-byte block size, we get the following capabilities:

Direct pointers graphics/rightarrow.gif

12 512

= 6 KB

Single indirect graphics/rightarrow.gif

128 512

= 64 KB

Double indirect graphics/rightarrow.gif

128 128 512

= 8 MB

Triple indirect graphics/rightarrow.gif

128 128 128 512

= 2 B


With the more modern 8-KB block size, we get the following:

Direct pointers graphics/rightarrow.gif

12 8 KB

= 96 KB

Single indirect graphics/rightarrow.gif

2 KB 8 KB

= 16 MB

Double indirect graphics/rightarrow.gif

2 KB 2 KB 8 KB

= 32 GB

Triple indirect graphics/rightarrow.gif

2 KB 2 KB 2 KB 8 KB

= 64 TB


The HFS inode

The inode is the end-all, be-all metadata structure when it comes to defining a file. All the attributes of a file its permissions, type, size, credentials, timestamps, linkage counts, and data block pointers are held in this structure. The only attribute of the file that isn't held in the inode is its name! In actuality, a name is an abstraction used to locate a specific file's data and exists to make it easier for humans to reference a file.

By maintaining this degree of separation between a file's attributes (its inode data) and its name (directory entry), it is a simple task to create multiple file names pointing to the same inode. This is the basis for the creation of hard links (reference the ln command).

To study the structure of a HFS inode, let's examine the icommon structure in the HP-UX kernel (Listing 8.1).

Listing 8.1. q4> fields struct icommon
 inode type (upper 4 bits) and access mode (lower 12 bits) 0 0 2 0 u_short ic_mode 

0x1000 IFIFO

FIFO or named pipe

0x4000 IFDIR

Directory

0x6000 IFBLK

Block special file

0x7000 IFCONT

Continuation inode

0x8000 IFREG

Regular file

0x9000 IFNWK

Network special file (retired)

0xA000 IFLKN

Symbolic Link

0xC000 IFSOCK

Socket


 current number of directory links (hard links)to this inode   2 0  2 0 short     ic_nlink low 16 bits of the owner's UID and GID   4 0  2 0 u_short   ic_uid_lsb   6 0  2 0 u_short   ic_gid_lsb file size in bytes   8 0  8 0 long long ic_size timestamps (access, inode modification, and creation) Note: the modification timestamp refers to the last time the  contents of the inode were modified and does not necessarily  mean that the file's contents were changed  16 0  4 0 u_int     ic_atime_tv_sec  20 0  4 0 u_int     ic_atime_tv_usec  24 0  4 0 u_int     ic_mtime_tv_sec  28 0  4 0 u_int     ic_mtime_tv_usec  32 0  4 0 u_int     ic_ctime_tv_sec  36 0  4 0 u_int     ic_ctime_tv_usec next are the 12 direct block pointers  40 0  4 0 int       ic_un2.ic_reg.ic_db[0]  44 0  4 0 int       ic_un2.ic_reg.ic_db[1]  48 0  4 0 int       ic_un2.ic_reg.ic_db[2]  52 0  4 0 int       ic_un2.ic_reg.ic_db[3]  56 0  4 0 int       ic_un2.ic_reg.ic_db[4]  60 0  4 0 int       ic_un2.ic_reg.ic_db[5]  64 0  4 0 int       ic_un2.ic_reg.ic_db[6]  68 0  4 0 int       ic_un2.ic_reg.ic_db[7]  72 0  4 0 int       ic_un2.ic_reg.ic_db[8]  76 0  4 0 int       ic_un2.ic_reg.ic_db[9]  80 0  4 0 int       ic_un2.ic_reg.ic_db[10]  84 0  4 0 int       ic_un2.ic_reg.ic_db[11] followed by the single indirect, double indirect, and triple  indirect block pointers  88 0  4 0 int       ic_un2.ic_reg.ic_un.ic_ib[0]  92 0  4 0 int       ic_un2.ic_reg.ic_un.ic_ib[1]  96 0  4 0 int       ic_un2.ic_reg.ic_un.ic_ib[2] 

If this is a fast symbolic link, the block pointers are replaced with

 40 0 60 0 char[60]   ic_un2.ic_symlink 

A fast symbolic link allows the symbolic or "soft link" pathname to be stored in the inode instead of a data block. This is possible as long as the linkage path doesn't exceed 59 characters in length.

 the status flags 100 0  4 0 int       ic_flags 

0x01IC_FASTLINK

enable fast symbolic links

0x02IC_LARGEUIDS

enable use of large UIDs


 blocks held 104 0  4 0 int       ic_blocks 108 0  4 0 int       ic_gen upper 16 bits of the owner's UID and GID if large UIDs have  been enabled 112 0  2 0 u_short   ic_uid_msb 114 0  2 0 u_short   ic_gid_msb 116 0  4 0 int       ic_spare[0] 120 0  4 0 int       ic_spare[1] continuation inode number (if needed) 124 0  4 0 u_int     ic_contin 

As we see from this listing, the size of the HFS inode is 128 bytes. If access control lists (ACLs) are being used, the last four bytes in an inode point to a continuation inode, which holds up to 13 additional access control settings. For more on ACLs, reference the manual pages on chacl and lsacl.

Following Metadata Using fsdb

Another useful tool for studying the internals of file systems is fsdb. It is similar to the classic adb debugging utility except that it is designed to allow the root user to examine disk-resident metadata structures. To reinforce the information about HFS inodes and the basic organization of file system metadata, we use fsdb to examine an HFS file system mounted on /dev/vg00/lvol10 (Listing 8.2). The file system's mount point is /chris, and it has a data file stored under it named /chris/stuff containing the following three lines of text:

 This is my data file 

Listing 8.2. # fsdb -F hfs /dev/vg00/lvol10 session output
 First we will point fsdb (actually fsdb_hfs) toward the file  system mount point and then examine the contents of inode #2 fsdb -F hfs /dev/vg00/lvol10  graphics/leftarrow.gif This opens the file system on            /dev/vg00/lvol10 for examination by fsdb.            Note: fsdb has no promp string file system size = 53248(frags) isize/cyl group=160(Kbyte blocks) primary block size=8192(bytes) fragment size=1024 (bytes) 2i   graphics/leftarrow.gif This command list the contents of inode #2 no. of cyl groups = 7 i#:2  md: d---rwxr-xr-x ln:    4 uid:    0 gid:    0 sz: 1024 ci:0 a0 :   208  a1 :     0  a2 :     0  a3 :     0  a4 :     0  a5 :     0 a6 :     0  a7 :     0  a8 :     0  a9 :     0  a10:     0  a11:     0 a12:     0  a13:     0  a14:     0 at: Fri Aug  8 12:56:54 2003 mt: Fri Aug  8 12:46:22 2003 ct: Fri Aug  8 12:46:22 2003 a0b.p0d  graphics/leftarrow.gif Next we list the block pointed to by "a0" and format it as directory data d0: 2      . d1: 2      .  . d2: 3      l  o  s  t  +  f  o  u  n  d d3: 1280   c  h  r  i  s 1280i  graphics/leftarrow.gif We see that inode #1280 is the "chris"        directory and list its contents i#:1280  md: d---rwxrwxrwx ln:    2 uid:    0 gid:    3 sz: 1024 ci:0 a0 :  8200  a1 :     0  a2 :     0  a3 :     0  a4 :     0  a5 :     0 a6 :     0  a7 :     0  a8 :     0  a9 :     0  a10:     0  a11:     0 a12:     0  a13:     0  a14:     0 at: Fri Aug  8 12:54:28 2003 mt: Fri Aug  8 12:54:35 2003 ct: Fri Aug  8 12:54:35 2003 a0b.p0d  graphics/leftarrow.gif We again list the directories data block d0: 1280   . d1: 2      .  . d2: 1281   s  t  u  f  f d3: 1283   y  y  y 1281i graphics/leftarrow.gif Next we list inode #1281, the "stuff" file metadata i#:1281  md: f---rw-rw-rw- ln:    1 uid:    0 gid:    3 sz: 22 ci:0 a0 :  8201  a1 :     0  a2 :     0  a3 :     0  a4 :     0  a5 :     0 a6 :     0  a7 :     0  a8 :     0  a9 :     0  a10:     0  a11:     0 a12:     0  a13:     0  a14:     0 at: Fri Aug  8 12:47:06 2003 mt: Fri Aug  8 12:46:55 2003 ct: Fri Aug  8 12:46:55 2003 8201b.p0c  graphics/leftarrow.gif follow the pointer to the file's data block  and dump it in character format 40022000    :  T  h  i  s    \n  i  s     m  y \n  d  a  t  a     f  i  l  e \n \0 \0 \0  graphics/ccc.gif\0 \0 \0 \0 \0 \0 \0 40022040    : \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0  graphics/ccc.gif\0 \0 \0 \0 \0 \0 \0 ... 40023740    : \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0  graphics/ccc.gif\0 \0 \0 \0 \0 \0 \0 

Blocks and Fragments

As we noted, the block size of the file system has increased over the years from its original 512 bytes to a typical 8 KB. On HP-UX, the block size is tunable on a per-file system basis between 2 KB and 64 KB. With this flexibility comes the potential to waste file space when storing small files, so HFS file systems support the allocation of file system block fragments. Like the block size, the fragment size is also a tunable parameter at the time of file system creation. The fragment size may be set to the same as the block size: one-half, one-quarter, or one-eighth the block size (with a hard minimum of 1 KB). To accommodate the fragment size variations, the last three bits of the block pointer are reserved to address a fragment offset within a block. Figure 8-10 illustrates this basic concept.

Figure 8-10. Blocks and Fragments

graphics/08fig10.jpg


The inode points to specific disk blocks using a 32-bit signed integer. The last three bits are the fragment offset, which leaves 28 bits for the actual block number and one sign bit. Note that these bits are reserved for the fragment offset regardless of the actual file system configuration. Only the last assigned direct pointer in an inode may point to a fragment.

Consider a file 27 KB in size. The first three direct pointers in its inode would point to full blocks (this would hold the first 24 KB, leaving 3 KB of space to be allocated). The fourth direct pointer could locate the remaining 3 KB of data in any fragmented block with three contiguous fragments available. As a file grows, whole free blocks are allocated; when the file is closed, the last direct pointer used is checked to see if its data may be relocated to an existing fragmented block or if it is a good candidate to allow remaining fragments to be used by other files. Each cylinder group maintains a list of fragmented blocks within its boundary.

Fragment allocation and management requires extra overhead within the kernel, and its payback is diminished once indirect pointers are required to map data. Once a file has grown to the point where all 12 direct pointers in the inode are used, all future allocations are of whole disk blocks, and no fragmentation is attempted for the file.

Allocation Policy

Specialized allocation policies are used by the kernel when an inode or data blocks are requested. An inode is allocated as follows:

  • Allocate inodes for files in the same cylinder group as their parent directory.

  • Allocate inodes for new directories in the cylinder group with a higher than average number of free inodes and the smallest number of directory entries.

Data block allocation follows these rules:

  • If we are allocating a direct block, allocate the block in the same cylinder group as the inode that describes the file.

  • If the block is not available or if allocating an indirect block, allocate the block in a group with a greater than average number of free blocks.

A secondary set of allocation policies follows:

  • Allocate the requested block.

  • Allocate a rotationally equivalent block in the same cylinder group.

  • Allocate any block in the same cylinder group.

  • If no data blocks are available in the requested cylinder group, then the try another cylinder group with a greater than average number of free blocks.

Exceptions to these policies are made if there is insufficient space within the cylinder group. In this manner, a single file may still occupy the entire available disk space (within tunable limits), but in general all the blocks of a file tend to exist within the narrow boundary of a cylinder group. This improves the locality of the data and greatly speeds up file access during operations such as copies and backups, or any other time the entire file needs to be accessed. Another tunable file system parameter in the kernel is maxbpg, which is set to 25 percent by default. This parameter states the maximum percentage of space in a cylinder group that may be allocated to a single file. This is an advisory limit and may be exceeded if a file can find no other available blocks.

Other File System Types

There are other file system types that may be utilized with the HP-UX operating system. The most common is the Veritas file system (VxFS). It is a third-party offering, so we will not go into the internal specifics of VxFS in this book. In general, the features offered by VxFS are transactional journaling (or intent logging) to speed up file system recovery following a system crash and a variety of data storage algorithms based on the allocation of large contiguous blocks of disk space for the storage of large files (it is also called an extent-based file system).

Following the recent merger of Hewlett-Packard and Compaq, features of the Compaq Tru-64 operating system, particularly the Advanced File System, are being considered for porting to HP-UX.

File System Utility Wrappers

As the metadata utilized by various file system implementations may be quite different, the administrative commands used to manipulate, repair, and monitor them must be specific for the type they work with. When a developer creates a new file system type to be used with the HP-UX kernel, it is her or his responsibility to also create the utilities that work with the specific file system. So that the administrator doesn't have to learn a different set of command names for each type of file system, most of the common commands have been converted to command wrappers.

A wrapper command is merely a front end to the file system type-specific executable. Most wrapper commands allow you to pass the file system type as an option, or if the type is not passed, they simply examine the superblock for a magic number identifying the type. Once the type is known, the wrapper simply passes control to the type-specific executable. One way to see if a command is a wrapper is to find the "See also" section of its main page. For the fsdb command, you would find references to fsdb_hfs and fsdb_vxfs, among others. In most cases, you may also pass the type-specific executable name to the man command to learn about additional options and flags for the command.



HP-UX 11i Internals
HP-UX 11i Internals
ISBN: 0130328618
EAN: 2147483647
Year: 2006
Pages: 167

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net