Section 15.7. Logging


15.7. Logging

Important criteria for commercial systems are reliability and availability, both of which may be compromised if the file system does not provide the required level of robustness. We have become familiar with the term journaling to mean just one thing, but, in fact, file system logging can be implemented in several ways. The three most common forms of journaling are

  • Metadata logging. Logs only file system structure changes

  • File and metadata logging. Logs all changes to the file system

  • Log-structured file system. Is an entire file system implemented as a log

The most common form of file system logging is metadata logging, and this is what UFS implements. When a file system makes changes to its on-disk structure, it uses several disconnected synchronous writes to make the changes. If an outage occurs halfway through an operation, the state of the file system is unknown, and the whole file system must be checked for consistency. For example, if the file is being extended the free block bitmap must be updated to mark the newly allocated block as no longer free. The inode block list must also be updated to indicate that the allocated block is owned by the file. If an outage occurs after the block is allocated, but before the inode is updated, file system inconsistency occurs.

A metadata logging file system such as UFS has an on-disk, cyclic, append-only log area that it can use to record the state of each disk transaction. Before any on-disk structures are changed, an intent-to-change record is written to the log. The directory structure is then updated, and when complete, the log entry is marked complete. Since every change to the file system structure is in the log, we can check the consistency of the file system by looking in the log, and we need not do a full file system scan. At mount time, if an intent-to-change entry is found but not marked complete the changes will not be applied to the file system. Figure 15.17 illustrates how metadata logging works.

Figure 15.17. File System Metadata Logging


Logging was first introduced in UFS in Solaris 2.4; it has come a long way since then, to being turned on by default in Solaris 10. Enabling logging turns the file system into a transaction-based file system. Either the entire transaction is applied or it is completely discarded. Logging is on by default in Solaris 10; however, it can be manually turned on by mount(1M) -o logging (using the _FIOLOGENABLE ioctl). Logging is not compatible with Solaris Logical Volume Manager (SVM) translogging, and attempt to turn on logging on a UFS file system that resides on an SVM will fail.

15.7.1. On-Disk Log Data Structures

The on-disk log is allocated from contiguous blocks where possible, and are only allocated as full sized file system blocks, no fragments are allowed. The initial pool of blocks is allocated when logging is first enabled on a file system, and blocks are not freed until logging is disabled. UFS uses these blocks for its own metadata and for times when it needs to store file system changes that have not yet been applied to the file system. This space on the file system is known as the "on disk" log, or log for short. It requires approximately 1 Mbyte per 1 Gbyte of file system space. The default minimum size for the log is 1 Mbyte, and the default maximum log size is 64 Mybtes. Figure 15.18 illustrates the on-disk log layout.

Figure 15.18. On-Disk Log Data Structure Layout


The file system superblock contains the block number where the main on-disk logging structure (extent_block_t) resides. This is defined by the extent_block structure. Note that the extent_block structure and all the accompanying extent structures fit within a file system block.

 typedef struct extent_block {         uint32_t        type;           /* Set to LUFS_EXTENTS to identify */                                         /*   structure on disk. */         int32_t         chksum;         /* Checksum over entire block. */         uint32_t        nextents;       /* Size of extents array. */         uint32_t        nbytes;         /* #bytes mapped by extent_block. */         uint32_t        nextbno;        /* blkno of next extent_block. */         extent_t        extents[1]; } extent_block_t;                                                 See usr/src/uts/common/sys/fs/ufs_log.h 


The extent_block structure describes logging metadata and is the main data structure used to find the on-disk log. It is followed by a series of extents that contain the physical block number for on-disk logging segments. The number of extents present for the file system is described by the nextents field in the extent_block structure.

typedef struct extent {         uint32_t        lbno;   /* Logical block #within the space */         uint32_t        pbno;   /* Physical block number of extent. */                                 /* in disk blocks for non-MTB ufs */                                 /* in frags for MTB ufs */         uint32_t        nbno;   /* # blocks in this extent */ } extent_t;                                                 See usr/src/uts/common/sys/fs/ufs_log.h 


Only the first extent structure is allowed to contain a ml_odunit structure (simplified: metadata logging on-disk unit structure).

typedef struct ml_odunit {         uint32_t        od_version;      /* version number */         uint32_t        od_badlog;       /* is the log okay? */         uint32_t        od_unused1;          /*          * Important constants           */         uint32_t        od_maxtransfer;  /* max transfer in bytes */         uint32_t        od_devbsize;     /* device bsize */         int32_t         od_bol_lof;      /* byte offset to begin of log */         int32_t         od_eol_lof;      /* byte offset to end of log */          /*           * The disk space is split into state and circular log           */         uint32_t        od_requestsize;  /* size requested by user */         uint32_t        od_statesize;    /* size of state area in bytes */         uint32_t        od_logsize;      /* size of log area in bytes */         int32_t         od_statebno;     /* first block of state area */         int32_t         od_unused2;          /*          * Head and tail of log           */         int32_t         od_head_lof;     /* byte offset of head */         uint32_t        od_head_ident;   /* head sector id # */         int32_t         od_tail_lof;     /* byte offset of tail */         uint32_t        od_tail_ident;   /* tail sector id # */         uint32_t        od_chksum;       /* checksum to verify ondisk contents */          /*           * Used for error recovery           */         uint32_t        od_head_tid;     /* used for logscan; set at sethead */          /*           * Debug bits           */         int32_t         od_debug;          /*           * Misc           */         struct timeval  od_timestamp;    /* time of last state change */ } ml_odunit_t;                                                 See usr/src/uts/common/sys/fs/ufs_log.h 


The values in the ml_odunit_t structure represent the location, usage and state of the on-disk log. The contents in the on-disk log consist of delta structures, which define the changes, followed by the actual changes themselves. Each 512 byte disk block of the on-disk log will contain a sect_trailer at the end of the block. This sect_trailer is used to identify the disk block as containing valid deltas. The *_lof fields reference the byte offset in the logical on-disk layout and not the physical on-the-disk contents.

struct delta {         int64_t          d_mof;  /* byte offset on device to start writing */                                  /*   delta */         int32_t          d_nb;   /* # bytes in the delta */         delta_t          d_typ;  /* Type of delta.  Defined in ufs_trans.h */ };                                                 See usr/src/uts/common/sys/fs/ufs_log.h 


typedef struct sect_trailer {         uint32_t         st_tid;         /* transaction id */         uint32_t         st_ident;       /* unique sector id */ } sect_trailer_t;                                                 See usr/src/uts/common/sys/fs/ufs_log.h 


15.7.2. In-Core Log Data Structures

Figure 15.19 illustrates the data structures for in-core logging.

Figure 15.19. In-Core Log Data Structure Layout


ml_unit_t is the main in-core logging structure. There is only one per file system, and it contains all logging information or pointers to all logging data structures for the file system. The un_ondisk field contains an in-memory replica of the on-disk ml_odunit structure.

typedef struct ml_unit {         struct ml_unit  *un_next;        /* next incore log */         int             un_flags;        /* Incore state */         buf_t           *un_bp;          /* contains memory for un_ondisk */         struct ufsvfs   *un_ufsvfs;      /* backpointer to ufsvfs */         dev_t           un_dev;          /* for convenience */         ic_extent_block_t *un_ebp;       /* block of extents */         size_t          un_nbeb;         /* # bytes used by *un_ebp */         struct mt_map   *un_deltamap;    /* deltamap */         struct mt_map   *un_logmap;      /* logmap includes moby trans stuff */         struct mt_map   *un_matamap;     /* optional - matamap */          /*           * Used for managing transactions           */         uint32_t        un_maxresv;      /* maximum reservable space */         uint32_t        un_resv;         /* reserved byte count for this trans */         uint32_t        un_resv_wantin;  /* reserved byte count for next trans */          /*           * Used during logscan           */         uint32_t        un_tid;          /*           * Read/Write Buffers           */         cirbuf_t        un_rdbuf;        /* read buffer space */         cirbuf_t        un_wrbuf;        /* write buffer space */          /*           * Ondisk state           */         ml_odunit_t     un_ondisk;       /* ondisk log information */          /*           * locks           */         kmutex_t        un_log_mutex;    /* allows one log write at a time */         kmutex_t        un_state_mutex;  /* only 1 state update at a time */ } ml_unit_t;                                                 See usr/src/uts/common/sys/fs/ufs_log.h 


mt_map_t TRacks all the deltas for the file system. At least three mt_map_t structures are defined:

  • deltamap. Tracks all deltas for currently active transactions. When a file system transaction completes, all deltas from the delta map are written to the log map and all the entries are then removed from the delta map.

  • logmap. Tracks all committed deltas from completed transactions, not yet applied to the file system.

  • matamap. Is the debug map for delta verification.

See usr/src/uts/common/sys/fs/ufs_log.h for the definition of mt_map structure

 struct mapentry {         /*          * doubly linked list of all mapentries in map -- MUST BE FIRST          */         mapentry_t      *me_next;         mapentry_t      *me_prev;         mapentry_t      *me_hash;         mapentry_t      *me_agenext;         mapentry_t      *me_cancel;         crb_t           *me_crb;         int             (*me_func)();         ulong_t         me_arg;         ulong_t         me_age;         struct delta    me_delta;         uint32_t        me_tid;         off_t           me_lof;         ushort_t        me_flags; };                                                 See usr/src/uts/common/sys/fs/ufs_log.h 


The mapentry structure defines changes to filesystem metadata. All existing mapentries for a given mt_map are linked into the mt_amp at the mtm_next and mtm_prev fields. The mtm_hash field of the mt_map is a hash list of all the mapen-tries, hashed according to the master byte offset of the delta on the file system and the MAPBLOCKSIZE. For example, the MTM_HASH macro determines the hash list in which a mapentry for the offset mof (where mtm_nhash is the total number of hash lists for the map). The default size used for MAPBLOCKSIZE is 8192 bytes, the hash size for the delta map is 512 bytes, and the hash size for the log map is 2048 bytes.

#define MAP_INDEX(mof, mtm) \         (((mof) >> MAPBLOCKSHIFT) & (mtm->mtm_nhash-1)) #define MAP_HASH(mof, mtm) \         ((mtm)->mtm_hash + MAP_INDEX((mof), (mtm)))                                                 See usr/src/uts/common/sys/fs/ufs_log.h 


A canceled mapentry with the ME_CANCEL bit set in the me_flags field is a special type of mapentry. This type of mapentry is basically a place holder for free blocks and fragments. It can also represent an old mapentry that is no longer valid due to a new mapentry for the same offset. Freed blocks and fragments are not eligible for reallocation until all deltas have been written to the on-disk log. Any attempt to allocate a block or fragment in which a corresponding canceled mapentry exists in the logmap, results in the allocation of a different block or fragment.

typedef struct crb {         int64_t         c_mof;          /* master file offset of buffer */         caddr_t         c_buf;          /* pointer to cached roll buffer */         uint32_t        c_nb;           /* size of buffer */         ushort_t        c_refcnt;       /* reference count on crb */         uchar_t         c_invalid;      /* crb should not be used */ } crb_t;                                                                    See sys/fs/ufs_log.h 


The crb_t, or cache roll buffer, caches blocks that exist within the same disk-block. It is merely a performance enhancement when information is rolled back to the file system. It helps reduce reads and writes that can occur while writing completed transactions deltas to the file system. It also acts as a performance enhancement on read hits of deltas.

UFS logging maintains private buf_t structures used for reading and writing of the on-disk log. These buf_t structures are managed through cirbuf_t structures. Each file system will have 2 cirbuf_t structures. One is used to manage log reads, and one to manage log writes.

typedef struct cirbuf {         buf_t           *cb_bp;         /* buf's with space in circular buf */         buf_t           *cb_dirty;      /* filling this buffer for log write */         buf_t           *cb_free;       /* free bufs list */         caddr_t         cb_va;          /* address of circular buffer */         size_t          cb_nb;          /* size of circular buffer */         krwlock_t       cb_rwlock;      /* r/w lock to protect list mgmt. */ } cirbuf_t;                                                                    See sys/fs/ufs_log.h 


15.7.3. Summary Information

Summary information is critical to maintaining the state of the file system. Summary information includes counts of directories, free blocks, free fragments, and free inodes. These bits of information exist in each cylinder group and are valid only for that respective cylinder group. All cylinder group summary information is totaled; these numbers are kept in the fs_cstotal field of the superblock. A copy of all the cylinder group's summary information is also kept in a buffer pointed to from the file system superblock's fs_csp field. Also kept on disk for redundancy is a copy of the fs_csp buffer, whose block address is stored in the fs_csaddr field of the file system superblock.

All cylinder group information can be determined from reading the cylinder groups, as opposed to reading them from fs_csaddr blocks on disk. Hence, updates to fs_csaddr are logged only for large file systems (in which the total number of cylinder groups exceeds ufs_ncg_log, which defaults to 10,000). If a file system isn't logging deltas to the fs_csaddr area, then the ufsvfs->vfs_nolog_si is set to 1 and instead marks the fs_csaddr area as bad by setting the superblock's fs_si field to FS_SI_BAD. However, these changes are brought up to date when an unmount or a log roll takes place.

15.7.4. Transactions

A transaction is defined as a file system operation that modifies file system metat-data. A group of these file system transactions is known as a moby transaction.

Logging transactions are divided into two types:

  • Synchronous file system transactions are those that are committed and written to the log as soon as the file system transaction ends.

  • Asynchronous file system transactions are those for which the file system transactions are committed and written to the on-disk log after closure of the moby transaction. In this case the file system transaction may complete, but the metadata that it modified is not written to the log and not considered commited until the moby transaction has been completed.

So what exactly are committed transactions? Well, they are transactions whose deltas (unit changes to the file system) have been moved from the delta map to the log map and written to the on-disk log.

There are four steps involved in logging metadata changes of a file system transaction:

1.

Reserve space in the log.

2.

Begin a file system transaction.

3.

Enter deltas in the delta map for all the metadata changes.

4.

End the file system transaction.

15.7.4.1. Reserving Space in the Log

A file system transaction that is to log metadata changes should first reserve space in the log. This prevents hangs if the on-disk log is full. A file system transaction that is part of the current moby transaction can not complete if there isn't enough log space to log the deltas. Log space can not be reclaimed until the current moby transation completes and is committed. And the current moby transaction can't complete until all file system transaction in the current moby transaction complete. Thus reserving space in the log must be done by the file system transaction when it enters the current moby transation. If there is not enough log space available, the file system transaction will wait until sufficient log space becomes available, before entereing the the current moby transaction.

The amount of space reserved in the log for write and truncation vary, depending on the size of the operation. The macro trANS_WRITE_RESV estimates how much log space is needed for the operation.

#define TRANS_WRITE_RESV(ip, uiop, ulp, resvp, residp)  \         if ((TRANS_ISTRANS(ip->i_ufsvfs) != NULL) && (ulp != NULL)) \                 ufs_trans_write_resv(ip, uiop, resvp, residp);                                                                  See sys/fs/ufs_trans.h 


All other file system transactions have a constant transaction size, and UFS has predefined macros for these operations:

/*  * size calculations  */ #define TOP_CREATE_SIZE(IP)     \         (ACLSIZE(IP) + SIZECG(IP) + DIRSIZE(IP) + INODESIZE) #define TOP_REMOVE_SIZE(IP)     \          DIRSIZE(IP)  + SIZECG(IP) + INODESIZE + SIZESB #define TOP_LINK_SIZE(IP)       \         DIRSIZE(IP) + INODESIZE #define TOP_RENAME_SIZE(IP)     \         DIRSIZE(IP) + DIRSIZE(IP) + SIZECG(IP) #define TOP_MKDIR_SIZE(IP)      \         DIRSIZE(IP) + INODESIZE + DIRSIZE(IP) + INODESIZE + FRAGSIZE(IP) + \             SIZECG(IP) + ACLSIZE(IP) #define TOP_SYMLINK_SIZE(IP)    \         DIRSIZE((IP)) + INODESIZE + INODESIZE + SIZECG(IP) #define TOP_GETPAGE_SIZE(IP)    \         ALLOCSIZE + ALLOCSIZE + ALLOCSIZE + INODESIZE + SIZECG(IP) #define TOP_SYNCIP_SIZE         INODESIZE #define TOP_READ_SIZE           INODESIZE #define TOP_RMDIR_SIZE          (SIZESB + (INODESIZE * 2) + SIZEDIR) #define TOP_SETQUOTA_SIZE(FS)   ((FS)->fs_bsize << 2) #define TOP_QUOTA_SIZE          (QUOTASIZE) #define TOP_SETSECATTR_SIZE(IP) (MAXACLSIZE) #define TOP_IUPDAT_SIZE(IP)     INODESIZE + SIZECG(IP) #define TOP_SBUPDATE_SIZE       (SIZESB) #define TOP_SBWRITE_SIZE        (SIZESB) #define TOP_PUTPAGE_SIZE(IP)    (INODESIZE + SIZECG(IP)) #define TOP_SETATTR_SIZE(IP)    (SIZECG(IP) + INODESIZE + QUOTASIZE + \                 ACLSIZE(IP)) #define TOP_IFREE_SIZE(IP)      (SIZECG(IP) + INODESIZE + QUOTASIZE) #define TOP_MOUNT_SIZE          (SIZESB) #define TOP_COMMIT_SIZE         (0) sys/fs/ufs_trans.h 


15.7.4.2. Starting Transactions

Starting a transaction simply means that the transaction has successfully entered the current moby transaction. As a result, once started, the moby will not end until all active file system transactions have completed. A moby transaction can accommodate both synchronous and asynchronous transactions. Most file system transactions in UFS are asynchronous; however, a synchronous transaction occurs if any of the following are true:

  • If the file system is mounted syncdir

  • If a fsync() system call is executed

  • If DSYNC or O_SYNC open modes are set on reads and writes

  • If RSYNC is set on reads

  • During an unmount of a file system

A transaction can be started with one of the following macros:

  • trANS_BEGIN_ASYNC Enters a file system transaction into the current moby transaction. Once the file system transaction ends, the moby transaction may still be active and hence the changes the file system transaction has made have not yet been committed.

    #define TRANS_BEGIN_ASYNC(ufsvfsp, vid, vsize)\ {\         if (TRANS_ISTRANS(ufsvfsp))\                 (void) top_begin_async(ufsvfsp, vid, vsize, 0); \ }                                                            See sys/fs/ufs_trans.h 

  • TRANS_BEGIN_SYNC. Enters a file system transaction into the current moby transaction with the requirement that the completion of the file system transaction forces a completion and commitment of the moby transaction. All file system transactions that have occurred within the moby transaction are also considered as committed.

    #define TRANS_BEGIN_SYNC(ufsvfsp, vid, vsize, error)\ {\         if (TRANS_ISTRANS(ufsvfsp)) { \                 error = 0; \                 top_begin_sync(ufsvfsp, vid, vsize, &error); \         } \ }                                                            See sys/fs/ufs_trans.h 

  • TRANS_BEGIN_CSYNC. Does a TRANS_BEGIN_SYNC if the mount option syncdir is set; otherwise, does a trANS_BEGIN_ASYNC.

  • TRANS_TRY_BEGIN_ASYNC and TRANS_TRY_BEGIN_CSYNC. Try to enter the file system transaction into the moby transaction. If the result would cause the thread to block, then do not block and return EWOULDBLOCK instead. This macro is used in cases where the calling thread must not block.

#define TRANS_TRY_BEGIN_ASYNC(ufsvfsp, vid, vsize, err)\ {\         if (TRANS_ISTRANS(ufsvfsp))\                 err = top_begin_async(ufsvfsp, vid, vsize, 1); \         else\                 err = 0; \ } #define TRANS_TRY_BEGIN_CSYNC(ufsvfsp, issync, vid, vsize, error)\ {\         if (TRANS_ISTRANS(ufsvfsp)) {\                 if (ufsvfsp->vfs_syncdir) {\                         ASSERT(vsize); \                         top_begin_sync(ufsvfsp, vid, vsize, &error); \                         ASSERT(error == 0); \                          issync = 1; \                 } else {\                         error = top_begin_async(ufsvfsp, vid, vsize, 1); \                          issync = 0; \                 }\         }\ }                                         See usr/src/uts/common/sys/fs/ufs_trans.h 


15.7.4.3. Ending the Transaction

Once all metadata changes have been completed, the transaction must be ended. This is accomplished by calling one of the following macros:

  • TRANS_END_CSYNC. Calls trANS_END_ASYNC or trANS_END_SYNC, depending on which type of file system transaction was initially started.

  • TRANS_END_ASYNC. Ends an asynchronous file system transaction. If, at this point, the log is getting full, (the number of mapentries in the logmap is greater than the global variable logmap_maxnme_async) committed deltas in the log will be applied to the file system and removed from the log. This is known as "rolling the log" and is done in by a seperate thread.

    #define TRANS_END_ASYNC(ufsvfsp, vid, vsize)\ {\         if (TRANS_ISTRANS(ufsvfsp))\                 top_end_async(ufsvfsp, vid, vsize); \ }                                               See usr/src/uts/common/sys/fs/ufs_trans.h 

  • TRANS_END_SYNC. Closes and commits the current moby transaction, and writes all deltas to the on-disk log. A new moby transaction is then started.

#define TRANS_END_SYNC(ufsvfsp, error, vid, vsize)\ {\         if (TRANS_ISTRANS(ufsvfsp))\                 top_end_sync(ufsvfsp, &error, vid, vsize); \ }                                               See usr/src/uts/common/sys/fs/ufs_trans.h 


15.7.5. Rolling the Log

Occasionally, the data in the log needs to be written back to the file system, a procedure called log rolling. Log rolling occurs for the following reasons:

  • To update the on-disk file system with committed metadata deltas

  • To free space in the log for new deltas

  • To roll the entire log to disk at unmount

  • To partially roll the on-disk log when it is getting full

  • To completely roll the log with the _FIOFFS ioctl (file system flush)

  • To partially roll the log every 5 seconds when no new deltas exist in the log

  • To roll some deltas when the log map is getting full (that is, when logmap has more than logmap_maxnme mapentries, by default, 1536)

The actual rolling of the log is handled by the log roll thread, which executes the trans_roll() function found in usr/src/uts/common/fs/lufs_thread.c. The trans_roll() function preallocates a number of rollbuf_t structures (based on LUFS_DEFAULT_NUM_ROLL_BUF = 16, LUFS_DEFAULT_MIN_ROLL_BUFS = 4, LUFS_DEFAULT_MAX_ROLL_BUFS = 64) to handle rolling deltas from the log to the file system.

typedef uint16_t rbsecmap_t; typedef struct rollbuf {         buf_t rb_bh;            /* roll buffer header */         struct rollbuf *rb_next; /* link for mof ordered roll bufs */         crb_t *rb_crb;          /* cached roll buffer to roll */         mapentry_t *rb_age;     /* age list */         rbsecmap_t rb_secmap;   /* sector map */ } rollbuf_t;                                                 See usr/src/uts/common/sys/fs/ufs_log.h 


Along with allocating memory for the rollbuf_t structures, trans_roll also allocates MAPBLOCKSIZE * lufs_num_roll_bufs bytes to be used by rollbuf_t's buf_t structure stored in rb_bh. These rollbuf_t's are populated according to information found in the rollable mapentries of the logmap. All rollable mapen-tries will be rolled starting from the logmap's un_head_lof offset, and continuing until an unrollable mapentry is found. Once a rollable mapentry is found, all other rollable mapentries within the same MAPBLOCKSIZE segment on the file system device are located and mapped by the same rollbuf structure.

If all mapentries mapped by a rollbuf have the same cache roll buffer (crb), then this crb maps the on-disk block and buffer containing the deltas for the roll-buf's buf_t. Otherwise, the rollbuf's buf_t uses MAPBLOCKSIZE bytes of kernel memory allocated by the trans_roll thread to do the transfer. The buf_t reads the MAPBLOCKSIZE bytes on the file system device into the rollbuf buffer. The deltas defined by each mapentry overlap the old data read into the rollbuf buffer. This buffer is then writen to the file system device.

If the rollbufs contain holes, these rollbufs may have to issue more than one write to disk to complete writing the deltas. To asynchronously write these deltas, the rollbuf's buf_t structure is cloned for each additional write required for the given rollbuf. These cloned buf_t structures are linked into the rollbuf's buf_t structure at the b_list field. All writes defined by the rollbuf's buf_t structures and any clone buf_t structures are issued asynchronously.

The trans_roll() thread waits for all these writes to complete. If any fail, a warning is printed to the console and the log is marked as LDL_ERROR in the logmap->un_flags field. If the roll completes successfully, all corresponding mapen-tries are completely removed from the log map. The head of the log map is then adjusted to reflect this change, as illustrated in Figure 15.20.

Figure 15.20. Adjustment of Head of Log Map


15.7.6. Redirecting Reads and Writes to the Log

When the UFS module is loaded, the global variable bio_lufs_strategy is set to point to the lufs_strategy() function. As a result, bread_common() and bwrite_common() functions redirect reads and writes to the bio_lufs_strategy(if it exists and if logging is enabled). lufs_strategy() then determines if the I/O request is a read or a write and dispatches to either lufs_read_strategy() or lufs_write_strategy(). These functions are responsible for resolving the read/ write request from and to the log. In some instances in UFS, the functions lufs_read_strategy() and lufs_write_strategy() are called directly, bypassing the bio_lufs_strategy() code path.

15.7.6.1. lufs_read_strategy() Behavior

The lufs_read_strategy() function is called for reading metadata in the log. Mapentries already in the log map that correspond to the requested byte range are linked in the me_agenext list and have the ME_AGE bit set to indicate that they are in use. If the bytes being read are not defined in a logmap mapentry, the data is read from the file system as normal. Otherwise, lufs_read_strategy() then calls ldl_read() to read the data from the log.

The function ldl_read() can get the requested data from a variety of sources:

  • A cache roll buffer

  • The write buffer originally used to write this data to the log (mlunit-> un_wrbuf)

  • The buffer previously used to read this data from the log (mlunit-> un_rdbuf)

  • The on-disk log itself

15.7.6.2. lufs_write_strategy() Behavior

The lufs_write_strategy() function writes deltas defined by mapentries from the delta map to the log map if any exist. It does so by calling logmap_add() or logmap_add_buf(). logmap_add_buf() is used when crb buffers are being used, otherwise logmap_add() is used. These function in turn call ldl_write() to actually write the data to log.

The function ldl_write() always writes data into the the memory buffer of the buf_t contained in the write cirbuf_t structure. Hence, requested writes may or may not always actually be written to the physical on-disk log. Writes to the physical on-disk log occur when the log rolls the tail around back to the head, the write buf_t buffer is full, or a commit record is written.

15.7.7. Failure Recovery

An important aspect of file system logging is the ability to recover gracefully after an abnormal operating system halt. When the operating system is restarted and the file system remounted, the logging implementation will complete any outstanding operations by replaying the commited log transactions. The on-disk log is read and any commited deltas found are populated into the logmap as committed logmap mapentries. The roll thread will then write these to the file system and remove the mapentries from the logmap. All uncommitted deltas found in the ondisk log will be discarded.

15.7.7.1. Reclaim Thread

A system panic can leave inodes in a partially deleted state. This panic can be caused by an interrupted delete thread (refer to Section 15.3.2 for more information on the delete thread) in which ufs_delete() never finished processing the inode. The sole purpose of the UFS reclaim thread (ufs_thread_reclaim() in usr/src/uts/common/fs/ufs/ufs_thread.c) is to clean up the inodes left in this state. This thread is started if the superblock's fs_reclaim field has either FS_RECLAIM or FS_RECLAIMING flags set, indicating that freed inodes exist or that the reclaim thread was previously running.

The reclaim thread reads each on-disk inode from the file system device, checking for inodes whose i_nlink is zero and i_mode isn't zero. This situation signifies that ufs_delete() never finished processing these inodes. The thread simply calls VN_RELE() for every inode in the file system. If the node was partially deleted, the VN_RELE() forces the inode to go through ufs_inactive(), which in turn queues the inode in the vfs_delete queue to be processed later by the delete thread.




SolarisT Internals. Solaris 10 and OpenSolaris Kernel Architecture
Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture (2nd Edition)
ISBN: 0131482092
EAN: 2147483647
Year: 2004
Pages: 244

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net