LVM: The Kernel View

   

In Figure 11-6 we see an overview of the LVM pseudodriver organization within the kernel. As mentioned earlier, an application thread requests a file open by referencing a file system pathname. The virtual file system resolves this to a specific vnode within the kernel file system table. The vnode contains the device number of the file system on which the file is located. The device number for LVM-based file systems directs all I/O requests to the LVM pseudodriver via the kernel's device switch table. Once an I/O request is received by the driver, the driver must pass it through several layers.

Strategy Layer: This layer receives the initial request for a block I/O transaction to a specific file system block. The kernel facilitates this request by passing a buf structure containing the logical volume number (b_dev), request flags defining the transaction (v_flags), the block number within the logical volume (b_blkno), the byte count of the request (b_bcount), and various options (b_options). The strategy layer must validate the request, checking the availability of the requested volume and its size against the block size of the volume.

Mirror Consistency Layer: If a logical volume has mirroring configured, then this layer must coordinate mirror writes. A volume may be configured to cache mirrored write request in an MWC. A volume group is divided into logical track groups (LTG) and cached write requests are first registered in one of the MWC's cache records (one per LTG) in kernel memory and also written to an mwc_entry on one of the physical volumes of the volume group. Originally, each volume group had 32 LTGs, but with the 11.i release this was increased to 126.

The cached data does not contain the user data, but it does register the intent to update data on the disk. When the write has been completed to all mirror copies, the cache record is cleared. In the case of a system crash and reboot, the cache records on the physical volumes may be used to identify which LTG's may be out of sync. If the MWC is not used, then all logical extents on the mirrors would have to be resynchronized.

Scheduling Layer: This layer makes full use of the kernel-based copies of the volume group's kernel-based configuration information. The actual location and number of mirrors are converted into one or more physical requests. The strategy layer accepts the buf pointer from the previous layers and directs it on its way through LVM.

A logical volume may be configured to follow one of several scheduling strategies. It may be LVM_RESERVED if the request is to the reserved area on the disk (the actual metadata structures used to configure and manage the volume). Normal read and write requests may be either LVM_SEQUENTIAL or LVM_PARALLEL. This effects the methodology used for mirrored read and write requests. Finally, the request could be flagged as LVM_STRIPE for parallel striped operations.

Physical Layer: This last layer is where the rubber meets the road. The LVM driver passes requests and their associated buf structures to the actual physical device drivers responsible for the mapped physical volumes.

Figure 11-6. The LVM Pseudodriver Architecture

graphics/11fig06.gif


Work Queues: Keeping a Request on Track

Because the LVM pseudodriver is a kernel resource, it is common for it to be processing multiple requests in the various layers at any one point in time. In addition to servicing multiple threads, the buffers may be queued while awaiting access to a physical device or to be transferred to the next layer within the LVM driver itself.

In Figure 11-7, we see that there are a number of queues that a request may find itself on.

Figure 11-7. Work Queues

graphics/11fig07.gif


They may be divided into four different categories.

Global queues: The pf_wait_Q contains all requests that may not be completed due to a power failure. A request on this queue is waiting for termination and cleanup.

Per-volume-group queues: The vg_cache_wait queue holds requests waiting on a free entry in their volume group's MWC. The vg_cache_write queue provides a linked list of physical volumes available for an MWC update to disk. When the LVM driver needs to store MWC data to a physical disk in its volume group, it selects the one at the head of this queue.

Per-physical-volume queues: All requests scheduled for a specific physical volume are linked to the pv_read_Q. The pv_cache_wait holds requests waiting for their MWC data to be written to a physical disk.

The LVM system supports a feature known as physical volume links (pv links). This feature allows for the automatic switchover from one bus to an alternate bus; that is, when a failure occurs on a bus controller, if the system has an alternate path available, I/O is switched to it. The pv_wait_Q holds requests waiting for the pv links switch to take effect.

Currently, pv links support an active/passive mode of operation. This references the fact that only one interface may be active at a time. If it fails they standby passive interface is activated. In the future, this may be enhanced to allow active/active configurations where both interfaces may be used to share the load and increase overall throughput while providing true hot standby functionality

Per-logical-volume queues: The work_Q is actually a per-logical-volume array of all outstanding requests for the volume. The array entries are the other queues on which the individual requests are currently linked. Since this master queue has knowledge of all current outstanding requests for a volume, the kernel strategy layer makes use of this information to serialize I/O request whenever possible.

The lv_read_Q is a holding place for requests waiting to be passed to the MWC layer in the pseudo-driver.

Next, let's consider the data structures stored in the kernel to support the operations of the LVM subsystem.

Kernel Resident Data Structures

When a volume group is activated (at boot or via the vgchange a y command), its metadata is copied to kernel-resident structures. Figure 11-8 presents an overview of these structures. The starting point for these structures is the kernel volgrp[] array. As individual volgrp structures are among the largest in the kernel, their number is limited by the kernel-tunable maxvgs and defaults to 10. Note: if you are creating a new volume, group directory and group file the volume group number passed to the mknod command (the first two digits of the minor number argument) should not exceed this tunable value. The volume group's number is used as the index into the kernel resident volgrp[] array.

Figure 11-8. Kernel-Resident Configuration Structures

graphics/11fig08.gif


Let's begin by examining the volgrp structure (Listing 11.10).

Listing 11.10. q4> fields struct volgrp
 We start with lock pointers and counters    0 0 4 0 *                  vg_lock.interlock    4 0 4 0 u_int              vg_lock.delay    8 0 4 0 int                vg_lock.read_count   12 0 1 0 char               vg_lock.want_write   13 0 1 0 char               vg_lock.want_upgrade   14 0 1 0 char               vg_lock.waiting   15 0 1 0 char               vg_lock.no_swap Next a pointer to the lvol array (sized to 256), the number of logical volumes, a lock, a pointer to the pvol array and its size, the major number for the volume group pseudo-device file (0x40, acts as a sanity check), the volume group identifier, a count of the open volumes   16 0 4 0 *                  lvols   20 0 4 0 u_int              num_lvols   24 0 4 0 *                  vg_pvolsListLock.lvc_slock   28 0 4 0 *                  pvols   32 0 4 0 u_int              size_pvols   36 0 4 0 u_int              num_pvols   40 0 4 0 int                major_num   44 0 4 0 u_int              vg_id.id1   48 0 4 0 u_int              vg_id.id2   52 0 2 0 short              vg_extshift   54 0 2 0 short              vg_opencount   56 0 4 0 u_int              vg_flags 

VG_LOST_QUORUM

run quorum is lost

VG_ACTIVATED

volume group is activated

VG_NOLVOPENS

disallow lvol opens

VG_READONLY

volume group activated read-only


   60 0 4 0 *                  vg_intlock.lvc_slock Total number of requests processed in the strategy layer and the current pending requests   64 0 4 0 int                vg_totalcount   68 0 4 0 int                vg_requestcount   72 0 4 0 *                  vg_ca_intlock.lvc_slock 

Byte offset 80 through 539 holds a variety of MWC information structures.

This section is examined later in this chapter.


 Pointers and offsets to various related structures  540 0 4 0 *                  vg_vgda  544 0 4 0 u_int              vg_LVentry_off  548 0 4 0 u_int              vg_PVentry_off  552 0 4 0 u_int              vg_PVentry_len  556 0 4 0 u_int              vg_VGtrail_off 

byte offset 560 through 1087 contains volume group status area data.


 Configured limits for the volume group, logical volumes,  physical volumes, physical extents, extent size, data area length, status area length, mirror cache size, the volume group number (a quick sanity check), available physical  volumes, data area and status area block sizes, cluster  locking ID, and configuration mode (used in conjunction  with service guard configuration) 1088 0 2 0 u_short      vg_maxlvs 1090 0 2 0 u_short      vg_maxpvs 1092 0 2 0 u_short      vg_maxpxs 1100 0 4 0 u_int        vgda_len 1096 0 4 0 u_int        vg_pxsize 1104 0 4 0 u_int        vgsa_len 1108 0 4 0 u_int        mcr_len 1112 0 4 0 int          vg_num 1116 0 2 0 u_short      vg_npv_avail 1118 0 2 0 u_short      vg_npv_newavail 1120 0 2 0 u_short      vgda_blkfactor 1122 0 2 0 u_short      vgsa_blkfactor 1124 0 4 0 u_int        vg_cluster_id 1128 0 4 0 int          vg_config_mode 

CLV_VG_CONF_STD

non-special mode

CLV_VG_CONF_EXCL

exclusive activation mode

CLV_VG_CONF_SHAR

shared activation mode


 The remainder of the structure holds shared mode data if  applicable, volume group switching, and spare information 

The lvol and pvol data is populated from that found in the PVRA and VGRA structures on the volume group's physical disks (Listings 11.11 and 11.12).

Listing 11.11. q4> fields struct lvol
 Various queue pointers and addresses     0 0 4 0 *                  work_Q     4 0 4 0 *                  lv_ready_Q.lv_head     8 0 4 0 *                  lv_ready_Q.lv_tail    12 0 4 0 int                lv_ready_Q.lv_count The logical ext array pointer (used during re-sync operations)    16 0 4 0 *                  lv_lext Three physical extent pointer maps for mapping mirrored  extents    20 0 4 0 *                  lv_exts[0]    24 0 4 0 *                  lv_exts[1]    28 0 4 0 *                  lv_exts[2] Pointer to the schedule queue, the number of stripes and the  strip size    32 0 4 0 *                  lv_schedule    36 0 2 0 u_short            lv_stripes    38 0 2 0 u_short            lv_stripesize An assortment of lock pointers and primitives    40 0 4 0 *                  lv_lock.interlock    44 0 4 0 u_int              lv_lock.delay    48 0 4 0 int                lv_lock.read_count    52 0 1 0 char               lv_lock.want_write    53 0 1 0 char               lv_lock.want_upgrade    54 0 1 0 char               lv_lock.waiting    55 0 1 0 char               lv_lock.no_swap    56 0 4 0 *                  lv_intlock.lvc_slock    60 0 4 0 int                lv_complcnt Next are the cumulative request count, the pending request count, and the current status flag    64 0 4 0 int                lv_totalcount    68 0 4 0 int                lv_requestcount    72 0 2 0 short              lv_status    74 0 2 0 short              lv_allow_cfgcmd_rslvr    76 0 2 0 u_short            lv_ref    78 0 2 0 u_short            lv_rawavoid    80 0 2 0 u_short            lv_rawoptions The lvol's physical extent count, maximum number of logical  extents, and the number of in-use logical extents    84 0 4 0 u_int              lv_curpxs    88 0 2 0 u_short            lv_maxlxs    90 0 2 0 u_short            lv_curlxs    92 0 2 0 u_short            lv_flags 

LVM_RESERVED

group file lvol0 strategy

LVM_SWAUENTIAL

sequential scheduling flag

LVM_PARALLEL

parallel scheduling flag

LVM_STRIPE

striping enabled

LVM_DYNAMIC

dynamic scheduling (not in current use)

LVM_STRIPE_NEW

new stripe (not in current use)


 Current scheduling strategy, mirror count, and a pointer to  the bit allocation map for the logical volume    94 0 1 0 u_char             lv_sched_strat    95 0 1 0 u_char             lv_maxmirrors    96 0 4 0 *                  lv_bitmap   100 0 2 0 u_short            lv_partner   102 0 2 0 u_short            lv_mimwchit   104 0 2 0 u_short            lv_mimwcmiss   108 0 4 0 u_int              lv_mirxfers   112 0 4 0 u_int              lv_mircount   116 0 4 0 u_int              lv_miwxfers   120 0 4 0 u_int              lv_miwcount   124 0 4 0 *                  lv_vg 

Byte offset 128 through 511 contains raw buffer data.

Byte offset 512 through 895 contains logical volume disk sort information.


 Number of seconds for a request timeout   896 0 4 0 u_int              lv_io_timeout 

Listing 11.12. q4> fields struct pvol
 Pointers to the volume group structure, the lvmrec structure,  and the bad block directory for this pvol      0 0  4 0 *                  pv_vg      4 0  4 0 *                  pv_lvmrec      8 0  4 0 *                  pv_bbdir The maximum and current number of entries in the bad block  directory     12 0  4 0 u_int              pv_maxdefects     16 0  4 0 u_int              pv_curdefects     20 0  4 0 u_int              pv_vgdats[0].tv_sec     24 0  4 0 long               pv_vgdats[0].tv_usec     28 0  4 0 u_int              pv_vgdats[1].tv_sec     32 0  4 0 long               pv_vgdats[1].tv_usec     36 0  4 0 int                pv_vgra_psn     40 0  4 0 int                pv_data_psn     44 0  4 0 u_int              pv_pxspace The total number of physical extents and the number of free extents for the pvol     48 0  2 0 u_short            pv_pxcount     50 0  2 0 u_short            pv_freepxs     52 0  4 0 *                  pv_intlock.lvc_slock     56 0  4 0 int                pv_armpos Work queue pointers     60 0  4 0 *                  pv_ready_Q.lv_head     64 0  4 0 *                  pv_ready_Q.lv_tail     68 0  4 0 int                pv_ready_Q.lv_count Cumulative number of transfers to this pvol, number of pending  requests, status flags, and the pvol's index number within its volume group     72 0  4 0 int                pv_totxf     76 0  2 0 short              pv_curxfs     78 0  2 0 u_short            pv_flags     80 0  1 0 u_char             pv_flags2     81 0  1 0 u_char             pv_num     84 0  4 0 int                pv_sa_psn[0]   88 0  4 0 int                pv_sa_psn[1]   92 0  4 0 u_int              pv_vgsats[0].tv_sec   96 0  4 0 long               pv_vgsats[0].tv_usec  100 0  4 0 u_int              pv_vgsats[1].tv_sec  104 0  4 0 long               pv_vgsats[1].tv_usec  108 0  4 0 *                  pv_cache_wait.lv_head  112 0  4 0 *                  pv_cache_wait.lv_tail  116 0  4 0 int                pv_cache_wait.lv_count  120 0  4 0 *                  pv_cache_next  124 0  4 0 *                  pv_mwc_rec  128 0  4 0 u_int              pv_mwc_latest.tv_sec  132 0  4 0 long               pv_mwc_latest.tv_usec  136 0  4 0 int                pv_mwc_flags  140 0  4 0 int                pv_mwc_loc[0]  144 0  4 0 int                pv_mwc_loc[1]  148 0  4 0 int                altpool_psn  152 0  4 0 int                altpool_next  156 0  4 0 int                altpool_end Physical volume defects array  160 0  4 0 *                  pv_defects[0] ----------------------------------------------  412 0  4 0 *                  pv_defects[63]  416 0  4 0 *                  freelist  420 0  4 0 *                  freelist_ptr  424 0  4 0 u_int              freelistsize  428 0  4 0 u_int              bbdirsize 

Byte offset 432 through 523 contains the physical volume attribute data.


 vnode and pv-links information  524 0  4 0 *                  currentPhysicalLink  528 0  4 0 *                  pv_wait_Q.lv_head  532 0  4 0 *                  pv_wait_Q.lv_tail  536 0  4 0 int                pv_wait_Q.lv_count 

Byte offset 540 through 927 contains physical volume buffer data.


 the size of a read/write request (multiple of 1 KB)  928 0  2 0 u_short            pv_blkfactor  930 0  2 0 u_short            sgio_flags 

Byte offset 934 through 1011 contains physical volume spare information.


Now we have in the kernel all the configuration data necessary to allow translation from a logical volume offset to a physical volume offset.

One-Way and Two-Way Mirroring Mechanics and Options

Mirroring allows for either two (one-way) or three (two-way) mirroring of individual logical volumes under LVM control. The basic premise of mirroring is very straightforward, but the behind-the-scenes mechanics require additional considerations.

A major concern with mirrored operations is assuring that each mirror copy has the same data. This really comes to bear when a write is requested. To help make sure that the data is consistent across all copies, LVM incorporates an MWC strategy. When a logical volume is configured to be mirrored, it may be configured to use one of three mirror-caching policies.

NONE: Choosing this policy disables all internal consistency checks. Extents are not marked as stale at activation, and mirrors are not synchronized. This may be suitable for a swap volume.

NOMWC: No MWC records are keep and there is no performance cost during normal operation. At volume group activation time all but one volume will be marked as stale. The activation process may take some time as all extents of the stale volume(s) will have to be copied from the non-stale volume.

MWC: If this policy is selected, then for any write request to proceed on a mirrored volume, a request is passed to the MWC layer of the driver. An entry must be made to an MWC structure and copied to one of the volume group's physical volumes before it may continue through to the next LVM layer. If a resynchronization is required, all track groups represented by an active entry in the disk-based copies of the cache data must be synchronized. Activation here proceeds more quickly than with NOMWC since only LGT's with incomplete MWC entries will need to be synchronized. The performance hit here lies with the requirement that the MWC be written to disk before LVM advances the request through its various work queues.

Note that while all mirrors should be identical following resynchronization, there is no way to know which mirror had the most current data. The mirror chosen to be copied to the others is selected by random choice.

The MWC data on a physical disk is stored in the mwc_entry structure we examined earlier in this chapter and is sized to hold 126 individual entries (as of HP-UX 11i). There is also a kernel-based copy of this information (see Figure 11-9).

Figure 11-9. Mirror Write Consistency Records

graphics/11fig09.gif


Listing 11.13 is an extracted portion of a listing (with annotation) created using q4> fields struct volgrp.

Listing 11.13. q4> fields struct volgrp
 This contains the spinlock structure for controlling mp-access to this data   72 0 4 0 *                  vg_ca_intlock.lvc_slock 

Byte offset 80 through 463 contains the vg_cache_lbuf structure used by the MWC to store MWC entries to a physical volume


 Next we have the linkage pointers to the vg_cache_wait or  vg_cache_write wait queues  464 0 4 0 *                  vg_cache_wait.lv_head  468 0 4 0 *                  vg_cache_wait.lv_tail  472 0 4 0 int                vg_cache_wait.lv_count  476 0 4 0 *                  vg_cache_write.lv_pvhead  480 0 4 0 *                  vg_cache_write.lv_pvtail  484 0 4 0 int                vg_cache_write.lv_pvcount The vg_mwc_rec points to the memory-resident copy of the lmv  record data  488 0 4 0 *                  vg_mwc_rec This points to the beginning of the memory-resident cache  array followed by a pointer to the least recently used element in the list  492 0 4 0 *                  ca_part2  496 0 4 0 *                  ca_lst This is the hash list used to speed searches for a cached entry  500 0 4 0 *                  ca_hash[0]  ---------------------------------------  528 0 4 0 *                  ca_hash[7] The number of current free entries, the total number of  entries, and the number of changed entries in memory (dirty entries)  532 0 1 0 u_char             ca_free  533 0 1 0 u_char             ca_size  534 0 1 0 u_char             ca_chgcount The cache flags  535 0 1 0 u_char             ca_flags 

CACHE_ACTIVATED

cache has been initialized

CACHE_INFLIGHT

cache being written to disk

CACHE_CHANGED

memory cache is currently dirty

CACHE_CLEAN

something is waiting for disk write to complete


  536 0 2 0 u_short            ca_clean_lvnum 



HP-UX 11i Internals
HP-UX 11i Internals
ISBN: 0130328618
EAN: 2147483647
Year: 2006
Pages: 167

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net