Volume Management | Microsoft Windows Internals (4th Edition): Microsoft Windows Server 2003, Windows XP, and Windows 2000

< Day Day Up >

Windows introduces the concept of basic and dynamic disks. Windows calls disks that rely exclusively on the MBR-style or GPT partitioning scheme basic disks. Dynamic disks are new to Windows 2000 and implement a more flexible partitioning scheme than that of basic disks. The fundamental difference between basic and dynamic disks is that dynamic disks support the creation of new multipartition volumes. Recall from the list of terms in the preceding section that multipartition volumes provide performance, sizing, and reliability features not supported by simple volumes. Windows manages all disks as basic disks unless you manually create dynamic disks or convert existing basic disks (with enough free space) to dynamic disks. Microsoft recommends you use basic disks unless you require the multipartition-functionality of dynamic disks.

Note

Basic disks support only multipartition volumes carried forward from an upgraded Windows NT 4 installation with the exception of Windows Server 2003, which does not support multipartition volumes on basic disks. For a number of reasons, including the fact that laptops usually have only one disk and laptop disks typically don't move easily between computers, Windows uses only basic disks on laptops. In addition, only fixed disks can be dynamic, and disks located on IEEE 1394 or USB buses as well as on shared cluster server disks are always basic disks.

Windows Volume Management Evolution

The evolution of storage management in Windows begins with MS-DOS, Microsoft's first operating system. As hard disks became larger, MS-DOS needed to accommodate them. To do so, one of the first steps Microsoft took was to let MS-DOS create multiple partitions, or logical disks, on a physical disk. MS-DOS could format each partition with a different file system type (FAT12 or FAT16) and assign each partition a different drive letter. MS-DOS versions 3 and 4 were severely limited in the size and number of partitions they could create, but in MS-DOS 5 the partitioning scheme fully matured. MS-DOS 5 was able to divide a disk into any number of partitions of any size.

Windows NT borrowed the partitioning scheme that evolved in MS-DOS both to provide disk compatibility with MS-DOS and Windows 3.x and to let the Windows NT development team rely on proven tools for disk management. Microsoft extended the basic concepts of MS-DOS disk partitioning in Windows NT to support storage-management features that an enterprise-class operating system requires: disk spanning and fault tolerance. Starting with the first version of Windows NT, version 3.1, systems administrators have been able to create volumes that comprise multiple partitions, which allows large volumes to consist of partitions from multiple physical disks and to implement fault tolerance through software-based data redundancy.

Although this MS-DOS-style partitioning support in versions of Windows NT prior to Windows 2000 is flexible enough to support most storage-management tasks, it suffers from several drawbacks. One drawback is that most disk-configuration changes require a reboot before taking effect. In today's world of servers that must remain on line for months or even years at a time, any reboot even a planned reboot is a major inconvenience. Another drawback is that the Windows NT 4 registry stores multipartition disk-configuration information for MS-DOS-style partitions. This arrangement means that moving configuration information is onerous when you move disks between systems, and you can easily lose configuration information when you need to reinstall the operating system. Finally, a requirement that each volume have a unique drive letter in the A through Z range plagues users of all Microsoft operating systems prior to Windows 2000 with an upper limit on the number of possible local and remote volumes they can create.

Windows includes three types of partitioning that enable it to overcome the limitations mentioned when necessary: Master Boot Record (MBR) style, GUID Partition Table (GPT), and Logical Disk Manager (LDM).

Basic Disks

This section describes the two types of partitioning, MBR-style and GPT, that Windows uses to define volumes on basic disks, and FtDisk, the volume manager driver that presents the volumes to file system drivers. The Windows 2000 Disk Manager recommends that you make any unpartitioned disk a dynamic disk, but Windows Server 2003, like Windows XP and Windows 2000 Professional, silently defaults to defining all disks as basic disks.

MBR-Style Partitioning

The standard BIOS implementations that x86 hardware uses dictate one requirement of the partitioning format in Windows that the first sector of the primary disk contains the Master Boot Record (MBR). When an x86 processor boots, the computer's BIOS reads the MBR and treats part of the MBR's contents as executable code. The BIOS invokes the MBR code to initiate an operating system boot process after the BIOS performs preliminary configuration of the computer's hardware. In Microsoft operating systems, including Windows, the MBR also contains a partition table. A partition table consists of four entries that define the locations of as many as four primary partitions on a disk. The partition table also records a partition's type. Numerous predefined partition types exist, and a partition's type specifies which file system the partition includes. For example, partition types exist for FAT32 and NTFS. A special partition type, an extended partition, contains another MBR with its own partition table. The equivalent of a primary partition in an extended partition is called a logical drive. By using extended partitions, Microsoft's operating systems overcome the apparent limit of four partitions per disk. In general, the recursion that extended partitions permit can continue indefinitely, which means that no upper limit exists to the number of possible partitions on a disk. The Windows boot process makes evident the distinction between primary and logical drives. The system must mark one primary partition of the primary disk as active. The Windows code in the MBR loads the code stored in the first sector of the active partition (the system volume) into memory and then transfers control to that code. Because of the role in the boot process played by this first sector in the primary partition, Windows designates the first sector of any partition as the boot sector. Recall from Chapter 4 that every partition formatted with a file system has a boot sector that stores information about the structure of the file system on that partition.

GUID Partition Table Partitioning

As part of an initiative to provide a standardized and extensible firmware platform for operating systems to use during their boot process, Intel has designed the Extensible Firmware Interface (EFI) specification. EFI includes a mini operating system environment implemented in firmware (typically ROM) that operating systems use early in the system boot process to load system diagnostics and their boot code. One of the first targets of EFI is IA-64, so the IA-64 versions of Windows use EFI. (You can still choose to create disks with the old-style MBR portioning.) You can find a complete description of EFI at http://developer.intel.com/technology/efi.

EFI defines a partitioning scheme, called the GUID (Globally Unique Identifier) Partition Table (GPT) that addresses some of the shortcomings of MBR-style partitioning. For example, the sector addresses that the GPT partition structures use are 64 bits wide instead of 32 bits. A 32-bit sector address is sufficient to access only 2 terabytes (TB) of storage, while GPT allows the addressing of disk sizes into the foreseeable future. Other advantages of the GPT scheme include the fact that it uses cyclic redundancy checksums (CRC) to ensure the integrity of the partition table, and it maintains a backup copy of the partition table. GPT takes its name from the fact that in addition to storing a 36-byte Unicode partition name for each partition, it assigns each partition a GUID.

Figure 10-4 shows a sample GPT partition layout. Like for MBR partitioning, the first sector of a GPT disk is an MBR that serves to protect the GPT partitioning in case the disk is accessed from a non-GPT aware operating system. However, the second and last sectors of the disk store the GPT partition table headers with the actual partition table following the second sector and preceding the last sector. With its extensible list of partitions, GPT partitioning doesn't require nested partitions, as MBR partitions do.

Figure 10-4. Example GPT partition layout

Note

Because Windows doesn't support the creation of multipartition volumes on basic disks, a new basic disk partition is the equivalent of a volume. For this reason, the Disk Management MMC snap-in uses the term partition when you create a volume on a basic disk.

Basic Disk Volume Manager

The FtDisk driver (\Windows\System32\Drivers\Ftdisk.sys) creates disk device objects that represent volumes on basic disks and plays an integral role in managing all basic disk volumes, including simple volumes. For each volume, FtDisk creates a device object of the form \Device\ HarddiskVolumeX, in which X is a number (starting from 1) that identifies the volume.

FtDisk is actually a bus driver because it's responsible for enumerating basic disks to detect the presence of basic volumes and report them to the Windows PnP manager. To implement this enumeration, FtDisk leverages the PnP manager, with the aid of the partition manager (Partmgr.sys) driver to determine what basic disk partitions exist. The partition manager registers with the PnP manager so that Windows can inform the partition manager whenever the disk class driver creates a partition device object. The partition manager informs FtDisk about new partition objects through a private interface and creates filter device objects that the partition manager then attaches to the partition objects. The existence of the filter objects prompts Windows to inform the partition manager whenever a partition device object is deleted so that the partition manager can update FtDisk. The disk class driver deletes a partition device object when a partition in the Disk Management Microsoft Management Console (MMC) snap-in is deleted. As FtDisk becomes aware of partitions, it uses the basic disk configuration information to determine the correspondence of partitions to volumes and creates a volume device object when it has been informed of the presence of all the partitions in a volume's description.

Windows volume drive-letter assignment, a process described shortly, creates drive-letter symbolic links under the \Global?? (\?? on Windows 2000) object manager directory that point to the volume device objects that FtDisk creates. When the system or an application accesses a volume for the first time, Windows performs a mount operation that gives file system drivers the opportunity to recognize and claim ownership for volumes formatted with a file system type they manage. (Mount operations are described in the section "Volume Mounting" later in this chapter.)

Dynamic Disks

As we've stated, dynamic disks are the disk format in Windows necessary for creating multipartition volumes such as mirrors, stripes and RAID-5 (described later in the chapter). Dynamic disks are partitioned using Logical Disk Manager (LDM) partitioning. The LDM subsystem in Windows, which consists of user-mode and device driver components, oversees dynamic disks. Microsoft licenses LDM from VERITAS Software, which originally developed LDM technology for UNIX systems. Working closely with Microsoft, VERITAS ported its LDM to Windows to provide the operating system with more robust partitioning and multipartition volume capabilities. A major difference between LDM's partitioning and MBR and GPT partitioning is that LDM maintains one unified database that stores partitioning information for all the dynamic disks on a system including multipartition-volume configuration.

Note

LDM's UNIX version incorporates disk groups, in which all the dynamic disks that the system assigns to a disk group share a common database. VERITAS's commercial volume-management software for Windows also includes disk groups, but the Windows LDM implementation includes only one disk group.

The LDM Database

The LDM database resides in a 1-MB reserved space at the end of each dynamic disk. The need for this space is the reason Windows requires free space at the end of a basic disk before you can convert it to a dynamic disk. The LDM database consists of four regions, which Figure 10-5 shows: a header sector that LDM calls the Private Header, a table of contents area, a database records area, and a transactional log area. (The fifth region shown in Figure 10-5 is simply a copy of the Private Header.) The Private Header sector resides 1 MB before the end of a dynamic disk and anchors the database. As you spend time with Windows, you'll quickly notice that it uses GUIDs to identify just about everything, and disks are no exception. A GUID (globally unique identifier) is a 128-bit value that various components in Windows use to uniquely identify objects. LDM assigns each dynamic disk a GUID, and the Private Header sector notes the GUID of the dynamic disk on which it resides hence the Private Header's designation as information that is private to the disk. The Private Header also stores the name of the disk group, which is the name of the computer concatenated with Dg0 (for example, DarylDg0 if the computer's name is Daryl), and a pointer to the beginning of the database table of contents. (As mentioned earlier, the Windows implementation of LDM includes only one disk group, so the disk group name will always end with Dg0.) For reliability, LDM keeps a copy of the Private Header in the disk's last sector.

Figure 10-5. LDM database layout

The database table of contents is 16 sectors in size and contains information regarding the database's layout. LDM begins the database record area immediately following the table of contents with a sector that serves as the database record header. This sector stores information about the database record area, including the number of records it contains, the name and GUID of the disk group the database relates to, and a sequence number identifier that LDM uses for the next entry it creates in the database. Sectors following the database record header contain 128-byte fixed-size records that store entries that describe the disk group's partitions and volumes.

A database entry can be one of four types: partition, disk, component, and volume. LDM uses the database entry types to identify three levels that describe volumes. LDM connects entries with internal object identifiers. At the lowest level, partition entries describe soft partitions, which are contiguous regions on a disk; identifiers stored in a partition entry link the entry to a component and disk entry. A disk entry represents a dynamic disk that is part of the disk group and includes the disk's GUID. A component entry serves as a connector between one or more partition entries and the volume entry each partition is associated with. A volume entry stores the GUID of the volume, the volume's total size and state, and a drive-letter hint. Disk entries that are larger than a database record span multiple records; partition, component, and volume entries rarely span multiple records.

LDM requires three entries to describe a simple volume: a partition, component, and volume entry. The following listing shows the contents of a simple LDM database that defines one 200-MB volume that consists of one partition:

Disk Entry         Volume Entry   Component  Entry   PartitionEntry Name: Disk1        Name: Volume1  Name: Volume1-01   Name: Disk1-01 GUID: XXX-XX...    ID: 0x408      ID: 0x409          ID: 0x407 DiskID: 0x404      State: ACTIVE  ParentID:0x408     ParentID: 0x409                    Size:200MB                        Disk ID: 0x404                    GUID:XXX-XX...                    Start: 300MB                    DriveHint:  H:                    Size: 200MB

The partition entry describes the area on a disk that the system assigned to the volume, the component entry connects the partition entry with the volume entry, and the volume entry contains the GUID that Windows uses internally to identify the volume. Multipartition volumes require more than three entries. For example, a striped volume (which is described later in the chapter) consists of at least two partition entries, a component entry, and a volume entry. The only volume type that has more than one component entry is a mirror; mirrors have two component entries, each of which represents one-half of the mirror. LDM uses two component entries for mirrors so that when you break a mirror, LDM can split it at the component level, creating two volumes with one component entry each.

The final area of the LDM database is the transactional log area, which consists of a few sectors for storing backup database information as the information is modified. This setup safeguards the database in case of a crash or power failure because LDM can use the log to return the database to a consistent state.

EXPERIMENT: Using LDMDump to View the LDM Database

You can use LDMDump from Sysinternals to view detailed information about the contents of the LDM database. LDMDump takes a disk number as a command-line argument, and its output is usually more than a few screens in size, so you should pipe its output to a file for viewing in a text editor for example, ldmdump /d0 > disk.txt. The following example shows excerpts of LDMDump output. The LDM database header displays first, followed by the LDM database records that describe a 4-GB disk with a 4-GB simple volume. The volume's database entry is listed as Volume1. At the end of the output, LDMDump lists the soft partitions and definitions of volumes it locates in the database.

C:\>ldump/d0 Logical  Disk Manager Configuration Dump  v1.03 Copyright(C)2000-2002 Mark Russinovich PRIVATE  HEAD: Signature          : PRIVHEAD Version            : 2.11 Disk  Id           : b78e4169-2e39-4a57-a98b-afd7131392f9 Host  Id           : 1b77da20-c717-11d0-a5be-00a0c91db73c Disk  GroupId      : 2ea3c400-c92a-4172-9286-de46dda1098a Disk  GroupName    : Vmware03Dg0 Logical  diskstart : 3F Logical  disksize  : 7FF54B(4094  MB) Configurationstart : 7FF7E0 Configurationsize  : 800(1MB) NumberofTOCs       : 1 TOCsize            : 7FE(1023KB) NumberofConfigs    : 1 Configsize         : 5AC(726  KB) NumberofLogs       : 1 Logsize            : DC(110KB) TOC0: Signature          : TOCBLOCK Sequence           : 0x9 Configbitmapstart  : 0x11 Configbitmapsize   : 0x5AC ... VBLK DATABASE: 0x000004: [000004]  <Disk>          Name         : Disk1          Object  Id   : 0x0403          DiskId       : b78e4169-2e39-4a57-a98b-afd7131392f9 0x000005: [000002]  <DiskGroup>          Name         : Vmware03Dg0          Object  Id   : 0x0401          GUID         : 2ea3c400-c92a-4172-9286-de46dda1098a 0x000006: [000006]  <Volume>          Name         : Volume1          Object  Id   : 0x0406          Volume  state: ACTIVE          Size         : 0x007FB68A  (4086  MB)          GUID         : e6ee6edc-d1ba-11d8-813e-806e6f6e6963          Drive  Hint  : C: ... PARTITIONLAYOUT: Disk Disk1:         Disk1-01 Offset: 0x00000000 Length: 0x007FB68A (4086 MB) VOLUME DEFINITIONS: Volume1 Size: 0x007FB68A (4086  MB)     Volume1-01 -       Disk1-01   VolumeOffset: 0x00000000 Offset: 0x00000000  Length: 0x007FB68A

LDM and GPT or MBR-Style Partitioning

When you install Windows on a computer, one of the first things it requires you to do is to create a partition on the system's primary physical disk. Windows defines the system volume on this partition to store the files that it invokes early in the boot process. In addition, Windows Setup requires you to create a partition that serves as the home for the boot volume, onto which the setup program installs the Windows system files and creates the system directory (\Windows). The system and boot volumes can be the same volume, in which case you don't have to create a new partition for the boot volume. The nomenclature that Microsoft defines for system and boot volumes is somewhat confusing. The system volume is where Windows places boot files, including the boot loader (Ntldr) and Ntdetect, and the boot volume is where Windows stores operating system files such as Ntoskrnl.exe, the core kernel file.

Although the partitioning data of a dynamic disk resides in the LDM database, LDM implements MBR-style partitioning or GPT partitioning so that the Windows boot code can find the system and boot volumes, when the volumes are on dynamic disks. (Ntldr and the IA64 firmware, for example, know nothing about LDM partitioning.) If a disk contains the system or boot volumes, partitions in the MBR or GPT partition table describe the location of those volumes. Otherwise, one partition encompasses the entire usable area of the disk. LDM marks this partition as type "LDM," a partition type new to Windows 2000. The region encompassed by this place-holding MBR-style or GPT partition is where LDM creates partitions that the LDM database organizes. On MBR-partitioned disks the LDM database resides in hidden sectors at the end of the disk and on GPT-partitioned disks there exists an LDM metadata partition that encompasses the LDM database near the beginning of the disk.

Another reason LDM creates an MBR-style or GPT partition table is so that legacy disk-management utilities, including those that run under Windows and under other operating systems in dual-boot environments, don't mistakenly believe a dynamic disk is unpartitioned.

Because LDM partitions aren't described in the MBR-style or GPT partition table of a disk, they are called soft partitions; MBR-style and GPT partitions are called hard partitions. Figure 10-6 illustrates this dynamic disk layout on an MBR-style partitioned disk.

Figure 10-6. Internal dynamic disk organization

Dynamic Disk Volume Manager

The Disk Management MMC snap-in DLL (DMDiskManager, located in \Windows\ System32\Dmdskmgr.dll) and shown in Figure 10-7, uses DMAdmin, the LDM Disk Administrator service (\Windows\System32\Dmadmin.exe), to create and change the contents of the LDM database. When you launch the Disk Management MMC snap-in, DMDiskManager loads into memory and starts DMAdmin, if it's not already running. DMAdmin reads the LDM database from each disk and returns the information it obtains to DMDiskManager. If DMAdmin detects a database from another computer's disk group, it notes that the volumes on the disk are foreign and lets you import them into the current computer's database if you want to use them. As you change the configuration of dynamic disks, DMDiskManager informs DMAdmin of the changes and DMAdmin updates its in-memory copy of the database. When DMAdmin commits changes, it passes the updated database to DMIO, the Dmio.sys device driver. DMIO is the dynamic disk equivalent of FtDisk, so it controls access to the on-disk database and creates device objects that represent the volumes on dynamic disks. When you exit Disk Management, DMDiskManager stops and unloads the DMAdmin service.

Figure 10-7. Disk Management MMC snap-in

DMIO doesn't know how to interpret the database it oversees. DMConfig (\Windows\ System32\Dmconfig.dll), a DLL that DMAdmin loads, and another device driver, DMBoot (Dmboot.sys), are responsible for interpreting the database. DMConfig knows how to both read and update the database; DMBoot knows only how to read the database. DMBoot loads during the boot process if another LDM driver, DMLoad (Dmload.sys), determines that at least one dynamic disk is present on the system. DMLoad makes this determination by asking DMIO, and if at least one dynamic disk is present, DMLoad starts DMBoot, which scans the LDM database. DMBoot informs DMIO of the composition of each volume it encounters so that DMIO can create device objects to represent the volumes. DMBoot unloads from memory immediately after it finishes its scan. Because DMIO has no database interpretation logic, it is relatively small. Its small size is beneficial because DMIO is always loaded.

Like FtDisk, DMIO is a bus driver and creates a device object for each dynamic disk volume with a name in the form \Device\HarddiskDmVolumes\PhysicalDmVolumes\BlockVolumeX, in which X is an identifier that DMIO assigns to the volume. In addition, DMIO creates another device object that represents raw (unstructured) I/O in a volume named \Device\HarddiskDmVolumes\PhysicalDmVolumes\RawVolumeX. Figure 10-8 shows the device objects that DMIO created on a system that consists of three dynamic disk volumes. DMIO also creates numerous symbolic links in the object manager namespace for each volume, starting with one link in the form \Device\HarddiskDmVolumes\ComputerNameDg0\ VolumeY for each volume. DMIO replaces ComputerName with the name of the computer and replaces Y with a volume identifier (different from the internal identifier that DMIO assigns to the device objects). These links refer to the block-device objects under the PhysicalDmVolumes directory.

Figure 10-8. DMIO driver device objects

Note

Another driver present in Windows 2000 is DiskPerf (Disk Performance driver, located at \Windows\System32\Drivers\Diskperf.sys). DiskPerf attaches device objects to the device objects that represent physical disks (for example, \Device\Harddisk0\Partition0) so that it can monitor I/O requests targeted at disks and generate performance-related statistics for the Performance tool to present. Statistics include bytes read and written per second, transfers per second, and the amount of time spent performing disk I/O. On Windows XP and Windows Server 2003, DiskPerf's functionality is implemented in the Partition Manager driver because it already filters physical disk device objects to implement its other functions (described earlier in this chapter).

Multipartition Volume Management

FtDisk and DMIO are responsible for presenting volumes that file system drivers manage and for mapping I/O directed at volumes to the underlying partitions that they're made of. For simple volumes, this process is straightforward: the volume manager ensures that volume-relative offsets are translated to disk-relative offsets by adding the volume-relative offset to the volume's starting disk offset.

Multipartition volumes are more complex because the partitions that make up a volume can be located on discontiguous partitions or even on different disks. Some types of multipartition volumes use data redundancy, so they require more involved volume-to-disk offset translation. Thus, DMIO must process all I/O requests aimed at the multipartition volumes they manage by determining which partitions the I/O ultimately affects.

The following types of multipartition volumes are available in Windows:

Spanned volumes
Mirrored volumes
Striped volumes
RAID-5 volume

After describing multipartition-volume partition configuration and logical operation for each of the multipartition-volume types, we'll cover the way that the FtDisk and DMIO drivers handle IRPs that a file system driver sends to multipartition volumes. The term volume manager is used to represent DMIO throughout the explanation of multipartition volumes because, as mentioned earlier in the chapter, FtDisk only supports multipartition-volume type migrated from NT 4.

Spanned Volumes

A spanned volume is a single logical volume composed of a maximum of 32 free partitions on one or more disks. The Windows Disk Management MMC snap-in combines the partitions into a spanned volume, which can then be formatted for any of the Windows-supported file systems. Figure 10-9 shows a 100-MB spanned volume identified by drive letter D: that has been created from the last third of the first disk and the first third of the second. Spanned volumes were called volume sets in Windows NT 4.

Figure 10-9. Spanned volume

A spanned volume is useful for consolidating small areas of free disk space into one larger volume or for creating a single, large volume out of two or more small disks. If the spanned volume has been formatted for NTFS, it can be extended to include additional free areas or additional disks without affecting the data already stored on the volume. This extensibility is one of the biggest benefits of describing all data on an NTFS volume as a file. NTFS can dynamically increase the size of a logical volume because the bitmap that records the allocation status of the volume is just another file the bitmap file. The bitmap file can be extended to include any space added to the volume. Dynamically extending a FAT volume, on the other hand, would require the FAT itself to be extended, which would dislocate everything else on the disk.

A volume manager hides the physical configuration of disks from the file systems installed on Windows. NTFS, for example, views volume D: in Figure 10-9 as an ordinary 100-MB volume. NTFS consults its bitmap to determine what space in the volume is free for allocation. It then calls the volume manager to read or write data beginning at a particular byte offset on the volume. The volume manager views the physical sectors in the spanned volume as numbered sequentially from the first free area on the first disk to the last free area on the last disk. It determines which physical sector on which disk corresponds to the supplied byte offset.

Striped Volumes

A striped volume is a series of up to 32 partitions, one partition per disk, that gets combined into a single logical volume. Striped volumes are also known as RAID level 0 (RAID-0) volumes. Figure 10-10 shows a striped volume consisting of three partitions, one on each of three disks. (A partition in a striped volume need not span an entire disk; the only restriction is that the partitions on each disk be the same size.)

Figure 10-10. Striped volume

To a file system, this striped volume appears to be a single 450-MB volume, but a volume manager optimizes data storage and retrieval times on the striped volume by distributing the volume's data among the physical disks. The volume manager accesses the physical sectors of the disks as if they were numbered sequentially in stripes across the disks, as illustrated in Figure 10-11.

Figure 10-11. Logical numbering of physical sectors on a striped volume

Because each stripe is a relatively narrow 64 KB (a value chosen to prevent individual reads and writes from accessing two disks), the data tends to be distributed evenly among the disks. Stripes thus increase the probability that multiple pending read and write operations will be bound for different disks. And because data on all three disks can be accessed simultaneously, latency time for disk I/O is often reduced, particularly on heavily loaded systems.

Spanned volumes make managing disk volumes more convenient, and striped volumes spread the I/O load over multiple disks. These two volume-management features don't provide the ability to recover data if a disk fails, however. For data recovery, a volume manager implements three redundant storage schemes: mirrored volumes, RAID-5 volumes, and sector sparing. (Sector sparing and NTFS support for sector sparing are described in Chapter 12.) These features are created with the Windows Disk Management administrative tool.

Mirrored Volumes

In a mirrored volume, the contents of a partition on one disk are duplicated in an equal-sized partition on another disk. Mirrored volumes are sometimes referred to as RAID level 1 (RAID1). A mirrored volume is shown in Figure 10-12.

Figure 10-12. Mirrored volume

When a program writes to drive C:, the volume manager writes the same data to the same location on the mirror partition. If the first disk or any of the data on its C: partition becomes unreadable because of a hardware or software failure, the volume manager automatically accesses the data from the mirror partition. A mirror volume can be formatted for any of the Windows-supported file systems. The file system drivers remain independent and are not affected by the volume manager's mirroring activity.

Mirrored volumes can aid in read I/O throughput on heavily loaded systems. When I/O activity is high, the volume manager balances its read operations between the primary partition and the mirror partition (accounting for the number of unfinished I/O requests pending from each disk). Two read operations can proceed simultaneously and thus theoretically finish in half the time. When a file is modified, both partitions of the mirror set must be written, but disk writes are done asynchronously, so the performance of user-mode programs is generally not affected by the extra disk update.

Mirrored volumes are the only multipartition volume type supported for system and boot volumes. The reason for this is that the Windows boot code, including the MBR code and Ntldr, don't have the sophistication required to understand multipartition volumes mirrored volumes are the exception because the boot code treats them as simple volumes, reading from the half of the mirror marked as the boot or system drive in the MBR-style partition table. Because the boot code doesn't modify the disk, it can safely ignore the other half of the mirror.

EXPERIMENT: Watching Mirrored Volume I/O Operations

Using the Windows Performance tool, you can verify that write operations directed at mirrored volumes copy to both disks that make up the mirror and that read operations, if relatively infrequent, occur primarily from one half of the volume. This experiment requires three hard disks and a Windows 2000 server or Windows Server 2003 system. If you don't have three disks or a server system, you can skip the experiment setup instructions and view the Performance tool screen shot in this experiment that demonstrates the experiment's results.

Use the Disk Management MMC snap-in to create a mirrored volume. To do this, perform the following steps:

Run Disk Management by starting Computer Management, expanding the Storage tree, and selecting Disk Management (or by inserting Disk Management as a snapin in an MMC console).
Right-click on an unallocated space of a drive, and select Create Volume.
Follow the instructions in the Create Volume Wizard to create a simple volume. (Make sure there's enough room on another disk for a volume of the same size as the one you're creating.)
Right-click on the new volume, and select Add Mirror from the context menu.

Once you have a mirrored volume, run the Performance tool and add counters for the PhysicalDisk performance object for both disk instances that contain a partition belonging to the mirror. Select the Disk Reads/Sec and Disk Writes/Sec counters for each instance. Select a large directory from the third disk (the one that isn't part of the mirrored volume), and copy it to the mirrored volume. The Performance tool output window should look something like the following as the copy operation progresses.

The top two lines, which overlap throughout the timeline, are the Disk Writes/Sec counters for each disk. The bottom two lines are the Disk Reads/Sec lines. The screen shot reveals that the volume manager (in this case DMIO) is writing the copied file data to both halves of the volume but primarily reading from only one. This read behavior occurs because the number of outstanding I/O operations during the copy didn't warrant that the volume manager perform more aggressive read-operation load balancing.

RAID-5 Volumes

A RAID-5 volume is a fault tolerant variant of a regular striped volume. RAID-5 volumes implement RAID level 5. They are also known as striped volumes with parity because they are based on the striping approach taken by striped volumes. Fault tolerance is achieved by reserving the equivalent of one disk for storing parity for each stripe. Figure 10-13 is a visual representation of a RAID-5 volume.

Figure 10-13. RAID-5 volume

In Figure 10-13, the parity for stripe 1 is stored on disk 1. It contains a byte-for-byte logical sum (XOR) of the first stripe on disks 2 and 3. The parity for stripe 2 is stored on disk 2, and the parity for stripe 3 is stored on disk 3. Rotating the parity across the disks in this way is an I/O optimization technique. Each time data is written to a disk, the parity bytes corresponding to the modified bytes must be recalculated and rewritten. If the parity were always written to the same disk, that disk would be busy continually and could become an I/O bottleneck.

Recovering a failed disk in a RAID-5 volume relies on a simple arithmetic principle: in an equation with n variables, if you know the value of n 1 of the variables, you can determine the value of the missing variable by subtraction. For example, in the equation x + y = z, where z represents the parity stripe, the volume manager computes z y to determine the contents of x; to find y, it computes z x. The volume manager uses similar logic to recover lost data. If a disk in a RAID-5 volume fails or if data on one disk becomes unreadable, the volume manager reconstructs the missing data by using the XOR operation (bitwise logical addition).

If disk 1 in Figure 10-13 fails, the contents of its stripes 2 and 5 are calculated by XORing the corresponding stripes of disk 3 with the parity stripes on disk 2. The contents of stripes 3 and 6 on disk 1 are similarly determined by XORing the corresponding stripes of disk 2 with the parity stripes on disk 3. At least three disks (or rather, three same-sized partitions on three disks) are required to create a RAID-5 volume.

The Volume Namespace

Drive-letter assignment is an aspect of storage management that changed significantly from Windows NT 4 to Windows 2000. Even so, Windows includes support for migrating drive-letter assignments made in a Windows NT 4 installation that upgrades to Windows. Windows NT 4 drive-letter assignments are stored in HKLM\SYSTEM\Disk. After the upgrade procedure reads and stores the information in Windows-specific locations, the system no longer references the Disk key.

The Mount Manager

The Mount Manager device driver (Mountmgr.sys), assigns drive letters for dynamic disk volumes and basic disk volumes created after Windows is installed, CD-ROMs, floppies, and removable devices. Windows stores all drive-letter assignments under HKLM\SYSTEM\ MountedDevices. If you look in the registry under that key, you'll see values with names such as \??\Volume{X} (where X is a GUID) and values such as \DosDevices\C:. Every volume has a volume name entry, but a volume doesn't necessarily have an assigned drive letter. Figure 10-14 shows the contents of an example Mount Manager registry key. Note that the MountedDevices key, like the Disk key in Windows NT 4, isn't included in a control set and so isn't protected by the last known good boot option. (See the section "Accepting the Boot and Last Known Good" in Chapter 4 for more information on control sets and the last known good boot option.)

Figure 10-14. Mounted devices listed in the Mount Manager's registry key

The data that the registry stores in values for basic disk volume drive letters and volume names is the Windows NT 4 style disk signature and the starting offset of the first partition associated with the volume. The data that the registry stores in values for dynamic disk volumes includes the volume's DMIO-internal GUID. When the Mount Manager initializes during the boot process, it registers with the Windows Plug and Play subsystem so that it receives notification whenever either FtDisk or DMIO creates a volume. When the Mount Manager receives such a notification, it determines the new volume's GUID or disk signature and uses the GUID or signature as a guide to look in its internal database, which reflects the contents of the MountedDevices registry key. The Mount Manager then determines whether its internal database contains the drive-letter assignment. If the volume has no entry in the database the Mount Manager asks either FtDisk or DMIO (whichever created the volume) for a suggested drive-letter assignment and stores that in the database. FtDisk doesn't return suggestions and DMIO looks at the drive-letter hint in the volume's database entry.

If no suggested drive-letter assignment exists for the volume, the Mount Manager uses the first unassigned drive letter (if one exists), defines a new assignment, creates a symbolic link for the assignment (for example, \Global??\D: on Windows XP and Windows Server 2003 or \??\D: on Windows 2000), and updates the MountedDevices registry key. If there are no available drive letters, no drive-letter assignment is made. At the same time, the Mount Manager creates a volume symbolic link (that is, \Global??\Volume{X}) that defines a new volume GUID if the volume doesn't already have one. This GUID is different from the volume GUIDs that DMIO uses internally.

Mount Points

Mount points let you link volumes through directories on NTFS volumes, which makes volumes with no drive-letter assignment accessible. For example, an NTFS directory that you've named C:\Projects could mount another volume (NTFS or FAT) that contains your project directories and files. If your project volume had a file you named \CurrentProject\Description.txt, you could access the file through the path C:\Projects\CurrentProject\Description.txt. What makes mount points possible is reparse point technology. (Reparse points are discussed in more detail in Chapter 12.)

A reparse point is a block of arbitrary data with some fixed header data that Windows associates with an NTFS file or directory. An application or the system defines the format and behavior of a reparse point, including the value of the unique reparse point tag that identifies reparse points belonging to the application or system and specifies the size and meaning of the data portion of a reparse point. (The data portion can be as large as 16 KB.) Reparse points store their unique tag in a fixed segment. Any application that implements a reparse point must supply a file system filter driver to watch for reparse-related return codes for file operations that execute on NTFS volumes, and the driver must take appropriate action when it detects the codes. NTFS returns a reparse status code whenever it processes a file operation and encounters a file or directory with an associated reparse point.

The Windows NTFS file system driver, the I/O manager, and the object manager all partly implement reparse point functionality. The object manager initiates pathname-parsing operations by using the I/O manager to interface with file system drivers. Therefore, the object manager must retry operations for which the I/O manager returns a reparse status code. The I/O manager implements pathname modification that mount points and other reparse points might require, and the NTFS file system driver must associate and identify reparse point data with files and directories. You can therefore think of the I/O manager as the reparse point file system filter driver for many Microsoft-defined reparse points.

An example of a reparse point application is a Hierarchical Storage Management (HSM) system, such as Windows Remote Storage Service (RSS) that Windows 2000 Server and Windows Server 2003 include, that uses reparse points to designate files that an administrator moves to offline tape storage. When a user tries to access an offline file, the HSM filter driver detects the reparse status code that NTFS returns, communicates with a user-mode service to retrieve the file from offline storage, deletes the reparse point from the file, and lets the file operation retry after the service retrieves the file. This is exactly how the RSS filter driver, Rsfilter.sys, uses reparse points.

If the I/O manager receives a reparse status code from NTFS and the file or directory for which NTFS returned the code isn't associated with one of a handful of built-in Windows reparse points, no filter driver claimed the reparse point. The I/O manager then returns an error to the object manager that propagates as a "file cannot be accessed by the system" error to the application making the file or directory access.

Mount points are reparse points that store a volume name ( \Global??\Volume{X}) as the reparse data. When you use the Disk Management MMC snap-in to assign or remove path assignments for volumes, you're creating mount points. You can also create and display mount points by using the built-in command-line tool Mountvol.exe (\Windows\System32\Mountvol.exe).

The Mount Manager maintains the Mount Manager remote database on every NTFS volume in which the Mount Manager records any mount points defined for that volume. The database file, $MountMgrRemoteDatabase, resides in the NTFS root directory. Mount points move when a disk moves from one system to another and in dual-boot environments that is, when booting between multiple Windows installations because of the Mount Manager remote database's existence. NTFS also keeps track of mount points in the NTFS metadata file \$Extend\$Reparse. (NTFS doesn't make any of its metadata files available for viewing by applications.) NTFS stores mount-point information in the metadata file so that Windows can easily enumerate the mount points defined for a volume when a Windows application, such as Disk Management, requests mount-point definitions.

EXPERIMENT: Recursive Mount Points

This experiment uses Filemon from http://www.sysinternals.com to show the interesting behavior caused by a recursive mount point. A recursive mount point is a mount point that links to the same volume it's on. Performing a recursive directory listing on a recursive mount point produces file access traces that clearly demonstrate how NTFS treats mount points.

To create and view a mount point, perform the following steps:

Open a command prompt or Windows Explorer window, and create a directory on an NTFS drive named \Recurse.
In the Disk Management MMC snap-in, right-click on the volume and select Change Drive Letter And Path.
When the Add/Remove dialog box appears, enter the path to the directory you created (for example, I:\Recurse).
Start Filemon, and in the Drives menu, uncheck all the volumes except the one on which you created the mount point.

Now you're ready to generate a recursive mount-point trace. Open a command prompt, and execute the command dir /s I:\Recurse. If you look at all the file accesses that reference Recurse in the Filemon's trace of the subsequent file operations, you'll notice that the command prompt first accesses I:\Recurse, then I:\Recurse\Recurse, and so on, recursing deeper and deeper.

The application attempts to perform a directory listing at each level of the recursion, but whenever it encounters the mount point, it digs into it to try another directory listing. NTFS returns a reparse status code, which tells the object manager to back up one level and try again. Finally, when it gets back to the root directory, the application examines the file or directory that it had found deep in the mount-point recursion. The application never receives a reparse code because of the object manager directed retry activity. The object manager processes the reparse codes as it receives them from NTFS when it performs directory lookups.

Filemon presents request types as their native IRP type, so a directory or file open appears as an IRP_MJ_CREATE request. A file or directory close is IRP_MJ_CLOSE, and a directory query is an IRP_ MJ_DIRECTORY_CONTROL request with FileBothDirectoryInfo in Filemon's Other column.

To prevent buffer overflows and infinite loops, both command prompt and Windows Explorer halt their recursion when the directory depth reaches 32 or the pathname exceeds 256 characters, whichever comes first.

Volume Mounting

Because Windows assigns a drive letter to a volume doesn't mean that the volume contains data that has been organized in a file system format that Windows recognizes. The volumerecognition process consists of a file system claiming ownership for a partition; the process takes place the first time the kernel, a device driver, or an application accesses a file or directory on a volume. After a file system driver signals its responsibility for a partition, the I/O manager directs all IRPs aimed at the volume to the owning driver. Mount operations in Windows consist of three components: file system driver registration, volume parameter blocks (VPBs), and mount requests.

Note

In Windows Server 2003 Enterprise and Datacenter Edition, the automatic mounting of volumes as the Volume Manager reports their presence is disabled to prevent the system from aggressively mounting volumes attached to a System Area Network (SAN). You can use the Diskpart command-line utility, which Windows Server 2003 includes, to enable and disable automounting behavior, and the Mountvol utility, also included in the system, to mount volumes.

The I/O manager oversees the mount process and is aware of available file system drivers because all file system drivers register with the I/O manager when they initialize. The I/O manager provides the IoRegisterFileSystem function to local disk (rather than network) file system drivers for this registration. When a file system driver registers, the I/O manager stores a reference to the driver in a list that the I/O manager uses during mount operations.

Every device object contains a VPB data structure, but the I/O manager treats VPBs as meaningful only for volume device objects. A VPB serves as the link between a volume device object and the device object that a file system driver creates to represent a mounted file system instance for that volume. If a VPB's file system reference is empty, no file system has mounted the volume. The I/O manager checks a volume device object's VPB whenever an open API that specifies a filename or a directory name on a volume device object executes.

For example, if the mount manager assigns drive letter D to the second volume on a system, it creates a \Global??\D: symbolic link that resolves to the device object \Device\ HarddiskVolume2. A Windows application that attempts to open the \Temp\Test.txt file on the D: drive specifies the name D:\Temp\Test.txt, which the Windows subsystem converts to \Global??\D:\Temp\Test.txt before invoking NtCreateFile, the kernel's file-open routine. NtCreateFile uses the object manager to parse the name, and the object manager encounters the \Device\ HarddiskVolume2 device object with the path \Temp\Test.txt still unresolved. At that point, the I/O manager checks to see whether \Device\HarddiskVolume2's VPB references a file system. If it doesn't, the I/O manager asks each registered file system driver via a mount request whether the driver recognizes the format of the volume in question as the driver's own.

EXPERIMENT: Looking at VPBs

You can look at the contents of a VPB by using the !vpb kernel debugger command. Because the VPB is pointed to by the device object for a volume, you must first locate a volume device object. To do this, you must dump a volume manager's driver object, locate a device object that represents a volume, and display the device object, which reveals its VPB field.

If your system has a dynamic disk, you can use the !drvobj driver object viewing command on the DMIO driver; otherwise, you need to specify the FtDisk driver. Here's an example:

kd> !drvobj ftdisk Driver object (818aec50) is for:  \Driver\Ftdisk Driver Extension List: (id, addr) Device Object list: 818a5290  817e96f0  817e98b0  817e9030 818a73b0  818a7810  8182d030

The !drvobj command lists the addresses of the device objects a driver owns. In this example, there are seven device objects. One of them represents the programmatic interface to the device driver, and the rest are volume device objects. Because the objects are listed in reverse order of the way that they were created and the driver creates the device driver interface object first, you know the first device object listed is that of a volume. Now execute the !devobj kernel debugger command on the volume device object address:

kd> !devobj 818a5290 Device object (818a5290) is for:  HarddiskVolume6 \Driver\Ftdisk DriverObject 818aec50 Current Irp 00000000 RefCount 3 Type 00000007 Flags 00001050 Vpb 818a5da8 Dacl e1000384 DevExt 818a5348 DevObjExt 818a53f8  Dope 818a50a8  DevNode 818a5ae8 ExtensionFlags (0xa8000000)  DOE_RAW_FDO, DOE_DESIGNATED_FDO                              Unknown flags 0x08000000 AttachedDevice (Upper) 86b52b58 \Driver\VolSnap Device queue is not busy.

The !devobj command shows the VPB field for the volume device object. (The device object shown is named HarddiskVolume6.) Now you're ready to execute the !vpb command:

kd> !vpb 818a5da8 Vpb at 0x818a5da8 Flags: 0x1  mounted DeviceObject: 0x850dcac0 RealDevice:   0x818a5290 RefCount: 3 Volume  Label:     BACKUP

The command reveals that the volume device object is mounted by a file system driver that has assigned the volume the name BACKUP. The RealDevice field in the VPB points back to the volume device object, and the DeviceObject field points to the mounted file system device object.

The convention followed by file system drivers for recognizing volumes mounted with their format is to examine the volume's boot record, which is stored in the first sector of the volume. Boot records for Microsoft file systems contain a field that stores a file system format type. File system drivers usually examine this field, and if it indicates a format they manage, they look at other information stored in the boot record. This information usually includes a file system name field and enough data for the file system driver to locate critical metadata files on the volume. NTFS, for example, will recognize a volume only if the type field is NTFS, the name field is "NTFS," and the critical metadata files described by the boot record are consistent.

If a file system driver signals affirmatively, the I/O manager fills in the VPB and passes the open request with the remaining path (that is, \Temp\Test.txt) to the file system driver. The file system driver completes the request by using its file system format to interpret the data that the volume stores. After a mount fills in a volume device object's VPB, the I/O manager hands subsequent open requests aimed at the volume to the mounted file system driver. If no file system driver claims a volume, Raw a file system driver built into Ntoskrnl.exe claims the volume and fails all requests to open files on that partition. Figure 10-15 shows a simplified example (that is, the figure omits the file system driver's interactions with the Windows cache and memory managers) of the path that I/O directed at a mounted volume follows.

Figure 10-15. Mounted volume I/O flow

Instead of having every file system driver loaded, regardless of whether or not they have any volumes to manage, Windows tries to minimize memory usage by using a surrogate driver named File System Recognizer (\Windows\System32\Drivers\Fs_rec.sys) to perform preliminary file system recognition. File System Recognizer knows enough about each file system format that Windows supports to be able to examine a boot record and determine whether it's associated with a Windows file system driver. When the system boots, File System Recognizer registers as a file system driver, and when the I/O manager calls it during a file system mount operation for a new volume, File System Recognizer loads the appropriate file system driver if the boot record corresponds to one that isn't loaded. After loading a file system driver, File System Recognizer forwards the mount IRP to the driver and lets the file system driver claim ownership of the volume.

Aside from the boot volume, which a driver mounts while the kernel is initializing, file system drivers mount most volumes when the Chkdsk file system consistency-checking application runs during a boot sequence. The boot-time version of Chkdsk is a native application (as opposed to a Windows application) named Autochk.exe (\Windows\System32\Autochk.exe), and the Session Manager (\Windows\System32\Smss.exe) runs it because it is specified as a boot-run program in the HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\ BootExecute value. Chkdsk accesses each drive letter to see whether the volume associated with the letter requires a consistency check.

One place in which mounting can occur more than once for the same disk is with removable media. Windows file system drivers respond to media changes by querying the disk's volume identifier. If they see the volume identifier change, the driver dismounts the disk and attempts to remount it.

Volume I/O Operations

File system drivers manage data stored on volumes but rely on volume managers to interact with storage drivers to transfer data to and from the disk or disks on which a volume resides. File system drivers obtain references to a volume manager's volume objects through the mount process and then send the volume manager requests via the volume objects. Applications can also send the volume manager requests, bypassing file system drivers, when they want to directly manipulate a volume's data. File-undelete programs are an example of applications that do this, and so is the DiskProbe utility that's part of the Windows resource kits.

Whenever a file system driver or application sends an I/O request to a device object that represents a volume, the Windows I/O manager routes the request (which comes in an IRP a self-contained package) to the volume manager that created the target device object. Thus, if an application wants to read the boot sector of the second volume on the system (which is a simple volume in this example), it opens the device object \Device\HarddiskVolume2 and then sends the object a request to read 512 bytes starting at offset zero on the device. The I/O manager sends the application's request in the form of an IRP to the volume manager that owns the device object, notifying it that the IRP is directed at the HarddiskVolume2 device.

Because volumes are logical conveniences that Windows uses to represent contiguous areas on one or more physical disks, the volume manager must translate offsets that are relative to a volume to offsets that are relative to the beginning of a disk. If volume 2 consists of one partition that begins 4096 sectors into the disk, the volume manager would adjust the IRP's parameters to designate an offset with that value before passing the request to the disk class driver. The disk class driver uses a miniport driver to carry out physical disk I/O and read requested data into an application buffer designated in the IRP.

Some examples of a volume manager's operations will help clarify its role when it handles requests aimed at multipartition volumes. If a striped volume consists of two partitions, partition 1 and partition 2, that are represented by the device object \Device\HarddiskDmVolumes\

PhysicalDmVolumes\BlockVolume3, as Figure 10-16 shows, and an administrator has assigned drive letter D: to the stripe, the I/O manager defines the link \Global??\D: to reference \Device\HarddiskDmVolumes\ComputerNameDg0\Volume3, where ComputerName is the name of the computer. Recall from earlier that this link is also a symbolic link, and it points to the corresponding volume device object in the PhysicalDmVolumes directory (in this case, BlockVolume3). The DMIO device object intercepts file system disk I/O aimed at \Device\ HarddiskDmVolumes\PhysicalDmVolumes\BlockVolume3, and the DMIO driver adjusts the request before passing it to the Disk class driver. The adjustment that DMIO makes configures the request to refer to the correct offset of the request's target stripe on either partition 1 or partition 2. If the I/O spans both partitions of the volume, DMIO must issue two subsidiary I/O requests, one aimed at each disk.

Figure 10-16. DMIO I/O operations

In the case of writes to a mirrored volume, DMIO splits each request so that each half of the mirror receives the write operation. For mirrored reads, DMIO performs a read from half of a mirror, relying on the other half when a read operation fails.

Virtual Disk Service

A company that makes storage products such as RAID adapters, hard disks, or storage arrays has to implement custom applications for installing and managing their devices. The use of different management applications for different storage devices has obvious drawbacks from the perspective of system administration. These drawbacks include learning multiple interfaces and the inability to use standard Windows storage management tools to manage thirdparty storage devices.

Windows Server 2003 introduces the Virtual Disk Service (or VDS, located at \Windows\ System32\Vds.exe), which provides a unified high-level storage interface so that administrators can manage storage devices from different vendors using the same user interfaces. VDS is shown in Figure 10-17. VDS exports a COM-based API that allows applications and scripts to create and format disks, and to view and manage hardware RAID adapters. For example, a utility can use the VDS API to query the list of physical disks that map to a RAID logical unit number (LUN). Windows disk management utilities, including the Disk Management MMC snap-in and the Diskpart and Diskraid (available in the Windows Server 2003 Deployment Kit) command-line tools, use VDS APIs.

Figure 10-17. VDS service architecture Management applications

VDS supplies two interfaces, one for software providers and one for hardware providers:

Software providers implement interfaces to high-level storage abstractions such as disks, disk partitions, and volumes. Examples of operations supported by these interfaces include creating, extending and deleting volumes, adding or breaking mirrors, formatting and assigning drive letters. VDS looks for registered software providers in HKLM\System\CurrentControlSet\Services\Vds\SoftwareProviders. Windows Server 2003 includes the VDS Dynamic Disk Provider (\Windows\System32\ Vdsdyndr.dll) for interfacing to dynamic disks and the VDS Basic Provider (\Windows\System32\Vdsbas.dll) for interfacing to basic disks.
Hardware vendors implement VDS hardware providers as DLLs that register under HKLM\System\CurrentControlSet\Services\Vds\HardwareProviders and that translate device-independent VDS commands into commands for their hardware. The hardware provider allows for management of a storage subsystem such as a hardware RAID array or adapter card and supported operations include creating, extending, deleting, masking and unmasking LUNs.

When an application initiates a connection to the VDS API and the VDS service isn't started, the Svchost process hosting the RPC service starts the VDS loader process (\Windows\ System32\Vdsldr.exe), which starts the VDS service process and then exits. When the last connection to the VDS API closes, the VDS service process exits.

Volume Shadow Copy Service

A limitation of many backup utilities relates to open files. If an application has a file open for exclusive access, a backup utility can't gain access to the file's contents. Even if the backup utility can access an open file, the utility runs the risk of creating an inconsistent backup. Consider an application that updates a file at its beginning and then at its end. A backup utility saving the file during this operation might record an image of the file that reflects the start of the file before the application's modification and the end after the modification. If the file is later restored the application might deem the entire file corrupt because it might be prepared to handle the case where the beginning has been modified and not the end, but not vice versa. These two problems illustrate why most backup utilities skip open files altogether.

Windows XP introduces the Volume Shadow Copy service (\Windows\System32\ Vssvc.exe), which allows the built-in backup utility to record consistent views of all files, including open ones. The Volume Shadow Copy service acts as the command center of an extensible backup core that enables independent software vendors (ISVs) to plug-in "writers" and providers. A writer is a software component that enables shadow-copy-aware application to receive freeze and thaw notifications to ensure that backup copies of their data files are internally consistent, whereas providers allow ISVs with unique storage schemes to integrate with the shadow copy service. For instance, an ISV with mirrored storage devices might define a shadow copy as the frozen half of a split mirrored volume. Figure 10-18 shows the relationship between the shadow copy service, writers, and providers.

Figure 10-18. Shadow copy service, writers, and providers

The Microsoft Shadow Copy Provider (\Windows\System32\Drivers\Volsnap.sys) is a provider that ships with Windows to implement software-based snapshots of volumes. It is a type of driver called a storage filter driver that layers between file system drivers and volume drivers (the drivers that present views of the disk sectors that represent a logical drive) so that it can see the I/O directed at a volume. When the backup utility starts a backup operation, it directs the Microsoft Shadow Copy Provider driver to create a volume shadow copy for the volumes that include files and directories being recorded. The driver freezes I/O to the volumes in question and creates a shadow volume for each. For example, if a volume's name in the Object Manager namespace is \Device\HarddiskVolume0, the shadow volume would have a name like \Device\HarddiskVolumeShadowCopyN, where N is a unique ID.

EXPERIMENT: Looking at Microsoft Shadow Copy Provider Filter Device Objects

You can see the Microsoft Shadow Copy Provider driver's device objects attached to each volume device on a Windows XP or Windows Server 2003 system in a kernel debugger. Every system has at least one volume, and the following command displays the device object of the first volume on a system:

0: kd> !devobj \device\harddiskvolume1 Device object (86b9daf8) is for:  HarddiskVolume1 \Driver\Ftdisk DriverObject 86b6af38 Current Irp 00000000 RefCount 2 Type 00000007 Flags 00001050 Vpb 86b90008 Dacl e1000384 DevExt 86b9dbb0 DevObjExt 86b9dc98  Dope 86ba33e8 DevN ode 86b71320 ExtensionFlags (0xa8000000) DOE_RAW_FDO, DOE_DESIGNATED_FDO                             Unknown flags 0x08000000 AttachedDevice (Upper) 86b687f8 \Driver\VolSnap Devicequeue is  not busy.

The AttachedDevice field in the output for the !devobj command displays the address of any device object, and the name of the owning driver, attached (filtering) to the device object being examined. For volume device objects, you should see a device object belonging to the Volsnap driver, as in the example output.

Instead of opening files to backup on the original volume, the backup utility opens them on the shadow volume. A shadow volume represents a point-in-time view of a volume, so whenever the volume shadow copy driver sees a write operation directed at an original volume, it reads a copy of the sectors that will be overwritten into a paging file-backed memory section that's associated with the corresponding shadow volume. It services read operations directed at the shadow volume of modified sectors from this memory section, and it services reads to nonmodified areas by reading from the original volume. Because the backup utility won't save the paging file or the contents of the system-managed \System Volume Information directory located on every volume, the snapshot driver uses the defragmentation API to determine the location of these files and directories and does not record changes to them. By relying on the shadow copy facility, the backup utility in Windows XP and Windows Server 2003 overcomes both of the backup problems related to open files.

Figure 10-19 demonstrates the behavior of applications accessing a volume and a backup application accessing the volume's shadow volume. When an application writes to a sector after the snapshot time the Volsnap driver makes a backup copy, like it has for sectors a, b, and c of volume C: in the figure. Subsequently, when an application reads from sector c Volsnap directs the read to volume C:, but when an the backup application reads from sector c Volsnap reads the sector from the snapshot. When a read occurs for any unmodified sector, such as d, Volsnap routes the read to the volume.

Figure 10-19. Volsnap operation

EXPERIMENT: Viewing Shadow Volume Device Objects

You can see the existence of shadow volume device objects in the object manager namespace by running the Windows backup application (in the System Tools in the Accessories folder of the start menu) and selecting enough backup data to cause the backup process to take long enough for you to run Winobj and see the objects in the \Device subdirectory.

File system drivers must handle two shadow copy related I/O control requests (IOCTL) to help ensure consistent snapshots: IOCTL_VOLSNAP_FLUSH_AND_HOLD_WRITES and IOCTL_VOLSNAP_RELEASE_WRITES. Their names are self-explanatory. The shadow copy API sends the IOCTLs to the logical drives for which snapshots are being taken so that all modifications initiated before the snapshot have completed when the shadow copy is taken, making the file data recorded from a shadow copy consistent in time.

Shadow Copies for Shared Folders

Windows Server 2003 takes advantage of Volume Shadow Copy to provide a feature that lets end users access backup versions of volumes so that they can recover old versions of files and folders that they might have deleted or changed. The feature alleviates the burden on systems administrators, who would otherwise have to load backup media and access previous versions on behalf of end users.

The properties dialog box for a volume on Windows Server 2003 system includes a tab named Shadow Copies, where an administrator can enable scheduled snapshots of volumes as seen in the following screen shot. Administrators can also limit the amount of space consumed by snapshots so that the system deletes old snapshots to honor space constraints.

Client systems that want to take advantage of Shadow Copies for Shared Folders must install an Explorer extension called the Previous Versions Client that ships with Windows Server 2003 under the \Windows\System32\Clients\Twclient directory and that Microsoft makes available as a download from its Web site (the extension comes installed with Windows XP Service Pack 2 and higher). When a client Windows system that has the Shadow Copy Explorer extension installed maps a share from a folder on a volume for which snapshots exist, a tab named Previous Versions appears on the Properties dialog box for folders and files on the share. The Previous Versions tab shows a list of snapshots that exist on the server, allowing the user to view or copy a file or folder's data as it existed in a previous snapshot.

< Day Day Up >