Section 6.5. NTFS

6.5 NTFS

The NT file system was designed to be the file system of choice for Windows NT. Since its introduction, several feature enhancements have been made, but the underlying robust design has remained the same. FAT and HPFS (High-Performance File System), the existing file systems from Microsoft when NTFS was designed, were inadequate to meet NT needs. In particular,

FAT does not offer the needed amount of file or object security.
FAT does not have the features needed to handle the extremely large disks available today. (Recall that FAT originally was designed to handle 1MB disks.)
Neither FAT nor HPFS offers transactional features needed to offer reliability and recovery from a system crash.

NTFS offers various features, summarized here and explored in detail later in this chapter:

NTFS transactional features log all file system metadata changes to a log file that allows recovery in the case of a system crash.
All data, including file system metadata, is stored in files.
NTFS and Win32 APIs use 64-bit pointers for file data structures.
NTFS supports file names that can be up to 255 characters long. NTFS also supports the Unicode character set for internationalization.
The data structures support fast directory traversal and navigation.
The file system supports compression and sparse files.
Starting with Windows 2000, an encrypting file system is also supported.
The file system supports fault tolerance features ”for example, bad disk cluster or sector remapping.
NTFS removes the 8.3 file name limit that MS-DOS introduced. NTFS supports case-sensitive, long file names, as well as the Unicode standard, and provides POSIX file name compatibility by supporting trailing dots and trailing spaces. However, sometimes problems arise when names that are not 8.3 compliant are used. The primary reason for the problems is that some of the tools and utilities used may not support long file names that are not 8.3 compliant. Individual file names in NTFS can be up to 255 characters long, and full pathnames are limited to 32,767 characters.
NTFS uses 64-bit file pointers and can theoretically support a file size of 2 ⁶⁴ bytes.

NTFS supports multiple data streams per file. The stream can be opened with the Win32 API CreateFile function, and a stream name in the format ":StreamName" can be appended to the end of the file name ”for example, File1:Stream25. These streams allow reading, writing, or locking independently of other streams that may be open . The Windows NT Macintosh server uses this feature to support Mac clients where files have two "forks": a data fork and a resource fork.

Note that although NTFS supports multiple streams, many tools and applets do not. Thus a file with a size of 1,024 bytes in the regular unnamed stream and 1MB of data in a named stream is reported to be of size 1,024 bytes by the "dir" command (which does not support multiple streams). When a file with multiple streams copies from NTFS to FAT, for instance, only the default unnamed stream data will copy. The data in the other streams is lost.

Table 6.3 summarizes the differences between the FAT and NTFS file systems.

Table 6.3. Comparison of Windows NT File Systems

Comment	FAT16	FAT32	NTFS
Maximum length of file name	8.3	255	255
Maximum file size	2GB	4GB	Maximum theoretical 16 exabytes
Maximum size of volume	2GB	2 terabytes	2 terabytes
Compatible with floppy	Yes	Yes	No
Multiple drives in a single volume	No	No	Yes
File- and directory-level security	No	No	Yes
File- and directory-level access auditing	No	No	Yes
Fault-tolerant features (multiple copies of critical data, journaling of metadata)	No	No	Yes
Encryption and compression of files	No	No	Yes

6.5.1 NTFS System Files

NTFS organizes everything on disk as a series of files, including not just the user files, but also the files containing metadata that pertains to data internal to the file system itself. This section describes the files that NTFS uses for its internal organization process.

The master file table file ($Mft) is always the first file in an NTFS volume. The MFT contains multiple records and has at least one entry for every file or directory on the volume, including an entry for the MFT itself. Each entry in the MFT can be from 1,024K to 4,096K in size, depending on the size of the volume on which the file system resides. Files that have many attributes or are extremely fragmented may require more than one record. The MFT is stored at the beginning of the volume.

System performance is significantly better when the MFT records are stored in contiguous disk clusters ”that is, when the MFT is not fragmented and occupies a contiguous area on the disk. To facilitate this, NTFS reserves an area called the MFT zone at the start of the volume or partition and attempts not to use this area for anything but MFT records. The first files or directories are stored after the MFT zone. About 12 percent of the volume is reserved for the MFT zone. Starting with Windows NT 4.0 SP4, a new registry key was introduced to control the size of the MFT zone. This registry key can take a value between 1 and 4, indicating the size range for the MFT zone, from minimum (1) to maximum (4). The disk defragmenter indicates the current size of the MFT zone.

The first 24 entries in the MFT table are reserved for Microsoft use. Some of the entries have been reassigned between different releases of the operating system, particularly with the release of Windows 2000. Table 6.4 summarizes the various NTFS system files, also referred to as metadata files .

Table 6.4. NTFS System Files

File ^[a]	Record Number	Description
$Mft		Master file table
$MftMirr	1	Master file table mirror containing copy of first 16 files in the MFT
$LogFile	2	Log file (for crash recovery and file system consistency)
$Volume	3	Volume description, including volume serial number, date and time of creation, and volume dirty flag
$AttrDef	4	Attribute definition
. (dot)	5	Root directory
$Bitmap	6	Cluster allocation bitmap
$Boot	7	Boot record of drive
$BadClus	8	Bad cluster list
$Quota	9	Defined as user quota file in NT4, but never used
$Secure	9	Redefined as security descriptors in Windows 2000 and now actually used
$UpCase	10	Uppercase table
$Extend	11	Directory that contains $ObjId, $Quota, and $UsnJrnl files; used from Windows 2000 on
”	12 “23	Reserved for future use

^[a] By convention, the dollar sign ($) in front of a file name indicates that this is a metadata file.

The $MftMirr file mirrors the first 16 MFT entries and is simply a way to ensure that the volume is usable even if the sectors on which the MFT resides become corrupted for some reason. The $MftMirr file is stored in the middle of a volume. Larger volumes may have more than one MFT mirror.

The log file ($LogFile) is used to recover from system crashes and unexpected conditions. NTFS is a transactional, or journaling, file system: It logs all file system metadata changes to the log file before attempting to make the changes. The log files contain redo and undo information used to recover from a system crash and maintain file system consistency.

Note that the metadata stored in the transactional log file is only sufficient to ensure file system integrity ”for example, to ensure that disk clusters are correctly marked as being free or belonging to a particular file and constituting a particular portion of the file data. The content of these data clusters, the actual user data, is not tracked in the transaction log. Once the transaction has been committed and the metadata is changed, a completion record entry is added to the log file.

All log file operations are accomplished by means of the NTFS log file service, which is really a set of routines in the NTFS file system driver. NTFS uses a circular buffer for the log file. The beginning area of this buffer contains a pointer to a location within the buffer where the recovery process should begin. This pointer is stored twice (redundantly) in the log file to ensure recoverability in case of corruption in one area of the log file.

The $Volume file contains the name of the volume, date, and timestamp, indicating when the volume was created, information about the NTFS version on the volume, and a dirty bit that is used to decide whether the system was properly shut down or not. This is also the bit that is checked when a system is booted , to determine whether or not the infamous CHKDSK utility should be run.

The $AttrDef file lists all the attributes that are supported on that particular volume. For each attribute, various information, such as attribute name, attribute type, attribute minimum length, and attribute maximum length, is stored.

The root directory is another important directory. All file and directory lookup operations that are not cached from prior lookups must begin by searching this directory.

The $Bitmap file has a bit representing every disk cluster on the volume. The bit indicates whether the corresponding cluster on the disk is free or in use.

The $Boot file is a placeholder to protect the boot code that must always be at a fixed location toward the beginning of the volume. When a volume is formatted to have NTFS on it, the formatting utility ensures that the $Boot file is shown to be owning the disk clusters where the boot code resides, thus protecting the boot code.

The $BadClus file has an entry for every bad cluster on the disk. The file is dynamically updated; that is, for any new bad cluster dynamically discovered , a new entry is added to this file.

The $Secure file was introduced with Windows 2000. NTFS enforces security on each file and directory. Prior to Windows 2000, the security information was stored in each file or directory MFT entry. Because a lot of files and directories have similar access information, the security information was duplicated a lot. For example, if a user has particular access rights ”for example, read and execute rights to 100 files that constitute a particular application (Microsoft Office might be a good example) ”all those files will have the same security information. Starting with Windows 2000, the security information is stored only once in the $Secure file, and all the files simply refer to this security information.

The $UpCase file is a table used to convert file names and pathnames from lowercase to uppercase, and to map file names to uppercase for applications that treat file names and pathnames as case sensitive.

The $Extend directory was introduced with Windows 2000 and contains files used to implement some NTFS features that are optional. The files within the $Extend directory are as follows :

The $ObjId file stores file and directory object identifiers. These object identifiers are used to track files and directories when they are migrated . For more details, see Section 6.5.15.
The $Quota file is used to store quota limit information on volumes that have quotas enabled. Quota tracking is a feature described in more detail in Section 6.5.9.
The $UsnJrnl file holds information related to changes made to files and directories. This is explained in more detail in Section 6.5.13.
The $ Reparse file holds information about all the files and directories that have a reparse point tag associated with them. Reparse points are a mechanism used to implement symbolic links and are explained in Section 6.5.22.

6.5.2 NTFS Logical Cluster Numbers and Virtual Cluster Numbers

NTFS works with an integral number of disk sectors as the basic minimum allocation unit. This unit is called a cluster . The cluster size for a volume is defined when the volume is formatted. Different volumes can have different cluster sizes. For the benefit of readers who are familiar with UNIX, the Windows term cluster is similar to the UNIX term file system block size . The file system decides the size of the disk cluster, taking into consideration the size of the disk and the type of file system being used. The range can be from 1K to as much as 64K. The cluster size is decided when the volume is formatted, on the basis of either the "format" command-line utility or the disk management GUI. Obviously the higher size can be wasteful ; for example, to store a file of size 1K, one would still allocate one cluster of size 64K.

NTFS uses some important parameters that relate to clusters. The first one is the logical cluster number ( LCN ). NTFS divides the whole disk into clusters and assigns each cluster a number starting from zero. Thus the first cluster is Cluster 0, the next one is Cluster 1, the next is Cluster 2, and so on. This number, which uniquely identifies the position of the cluster within the volume, is the LCN. The second important parameter is the virtual cluster number ( VCN ). The VCN identifies the logical position of the cluster within a particular file. Thus an LCN of 25 indicates the twenty-sixth (start counting from zero, not one) cluster in a volume, and a VCN of 25 indicates the twenty- sixth cluster in a particular file.

To summarize, the VCN allows calculation of the position of the attribute ”for example, file data offset within a file. The LCN allows calculation of the offset relative to the volume or partition for that particular data block.

6.5.3 NTFS MFT Record Structure

As discussed earlier, every file and directory on an NTFS volume has an entry in the master file table. This entry is also referred to as an MFT record. Each MFT entry is a fixed size, which is decided at disk formatting time and is typically from 1,024 bytes to 4,096 bytes. With Windows NT 3.51, the MFT entry size was set at 4K. With Windows NT 4.0, Microsoft changed this to a minimum of 1K or the cluster size, whichever is larger, after analysis showed that the MFT entries were wasting disk space.

The MFT record contains a standard header, followed by a series of attributes that are stored in the following form:

Attribute header
Attribute name
Attribute data

Examples of attributes include file name, file security ACLs, and the file data. Table 6.5 summarizes the various attributes that an NTFS file or directory may have.

Table 6.5. NTFS Attributes

Attribute	Attribute Type Value	Description
$STANDARD_INFORMATION	0x10	File standard information
$ATTRIBUTE_LIST	0x20	Used to indicate nonresident attributes
$FILENAME	0x30	File name stored as a possible multivalued attribute because files can have multiple names (NTFS name, DOS name, and hard links)
$VOLUME_VERSION	0x40	Defined but unused in Windows NT 4.0; deleted in Windows 2000
$OBJECT_ID	0x40	64-byte value used from Windows 2000 on for link tracking; does not apply to prior Windows NT versions; see Section 6.5.15
$SECURITY_DESCRIPTOR	0x50	Security descriptor (file ACL); see Section 6.5.6
$VOLUME_NAME	0x60	Volume name; present only in the $Volume file
$VOLUME_INFORMATION	0x70	Volume information; present only in the $Volume file
$DATA	0x80	File user data stored as possible multivalued attribute because NTFS files can have multiple data streams
$INDEX_ROOT	0x90	Used in large directories
$INDEX_ALLOCATION	0xA0	Used in large directories
$BITMAP	0xB0	Used in directories only
$SYMBOLIC_LINK	0xC0	Defined but unused in Windows NT 4.0
$REPARSE_POINT	0xC0	Presence indicates file has reparse point metadata; see Section 6.5.22
$EA	0xE0	OS/2 extended attributes
$EA_INFORMATION	0xD0	OS/2 extended attributes information
$PROPERTY_SET	0xF0	Property set; defined but unused in Windows NT 4.0
$LOGGED_UTILITY_STREAM	0x100	Used by the encrypting file system; see Section 6.5.20

If the attribute data is small, it may be stored directly in the MFT record and is referred to as resident attributes. Alternatively, when the data is too large to be contained within the MFT, the MFT contains information about the clusters where the data is stored. Such data is termed nonresident attributes. There is nothing special about the file data, and it is just another attribute. Any attribute stored can be resident or nonresident.

Consider Figure 6.5, which shows some file data being stored in a nonresident fashion. The data structure involved, a run list , has three elements:

A virtual cluster number ( VCN ), which indicates the position of a cluster relative to a file. For example, a VCN of 0 indicates that the cluster in question is the first cluster of a file attribute.
A logical cluster number ( LCN ), which indicates the position of the cluster relative to the volume or partition. For example, an LCN of 25 indicates that the cluster in question is the twenty-sixth cluster on the volume or partition.
The number of clusters in a particular "run" ”that is, the number of contiguous clusters allocated to the file attributes.

Figure 6.5. MFT Record Structure

graphics/06fig05.gif

If the run list for a file does not fit into a single MFT entry, it is stored in additional MFT entries.

NTFS also supports multiple data streams. The default data stream is opened when one uses the CreateFile API and specifies just the file name using either a relative path or an absolute path . One can open a different data stream by specifying a file name followed by a colon and a data stream name ”for example, \directory1\File1:DataStream2. NTFS stores this as just another attribute in the MFT and stores the data associated with this second data stream as just another attribute.

6.5.4 NTFS Directories

NTFS directories are simply files that happen to contain directory information. NTFS stores directories in a way that facilitates quick browsing. When the data needed for a directory does not fit into the MFT, NTFS allocates a cluster and a run structure, similar to data or other attributes. All directory entries are stored as B+ trees.

A B+ tree is a data structure that maintains an ordered set of data and allows efficient operations to find, delete, insert, and browse data. A B+ tree consists of "node" records containing the keys, along with pointers that link the nodes of the B+ tree together. The advantage of using such a structure is that a B+ tree tends to widen rather than increase in depth, ensuring that performance does not degrade too much even when a directory has a large number of entries.

The entries within a directory are stored in sorted order. Each entry stores the file name, a pointer to the file MFT record, and the file date/timestamp stored redundantly in the directory entry (this information is already stored in the file MFT record) so as to facilitate a quick response time when somebody lists the contents of a directory. B+ trees are very efficient in terms of the number of comparisons required to find any given directory entry.

6.5.5 NTFS Recovery Log

NTFS has been designed to be a high-performance, reliable file system that can recover from system failures. To achieve this goal, NTFS logs all file system metadata changes to $LogFile before attempting to make the changes. The log files contain redo and undo information used to recover from a system crash and maintain file system consistency. Note that the metadata stored in the transactional log file is only sufficient to ensure file system integrity; for example, it ensures that disk clusters are correctly marked as being free or that they belong to a particular file and constitute a particular portion of the file data. The content of these data clusters ”that is, the user data ”is not tracked in the transaction log. Once the transaction has been committed and the metadata has been changed, a completion record entry goes into the log file.

With Windows NT 4.0, the log file was cleared with every successful reboot. With Windows 2000, the entries in the log file can survive multiple reboots.

6.5.6 NTFS Security

NTFS security is derived from the Windows NT security and object model. Each file and directory has a security descriptor associated with it. The security descriptor consists of the following:

A security token identifying the owner of the file.
A series of access control lists (ACLs) that explicitly or implicitly allow access to the file for the users described within those ACLs.
An optional series of ACLs that explicitly or implicitly disallow access to the file, for certain users; if a user exists on both the allowed and disallowed lists, no access is granted.

With Windows NT 4.0, the security descriptor was stored in the file's MFT record. Because many files and directories have similar access information, the security information was duplicated a lot. For example, if a user had particular access rights, such as read and execute rights to 100 files that constitute a particular application (Microsoft Office is a good example), all those files would have the same security information. Starting with Windows 2000, the security information is stored only once in the $Secure file, and all the other files simply have a reference to this security information.

6.5.7 NTFS Sparse Files

NTFS supports a feature called sparse files that allows a file to store only nonzero data. When a file is representing a data structure such as a sparse matrix, this feature is especially useful. This feature can be enabled and disabled administratively for a whole volume, a directory (and the files and directories contained within that directory), or just an individual file. This administrative setting can be overridden by a program when it is creating a file or directory. When an existing volume or directory is marked as sparse , no action is taken on files already existing in it. The setting applies only to new files or directories created within that volume or directory.

Sparse files and compressed files are two completely different and independent implementations , both an effort to reduce disk resource consumption. A file may be compressed and not sparse, and vice versa. Compression is explained in Section 6.5.8.

The term sparse files refers to files that have some data, then no data for a large byte range thereafter, then a small amount of data, and again a huge gap between that data and the next. For these empty ranges of a sparse file, NTFS does not allocate any disk clusters. Recall that a VCN defines a cluster position relative to its position in a file, and an LCN defines a cluster position relative to its offset on the volume. For sparse files, NTFS allocates the file-relative VCN, but it allocates no clusters on the volume. Thus the volume-relative LCN for some VCNs is simply unallocated . NTFS zero-fills data in the specified buffer corresponding to these gaps if an application tries to read. When an application writes data in these sparse ranges, NTFS will allocate the required disk clusters as needed.

Consider Figure 6.6. Recall that a virtual cluster number (VCN) indicates a cluster position relative to a particular file, whereas the logical cluster number (LCN) indicates the position of a cluster relative to a volume.

Figure 6.6. NTFS Sparse File Cluster Allocation

graphics/06fig06.gif

Figure 6.6 shows two run lists ”one when a file is stored via a nonsparse technique, and the other for the same file, but stored with the sparse technique. The run list with the nonsparse technique shows three entries in the run list; the first entry starts with a VCN value of zero (indicating the start of the file), which is is stored on logical cluster 125. Four clusters accompany this information, indicating that four clusters are stored contiguously. The next entry in the run list indicates that the next portion of the file ”VCN 4 (fifth cluster in the file) ”starts at LCN 251 and is eight clusters long. This cluster is shown shaded in Figure 6.6 and is different from the other clusters in that there is no data in the file corresponding to this range. The last entry in this run list shows that the next cluster in the file is stored at LCN 1251.

The second run list shows the first entry to be identical. The file still has four clusters allocated starting from LCN 125. The next entry in the run list shows the last seven clusters of the file; VCN 12 (eleventh cluster in the file) starts at LCN 1251. There is no entry in the run list for the intervening portion of the file that has no data.

When data is requested from a file, NTFS accesses the file MFT record, locates the corresponding VCN in the file, looks up the corresponding LCN and translates that LCN into a volume-relative offset. If needed, the relevant part of the volume is read, via the services of the disk class driver and volume manager. If the LCN is not allocated, NTFS simply returns zeros in the data buffer. If an application writes to a portion of the file where no LCN is allocated, NTFS simply allocates clusters in that area and adds them to the run list. The data is then copied from specified buffers and written to those clusters.

With the falling prices of disk storage, the savings in disk space using sparse files is not so significant. What is more significant is that access to sparse files can be more efficient because a lot of disk I/O is avoided (to retrieve data that is just a stream of zeros).

Applications set file attributes to be sparse using the FSTCL_SET_SPARSE function code to the DeviceIoControl API. Applications can query as to whether a file is sparse or not using the GetFileAttributes API.

6.5.8 NTFS Compressed Files

NTFS also supports compression on files stored in a volume with a cluster size of 4K or less. Data is compressed and uncompressed on the fly in manner that is transparent to the user when an application issues a read or write API call. Compression can be enabled and disabled administratively for a whole volume, a directory (and the files and directories contained within that directory), or just an individual file. Again, programs can override this setting when creating a file or directory. When an existing volume or directory is marked as compressed, no action is taken on files already existing in that volume or directory. The setting applies only to new files or directories created within that directory.

Compressed files are stored in runs of 16 clusters. NTFS takes the first 16 clusters of a file and attempts to compress them. If the result of the compression is 15 or fewer clusters, the file is compressed; otherwise NTFS abandons the attempt to compress.

While reading a file, NTFS needs to detect whether that file is compressed. One way of doing that is to check the final LCN in a run. A value of zero in the final LCN indicates that the run is compressed. Recall that the assignment of LCN number zero is for storing the boot sector; hence, it can never be part of a normal file (run). For a compressed file, when an application seeks to a random location, NTFS may have to decompress an entire run of clusters.

Figure 6.7 shows two run lists for the same file. With the first run list, at the top left of the diagram, no compression is applied. The file is stored in three runs of clusters, each 16 clusters long. The first 16 clusters start at LCN 125, the next 16 at LCN 251, and the last 16 at LCN 1251. The file occupies 48 clusters on the volume. With the second run list, at the bottom left in Figure 6.7, compression is applied. The file now occupies only 12 clusters, in three runs. The first 4 clusters start at LCN 125, the next 4 at LCN 251, and the last 4 at LCN 1251.

Figure 6.7. NTFS Compressed File Cluster Allocation

graphics/06fig07.gif

By looking at the next VCN and the number of clusters, NTFS can determine whether or not the run is compressed. Compressed data is decompressed into a temporary buffer and stored in the cache. The data is copied into an application buffer as needed.

Compression costs CPU time and delay in I/O, with a resulting drop in performance. With the rapidly decreasing prices of storage drives, it is not always an obvious advantage to use compression; hence, in Windows 2000, compression is turned off by default. Some Knowledge Base articles issued by Microsoft recommend not using compression, especially for applications that do a large amount of I/O.

Applications can set a file to be compressed using the FSTCL_SET_COMPRESSION function code to the DeviceIoControl API. Applications can use the GetFileAttributes ATP to inquire as to whether or not a file is compressed.

6.5.9 NTFS User Disk Space Quotas

With Windows 2000, Microsoft introduced features whereby NTFS supports tracking, the issuance of alerts, and limits on disk resource usage on a per-user basis. These are features of NTFS, and not of the Windows 2000 operating system itself. Hence they are not available on other file systems, such as FAT or UDF. The highlights of the quota implementation are as follows:

Quotas are implemented and tracked on a per-volume and per-user basis. All data associated with quotas is kept within the NTFS file system on the volume in the $Quota file that resides in the $Extend directory. Different users may be assigned different quota limits. System administrators are not subject to quotas. Further, default quota limits may be defined that will apply to a new user when the user first starts using disk resources.
System administrators have tools for quota management by which the quotas on a volume can be set to one of three states:
1. Disabled. In this state NTFS retains the information, taking no action. If quotas on the volume are reenabled, then this retained quota information is instantly available for use.
2. Enabled for tracking purposes only.
3. Enabled for tracking and limiting disk resource usage ”that is, limiting the amount of disk space used.
When a user exceeds the warning limit on disk quota, NTFS creates an entry in the Windows NT event log file. Only one such entry per user is written within an hour , though the user may repeatedly exceed the warning quota.
Quota calculations report the length of a file as if the compressed clusters were uncompressed and as if the sparse clusters that were not allocated appeared allocated. The size calculation takes into account the size of the disk cluster allocated to the file. Thus a file that is only 5,000 bytes long may be reported as using 8,192 bytes toward disk resource usage if the file is using two disk clusters, each of size 4,096 bytes. The apparent thought behind such a design is to handle cases in which a user has files on a compressed volume and then attempts to copy those files to another volume that is uncompressed and gets errors about quotas being exceeded. With the design as described, quotas may be compared across volumes irrespective of the compression state.
Quota settings are administered by the Microsoft-supplied GUI management tools. Dragging and dropping quota information from one volume to another, in order to have the same quota settings applied on both volumes, allows for easy administration of these settings. Further, exporting quota settings into various formats, including CSV (comma-separated values) and Unicode text, provides additional setting administration. These settings can then be imported into a tool of the users' choice ”for example, Excel ”and manipulated there.

6.5.10 NTFS Native Property Sets

Windows 2000 introduces support for native property sets that consists simply of user-defined metadata that can be associated with a file. This metadata can also be indexed with the Index Server that now ships with the Windows 2000 server. For example, metadata can be used to track the name of a document's author or its intended audience. Users can then search for a document by these user-defined tags or metadata. NTFS treats files as a collection of attribute value pairs. The user-defined properties are simply stored as additional optional attributes on a file.

6.5.11 File Ownership by User

Windows 2000 assigns a security identifier (SID) to each user, group , and computer that has an account on the domain or system. All internal security checking is accomplished via the SID. NTFS in Windows 2000 can scan the MFT table and identify all files owned by a particular user on the basis of the SID for the user. One use of such functionality is to allow administrators to clean up files after a user ID has been deleted.

6.5.12 Improved Access Control List Checking

NTFS in Windows NT 4.0 kept access control lists (ACLs) on a per-file and per-directory basis. If a user had 50 files, the identical ACLs were stored 50 times, once in each file. NTFS in Windows 2000 stores ACLs in a directory and indexes them as well. Therefore, for the scenario just described, the ACL would be stored just once, and each of the 50 files would have a "pointer" that would help identify the ACL. The result would be a reduction in storage requirements. In addition, the internal implementation for ACL checking is now more efficient.

This change facilitates bulk ACL checking used by the Indexing Service. When a user performs a search, the Indexing Service prepares a list of files; and before returning the list to the user, the Indexing Service performs ACL checking and eliminates all files that the user cannot access. Thus a user will see only the files that the user can access.

The new mechanism allows other scenarios as well ”for example, determining what a given user can and cannot do with a given set of files.

6.5.13 Change Log Journal, USN Journal, and Change Log File

Starting with Windows 2000, NTFS offers application developers a reliable way to track changes to files and directories using a mechanism called the update sequence number ( USN ). This optional NTFS service is designed for developers of storage management applications that handle things such as content indexing, file replication, and Hierarchical Storage Management. For any change made to a file or directory, a log record is written to a file. Every record is given a unique number; this is the update sequence number. The reasons for which a record is created include the following:

Creating or deleting a file
Creating or deleting a directory
Modifying, deleting, or adding data to a file data stream (any data stream, named or unnamed)
Modifying (including adding or deleting) attributes of a file or directory

These change records are stored in the $Extend or $UsnJrnl log file. This file can survive multiple reboots and is stored as a sparse file. To keep the log file from growing too big, older records are deleted in 4K chunks , implying that an application may not be able to access all the changes that have occurred. However, the APIs provided allow an application to determine that some log records are unavailable, and the application can take appropriate action, which can include a complete scan of the volume. To maximize performance, the USN value actually represents the offset within the file of the respective log record. When no information has been lost, an application queries all the log records one by one and identifies the files and directories that have changed.

Not only is the list of files and directories that have changed available, but the cause of change is also identified. On the basis of cause, there are three types of changes:

Changes made by applications.
Changes made by storage management applications (such as Hierarchical Storage Management) and replication applications.
Changes made by applications that build auxiliary data on the basis of the primary data in a file. A good example is an imaging application that builds a thumbnail picture.

The idea behind identifying the cause of the change is to allow applications to make intelligent choices and ignore certain changes as they deem appropriate.

Applications can start the change-logging service using the FSTCL_CREATE_USN_JOURNAL function to the DeviceIoControl API. Applications read USN records using the FSCTL_QUERY_USN_JOURNAL function code.

6.5.14 NTFS Stream Renaming

Windows NT NTFS has always shipped with support for multiple data streams per file. One example of an application that uses multiple data streams is the Windows NT Macintosh server. Streams can be created via the CreateFile API and deleted via the DeleteFile API. Note that when a file containing a non-default-named data stream is copied from an NTFS volume to a FAT volume (which does not support named streams), the named stream data is lost.

Until Windows 2000, there was no way to rename a data stream once it was created. One could create a new file with a new named data stream and then copy the contents of the old file to the new one, including the contents of the old data stream to the new (named) data stream, but this approach is rather inefficient. NTFS shipping with Windows 2000 introduced an API to allow an application to rename an existing named data stream.

6.5.15 Object IDs and Link Tracking

Windows 2000 implements link tracking. Links can be shortcuts for files or OLE (Object Linking and Embedding) objects such as Excel or PowerPoint documents embedded within a file. An application can track a link even when the source object behind the link moves in some way, such as

Moving a document representing the link source within the same volume on a Windows NT server
Moving a document representing the link source between volumes on the same Windows NT server
Moving a document representing the link source from one Windows NT server to another Windows NT server within the same domain
Moving a complete volume containing the link source from one Windows NT server to another Windows NT server within the same domain
Renaming a Windows NT server with a mounted volume that contains the link source
Renaming a network share on a Windows NT server that contains the source of the link
Renaming the document representing the link source
Any combination of the above

All of this functionality is based on a requirement that both the source and destination files reside on a volume that is a version of the Windows 2000 or higher NTFS file system.

Each file in Windows 2000 (and higher Windows NT versions) can have an optional unique object identifier (the 16-byte $OBJECT_ID structure described in the Section 6.5.3). To track a file, an application refers to that file by its unique object identifier. When the file reference fails (e.g., when the file has been moved) a user mode link-tracking service is called (by the operating system) for assistance. The user mode service attempts by trial and error to locate the file, using its object ID for all the scenarios just described.

To enable programmatic use of object IDs and link tracking, the following APIs are available:

Applications create an ID for a file or directory using the FSCTL_CREATE_OR_GET_OBJECT_ID file system control function code.
Applications delete an object ID using the FSCTL_DELETE_OBJECT_ID file system control function code.
Applications query an object ID using the FSCTL_CREATE_OR_GET_OBJECT_ID file system control function code.

6.5.16 CHKDSK Improvements

Windows 2000 NTFS reduces the number of situations in which CHKDSK needs to run, while significantly reducing the amount of time taken to run CHKDSK. The phrase "your mileage will vary" comes to mind in view of the fact that the exact amount of improvement depends on the size of the volume and the nature of the corruption. For volumes with millions of files, however, an improvement that reduces the amount of time needed to run CHKDSK by a factor of ten is quite possible.

6.5.17 File System Content Indexing

Windows 2000 Server ships with the Indexing Service fully integrated with the operating system and tools:

Access to indexing functionality through the Find files or folders dialog in Explorer, allows a method for using the service.
The Indexing Service can index file contents and file attributes, including user-defined attributes, as well.
The Indexing Service can also index offline content managed by Remote Storage Services.
On NTFS volumes, the Indexing Service uses the change log journal to determine which files have changed since the last index run.
The Indexing Service uses the bulk ACL-checking feature and will return files only in response to a user search for which the user has permissions. Files that are not allowed access by that user don't appear in the search results.
The Indexing Service can also work on FAT volumes, but it will work less efficiently on FAT volumes as compared to NTFS volumes because it cannot use NTFS-specific features such as the change log journal.

6.5.18 Read-Only NTFS

Starting with Windows XP, NTFS can now handle read-only volumes. The underlying volume itself is marked read-only. NTFS still checks the log file, and if the log file indicates that some log transactions need to be redone, the volume mount request will fail. Read-only volumes have some important applications, including mounting multiple versions of a single volume that have been created with a snapshot technique.

6.5.19 NTFS Fragmentation and Defragmentation

In fragmentation, a file is stored as a series of clusters that are not contiguous. A 64K file would occupy 16 clusters, each 4K. If these clusters were all contiguous, there would be only one entry in the MFT for mapping the file LCN to a VCN. However, if the disk were fragmented to the point where the file were stored as 16 noncontiguous clusters, each separate from one another, the MFT would have 16 entries, each one with a mapping between a VCN/LCN pair. Fragmentation is bad because it causes performance degradation. Positioning the disk head once and reading 16 clusters is much more efficient than positioning the disk head 16 times and reading one cluster each time.

Fragmentation can occur for various reasons. To start with, a newly created NTFS file system is well laid out on the volume and would have

The MFT at the beginning of the volume
Free space for the MFT to grow
System and user files
Additional free space

Fragmentation can occur because of the file system behavior or the application behavior, and typically it is due to a combination of both. Examples include

Installations of a Windows NT service pack that involve allocating new files and deleting old files. On a perfectly defragmented disk, ^[5] the allocation of new files starts from the beginning of free space for files. When the old files are deleted, small holes are left behind.

^[5] Literature talks about disk defragmentation , and the same terminology is used here as well. However, this is really volume defragmentation and not disk defragmentation.
Other application activity ”for example, Word or Excel saving a file as a temporary file, deleting the old file, and renaming the temporary file to be the file that was just deleted.
Application behavior that could lead to the allocation of file space that is not really used by the application. A good example is an OLE document that contains a mix of Microsoft Office files ”such as a Word document that also contains an embedded PowerPoint slide or two and an Excel spreadsheet. When one of these is changed, the new version is saved at the end of the file, and the old version is marked deleted but stays embedded within the file. Microsoft shipped a reparse point filter driver to support a feature called Native Structured Storage to cater to this situation and the Microsoft Office documents are actually be stored in different files, yet appear to be a single file to the Microsoft Office application. Although this feature was present in one of the Windows 2000 beta releases, it was withdrawn in the final release of Windows 2000. ^[6]

^[6] This is a concrete example of the caution throughout this book that at least parts of the book are forward-looking in nature and could be erroneous. This is also a concrete example of the reason that Microsoft strives to explain that the only way of determining features in a release is to examine the release once it has occurred.
Directory fragmentation is another problem caused by the fact that some directories ”for example, an application directory that contains application executable files ”rarely grow or shrink, whereas other directories, such as the My Documents and Temp directories, see a lot of files added and deleted. If the MFT for such a directory is allocated early in system installation, the directory will have multiple MFT records, all of them potentially noncontiguous with respect to each other.

Windows NT 4.0 introduced support for a set of defragmentation APIs. These APIs allowed defragmentation applications to query file allocation data and manipulate it. The APIs worked on both FAT as well as NTFS and allowed for writing of a defragmentation application without any knowledge of the on-disk structure. There is also an API to read an MFT record, but that obviously assumes knowledge of the MFT structure and its on-disk form. This API documentation is found on third-party Web sites, but not authored by Microsoft, perhaps to allow for changes to the format and structure in the future.

With Windows 2000, Windows XP, and Windows Server 2003, Microsoft has successively enhanced this support in various ways. Some of the highlights for Windows XP and Windows Server 2003 include

Support for defragmentation of the MFT. The first 16 entries of the MFT, however, cannot move. This is not an issue because these are not typically fragmented. The sole possible exception here is the root directory.
Support for defragmentation of disks with cluster sizes greater than 4K.
Use of the MFT zone for temporarily defragmenting files. When the disk is full and the user or an administrator attempts to defragment an operation, temporary disk space is needed. Prior to this enhancement, the defragmentation operation would fail if free space were not available outside the MFT zone, even if there were plenty of free space in the MFT zone.
Defragmentation of the default data stream, as well as of its reparse point data and attribute lists. Reparse points and attribute lists can be opened and manipulated just as if they were named data streams.
Defragmentation of the area between the logical end of a file and the actual physical end of a file as indicated by allocated clusters. Applications such as backup/restore applications preallocate clusters to a file and then set the valid length of file using the Win32 SetFileValidData API. Windows 2000 can defragment only the valid data of a file and not the area between the logical end of the file and the end of the clusters allocated and assigned to the file.
Prevention of the defragmentation of open files, which applications do by opening the file and issuing a newly defined FSCTL code (FSCTL_MARK_HANDLE, with option MARK_HANDLE_PROTECT_CLUSTERS).
Defragmentation of encrypted files without their being read (i.e., decrypted). This closes a security hole where the files may be decrypted into system cache and be available for users with malicious intentions.

6.5.20 Encrypting File System

Windows 2000 ships with an encrypting file system (EFS) that shuts down a major security hole. NTFS enforces security, provided that an application accesses disk resources using NTFS. A malicious user that managed to access a server and reboot using a different operating system or one that managed to steal a hard disk could open the disk in "raw mode" without using any file system and read the data off the hard disk. To guard against this possibility, Windows 2000 provides the encrypting file system, which ensures that all of its data is encrypted before the data is written to the disk. EFS can be enabled on a per-directory or per-file basis. In contrast, earlier solutions for the Windows 9X platform worked on a per-partition basis.

EFS uses both symmetric and asymmetric cryptography. The architecture allows different encryption algorithms to be plugged in as well. The data can be decrypted via the same key because the Data Encryption Standard (DES) is a symmetric cipher.

In Figure 6.8, the data is encrypted via a randomly generated 128-bit key with a variant of the DES encryption algorithm. Step 1 shows the file data encryption using a randomly generated key. Step 2 shows how this randomly generated key is encrypted with the file user's public key and stored in the Data Decryption field attribute of the file. This field is used for decryption purposes, as will be shown later in this section. Finally, the randomly generated encryption key is encrypted again, via the public key of a different entity, called a recovery agent . This entity could be simply the system administrator or another designated user. This generation of the Data Recovery field is shown as step 3 in Figure 6.8. The Data Recovery field provides a secondary means of retrieving the file data, should the user not be available or should a disgruntled user attempt to render the data irretrievable.

Figure 6.8. EFS File Encryption Overview ^[7]

graphics/06fig08.gif

^[7] RSA is a de facto standard for asymmetric key cryptography, and DESX is a de facto standard for symmetric key cryptography. More details are available at http://www.rsasecurity.com/rsalabs/faq/3-1-1.html.

When the file is read, the Data Decryption field is read and decrypted with the user's private key (Step 1 in Figure 6.9) to retrieve the 128-bit key needed for decrypting the file data. The data is then decrypted with this 128-bit key (step 2). Figure 6.9 also shows an optional step (step 3), recovering the 128-bit encryption/decryption key by decrypting the Data Recovery field (instead of the Data Decryption field).

Figure 6.9. EFS File Decryption Overview

graphics/06fig09.gif

Figure 6.10 shows the architecture for EFS implementation. The EFS driver is a file system filter driver that layers itself over NTFS. (EFS does not work on other file systems, including FAT.) The driver implements runtime callouts, called the FSRTL (for File System Run-Time Library), that handle file operations such as read, write, and open on encrypted files. The FSRTL interacts with NTFS to read or write the encryption-related metadata such as the Data Decryption field or Data Recovery field.

Figure 6.10. EFS Architecture

graphics/06fig10.gif

The EFS service implements functionality to accomplish the encryption/decryption and generation of encryption keys using the Crypto API infrastructure that is part of Windows NT. The EFS service communicates with the EFS driver using local procedure call (LPC), a facility made available by the operating system.

Windows 2000 does not support the use of an encrypted file by multiple users, but Windows Server 2003 is expected to do so. However, this functionality can be achieved by encryption of the symmetric key multiple times via the public key for multiple users.

To allow programmatic access to encrypted files, the APIs EncryptFile and DecryptFile are provided.

6.5.21 NTFS Hard Links

NTFS supports two kinds of links: soft links and hard links. Note that as of Windows Server 2003, hard links are supported only on files, whereas soft links are supported only on directories. Soft links are implemented with an architecture called reparse points, which is described in Section 6.5.22. Hard links are described in this section.

Hard links allow a single file to have multiple path names. Application of hard links is valid only on files and not on directories. Hard links are used, for example, in header files that need to be included in multiple build projects and thus need to have any changes reflected in all build projects. The alternative to hard links is to have multiple copies of the file. Hard links are implemented through a single MFT record for the file that simply stores multiple name attributes. The Win32 API called CreateHardLink creates hard links and takes as input parameters a pathname to an existing file and a pathname to a nonexistent file.

Hard links have been implemented in NTFS since Windows NT 3.X days and were a requirement for the POSIX subsystem. What has changed recently is simply the exposure of the API to create and delete hard links. Files are deleted after deletion of the last name for the file. To put this differently, if a file has hard links that are designated link1.doc and link2.doc, deleting link1.doc will still leave link2.doc in existence.

6.5.22 Reparse Points

Reparse points represent a significant new architectural feature in NTFS and the Windows NT I/O subsystem. Reparse points provide the foundation for implementing features such as

Volume mount points
Directory junction points
Single Instance Storage
Remote storage (Hierarchical Storage Management, or HSM)

This section is dedicated to examining reparse point architecture in detail. Sections 6.5.22.1 through 6.5.22.4 provide descriptions of the applications of reparse point listed above.

Note that this section describes reparse points as being integral with NTFS. Although it is true that the FAT file system does not support reparse points, it is conceivable that an independent software vendor (ISV) or Microsoft could write another file system, different from NTFS, that also supported reparse points. Such a task would not be trivial by any means, but three components are crucial to implement:

The file system ”for example, NTFS
The I/O subsystem and the Win32 API set
The tools and utilities

Microsoft has obviously done the necessary work in all three areas; hence it is conceivable for a new file system to support reparse points as well.

A reparse point is an object on an NTFS directory or file. A reparse point can be created, manipulated, and deleted by an application via the Win32 API set in general and CreateFile, ReadFile, and WriteFile in particular. Recall that the Win32 API set allows an application to create and manipulate user-defined attributes on a file or directory. Think of reparse points as simply user-defined attributes that are handled in a special manner. This includes ensuring uniqueness about some portions of the attribute object and handling in the I/O subsystem. An ISV would typically write the following:

Some user mode utilities to create, manage, and destroy reparse points
A file system filter driver that implements the reparse point “related functionality

Each reparse point consists of

A unique 32-bit tag that is assigned by Microsoft. ISVs can request that such a unique tag be assigned to them. Figure 6.11 shows the structure of the reparse tag, which has of a well-defined sub-structure:
- A bit (M) indicating whether or not a tag is for a Microsoft device driver.
- A bit (L) indicating whether the driver will incur a high latency to retrieve the first data byte. An example here is the HSM solution, in which retrieving data from offline media will incur a high latency.
- A bit (N) indicating whether the file or directory is an alias or redirection for another file or directory.
- Some reserved bits.
- The actual 16-bit tag value.
Figure 6.11. Reparse Point Tag
A data blob that is up to 16K in size. NTFS will make this data blob available to the vendor-written device driver as part of the I/O subsystem operation that handles reparse points.

To understand the sequence of operations and how reparse points are implemented, consider Figure 6.12. To keep the discussion simple, assume that the user has the required privileges for the requested operation. Also note that in the interest of keeping things simple and relevant, Figure 6.12 shows only one file system filter driver.

Figure 6.12. Reparse Point Architecture

graphics/06fig12.gif

The sequence of steps in creating reparse point functionality includes the following, as illustrated in Figure 6.12:

Step 1. Using the Win32 subsystem, an application makes a file open a request.

Step 2. After some verification, the Win32 subsystem directs the request to the NT Executive subsystem.

Step 3. The Windows NT I/O Manager builds an I/O request packet (IRP) with an open request (IRP_MJ_OPEN). Normally this request would go to the NTFS driver. Because filter drivers are involved ”in particular, a reparse point filter driver ”the I/O Manager sends the request to the filter driver, giving it a chance to preprocess the IRP before the NTFS driver gets a chance to process it.

Step 4. The reparse point filter driver specifies a completion routine in its part of the IRP and sends the IRP on to the NTFS driver.

Step 5. The IRP reaches the file system. The file system looks at the IRP_MJ_OPEN request packet, locates the file or directory of interest, and notes the reparse point tag associated with it. NTFS puts the reparse point tag and data into the IRP and then fails the IRP with a special error code.

Step 6. The I/O subsystem now calls each filter driver (one at a time) that has registered a completion routine for the IRP. Each driver completion routine looks at the error, and if the error indicates a special reparse point error code, the driver completion routine inspects the reparse point tag in the IRP. If the driver does not recognize the tag as its own, it invokes the I/O Manager to call the next driver's I/O completion routine. Assume that one of the drivers recognizes the reparse point tag as its own. The driver can then use the data within the reparse point to resubmit the IRP with some changes based on the data in the reparse point; for example, the pathname is changed before the IRP is resubmitted.

Step 7. NTFS completes the resubmitted IRP operation. A typical example might be that the pathname was changed and the open request succeeds. The I/O Manager completes the open request; each file system filter driver may then be invoked at its completion routine again. The driver notices that the open request succeeded and takes appropriate action. Finally, the IRP is completed, and the application gets back a handle to the file.

If no filter driver recognizes the reparse point tag, the file or directory open request fails.

Some applications may need to be aware of reparse point functionality; other applications may not care and never even realize that a reparse point exists at all. A Microsoft Office application simply opening a Word, PowerPoint, or Excel document may not care at all about reparse point functionality that redirects the open request to a different volume. However, some applications that walk a tree recursively may need to be aware of the possibility of having paths that create a loop.

Applications can suppress the reparse point functionality by appropriate options (FILE_OPEN_REPARSE_POINT) in the CreateFile, DeleteFile, and RemoveDirAPI requests . The GetVolumeInformation API returns the flag FILE_SUPPORTS_REPARSE_POINTS. The GetFileAttributes, FindFirstFile, and FindNextFile APIs return the flag FILE_ATTRIBUTE_REPARSE_POINT to indicate the presence of a reparse point. Reparse points are created via the FSCTL_SET_REPARSE_POINT function code with the DeviceIoControl API.

Windows 2000 allows an application to enumerate all the reparse points and/or mount points on a volume. To facilitate this process, NTFS also stores all reparse point data (including mount points) in the file $Extend\$Reparse. All reparse points on an NTFS volume are indexed within a file called $Index that lives in the \$Extend directory. An application can thus quickly enumerate all reparse points that exist on a volume.

6.5.22.1 Volume Mount Points

Windows NT 4.0 required that a drive letter be used to mount volumes or partitions. This constraint limited a system to having 26 volumes or partitions at the most. Windows 2000 allows mounting a volume without using a drive letter. The only limitations are as follows:

A volume may be mounted only on a local directory; that is, a volume cannot be mounted on a network share.
A volume may be mounted only on an empty directory.
This empty directory must be on an NTFS volume (only NTFS supports reparse points).

Applications accessing the directory that hosts the mount point do not notice anything special about the directory unless the application explicitly requests such information.

APIs that allow the addition and modification of volume mount points are now included in the Windows SDK. Examples of these APIs include

GetVolumeInformation , which may be used to retrieve volume information, including an indication of whether or not the volume supports mount points
FindFirstVolumeMountPoint and FindNextVolumeMountPoint , which are used to find the volume mount points
FindVolumeMountPointClose , which frees up resources consumed by FindFirstVolumeMountPoint and FindNextVolumeMountPoint
GetVolumeNameForMountPoint , which returns the corresponding volume name to which a volume mount point resolves

6.5.22.2 Directory Junction Points

Directory junction points are closely related to volume mount points. The difference is that whereas volume mount points resolve a directory to a new volume, directory junction points resolve a directory to a new directory that exists on the same local volume where the directory junction point itself resides. Directory junction points may be created by use of the linkd.exe tool or junction.exe tool that ship with the Windows 2000 Resource Kit and Windows 2000 companion tool, respectively.

6.5.22.3 Single Instance Storage

Windows 2000 ships with Single Instance Storage (SIS) for Remote Installation Services (RIS). RIS provides functionality to efficiently create boot images and application images on a network share and allow clients to use these images. Very often enterprise customers create multiple images ”for example, one for the Engineering department, a different image for the Accounting department, yet a different image for the Human Resources department, and so on. Many files are common across all images.

Using symbolic links to share a single file across multiple installations runs the risk that a change being made for one installation will be visible to all the installations. Consider a .ini file that is shared across all three departments mentioned here. If the Engineering department makes a change to this .ini file, the install images for the other two departments also pick up this change. SIS is a way to have a single file copy where possible, yet automatically break that single copy into multiple different versions when needed.

Windows customers typically have multiple images for different clients. For example, Engineering workstations are configured slightly differently from PCs used in the Accounting department, and so on. As a result, many duplicate files on these various images are created. SIS provides a way to store these files just once.

The whole SIS architecture is geared toward accomplishing the following functionality:

Detecting files that are identical and stored multiple times.
Copying in a special store the files that are stored multiple times. This functionality is similar to Hierarchical Storage Management, except that the file is always stored in a special SIS store on the disk.
Implementing SIS links for these files. SIS links are simply stub files in the original location with a SIS reparse point that points toward the copy of the file in the SIS store.

As Figure 6.13 illustrates, SIS implementation architecure has the following components:

A SIS store
A SIS filter driver
SIS APIs
A SIS groveler

Figure 6.13. Single Instance Storage Architecture

graphics/06fig13.gif

SIS implements a protected store that contains all files identified as SIS candidates. The advantage with such a scheme is the avoidance of problems associated with operations on common files such as delete, move, and rename. The disadvantage is that the cost to be paid includes the overhead of a file copy operation. The files in the common store contain a back-pointer to the files they represent.

The SIS file system filter driver implements reparse point functionality that provides a link between a file and its copy in the SIS common store. The SIS driver implements two important IOCTL functions.

The first is SIS_COPYFILE, which copies a file into the SIS common store and turns the source file into a link file; this link file is actually created as a sparse file with no data blocks and only the MFT table entry. The file contains a SIS reparse point tag whose data contains a link to the actual file in the SIS common store that has the data for the file. These IOCTL functions are called by any application as long as the application has read permissions for the source and write permissions for the destination. The copy happens if the source file is not already a SIS file. If the source file is already a SIS file, a back-pointer is added to that source file linking it with the file in the common store. One reason for copying files is that applications can open a file by its file identifier (ID). When a file is moved or renamed, its file ID remains the same. Thus if the file were renamed , applications would be opening the file in the SIS store, rather than the link file. The drawback is the performance penalty incurred while large files are copied from one disk location to another.

The second important IOCTL function implemented by the SIS driver is SIS_MERGE_FILES, which merges two files. The IOCTL function is protected and called by the user mode component of SIS ”namely, the SIS Groveler, which is explained later in this section.

Beyond the specific IOCTL functions, the SIS driver is responsible for implementing SIS links (which are similar to symbolic links) that allow an application to refer to a file, but the driver implements functionality to provide the file data from the file in the SIS common store.

The SIS Groveler is responsible for scanning all files on the volume and detecting duplicate files. The Groveler uses the SIS driver functionality to move the duplicate files that it detects into the SIS common store. The Groveler uses the NTFS change log journal to detect changed files. Once a full disk scan has completed, the change log allows the Groveler to be efficient and limits its scanning to only those files that have changed.

SIS does not manage all volumes. When the SIS service starts, it scans all NTFS volumes to locate volumes that have a SIS Common Store folder and attaches itself only to volumes that have this folder. This is the folder containing the SIS common store, and it is created at the time of SIS installation.

When an application opens a file, it may really be opening just a SIS link file, and the actual data content of the file may be coming from the common file in the SIS store. Consider an example in which a .ini file is shared by, say, three departments: Engineering, Human Resources, and Accounting. There would be a single copy of the .ini file in the SIS common store and three link files, one corresponding to each department. Assume that the Engineering department decides to change a value in the .ini file. SIS ensures that the two other departments ”Accounting and Human Resources ”are not forced to also accept the change.

SIS ensures this departmental independence by doing a copy-on-close when the application writing to what it thinks is the Engineering department .ini file closes the file after doing the appropriate edits. The reason for copy-on-close rather than copy-on-write is that statistics show that an extremely high percentage of write operations in the relevant situations end up changing the whole file, rather than just limited portions of the file. Thus, copy-on-write would needlessly copy data from the existing file and then overwrite that freshly copied data. When the whole file is not freshly written, unchanged portions of the file are extracted from the existing SIS common store and added to the freshly written parts of the file.

The SIS implementation provides APIs for backup applications in order to ensure that all the different SIS links do not end up as full-fledged files on the backup media. The idea is to ensure that only one copy of the SIS common store file data ends up on the backup media and that the appropriate link files and a copy of the SIS common store data files are restored as relevant.

6.5.22.4 Hierarchical Storage Management

Hierarchical Storage Management (HSM) is described in Chapter 7 in more detail. For now, suffice it to say that such applications can be built on top of the reparse point mechanism described in Section 6.5.22.3. Indeed, Microsoft's implementation of HSM is one such example. HSM migrates files from disk to other media and leaves behind stub files with a reparse point. When applications open this file, the reparse point mechanism can be invoked to seamlessly retrieve data from other media, as appropriate.

Top