Section 6.6. SAN File Systems | Inside Windows Storage: Server Storage Technologies for Windows 2000, Windows Server 2003 and Beyond

6.6 SAN File Systems

Storage area networks allow administrators to have a pool of storage resources that coexist with a group of servers and have individual storage resources assigned to a particular server. SANs still require that at any given moment, only a particular server may be accessing a particular storage resource. SANs just facilitate the easy reassignment of a storage resource from one server to another. To understand this better, consider Figure 6.14.

Figure 6.14. SAN Usage Scenario with a Local File System

graphics/06fig14.gif

Figure 6.14 shows a typical three-tiered SAN deployment. At the top are clients accessing servers using a LAN. The servers are connected to a Fibre Channel switch. In addition, several storage disks are connected to the Fibre Channel switch. The storage disks can be considered to be a pool of storage disks, consisting of Disks D1 through D4. Figure 6.14 shows Server 1 and Disks D1 and D3 shaded to indicate that Server 1 is exclusively accessing Disks D1 and D3. Server 2 is exclusively accessing Disks D2 and D4.

The SAN simply facilitates relatively easy movement of a disk from one server to another. SANs do not facilitate true simultaneous sharing of the storage devices. SANs simply make some storage resources appear to be direct-attached storage resources, as far as upper layers of software such as file systems (and above) are concerned . This is true whether the SAN is Fibre Channel based or IP storage based. ^[8]

^[8] IP storage is discussed in detail in Chapter 8.

To allow a storage resource such as a volume to be truly simultaneously shared and accessed by different servers, one needs an enhanced file system, often referred to as a SAN file system . SAN file systems allow for multiple servers to access the same storage device simultaneously while still providing for some files or parts of files to have exclusive access by only a particular server process for some duration of time. Astute readers might argue that even network-attached storage allows for files to be simultaneously shared, and they would be correct. The difference is that network-attached storage has a single server (the NAS server) acting as a gatekeeper, and all file operations (e.g., open , close, read, write, lock) are issued to that server.

The NAS server can easily become a bottleneck. Network file systems such as CIFS and NFS (described in Chapter 3) provide file system sharing at the file level for clients accessing servers using a network protocol such as TCP/IP. SAN file systems provide for sharing of storage devices at the block level for clients accessing the storage device using a block mode protocol such as SCSI. With SAN file systems, each server is running what it thinks is a file system running on a local disk. In reality, however, multiple such servers are operating under this illusion, and the SAN file system operating on each server correctly maintains file system state on the volume that is being simultaneously operated on by multiple servers.

A diagram might help explain this. Figure 6.15 shows two scenarios. The left-hand side of the figure shows a network-attached storage disk being accessed by multiple servers via a network file system, and the right-hand side of the figure shows multiple servers accessing a single disk via a SAN file system. In the first case, each server uses its network file system (such as SMB or NFS) to send requests to the server on the NAS device. The NAS device thus constitutes a potential single point of failure, as well as a potential bottleneck. When a SAN file system is deployed, there is no such potential bottleneck or failure point. The storage disk can be accessed in a load-sharing fashion via both Servers 1 and 2. If one of the servers fails, the disk data can still be accessed via the other disk. Of course, the cost here is the added complexity and cost of the SAN file system.

Figure 6.15. SAN and NAS File System Usage Scenario

graphics/06fig15.gif

6.6.1 Advantages of SAN File Systems

The advantages of SAN file systems include the following:

SAN file systems provide a highly available solution. Multiple servers can access each volume, so the servers are not a single point of failure. In addition, it is relatively cheap to ensure that a volume is fault tolerant (i.e., the appropriate RAID solution is deployed), whereas a fault-tolerant cluster is more expensive to procure , and a lot more expensive in terms of operational costs.
The solution provides high throughput because a server I/O bus typically is the bottleneck for I/O requests. With multiple servers accessing the same volume, the I/O-binding constraints wane. All disks appear to be locally attached disks. Data copy operations are minimal in a local-attached disk solution compared to a NAS solution. For example, for a read issued by a client, data is typically copied from server buffers to TCP/IP buffers and, at the client end, from TCP/IP buffers to the application buffers.
Storage consolidation allows the user to avoid unnecessary duplication and synchronization of data, which occur when a server is not capable of meeting required performance criteria because of heavy load and the server needs its own private storage unit with identical volumes , since two servers may not simultaneously access a single storage disk without a SAN file system.
Demands on management overhead of data storage are reduced, leading to lower total cost of ownership. One needs to manage only one instance of a file system rather than multiple instances.
SAN file systems provide for an extremely scalable solution in which one can easily add servers, storage, or more SAN devices (such as switches) as needs change.
Applications can choose the type of storage most appropriate to their needs ”for example, RAID 0, RAID 1, RAID 5.
The solution is truly scalable, resembling the computing equivalent of the LEGO brick. For more computing, we simply drop in another server and configure it to access the existing shared disk.

6.6.2 Technical Challenges of SAN File Systems

One of the engineering feats in implementing SAN file systems is striking the right balance between concurrent access and serialization. Concurrent access to files and disks is required to have a highly scalable system that allows multiple processes to access the same set of files simultaneously. Synchronization is required to ensure that the integrity of user data and file system metadata is maintained , even while multiple processes or users are simultaneously accessing files.

Note that this challenge of concurrent access and serialization exists even on non-SAN file systems such as NTFS. The difference is that the mechanisms needed to ensure the proper serialization are much simpler and are provided by the operating systems; for example, the synchronization mechanisms provided by the Windows operating system, such as spinlocks and semaphores, are perfectly adequate for non-SAN file systems such as NTFS.

A complete description of the technology behind creating SAN file systems is beyond the scope of this book. Suffice it to say that the issues involved include the following:

A synchronization mechanism, also often referred to as a distributed lock manager , is needed that can operate across multiple machines and tolerate network latency and reliability issues.
Problems arise from some machines crashing while they hold resources such as file locks.
Problems arise when the configuration is deliberately or inadvertently changed. For example, a network (TCP/IP or SAN) experiences some topology changes that render some of the machines inaccessible.
Deadlock detection capability is necessary. Deadlock occurs when one client holds some resources that a second client is waiting for, while the second client simultaneously holds resources that the first client is waiting for.
Software-based RAID either cannot be used at all or needs a fair degree of complexity. With software-based RAID, two levels of mapping are involved. First the file system maps file-relative I/O to volume- or partition-relative I/O. Next the software RAID component (Logical Disk Manager in Windows 2000) maps the volume-relative block I/O to a physical disk-relative block I/O. Further, to prevent data corruption, implementations using two levels of SAN locking are needed. The first is at the file system level to ensure that there is serialization between different windows systems attempting to write overlapping data to the same file. Further, because the software RAID component will attempt to update the parity data for this file, the different software RAID components (Logical Disk Manager or equivalent) running on different Windows systems must also implement a mutual SAN locking mechanism.
Differences exist between the operating system and the file systems that the various clients are running. This is a complex problem by itself, and an area in which vendors, including NAS vendors , have expended considerable effort with a good degree of success. There are really several issues here, including the following:
- Providing for some mapping between the different ways in which user and group accounts and permissions are tracked on different operating systems.
- Providing for semantic differences between file open and locking in operating systems and file systems.
- Providing for differences between file naming conventions. Different file systems have different ideas about maximum file name lengths, file name case sensitivity, and valid characters within a file name.
- Different operating systems support different timestamps. Whereas Windows NT supports three timestamps per file, UNIX file systems typically support only two timestamps. Even when the number of timestamps is identical, the units may be different.
- The file systems on the heterogeneous systems can also have different sizes; for example, some are 32-bit file systems, and others are 64-bit file systems. All structures need appropriate mapping. In terms of implementation details, we have to map data structures back and forth, keeping in mind that they may need to be padded to 4-bit, 8-bit, 16-bit, 32-bit, or 64-bit boundaries.

At an extremely high level, SAN file systems may be designed in two ways:

A truly symmetric approach, in which every node on the SAN is a peer and the synchronization mechanism is truly distributed across all of the nodes. To date, a symmetric file system is not yet commercially available for the Windows platform.
An asymmetric approach, in which one particular node acts as a metadata server and a central synchronization point. This metadata server is responsible for managing all file system metadata (e.g., disk cluster allocation). The other servers implementing the SAN file system obtain metadata from this server ”for example, disk cluster allocation information, as well as disk target ID, LUN ID, and so on ”and then do the actual user data I/O directly over a SAN. Several vendors, including ADIC and EMC (to name just a couple), ship a commercially available product for the Windows NT platform based on the asymmetric approach.

The asymmetric approach to a SAN file system is illustrated in Figure 6.16:

Step 1. A client connects to a server and requests some data from a file using a protocol such as CIFS (explained in Chapter 3).

Step 2. The server contacts a metadata server and obtains information about the storage device on which the file resides, including particulars of the disk block on which the file resides.

Step 3. At this stage the server can accomplish all I/O directly, using the data it received from the metadata server.

Figure 6.16. SAN File System with Metadata Server

graphics/06fig16.gif

6.6.3 Commercially Available SAN Systems

Some vendors have implemented SAN file systems for the Windows NT platform using the asymmetric approach. Examples include EMC, with its Celerra HighRoad product line; Tivoli, with its SANergy product; and ADIC, with its StorNext product (formerly known as CentraVision). All of these products use a Windows server for implementing the metadata server and support access to the metadata server by secondary Windows servers. Some of these products support a standby metadata server; some do not. In addition, some of these products support other servers (such as Netware, UNIX, or Solaris) accessing the metadata server, and some do not.

It is interesting to explore the details of how such functionality is implemented and the details of the execution, with respect to the Windows NT I/O stack.

Figure 6.17 shows the Windows NT network I/O stack, as well as the local storage (Storport and SCSI) I/O stack. The SAN file system filter driver (shaded in the figure) layers itself over the network file system in general and the CIFS redirector in particular. The filter driver intercepts file open, close, create, and delete requests and lets them flow along the regular network file system stack. The interception is simply to register a completion routine. For all files successfully opened, the filter driver then optionally obtains information about the exact disk track, sector, and blocks where the file data resides.

Figure 6.17. Windows NT SAN File System I/O Stack

graphics/06fig17.gif

This is done for all large files. Some implementations choose not to do this for small files, the underlying thought being that the overhead of obtaining disk track or sector information for small files is comparable to the actual read or write operation on those few sectors. Thereafter, all file operations such as read or write operations (that do not involve manipulation of file system metadata) are handled directly by block-level I/O between the server and the storage disk.

The drawback of having a centralized metadata server is that this server can become a bottleneck, as well as a single point of failure. Some vendors provide capability in their products to have a standby metadata server take over in case of failure of the primary metadata server. On the other hand, the metadata server is the only server that caches metadata, so clusterwide I/O to read and write metadata is avoided.

Top