Subsystem Architecture | Storage Networking Fundamentals: An Introduction to Storage Devices, Subsystems, Applications, Management, and File Systems (Vol 1)

Most subsystems share the following functional elements:

Controller hardware and logic Performs most of the storage functions in a subsystem, including the execution of storage commands received from host systems as well as subsystem-based storage applications
External storage network ports Connects to host systems as well as optional data network ports for management connections
Memory Used for caching storage data
Internal connection technology (including the bus or network technology) Used to connect network ports with controllers and memory, as well as the device interconnect technology used to connect storage devices
Storage devices Where data is eventually stored

Sometimes the term front end is used with storage subsystems to indicate functions that process data at or near the point it is received by the subsystem. For example, the front end of a subsystem might include a number of access control functions and possibly caching functions. The term back end is used to indicate functions relating to device operations. An example of a back end is a Small Computer Systems Interface (SCSI) bus controller.

Figure 5-1 shows the basic architectural elements of most storage subsystems.

Figure 5-1. Basic Architecture of a Storage Subsystem

Subsystem Controllers

The controller design of a subsystem determines most of its capabilities. Subsystem controllers can be implemented many different waysfrom relatively simple, inexpensive single-board designs to highly intricate switched architectures. Many types of processors are used, including those designed for real-time applications, such as the Intel i960 or StrongArm processors, systems processors, such as the Power PC and Pentium, or embedded processor cores that are integrated into custom designs, such as those from MIPS Technologies.

While there are many possible subsystem designs, most of them can be placed within three categories:

Integrated controller/adapters These are subsystem controllers implemented as adapter cards that manage the actions of smaller storage subsystems. These integrated controllers usually hold some fixed configuration of cache memory and are limited in scalability by the number of external network and internal storage ports.
System-based controllers These are characterized by specialized, standalone systems complete with processors, high-speed cache memory, and a system bus or internal loop network connecting the various subsystem elements. These subsystems typically have multiple processors to provide scalable performance and high availability.
Switching-based controllers These are the most recent innovation in subsystem design. These subsystems have integrated high-speed switching cores and backplanes instead of shared media buses or loop networks. Controller processors and logic are located within processing modules or "blades" that fit into rack slots.

NOTE

In this chapter, the term controller refers to the complete set of functions and processors that drive the subsystem. The word controller might seem a bit puny when referring to all the stuff that goes into a large enterprise subsystem, but there needs to be a way to discuss the functional role of the controller succinctly. Controller uses the perspective of an I/O operation, in which a host I/O controller (that is, an adapter) communicates with a storage subsystem controller.

Subsystem Internal Connectivity

Controller designs are necessarily closely integrated with the internal connectivity technologies used in a storage subsystem. The internal communications technologies in storage subsystems often include a system bus for connecting controller processors, ports, and memory, as well as a device interconnect technology such as SCSI or Fibre Channel (see Chapter 7, "Device Interconnect Technologies for Storage Networks").

Some of the internal connection technologies used in storage subsystems include the following:

System I/O buses such asPeripheral Component Interconnect (PCI),PCI Extended (PCI-X), andproprietary bus implementations System buses are normally used for front-end connections, although they can also be used on the back end to connect devices.
Direct attached storage (DAS) storage buses such asSCSI,Advanced Technology Attachment (ATA),Serial ATA (SATA), andSerial Attached SCSI (SAS) These are used for back-end communications within a subsystem.
Fibre Channel (FC) loop networks FC loop can be used for front-end connectors between network ports and controllers as well as for connecting storage devices in the back end. FC loop technology can also be used to connect external expansion cabinets that expand the storage capacity of a subsystem.
Switched connections This can include almost any type of switchable communications technology, including FC, Ethernet, dense wavelength division multiplexing (DWDM), SATA, SAS, and proprietary switching technology. High-speed processors are used at the front and back ends to provide an interface between the switched technology and its adjacent element.

Whereas the controller design determines the functions and applications supported by a subsystem, the subsystem's internal connection technology determines its scalability, performance, and availability characteristics.

For instance, the address space of switched topologies tends to be far greater than that of buses and loops, which means it is easier to accommodate large numbers of devices in a subsystem. In addition, switched technologies tend to have much better bandwidth characteristics than shared media technologies, such as buses and loops.

Aggregating Devices in Subsystems

The primary characteristic of storage subsystems is their ability to aggregate multiple storage devices into a single manageable system. While there are many possible configurations and feature sets for storage subsystems, the aggregation of devices typically results in some or all of the following key advantages:

Data redundancy Data is written to multiple devices, allowing continuous operations should an individual device fail.
Capacity scaling Aggregating a large number of physical devices enables the creation of large virtual devices.
Integrated packaging Devices installed in a single enclosure or in linked, modular enclosures are more easily added, removed, and managed.
Hot swapping Devices can be removed and replaced without interrupting production operations.
Managed, shared power Devices receive power from conditioned, redundant power supplies.
Consolidated communications Devices are connected via a single communications system that manages storage traffic for all devices.

NOTE

The subsystem characteristics discussed in this chapter involve mainly disk drives. They are by far the most common type of device, and disk subsystems are by far the most common type of subsystem. But disk drives aren't the only kind of subsystemtape, optical, and solid-state subsystems also exist. This chapter could deal with disk drives or storage devices, but either way, the discussion can be misleading. If you interpret storage device to mean disk drive, you won't be far off.

High Data Availability with Redundant Subsystem Designs

A number of different problems can obstruct data availability, including device failures, connectivity failures, and power failures. While it is not necessary to have high availability storage to achieve high availability for data, most storage network architects prefer to increase their odds by using subsystems designed for high availability. In general, high availability is equated with redundancy for all components in the I/O path.

Redundant Storage Devices

One of the most common and powerful forms of redundancy is mirroring data on a pair of disk drives. Mirroring in a subsystem is a simple concept: the subsystem's controller duplicates every I/O operation it receives from a host system to a pair of internal disk drives. Figure 5-2 illustrates disk mirroring in a storage subsystem controller.

Figure 5-2. Disk Mirroring in a Storage Subsystem

Although it is a simple concept, this relatively simple idea has some interesting and unexpected angles that are discussed in greater detail in Chapter 8, "An Introduction to Data Redundancy and Mirroring."

Another form of storage redundancy commonly used in storage subsystems is based on parity RAID (redundant array of independent disks), the topic of Chapter 9, "Bigger, Faster, More Reliable Storage with RAID." In essence, the subsystem controller performs a function similar to disk mirroring, except that with parity RAID the operations are not duplicated. Instead, the operations are segmented to form multiple I/O operations, as shown in Figure 5-3.

Figure 5-3. Storage Subsystem Controller Performing RAID Operations

The file system or database that controls the placement of data on disk drives is unaware that disk mirroring or RAID is being used. In other words, the storing process used for device redundancy in a subsystem is transparent to the systems that use the subsystem. Mirroring and RAID used for redundancy are part of the larger subject area of storage virtualization, the topic of Chapter 12, "Storage Virtualization: The Power in Volume Management Software and SAN Virtualization Systems."

Dual-Ported Devices and Dual Internal Networks or Buses

Traditionally, storage devices were designed with a single connecting port for communicating with host systems. However, some newer storage devices designed for storage networking applications, such as FC disk drives, have dual ports that provide redundant protection from failures to device connectors, controller connections, or controllers.

NOTE

The road maps for SATA and SAS disk drive technologies also include dual-ported products. Perhaps by the time you read this book, dual-ported SATA and SAS drives will be available. As one familiar with self-induced hypoxia from holding my breath waiting for technologies to appear, I find it more pleasant now to write about actual shipping technologies as opposed to those with high dew points.

If dual-ported devices are used in a subsystem, it is possible to have redundant network or bus connections between subsystem controllers and storage devices. The idea is similar to multi-pathing in a storage network (see Chapter 11, "Connection Redundancy in Storage Networks and Dynamic Multipathing"), but the redundancy occurs within the confines of a subsystem. If one of the connections fails, the other can be used, as shown in Figure 5-4.

Figure 5-4. Redundant Connections to a Dual-Ported Device

The presence of multiple networks or buses does not necessarily mean the storage subsystem provides connection redundancy. For example, a subsystem could have four independent parallel SCSI buses with multiple single-ported devices connected to each bus. Even though there are multiple buses in the subsystem, there is no connection redundancy because the devices have only a single port.

Redundancy in Storage Controllers and Processors

Another potential point of failure is the storage controller. There are several ways multiple storage controllers can be implemented to achieve redundancy, including the use of multiple modular controller cards. For example, each controller card could have its own integrated external network port, run its own copy of the controller logic, and connect to other controller cards over a system bus, as shown in Figure 5-5.

Figure 5-5. Redundant Controllers Implemented on Multiple Controller Cards

When you are using multiple controller modules, and one of them fails, all storage access must be routed through one of the other controller modules. Obviously, the redundant module must be able to access the devices where the data is stored. More importantly, the redundant controller must be able to assume storage work that is already in progress.

Another approach for controller redundancy is to use multiple processors within a single controller design. In this case, all processors are capable of performing storage I/O work, and if one of the processors fails, the other processors are assigned its tasks.

NOTE

Storage subsystem controllers can be made with custom silicon or general-purpose (off-the-shelf) components. Although custom silicon is designed to deliver faster performance than general-purpose components, the redundant data protection is exactly the same for both. In other words, the redundant protection offered by disk mirroring or RAID does not depend on the cost of the controller doing the work, as long as the controller is functioning properly.

There are many differences between controller implementations, such as how they work with cache memory and how they connect to storage devices. The intrinsic nature of data redundancy on storage media, however, is independent of the controller implementation and its cost.

Redundant Cache Memory

I/O write data stored in a subsystem's cache memory that has not been written to disk is at risk of being lost before it is stored in nonvolatile storage. Therefore, subsystem designs might include mirrored cache memory so that a failure or loss of power to one of the caches will not cause data to be lost. The topic of redundant cache and the mechanisms to ensure data integrity during failure scenarios are advanced topics that are beyond the scope of this book.

Battery backup systems and subsystem control logic are used to ensure that data is not lost from cache memory should a problem interfere with writing cached data to storage. If power is lost or if the connection between cache memory and storage media should break, the data in the redundant cache would be used to write data to disk.

Redundant Network Ports

In general, subsystem network ports are host bus adapters (HBAs) with modifications for the requirements of subsystem operations. The HBAs used for subsystems are sometimes called Fibre adapters (FAs). They can differ considerably from host HBAs in the number of ports on a single board, the amount of memory, and available processing power.

Just like HBAs in host systems, the HBAs/FAs in subsystems can fail too. To counteract this, subsystem designs enable host/subsystem communications to be shifted to an alternate HBA/FA.

Redundant Power and Cooling

Like most computer and network products designed for high availability, storage subsystems typically have redundant power supplies to maintain uninterrupted power delivery to storage devices and all other components in the subsystem. These power supplies usually are designed to share the power load during normal operations and have the ability to provide consistent power levels during and after the failure of the other power supply.

Should a power loss occur, subsystems with battery backup power enable the subsystem to continue processing I/Os from host systems. The basic idea is to provide a window of coverage that allows the subsystem to process I/Os for a limited period of time and flush data from cache memory to nonvolatile disk storage.

It's never a good idea to cut off power to a storage subsystem through anything other than a planned orderly shutdown process. Therefore, it is recommended that subsystems without integrated battery backup power be connected to suitable uninterruptible power supply (UPS) systems.

Most storage subsystems are also designed with redundant fans that can provide the necessary airflow to maintain acceptable operating temperatures in a subsystem should one of the fans fail. Sometimes fan failures are not taken as seriously as disk drive failures, because no data is stored in a fan. However, devices that run hot are more likely to fail than those that run within the vendor's recommended specifications.

Capacity Scalability

Another important benefit of device aggregation is the increased scalability of block storage address spaces. By combining the storage address spaces of multiple devices, it is possible to create virtual devices that are many times larger than a single storage device and have the redundancy advantages discussed previously. Large subsystems can contain over 1000 drives with a total aggregate capacity larger than 150 TB. This gives administrators the ability to create multiple large virtual storage devices for a number of applications and systems while maintaining spare storage capacity that can be used where and when it is needed.

Capacity Scaling with Striping

The most common way to aggregate devices to achieve higher storage capacities uses a technique called striping. Striping writes and reads data across multiple storage devices in rapid succession. The storage capacities of the individual devices are aggregated to form a larger storage address space.

A simple analogy for data striping is a paint sprayer that sprays paint across multiple boards lined up in sequence. As the sprayer moves across the boards, a small amount of paint is deposited on each.

Unlike spray painting, the process of striping data is precise. Data striping segments data into regularly sized stripes, which are then transmitted to multiple storage devices in rapid succession.

Similar to mirroring and RAID, which were discussed earlier in this chapter, disk subsystem controllers receive storage commands from a host system and then parcel them out to individual devices. The file system or database that is responsible for using the storage is not aware that the subsystem is aggregating data by striping it. Figure 5-6 shows four disk drives aggregated by striping to form a virtual device with a larger storage address space.

Figure 5-6. Four Devices Aggregated by Striping to Form a Larger Virtual Device

In practice, striping is almost always accomplished through RAID technology to achieve the protection of data redundancy.

Capacity Scaling with Concatenation

Another way to scale storage capacity by aggregating devices is called concatenation. In a nutshell, concatenation fills each device, more or less in sequence. In other words, data is not subdivided into small pieces and scattered across multiple devices, as it is with striping; instead it is written together on a single device. Concatenation is not nearly as popular or commonly used as striped RAID, considering it lacks the redundancy and performance advantages of striped RAID.

Performance Advantages of Subsystems

The electromechanical nature of storage devices limits their performance at orders of magnitude lower than microprocessor and memory devices and, thereby, creates I/O bottlenecks. Server systems and the clients that connect to them can be particularly affected. Subsystems address the performance limitations of storage devices through three different mechanisms:

Parallelism through overlapped I/Os on multiple devices using data striping and mirroring
Disk caching with high-speed memory
Write-back caches

Parallelism and Overlapped I/Os

Because disk drives are inherently much slower than processors and memory devices, increasing storage I/O performance depends primarily on the ability to perform storage work in parallel on multiple devices. Parallelism with storage depends on the ability to use overlapped I/Os on multiple devices.

Overlapping is a technique where multiple commands and data transfers are distributed on multiple devices simultaneously. Figure 5-3, which shows RAID operations between a subsystem controller and four disk drives, also shows the basic concept of overlapped I/Os. The amount of overlapping and parallelism achieved depends to a large degree on the type of device interconnect used in the subsystem. The topic of device interconnects is the subject of Chapter 7.

Overlapped I/Os are accomplished through two different approaches: striping and mirroring. Striping naturally provides the mechanism for parallel operations. The subsystem controller initiates buffer-to-buffer data transmissions with each drive, starting multiple I/O operations on multiple drives. Each drive performs the command and signals the controller when it is done or has data to transmit. The controller aggregates the responses from all drives before completing the operation with the host system. For read operations, this means assembling the data in order and transmitting it to the host. For write operations, it means acknowledging the successful completion of the host's write command.

Systems with requirements for high I/O rates often have storage configurations that stripe data over a large number of disk drives. Large database systems, in fact, might stripe data over hundreds of disk drives configured as several virtual devices.

Mirroring, which is illustrated in Figure 5-2, also provides parallelism for read operations. The subsystem controller can split read requests and send them to different drives. By distributing read requests on the two different drives in a mirror, read performance can be effectively doubled.

Mirroring does not provide parallelism for write I/Os because mirroring needs to create duplicate data on both disks in a mirror. Therefore, all writes must be written to both drives.

Caching and I/O Performance

Another common technology for improving storage I/O performance is disk caching. The basic idea of disk caching is simple: data that will probably be needed in the near future is stored in memory by the subsystem controller, where it can be retrieved quickly (within several nanoseconds), as opposed to taking several milliseconds to retrieve from disk. Whereas disk striping and mirroring cannot overcome the relatively lengthy delays caused by disk seeks and rotational latency, disk caching can overcome the delays, because it circumvents disk accesses altogether.

The general architecture for a caching controller includes a high-speed memory bus and RAM. An index is created in memory that maps to sections of the storage address space that the cache is associated with. When data from a specific storage location is loaded in the cache, the memory index is updated to indicate that the data is in cache.

All storage I/O requests are first processed by the caching process in the controller that checks the index. If the data is in cache, it is accessed there instead of on disk. The term cache hit refers to the occurrence of data being read from cache, and the term cache miss refers to the occurrence of data having to be read from a disk drive. Figure 5-7 illustrates the operation of a disk cache, including a cache hit and cache miss.

Figure 5-7. Comparing a Cache Hit and a Cache Miss

The effectiveness of disk caching depends to a large degree on the predictability of the I/O operations issued by the application that is reading and writing data. In general, caching is more effective for applications that read and write to certain blocks frequently. Applications that read and write large amounts of data sequentially, such as multimedia streaming and data warehousing, are not likely to benefit as much from disk caching.

Write-Through and Write-Back Caches

While caches can provide valuable performance benefits, they might introduce a risk of losing data if power is lost or the subsystem fails before the data in cache is written to permanent storage on disk.

Write-through caches eliminate the risk by always writing data to disk before acknowledging completion of the I/O operation. The write-through cache might write data into cache memory, but it also writes it to disk. The term write-through is used because the data is written through the cache function all the way to the disk. The I/O path to disk is not altered by the caching function. Some subsystems place the cache in the data path between the controller and disk drives. In that case, their write operations are always of the write-through variety.

Write-back caches are optimized for performance. They acknowledge the completion of a write operation after the data is written to cache and before it is written to permanent storage on disk. The term write-back refers to the fact that data is first written to cache memory and later written back to disk.

Obviously, it is important to have battery backup power for cache memory if the write-back cache is being used; otherwise, data could easily be lost during a power outage. Some subsystems have integrated battery backup within the enclosure or controller, and others might rely on external UPS systems.

NOTE

The memory used in disk subsystems for caching is some of the most expensive memory sold. Therefore, it's a good idea to understand whether or not the applications you are buying cache for will benefit from the cache.

Modular Enclosure Designs

Storage subsystems facilitate the management of large numbers of storage devices by incorporating modular designs for fast servicing and manageability of all components. Storage devices, especially high-performance devices, use a lot of power and create a lot of heat. Storage subsystem enclosures are engineered to provide an environment that meets the power and cooling requirements of a large number of devices.

Hot Swapping

Most storage subsystems today are designed to allow the servicing of redundant components. Hotswapping refers to the ability to remove and replace (hot swap) components without having to shut down the subsystem.

The technology that supports hot swapping includes the following:

Componentmonitoringandanalysis to identify devices that might be operating near or past their limits or have failed altogether. This includes relatively unintelligent devices, such as cooling fans, and also includes more intelligent devices, such as disk drives.
Managementinterfacesforalertsandmanagement , including various forms of administrator notification, logical indicators within management consoles, and physical indicators, such as lights or audio signals.
Logical and algorithm adjustments that are needed to maintain data integrity. For example, when disk drives fail, the subsystem controller has to adjust its RAID or mirroring operations to operate in degraded or reduced mode without the missing device. This topic is discussed in much more detail in Chapter 9.
Physicalcablingandconnectors that facilitate removing and inserting components. Electrical circuits and connection buses/networks need to be able to operate without interruption while components are removed and inserted.

NOTE

This matter of removing and inserting components transparently can be surprisingly convoluted and difficult. FC loop technology provides an example of how getting this wrong more or less sank a whole segment of the industry. Devices inserted or removed from FC loops force the devices on the loop to reestablish their prioritization and addresses. As it turned out, this process did not work nearly as well as expected and, in some instances, caused the complete failure of the SAN. This completely unacceptable situation drove the market toward switched topology networks (referred to as fabrics in the FC community). Today FC loops are mostly limited to internal subsystem connections in which the subsystem manufacturer can completely control the environment.

Not all subsystems support hot swapping. Instead, they might support the removal and replacement of components only if all I/O activities are stopped. This is known as warm swapping. It is obviously not quite as convenient as hot swapping, because it might force administrators to shut down applications to ensure that no I/O activities are taking place.

Hot Spares

In addition to device redundancy and hot swapping, many subsystems also provide hot-spare devices that are waiting and ready to step in when a production device fails.

Hot-spare devices are connected to the same power circuits and subsystem controllers as the other disk drives in the subsystem. They are powered on and identified as usable devices by the subsystem's controllers, but they are not used for storage until another device fails.

In the case of disk subsystems, the subsystem controller recognizes the disk failure and ceases operations with the failed disk. At that point, any mirrors and RAID arrays that were defined on the failed drive are operating in degraded or reduced mode. The controller then begins to activate the hot spare by recreating the same partitions on the hot spare that had been defined for the failed drive. After the partitions are recreated, they are ready to be repopulated with data from the other partitions on functioning drives.

The data repopulation process, also known as a data rebuild, can be started at any time. Most subsystems give administrators a choice between starting data rebuilds immediately when the hot spare is ready or at a later time chosen by the administrator. Chapter 9 discusses the data rebuild process in greater detail.

Storage Enclosure Services

The storage industry has created a standard for management of the environmental characteristics of storage subsystems called SCSI enclosure services (SES). SES is a fairly large specification with many different parts. Viewed primarily as an optional specification, vendors have implemented different aspects of this extensive standard. Although SES does not impact the interoperability of a storage subsystem in performing its primary functions, the subsystem's ability to be managed by third-party management software might be related to the extent to which it supports SES.

Modular Cabinet Components

Some storage subsystems are designed for modular configuration and scalability. This can include external connectors and cabling intended to expand the storage capacity of the subsystem with devices located in an expansion cabinet, as shown in Figure 5-8.

Figure 5-8. Single Subsystem Connected to an Expansion Cabinet to Increase Storage Capacity

A subsystem controller in an existing subsystem enclosure would typically manage the devices inside the expansion cabinet. However, it is certainly possible to add additional controllers in expansion cabinets to achieve performance scaling as well as high availability features. Such designs would be much more complicated than a simple expansion cabinet and would likely require clustering technologies and traffic in addition to storage I/O traffic.