Mirroring is Not a Silver Bullet | The Holy Grail of Network Storage Management

Remote disk mirroring provides a mechanism for instantaneous data recovery, according to advocates. A simplistic configuration entails the use of two storage platforms connected by a wide area network link and placed at some geographical distance from each other. In operation, should disk platform A in the production environment become compromised, then applications and end-users "fail over" to a backup disk platform B at a remote location, which contains a current copy of the data in platform A. Information processing continues unabated.

Part of the high cost of mirroring is that it typically entails more than simply the deployment of two identical arrays. Within each array, or at least inside the primary array, most vendors recommend the use of "mirror-splitting," ^[4] which, depending on how the strategy is implemented, can increase the price for an array by several times the price of the nonmirrored configuration.

Mirror- splits are created by synchronizing the data on one set of disk drives inside the array with another set inside the same array (that is, creating a synchronous or symmetrical mirror pair), then periodically removing the mirrored set from service (i.e., "breaking it off") and substituting a second synchronized mirror set in its place (see Figure 9-10).

Figure 9-10. Mirror-splits and replication.

graphics/09fig10.gif

It is important not to oversimplify this process, which requires a bit of magic to do properly. At the block-device level, data mirroring involves synchronously copying changes made at one storage volume (source) to another volume (target). From the host or application perspective, no write is considered complete until the changes have been applied to all of the mirrors as well as the original. Mirrors may be created within a single storage device or, if the application's architecture allows, between physically separate devices. When a mirror target device is broken away or 'split" from the original, the target device becomes a static, point-in-time (PIT) copy of the source.

This is where things get tricky. Depending on the steps taken to quiesce the application or file system at the moment before the mirror is split, the PIT copy will have a state of "coherency" relative to the application, the file system, or the block device level. So, integration of the mirror-splitting process with the application is key to determining the level of data consistency found in the resulting PIT copy. States of coherency range from " transactionally consistent," meaning that the resulting copy represents a PIT copy of all user transactions completed up to the moment of the split, to "crash consistent," meaning that the copy looks pretty much like what would exist if someone had simply pulled the plug on the application server. With crash consistent PIT copies, some undetermined number of user transactions may be incomplete or lost.

Performed properly, this process provides a safeguard against certain types of internal array failures and delivers instant access to the last version of the data saved at the time of the split. This process is usually replicated to some degree on each external mirror array.

The intention of implementing multiple mirror-splits is to speed the recovery effort and minimize the amount of real-time data lost in the event of an outage . A disk-based data protection strategy, using mirror-splits and replication, provides considerable improvement over tapes in terms of recovery speed.

Such solutions may be intriguing from a risk reduction standpoint, but the costs are enormous because an extra full set of disks is needed for each mirror-split, and additional disks may be required for local and remote replication, as well. In this strategy, for every terabyte of storage that is used to support a host application, nine or more terabytes of additional disk capacity are required to support mirror-splits ( assuming a replication interval of every six hours) and split replication on other arrays.

In addition to these high costs, mirror-split and replication solutions also have the following major limitations:

Four mirror-splits taken six hours apart (see Figure 9-10) only provide 24 hours of online protection. Older data must still be retrieved from tape, which lengthens restore times.
The data in the mirror-splits can be hours old. A corruption event at 11:00 A.M . requires going back to the 6 A.M . mirror-split, possibly resulting in five hours of lost data. The only way to reduce this time is to either reduce the amount of online protection (e.g., four mirror-splits, one hour apart), or to increase the total number of online mirror-splits, which rapidly increases storage capacity requirements and costs.
Restoration can still require hours to accomplish. While the mirror-split is available instantly, it must first be copied to primary disk before it can be used because it is vital that this mirror-split not be damaged. Additionally, the physical mirror does not necessarily align with the logical layout of files or databases (see crash consistency above), so often it is necessary to piece together data elements to affect a recovery.
If the chosen mirror-split contains data modified after a corruption event has occured, then additional time may be required to identify errors and test other sources of accurate data.
Mirror-splitting technology typically locks the customer into a single vendor solution.

New approaches and products, such as Time Addressable Storage (TAS) from Revivio, are helping to reduce the need for multiple mirror-splits and to optimize mirroring hardware requirements (see the discussion of TAS later in this chapter). However, mirroring in general has some additional potential drawbacks that should be considered.

Application Latency/Data Concurrency: All mirroring strategies introduce some latency into application operations, even when primary and mirror arrays are in close proximity to one another. Applications must be suspended (or their I/O cached and queued, which introduces another set of potential problems) while data is written to the primary, then the mirrored, array. This problem is exacerbated in typical disaster prevention mirroring topologies in which the mirror platform must be placed at a geographically remote location to prevent it from falling prey to the same disaster that impacts the local production array. Simple physics imposes a time delay on the transfer of data to the mirror array (signals cannot propagate faster than the speed of light as summarized in Table 9-2), and this delay will result in either a slowing of application performance while mirrored writes are being made, or a lack of concurrency between the data on the primary and on the mirror (called a "delta" in the industry). The size of the delta ”the difference between data on the primary and mirror platforms ”impacts the efficacy of mirroring as a data protection strategy. A number of hardware and software-based "multitargeting" products exist in the market today that might aid in offsetting latency introduced by mirroring, as discussed below.

Expense: Cost has always been a gating factor in mirror strategy adoption. While bandwidth costs are said to be dropping because of plentiful bandwidth in and around most major metropolitan areas, placing a primary or mirrored array in a location where network bandwidth is not so plentiful or cheap (corporate data centers are often placed in off-the-beaten- path locales to protect against certain disaster potentials associated with urban centers) is a cost multiplier in mirror operations. Moreover, to surmount application latency issues, many vendors of mirroring solutions promote the idea of "multi-hop mirroring" (as discussed previously) in which they endeavor to address application latency through the use of a second data replication operation. While this approach may provide a wonderful way for vendors to sell multiple copies of hardware platforms, it has the effect of adding significant cost to the solution as well as introducing more data deltas (delta 1 is the discrepancy at any given time between data sets in the local primary and mirror, and delta 2 is the discrepancy between the local and the remote mirrors). Costs need to be considered within the context of outage potentials and their cost to the organization in terms of lost revenue, lost customer confidence, and potential legal or regulatory penalties and fines .

Table 9-2. Signal Velocity and Propagation Delay through Various Transparent Media and Copper Wire

Material	Propagation velocity (fraction of speed of light in a vacuum )	Index of refraction	Velocity of Signal (km/s)
Optical fiber	.68	1.46	205,000
Flint glass	.58	1.71	175,000
Water	.75	1.33	226,000
Diamond	.41	2.45	122,000
Air	.99971	1.00029	299,890
Copper Wire (Cat 5 Cable)	.77	N/A	231,000

Vendor lock-in: Software utilities provided by vendors of hardware arrays to support mirroring (e.g., SRDF from EMC, XRF from IBM, etc.) primarily support mirror operations only between two (or more) platforms from the same vendor. While third-party software-based mirrors have begun to appear in the market (i.e., various volume level mirroring products from "virtualization" vendors) that may support cross-platform mirrors, these solutions 1) may not be supported by all hardware vendors and expose consumers to hardware warranty issues, 2) may be resource intensive if installed on application servers, 3) may be difficult to administer and maintain, or 4) may require an investment in new storage topology (such as a Fibre Channel fabric) for which the IT manager may not be able develop a business justification.

It should be added that mirroring, like other data copying schemes, does not protect against all threats to data integrity. Erred data, whether created by a software glitch, user input error, virus program, or other source of data corruption, is replicated across mirrors with the same speed and alacrity as good data. In this regard, disk- and tape-based backup are both vulnerable, though one could argue that the selective restoral of tape insulates against some threats that mirroring does not.

Of course, as many successful recoveries enabled by mirroring solutions demonstrate , the disk-to-disk data protection strategy can be a powerful one. Properly applied and carefully implemented, such a strategy offers the capability for short "time-to-data"recovery of mission critical data access.