8.3 Data Replication | Designing Storage Area Networks: A Practical Reference for Implementing Fibre Channel and IP SANs (2nd Edition)

Tape backup has survived simply because it is the most available means to provide a copy of data on a stable medium. Data on tape cartridges can be archived, transported off site, and stored indefinitely. But even though tape ensures the integrity of storage data, it is far too slow to be used for regular data access and requires a lengthy restore process if it must be copied back to spinning media.

Ideally, an enterprise should have a synchronized copy of data available and should be able to access that copy immediately if the primary storage fails. This is the goal of data replication, which uses disk mirroring algorithms to duplicate data from one disk array to another.

Software-based data replication may require that the host perform write operations to two separate disk targets. Although this solution may be more economical than others, it incurs more overhead on the host and requires multiple data paths from host to the intended targets. Disk-based data replication, by contrast, is transparent to the host system and offloads all duplication tasks to the disk array itself. In this case, the disk arrays must be both targets and initiators, receiving data to be written while managing write operations to the secondary storage.

Each storage vendor has developed its own data replication solution. For example, EMC's Symmetrix Remote Data Facility (SRDF), Hitachi Data Systems' True Copy, Hewlett-Packard's (Compaq) Data Replication Manager, and XIOtech's REDI-SANlinks all provide a means to perform disk-to-disk data replication for disaster recovery and other business continuance applications. Data replication normally implies distance, with primary and secondary storage arrays separated by at least metropolitan area distance. Consequently, data replication must define how data mirroring will be accomplished at the array level and how wide areas will be spanned.

As shown in Figure 8-4, a data replication configuration has primary and secondary storage arrays. In the case of EMC SRDF, an active-passive configuration requires data written to the primary array to be synchronously written to the secondary array. In the event of failure of the primary, the secondary can be accessed directly. For companies with multiple data centers or sites, however, an active-active configuration enables each site to serve as both primary for local access and secondary for another site. Regional centers can thus serve as mutual data replication sites for each other, ensuring that a readily accessible copy of each site's data is always available.

Figure 8-4. Active-passive (top) and active-active data replication

graphics/08fig04.gif

Data replication can be performed with synchronous or asynchronous updates between primary and secondary storage. In synchronous mode, the write operation is not final until both arrays have signaled write completion. This guarantees that an exact copy of data is now on both arrays, although at a performance penalty. The primary storage must always wait until the secondary is finished before reporting successful write to the host. In asynchronous mode, the primary array can buffer writes intended for the secondary and initiate them only during idle periods. This improves performance but may result in loss of a true copy if the write to the secondary subsequently fails. The primary array would then be forced to break the mirror to the secondary, and possibly track changes in data until the secondary recovers.

Data replication can be performed within the local data center to provide a current and readily accessible copy of data, or it can be extended over distance to facilitate disaster recovery and business continuance scenarios. In both synchronous and asynchronous implementations, the stability of the link between primary and backup disk arrays is critical, as is the latency that would naturally occur over very long haul links. If a primary array, for example, must wait some tens of milliseconds before a synchronous write is completed, the response time between the host and the primary will be affected. At some point, the latency might be too great for synchronous replication over very long distances. Exactly where the edges of this envelope lie can be determined only in practice.

Another practical caveat is posed by the wide area services and protocols used to link primary and secondary sites. Some data replication applications, for example, were written specifically for Fibre Channel SANs and expect to leverage Fibre Channel Protocol (FCP) in the disk-to-disk storage transaction. Converting from Fibre Channel to iSCSI for the wide area link and then back to Fibre Channel at the secondary site loses too much of the FCP layer mapping and so is unworkable. Simple FCIP tunneling may be suitable for limited point-to-point extension, but it typically requires a Fibre Channel switch at each end to serve as a front end to the storage arrays. The iFCP gateway protocol enables multipoint data replication while avoiding the need for additional Fibre Channel switches for storage attachment.