Fundamentals of Remote Copy | Storage Networking Fundamentals: An Introduction to Storage Devices, Subsystems, Applications, Management, and File Systems (Vol 1)

This section presents an overview of the common principles used in remote copy applications. Like many things in storage, the concept seems very simple, but the implementation is much more complicated and must be done with the highest attention to detail.

Basic Architecture of Remote Copy Applications

Remote copy applications store data temporarily from write I/O operations and forward it to another storage subsystem at the same or another site. Specialized storage controllers are used to forward and receive the remote copy data. We will use the terms forwarding controller and receiving controller in this chapter to indicate the roles these controllers assume in the remote copy system.

A network of some type is used to forward the data from one storage location to another. This intermediary network can be virtually any type of network, such as TCP/IP/Ethernet, SONET, DWDM, or ATM. The remote copy application running in the remote copy controllers is responsible for conducting the data transfers across this network. This basic architecture is shown in Figure 10-1.

Figure 10-1. Basic Architecture of a Store-and-Forward Remote Copy Application

Remote Copy Sites and Storage Hierarchies

Remote copy implies a hierarchy among storage subsystems and sites. There are many ways businesses structure their remote copy storage hierarchies, made out of these four generic building blocks:

Primary storage and sites
Secondary storage and sites
Tertiary storage and sites
Bunker storage and sites

Primary Storage and Sites

In short, applications read and write data on primary storage. The role of primary storage is to support the data access needs of the company's production data center. It's also useful to talk about primary storage sites, where equipment, data, and staff are centralized. Primary storage can be implemented on many different types of storage products, including SAN storage subsystems and network attached storage (NAS) storage servers.

Small-to medium-sized companies tend to have a single primary site, while larger companies may have several primary sites. Primary in this case does not indicate a priority among data centers, but instead indicates that a significant number of applications are using primary storage.

Secondary Storage and Sites

Secondary storage is the first line of defense against a disaster or failure that strikes primary storage or storage sites. Remote copy applications copy data from primary storage to secondary storage. A secondary storage subsystem could be at the primary site or at a secondary site. The redundancy can be local or at some distance.

If a disaster occurs at a primary site, one or more secondary sites may be used to continue processing. Replacement servers are usually located or are quickly available at a secondary site in order to resume normal application processing as soon as possible.

Secondary sites may be geographically near or far from primary sites, depending on business variables and corporate risk/business continuity strategies. For instance, a secondary site may be across campus, across town, across the state, or located in another state hundreds of miles away. Companies with multiple data processing sites often use them as both primary and secondary sites, providing disaster protection within the broad corporate computing system. For instance, a company with two primary sites A and B could use each site as a secondary site for the other; A would be the secondary site for B, and vice versa.

Tertiary Storage and Sites

Tertiary sites and tertiary storage provide additional redundancy options for business continuity in case both the primary and secondary storage and/or sites experience a disaster. Tertiary storage could be located at a secondary site (if secondary storage is at the primary site) or at a separate tertiary site. Remote copy applications can copy data from either primary sites or secondary sites to tertiary sites.

Tertiary sites are usually not stocked with the same level of equipment as secondary sites. For instance, there might not be as much computing equipment available, and there are usually different assumptions about how quickly different applications can resume operating. Tertiary sites are almost always a meaningful distance from primary sites in order to be outside the geographic range of a single catastrophic event that could ruin a primary site. For instance, a tertiary site is usually not within the same flood plain or seismic fault line as the primary site it supports.

Bunker Sites

Bunker sites are located in the local vicinity of primary sites in facilities that are specifically established to support the remote copy mission. In other words, they do not typically have application or server systems and adequate facilities for people to work. Their sole mission is to function as a stepping-stone in the redundancy hierarchy.

Bunker sites typically house secondary storage subsystems that are accessible to primary storage over high-speed links spanning short distances. This provides the best opportunity to capture data off-site, on disk with optimal data integrity (consistency). Applications at the primary site can run with a minimum of delay imposed by the remote copy application.

NOTE

There probably wouldn't be a need for tertiary sites and tertiary storage if there were not a need to put secondary sitessuch as bunker sitesclose to primary sites where they could be affected by the same disaster that wipes out the primary site. So we end up having these interesting discussions where secondary storage could be either local or remote and where remote storage could be either secondary or tertiary. At this point the rationale might not make much sense to you, so you'll need to read further, to the section "Performance Implications of Remote Copy."

Objectives of Remote Copy Applications

The objectives of remote copy applications are simple to state but considerably more difficult to meet. Remote site storage and data should be

Immediately available for online use supporting ongoing business operations, including systems management functions like backup.
Consistent. In other words, it should have complete data integrity without errors injected by the remote copy process.
Capable of resuming normal operations at the primary site as quickly as possible.

Immediate Availability to Support Ongoing Operations

The word "immediate" has different meanings to different-sized businesses and organizations. Perhaps the words "practically immediate" would be more accurate. For many companies and transaction-processing applications, it may not be possible to have data copied to a geographically remote site in real time, meaning local and remote copies of data are not synchronized. The issue of data consistency is a primary element of how immediately available remote data may be. This topic is discussed later in this chapter in the section "Synchronous, Asynchronous, and Semisynchronous Operating Modes."

It is assumed that complete replacement data may need to be made operational following a disaster. This does not mean that the replacement data center will be a replica of the original data center and do all the data processing of the primary site. Instead, the remote data center will quickly be able to replace the functions of the highest-priority applications that were running at the primary site.

Remote storage equipment needs to have all the necessary connecting equipment readily available to establish connections to replacement systems. Servers that use secondary or tertiary storage might be at another nearby building location. If so, it is essential that connectivity between those servers and storage be available as quickly as possible.

Secondary sites and facilities also need to support the "normal" operations and systems management functions that are part of responsible systems management. There is very little that is "normal" about operations following a disaster, but it is important to maintain best practices for backup and recovery. Companies that have multiple levels in their remote copy hierarchies will probably want to continue running remote copy applications by copying data to another secondary or tertiary site.

Data Integrity, Consistency, and Atomicity

The term data integrity for storage means that stored data has not been altered in any way after it has been changed or created by the application that processes it. In other words, data in storage is what the application intended to write. As it turns out, this is more complicated than it appears at first.

One of the most intricate and challenging aspects of remote copy applications is their requirement to maintain write ordering, otherwise called data consistency. In short, data consistency refers to the relationship between related data values, whether they be data values from an application or are provided by the system. It is fairly common to have complex data structures that reference multiple data entries as part of a single high-level data object.

Applications and filing systems process their I/O operations in a precise, structured sequence that guarantees the order in which data is written. For instance, when an application updates data, it might store an internal reference to the data as located within a certain byte rangeand shortly thereafter the file system may be requested to store the actual updated data. Both these actions are made on disk at different times. If something goes wrong with the process and one of them does not complete correctly, the two data values will not be synchronizedthey will be inconsistent with each other. Not only will the data be wrong, but it is somewhat likely that other problems will arise, including abnormal application or system failure.

Where remote copy applications are concerned, it is necessary to preserve the write ordering that was executed on the primary storage subsystem on secondary and tertiary storage. In other words, any secondary or tertiary storage must have data written to it in exactly the same order that it was written on primary storage.

Local storage interconnects and SAN technologies have no problem with write ordering because SCSI protocol processes dictate the sequence of I/Os in these environments. With direct attached storage (DAS) or SAN storage, SCSI WRITE CDBs sent by initiators to logical units (LUs) are acknowledged by the LU after successful completion of the command. Applications wait to transmit subsequent I/O commands until their pending commands are acknowledged.

However, with remote copy applications, data is sent over an intermediary network that has its own set of protocols. It's important to realize that remote copy data transfers and communications are distinct from local storage transfers. Even though remote copy data might be forwarded in the same order as local writes, there is no guarantee that it will be received in the same order due to network congestion and error conditions.

Another example illustrates the nature of the errors that can occur. Assume a database system writes an update using the following hypothetical sequence:

Step 1.	Write a journal entry describing the update to be performed.
Step 2.	Perform the update.
Step 3.	Make a journal entry confirming the update occurred.

If Steps 1 and 3 are committed to disk in secondary or tertiary storage, and Step 2 is skipped due to a transmission error or delay of some sort, the actual data on disk is inconsistent with the database journal entries that indicate everything worked as planned. Besides databases, many other applications could have their data and metadata stored out of sync if write ordering is not followed properly by a remote copy application.

Write atomicity is another aspect of remote copy applications that can result in the loss of data integrity. Multiple I/O WRITE commands are often needed to complete a single application data writing process. If they do not all complete, the stored data does not have integrity. Therefore, all writes need to be made in order, and all need to be made completely in order to have data integrity and consistency.

NOTE

Many years ago I developed a mnemonic for remembering the relative importance of different storage assumptionsalong the lines of a hierarchy of needs. My mnemonic, IRSAM, stands for

Integrity
Recoverability
Security
Availability
Manageability

Data integrity is like breathing. If data does not have integrity, systems will fail miserably. Storage that does not maintain integrity may as well just be a bit bucket in outer space. This applies to all forms of storage, including backup systems and remote copy applications.

Recoverability is more like drinking water. Its not quite as important as having good data, but if data is not recoverable, it may become lost in the great void of lost data.

Security is akin to food and is close in importance to recoverabilitybut not quite. If weak security allows threats that change or destroy data, the data is worthless. Then you have to be able to recover other copies of good, unharmed data.

Availability is like coffee. Availability is considered the top priority by many, but it applies only if the data exists. Data can be preserved while availability is reestablished. You want to be wired for availability, but there are much more severe conditions than a temporary loss of availability.

Manageability is the lowest priority in terms of fundamental needs and is analogous to painkillers. It is extremely important, but a lack of manageability does not guarantee failure. It simply suggests we may have a miserable existence without it.

The challenge in maintaining write ordering with remote copy operations is the fact that the intermediate networks used to transport remote copy data may not be able to guarantee deliverymuch less in-order delivery. Therefore, it is up to the remote copy application to maintain write order at secondary and tertiary storage.

Resuming Normal Primary Site Operations

In addition to making data accessible to replacement systems following a disaster, remote copy systems can also be used to recover data to the primary site to resume normal data center operations. In general, this involves forwarding new data created at the remote site back to the (original) local site.

Assuming the disaster is short-lived and does not result in loss of data or equipment, as during a major blackout, the remote site assumes data processing operations on an interim basis. All subsequent data updates that occur at the remote site are stored locally and logged so they can be copied back to the primary site later when power is restored and equipment is made operational again. Then the remote copy function is reversed and the primary site is brought back in sync with the remote site. At some point, operations at the remote site are temporarily suspended and resumed again at the primary site with remote copy operations reestablished as they were originally. Some companies with redundant data centers regularly switch their primary and remote sites to make sure everything is working as planned if an actual disaster should occur.

Major, high-impact disasters with data and equipment loss that take days to recover from have much more complicated recovery scenarios and operations. Chapter 13 discusses some of the aspects of major disaster recovery.