DISASTER RECOVERY VS. BUSINESS CONTINUITY

image from book

Hurricanes Katrina and Wilma were devastating blows to most organizations with offices in affected areas, and really taught us the tremendous value in business continuity and disaster recovery situations. Because we had deployed Citrix technology to centralize and secure our applications and data, we were able to continue business as usual. Our employees could log in from anywhere with a connection and have access to all their data and e- mails , allowing us to work through the aftermath.

Alan Kauffman, CIO, March of Dimes

image from book
 

Many organizations today have a disaster recovery (DR) plan in place, although very few have thought it out thoroughly, and even fewer have it documented or tested on a consistent basis. Most DR plans for smaller organizations consist of a tape backup and maintain the assumption that anything further will cost more than the statistical chance of downtime. The challenge is that although a tape backup does provide potential recovery, it does not provide business continuity (BC). A business continuity plan is an all-encompassing, documented plan of how an organization will return to productive activity within a predefined period of time. This not only includes IT services, but also telecommunications, manufacturing, office equipment, and so on. It is important to understand that recovering from a disaster is a subset of business continuity. Although DR is the most important part of business continuity, just having the ability to recover mission-critical data (or never losing it in the first place) is not sufficient to return most organizations to even a minimum level of productivity. Additional concepts such as end- user access and off-site storage locations are critical for a full return to productivity. In the same light, though, without recovery of the data, access is a mute point. Most organizations today could not re-create such electronic information as accounting and e-mail data in the event that computer records are lost or corrupted, or recovery from tape backup fails (a significant statistical probability).

Business continuity planning should be broken into two phases:

  • Minor disasters that do not involve a major facility problem (database corruption, temporary power loss, server failures, virus outbreaks, and so on)

  • Major disasters that may require relocation (natural or geopolitical disasters, for example)

From these phases, documentation can be built to describe the risk mitigation procedures, as well as recovery procedures required to maintain business productivity.

When creating a business continuity plan, the following aspects should be considered :

  • What defines a minor and major disaster, and what are the critical points at which a BC plan will be enacted?

  • What applications, key business systems (including non-IT-based systems), and employees are defined as critical?

  • Where will employees be housed if their main location is unavailable?

  • What time period is acceptable for mission-critical systems to be down, and what is an acceptable time to enact the BC plan?

  • How will access to critical data, business systems, and applications be provided within the predefined time period following a disaster?

  • Who will be responsible for enacting and maintaining the BC plan?

From the preceding list, it is clear that BC planning focuses primarily on two objectives: recovery time and recovery point. Put simply, an organization must ask the question "How long can we be down?" and "What do we need to have available after that time?" When initiating a DR/BC study, many companies start out with an attitude that the entire IT infrastructure has to be continuously available, or at least recoverable, in a very short time window, such as four hours. Without on-demand access though, few companies can afford this kind of high availability for the entire IT infrastructure. And even with on-demand access, an effort should be made to prioritize what must be recovered and how long it can take.

Recovery Time Objectives

When examining the disaster recovery needs of your organization, you will likely find differing service-level requirements for the different parts of your system. For example, it may be imperative that your billing and accounting system come back online within two hours in the event of a disaster. While inconvenient, it may still be acceptable for the manufacturing database to recover in 24 hours, and it may be acceptable for engineering data to come back online in two weeks (since it may be useless until new facilities are in place anyway). A key to a successful BC plan is knowing what your recovery time objectives are for the various pieces of your infrastructure. Short recovery times translate directly into high costs, due to the requirements of technology such as real-time data replication, redundant server farms, and high-bandwidth WAN links. Fortunately, with Citrix Presentation Server and Terminal Server, you don't have to hunt down PCs across the enterprise to recover their applications; all of your application servers will be located in the data center. We recommend using a tiered approach when applications and users must be restored. Figure 19-1 shows an example of one company's top recovery time objectives.

image from book
Figure 19-1: Recovery time objectives
Note 

A continuity plan requires an ongoing process of review, testing, and reassessment, since most organizations will change significantly over the course of a year, thus making a two-year old DR/BC plan useless.

The On-Demand Access Solution to Business Continuity

A major theme throughout this book has been building robustness into on-demand access. Redundancy of the network, server, application, and data center has been discussed. We also made the assumption that on- and off-site tape backups are performed nightly. Most minor disasters can be mitigated by simply following the best practices in this book. It is impossible though to guarantee uptime for a single location, due to the large number of both internal and external risks. Additionally, the data center is not the only thing requiring redundancy-a workstation with access to the mission-critical applications and data for an employee to work from is also required.

An access approach to business continuity decouples the desktop from the workstation. A disaster may preclude employees from accessing their normal workstations, but with access to a browser, the employees can still securely access the virtualized applications. This access remains whether the "desktop" is running in the normal data center or at the disaster recovery site. If employees are prevented from entering their office due to a natural disaster, they can still continue working from another office, from home, or even from an Internet caf.

Some of the more typical problems with a distributed environment that are solved with an on-demand access solution are listed here:

  • Foreseeable disasters often entail evacuation of large numbers of workers, thus leading to the need to have total flexibility for where knowledge-based workers work, what device they are working from, and when they work.

  • Even if the workers are not displaced, if the data center is displaced, it is highly unlikely in a distributed environment that users will still have sufficient bandwidth to access the data at a new location. In an on-demand access environment, the bandwidth requirements are much lower and more flexible (we show later in this chapter that Internet bandwidth from any source is sufficient if the on-demand access environment is built properly).

  • The availability of specific replacement PCs on a moment's notice cannot be guaranteed , thus making it difficult in a distributed environment to guarantee that users will have the necessary processing power to run their applications. In an on-demand access environment, a user's desktop CPU power and operating system environment are largely irrelevant, allowing the use of whatever hardware might be available.

  • The manpower required to quickly install and configure ten or more applications for hundreds or thousands of users is enormous in a distributed environment. In an on-demand access environment, the applications don't need to be installed or configured, as they are already on the server farm (or backup server farm).

With this clear advantage, many organizations today are embracing on-demand access as the only possible solution to IT business continuity.

On-Demand Access Business Continuity Design

Conceptually, there are two simple approaches to fulfill immediate resumption fail-over requirements in an on-demand access environmentfail-over of the data center and fail-over of the client environment. If both are in place, under major disaster circumstances, an organization will simply switch the data center to another location and then have users connect to the new data center from wherever they can get an Internet connection. Of course the larger an organization is, and the more dispersed its users are, the more complex this task will be. Additionally, for small organizations, this solution may appear to be overkill, as the cost of the redundant data center may exceed the value of the data. Approaches to reducing the cost of business continuity in an on-demand access environment include the following.

  • Defining only a subset of users and applications that need access following a disaster, thus reducing the amount of redundant infrastructure.

  • Placing lower expectations when defining what is acceptable downtime, thus allowing the use of a cold backup rather than a hot backup.

  • Increasing the acceptable amount of data loss. For example, if a full day of data loss is acceptable, then the main and redundant data centers require less bandwidth than if all data must be current to within 30 minutes.

From this list, it is clear that prior to implementing an on-demand access business continuity plan, we must answer the questions from the first section of this chapter regarding how long we can be down, and who needs to have access. In order to provide guidance in this process, we will call upon our case study, CME Corporation, again.

The CME Business Continuity Plan

CME's infrastructure, as described in Part III of this book, is similar to many mid- sized and enterprise organizations. CME has multiple locations, a large number of mission-critical applications, and the perceived need for immediate recovery from data loss. In Chapter 17, we defined that CME will have one central data center to reduce complexity and cost, and allow for central management. Although we will define some IT resources and equipment at CME West in Seattle, CME West users will access their applications and data at CME Corp in Chicago, since that is where the live, up-to-the-minute data resides, and also because it is very costly to maintain the bandwidth required to mirror database and files in real time between two geographically disparate locations.

The apparent downside to this approach, though, is that all of CME's eggs are in one basket at the CME Corporate headquarters data center. Should a natural, accidental, or geopolitical disaster occur on or near this site, all 3,000 users will lose access, potentially forever. To resolve this problem, CME has defined a remote backup site, CME West, as the hot backup site. In order to minimize costs, CME will only replicate a subset of the corporate data-center hardware to permit rapid recovery of mission-critical services and applications and allow managers to make an informed decision regarding permanent rebuilding of the entire corporate data center at the alternate site. In order to achieve this objective, CME has defined a prepositioned hot backup at CME West for initial recon-stitution (8-24 hour survivable ), which provides immediate access for a subset of users while the corporate staff is moved to CME West.

CME's IT staff have met with CME's executives and answered the questions posed earlier in this chapter. Table 19-1 shows the results.

Table 19-1: CME's Business Continuity Definitions

Business Continuity Question

CME's Answer

What applications are defined as critical, and what is acceptable downtime for them?

CME has determined that not all applications and users have the same requirement for access and availability in the case of a major disaster. Accordingly, CME has defined three tiers of availability:

Tier one requires application availability and user access within two hours, regardless of cause, Tier two requires application availability and user access within 24 hours, and Tier three requires application availability and user access within two weeks. The Tier -one applications include: Microsoft Exchange e-mail and Microsoft Great Plains accounting software (including payroll, human resources, and accounts receivable/payable functions).

Tier-two applications include the Oracle-based Manufacturing (including production schedules, bill of materials, supply chain information, inventory, and so on).

Tier three includes all remaining applications. Note that this timeline has been set at two weeks to allow for a temporary facility move.

Who are the key personnel requiring access at each tier?

Tier-one key personnel who require access include all top-level managers/directors, critical IT staff, and a limited number of predefined support staff (about 50 people total). It is important to note that some of these key users must be located at CME West to provide skill set redundancy in the case of a major disaster in Chicago.

The key personnel to which access must be guaranteed grows in Tier two to include a larger set of personnel (about 500 people total) across all CME locations required to operate these key systems. These additional personnel include accountants , human resource managers, remaining IT staff, key manufacturing and development engineers , and lower-level managers.

Tier three includes all remaining personnel.

What defines a major disaster and what are the critical points at which a business continuity plan will be enacted?

Note that the CME Corporation data center has internal data redundancy, including redundant network core components , bandwidth, servers, HVAC, and power. Thus, the business continuity plan calls for a data center fail-over only in the event of a major disaster in which the determination that more than eight hours of localized downtime will occur (this may be a guess or a well-known fact, depending on the type of disaster and available information).

Any event that will cause a minimum of eight hours of downtime at the Chicago data center will enact the data center fail-over. Here are examples: a major server hardware or network infrastructure failure occurs, which, due to delays in getting replacement equipment, causes an outage at the data center for more than eight hours; a malicious ex-employee sabotages the infrastructure; a government organization confiscates servers and data due to illegal employee activity. Examples of less common disasters might include: a severe snowstorm that renders major utilities offline or causes structural damage to the building; a train derailment at the nearby depot that forces evacuation due to a hazardous spill; a localized geopolitical disaster that renders the facility unusable.

How long is acceptable before enacting the business continuity plan and who is responsible for enacting the BC plan?

Once notification of a major outage has been issued, a decision will be made by the BC team (which consists of the CIO, CTO, CEO, CFO, and their support personnel) within one hour regarding whether to fail over to the CME West data center. Note though that this provides only one hour of time to accomplish the actual fail-over of the data center within the specified two-hour window.

How will access to critical data and applications be provided within the predefined time period following a disaster, and where will employees be housed if their corporate headquarters location is unavailable?

Employees required for Tier-one and Tier-two continuity must have broadband or dial-up Internet connectivity from home and must complete the BC training and maintain the accompanying BC documentation at their residences. CME will provide the broadband connectivity and a thin-client or corporate laptop for the 50 Tier-one designated employees. Tier-two employees will use existing employee-provided hardware and Internet connectivity to connect from their residences or other CME branches. These Internet connections will provide full access to the Tier-one and Tier-two applications. Tier three will utilize a makeshift facility if required, in addition to any home-based access.

With the business continuity requirements documented and defined, CME's IT group is now able to create the technical portion of the document to ensure that the requirements will be met.



Citrix Access Suite 4 for Windows Server 2003. The Official Guide
Citrix Access Suite 4 for Windows Server 2003: The Official Guide, Third Edition
ISBN: 0072262893
EAN: 2147483647
Year: 2004
Pages: 137

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net