10.6 Disaster recovery and business continuity

Disaster Recovery (D/R) is the process that many organizations utilize for high availability solutions. Backup and restore methods are designed and implemented to allow secure storage of critical data to be used in the event of a catastrophic failure within the environment. A catastrophic failure includes such disasters as the collapse of the supporting infrastructure (such as an earthquake) or a "routine" failure in a piece of equipment. Sites prepare elaborate schemas to move critical data to removable media backup (such as tapes), which are then stored in an off-site facility. In the event of a catastrophic failure, new equipment can be commissioned to replace that equipment affected by the failure. This might take the form of hot sites where servers run mostly idle but contain current data for immediate recommissioning, warm sites where servers run in a powered-down state but are available for quick commission to replace failed equipment, or cold sites where servers have to potentially be procured, installed/configured, and then have the critical data restored prior to commission.

10.6.1 Business continuity

Business continuity (B/C) provides for servers to remain operational in the event of a catastrophic failure. Duplicate equipment exists to provide for the availability of either the "primary" or "fail-over" equipment to be utilized by the end-user community. B/C solutions are designed such that end users are oblivious of a catastrophic failure. Many times these solutions will include application based solutions; i.e. operating system solutions like NT Wolfpack, Linux, or Beowulf, or hardware solutions like HACMP or Mirroring. Business continuity is of course a vital activity. However, prior to the creation of a business continuity plan, it is essential to consider the potential impacts of disaster and to understand the underlying risks. These are the foundations upon which a sound business continuity plan should be built.

The following describes some of the basic steps that any company should review as part of their disaster recovery and/or business continuity plans:

Policy Develop and create a policy on disasters and business continuity
Remember the risk analysis now you need to conduct a Business impact analysis
Develop and create recovery strategies and action plans
Develop and create a contingency plan
Develop pilots and testing plans

Policy

As with any important process and/or program you will need to have support from senior management. This will also include a budget. When you build the policy be sure and include the following items:

Service Level requirements, mean time between failure statistics, mean time to recovery expectations.
Roles and responsibilities who, what, when, and where.

Areas of impact. Example:

Hot sites for the data center
Fail over sites for order processing
Vendor support in the event of a disaster

Business impact analysis

Business impact analysis (BIA) is an important part of any organization's business continuity plan; it includes an analysis process to reveal any vulnerability, and a planning component to develop strategies for minimizing risk. One of the basic assumptions behind BIA is that every component of the organization is reliant upon the continued functioning of every other component, but that some are more crucial than others and require a greater allocation of funds in the wake of a disaster. For example, a business may be able to continue more or less normally if the company bookstore has to close, but would come to a complete halt if the server room burns to the ground. The business impact analysis will identify costs linked to failures, like:

Loss of messaging components (e-mail, instant messaging, )
Inability to pay employees (payroll systems down)
Order systems down (transaction cannot be realized from customers and/or vendor)

Each company should create a business impact report. This document will quantify the importance of business components and suggests appropriate budgets needed to implement required measures.

Recovery strategies and action plans

As part of your planning process you will need to determine and define your enterprise recovery strategies some of these should include the following:

Roles and responsibilities

The size and definition of the disaster recovery team will vary with each enterprise. The large enterprises will need larger more diverse teams. Earlier in this chapter we defined the basic team that you will need to incident handling. Overall the same team can be used. The difference is in the implementation of the solution. The table above shows where the team determines the problem, the scope of the problem, and the solution. This is where the team determines the action needed to keep the business running; or get the business running as soon as possible.

It is important for an enterprise to develop the top five or ten scenarios that may impact the business these "example problems" will help guide the plan and cost needed to build the disaster recovery team. Level one could be a simple virus or e-mail that is not in the virus definition file. At first you may think that this hardly needs a team to jump in and save the day. The authors have seen large enterprises shut down due to a simple macro virus invading the corporate mail system. In some cases it was made worse by a single administrator trying to fix the problem while the virus was spreading from user to user think of the vaudevillian trying to keep 20,000 plates spinning. Not possible this is where the incident response team comes to the rescue with processes and procedures (and experience). Now a Level 5 type of incident could be a fire in the data center in this case all is lost. Fire, water, confusion, and in the end no data center. Again the team will need to be called into action to execute the plan. In the case of a worst case scenario, the team should include the following members:

In most cases the disaster recovery team would be part-time like the volunteer fire department. Also in some companies vendor specialists will be on-call for an emergency. Take the time up-front to develop SLAs and contracts with your vendors so they can help you get back on-line quickly.

Also be sure to run system 'tests' to see if your team is ready for a disaster. Include the vendors in the exercise.

10.6 Disaster recovery and business continuity