Disaster Recovery (D/R) is the process that many organizations utilize for high availability solutions. Backup and restore methods are designed and implemented to allow secure storage of critical data to be used in the event of a catastrophic failure within the environment. A catastrophic failure includes such disasters as the collapse of the supporting infrastructure (such as an earthquake) or a "routine" failure in a piece of equipment. Sites prepare elaborate schemas to move critical data to removable media backup (such as tapes), which are then stored in an off-site facility. In the event of a catastrophic failure, new equipment can be commissioned to replace that equipment affected by the failure. This might take the form of hot sites where servers run mostly idle but contain current data for immediate recommissioning, warm sites where servers run in a powered-down state but are available for quick commission to replace failed equipment, or cold sites where servers have to potentially be procured, installed/configured, and then have the critical data restored prior to commission.
Business continuity (B/C) provides for servers to remain operational in the event of a catastrophic failure. Duplicate equipment exists to provide for the availability of either the "primary" or "fail-over" equipment to be utilized by the end-user community. B/C solutions are designed such that end users are oblivious of a catastrophic failure. Many times these solutions will include application based solutions; i.e. operating system solutions like NT Wolfpack, Linux, or Beowulf, or hardware solutions like HACMP or Mirroring. Business continuity is of course a vital activity. However, prior to the creation of a business continuity plan, it is essential to consider the potential impacts of disaster and to understand the underlying risks. These are the foundations upon which a sound business continuity plan should be built.
The following describes some of the basic steps that any company should review as part of their disaster recovery and/or business continuity plans:
Policy Develop and create a policy on disasters and business continuity
Remember the risk analysis now you need to conduct a Business impact analysis
Develop and create recovery strategies and action plans
Develop and create a contingency plan
Develop pilots and testing plans
As with any important process and/or program you will need to have support from senior management. This will also include a budget. When you build the policy be sure and include the following items:
Service Level requirements, mean time between failure statistics, mean time to recovery expectations.
Roles and responsibilities who, what, when, and where.
Areas of impact. Example:
Hot sites for the data center
Fail over sites for order processing
Vendor support in the event of a disaster
Business impact analysis (BIA) is an important part of any organization's business continuity plan; it includes an analysis process to reveal any vulnerability, and a planning component to develop strategies for minimizing risk. One of the basic assumptions behind BIA is that every component of the organization is reliant upon the continued functioning of every other component, but that some are more crucial than others and require a greater allocation of funds in the wake of a disaster. For example, a business may be able to continue more or less normally if the company bookstore has to close, but would come to a complete halt if the server room burns to the ground. The business impact analysis will identify costs linked to failures, like:
Loss of messaging components (e-mail, instant messaging, )
Inability to pay employees (payroll systems down)
Order systems down (transaction cannot be realized from customers and/or vendor)
Each company should create a business impact report. This document will quantify the importance of business components and suggests appropriate budgets needed to implement required measures.
As part of your planning process you will need to determine and define your enterprise recovery strategies some of these should include the following:
Roles and responsibilities
The size and definition of the disaster recovery team will vary with each enterprise. The large enterprises will need larger more diverse teams. Earlier in this chapter we defined the basic team that you will need to incident handling. Overall the same team can be used. The difference is in the implementation of the solution. The table above shows where the team determines the problem, the scope of the problem, and the solution. This is where the team determines the action needed to keep the business running; or get the business running as soon as possible.
It is important for an enterprise to develop the top five or ten scenarios that may impact the business these "example problems" will help guide the plan and cost needed to build the disaster recovery team. Level one could be a simple virus or e-mail that is not in the virus definition file. At first you may think that this hardly needs a team to jump in and save the day. The authors have seen large enterprises shut down due to a simple macro virus invading the corporate mail system. In some cases it was made worse by a single administrator trying to fix the problem while the virus was spreading from user to user think of the vaudevillian trying to keep 20,000 plates spinning. Not possible this is where the incident response team comes to the rescue with processes and procedures (and experience). Now a Level 5 type of incident could be a fire in the data center in this case all is lost. Fire, water, confusion, and in the end no data center. Again the team will need to be called into action to execute the plan. In the case of a worst case scenario, the team should include the following members:
The incident response team as listed above
Damage assessment team
Alternate site fail over team
Operating system administration team
Network and telecommunication team
Emergency communication help desk (cell phones, pagers, home number access )
In most cases the disaster recovery team would be part-time like the volunteer fire department. Also in some companies vendor specialists will be on-call for an emergency. Take the time up-front to develop SLAs and contracts with your vendors so they can help you get back on-line quickly.
Also be sure to run system 'tests' to see if your team is ready for a disaster. Include the vendors in the exercise.
System backups can be key in any type of incident and/or disaster. Enterprise data should be backed up based on a predefined schedule. Scheduled test restores should be conducted at least once a month; some companies will execute a restore daily to make sure the data from the day before is properly backed up.
There are many different types of backup:
External storage media backups
Off-site backup or clustering
Hot site mirroring
External storage media backups This is your old standard backup Full, incremental, and differential. Data is copied from the hard disk to another device, tape, floppy (not very efficient), CD, another hard disk, and more. Note: make sure your applications and your backup software are compatible some applications may keep a file open and confuse the backup software.
Hierarchical backup This is basically the same as external storage media backup, but with a big difference. The data is managed by a set of rules one common rule is the age of the data. As the data is aged (or accessed), it is transferred to slower media. Example: Let's say you have daily transactions, and as each transaction is completed, you store it on the live disk. Eventually the disk will run out of space, so you now move the data to another disk so you can mine the data like a monthly report. This data is also put on a backup tape. As the data gets older you may now move it to an optical storage media, a larger but slower storage. A backup may also be made on a duplicate disk and moved to another site. The message here is to try and understand how much live data you need on your active systems and how much archive data you need for data mining. Then you will understand how often you need to archive that data out from the live site.
Journaling There are many different types of journaling one example is where an enterprise actually journals the transaction logs to another site. In the event of a disaster, a full backup is executed and the transaction logs are executed against the backup. In this case the amount of data lost is normally zero.
Off-site backup or clustering Off-site backups are straightforward take the data and move it off-site. (Duh!). Clustering is one form of that this is where the data is network (wan) clustered live to another site. If a file is changed on the primary site, then it is automatically duplicated on the clustered site. The difference between journaling and clustering is the amount of data. Journaling is moving just the delta of each transaction and clustering is moving complete files or volumes.
When selecting an offsite storage facility and vendor, consider the following:
Cost this is always important.
Geographic area the distance from the organization and the probability of the storage site being affected by the same disaster event as the organization
Accessibility the length of time necessary to retrieve the data from storage and the storage facility's operating hours
Security the security capabilities of the storage facility and employee confidentiality, which must meet the data's sensitivity and security requirements
Environment the structural and environmental conditions of the storage facility (i.e., temperature, humidity, fire prevention, and power management controls)
Why is a fail-over site needed? The basic answer is in the event of a primary site outage, then the business can continue.
Overall you would hope that a primary site outage would be rare. But the determination that a primary site is at risk is based on the following criteria.
Business impact What is the SLA for your business? Can your business be down for several days due to a fire, flood, or loss of facility? The answers will help you determine what type of site that you will need for a fail-over. Lets say you are a credit-card company that has transactions every minute or every second. If you go down for a few minutes, then you could be open to credit card fraud and, of course, lost business. What if you lose your building for several days? It's the same with an airline company or a bank with several offices. In any of these examples, a company would lose business and possibly go out of business if they cannot move to a working site quickly. On the other end of the spectrum is the Small-to-Medium business. In the smallest case, the owner could keep a backup over at his partner's house, purchase a computer from a local vendor, and be on-line in about 4 6 hours. Overall, little to no impact on the business.
Location Due to your business and other factors you may have your business and data center in the middle of a hurricane and/or tornado zone. In this case you can lose your IT facility. Also you may have a business in a part of town that has a high crime rate. In either case you can lose your computers and/or facilities. Take the time to understand the impact of where you house your business in relation to localized impacts. Check with your insurance company and see if they have any data for your area example: if you live in the 100 year flood plain.
Let's look at a few example of backup sites:
Billy Jo-Bob's House You won't find this one listed in any other books on disaster recovery. OK, this is not a well-known-site type; this is the simplest of recovery mechanisms. Keep a copy of a daily backup at someone's house, like your business partner, or a trusted employee. Remember, be sure to test your recovery capability at least once a month.
Hot Sites This is the best of the best, and normally the most expensive. The hot site can perform not only as a hot site, but also as a load balancing site, or even an overload site. The hot site is normally built with necessary system hardware, supporting infrastructure, and support personnel and the sites are staffed 24 hours a day, 7 days a week.
Mirrored Site This is a redundant facility that contains duplicated data from the primary site. Some type of system clustering is normally used. Mirrored sites must be identical to the primary site in all technical respects. These sites provide the highest degree of availability because the data is processed and stored at the primary and alternate site simultaneously. These sites typically are designed, built, operated, and maintained by the organization. The mirrored site is the hot site plus all of the active data.
A Warm Site This is a dedicate facility that contains some or all of the IT infrastructure needed to get back on-line. The warm site will typically include:
Network and telecommunications
Backup power systems
The site may need to be "enabled" before it can be used. This enabling process would include:
Backup take loads
Network routing for business traffic
Movement of IT staff to the site for setup and business operations.
A Cold Site will consist of a facility with enough space and infrastructure to support some temporary IT operations. Typically you might have:
Spare computers not configured
Power but may not have a UPS system
Some telecommunication equipment
Basic environmental controls
You may not have active telecommunication circuits, but in place of that you may have SLA's from the local provider to get you service in a short period of time. Regardless of the type of alternate site chosen, the type of site needed must be able to support system operations as defined in the contingency plan.
In preparation for Year 2000, many organizations developed some contingency plans for their mission critical IT systems. However, these plans were designed to a specific event, program failure, and are not comprehensive enough. Companies must go beyond their information systems and develop comprehensive contingency plans for all mission critical systems. After the tragic events of September 11, 2001, many organizations initiated a systemic review of their internal support systems. Companies that had some type of plan were able to get back on-line quickly, while others contacted their hardware and software vendors for help. Organizations must review their current plans and update them on a regular basis. Organizations must ensure they can rapidly provide a minimally acceptable level of critical services during an outage and/or a disaster. The following is an example of the steps needed in a contingency plan.
Establish organizational planning guidelines
Develop detailed contingency plans (specific steps)
Validate the plans
Communicate the plans
Once all of the plans are developed, they should then be tested. A pilot is a device that a company can use to test a basic assumption and to then determine if the plan and technology can provide the needed solution. There are basically two types of pilots:
The technical pilot will include all of the hardware and software components. The process pilot will include the actual steps to implement the solution. Pilots are normally timed into nonproduction pilots and production pilots. The nonproduction pilots are conducted in the LAB. Production pilots are executed on production equipment but normally scheduled to holidays and/or off-hours.