Some people seem to operate on the assumption that if they don't think about a disaster, one won't happen. This is similar to the idea that if you don't write a will, you'll never die—and just about as realistic. No system administrators should feel comfortable about their network's degree of preparedness without a clear disaster recovery plan that has been thoroughly tested. Even then, it's wise to always look for ways to improve the plan.
Planning for disaster or emergencies is not a single step, but an iterative, ongoing process. Systems are not mountains, but rivers, constantly moving and changing, and your disaster recovery plan needs to change as your environment changes. To put together a good disaster recovery plan, one you can bet your business on, you need to follow these steps:
The first step in creating a disaster recovery plan is to identify the risks to your business and the costs associated with those risks. The risks vary from the simple deletion of a critical file to the total destruction of your place of business and its computers. To properly prepare for a disaster, you need to perform a realistic assessment of the risks, the potential costs and consequences of each disaster scenario, and the likelihood of any given disaster scenario.
Identifying risks is not a job for only one person. As with all of the tasks associated with a disaster recovery plan, all concerned parties must participate. There are two important reasons for this: you want to make sure that you have commitment and buy-in from the parties concerned, and you also want to make sure you don't miss anything important.
No matter how carefully and thoroughly you try to identify the risks, you'll miss at least one. You should always account for that missing risk by including an "unknown risk" item in your list. Treat it just like any other risk: identify the resources available to address it and develop countermeasures to take should it occur. The difference with this risk, of course, is that your resources and countermeasures are somewhat more generic, and you can't really test your response to the risk, because you don't yet know what it is.
Start by trying to list all of the possible ways that your system could fail. If you have a team of people responsible for supporting the network, solicit everyone's help in the process. The more people involved in the brainstorming, the more ideas you'll get and the more prevention and recovery procedures you can develop and practice.
Next, look at all of the ways that some external event could affect your system. The team of people responsible for identifying possible external problems is probably similar to a team looking at internal failures, but with some important differences. In a large industrial plant, for example, when you start to look at external failures and disasters, you'll want to involve the security and facilities groups, because they will need to understand your needs as well as provide input on how well the plant is protected from these disasters.
An important part of this risk assessment phase is to understand and quantify just how likely a particular risk is. If you're located in a flood plain, for example, you're much more likely to think flood insurance is a good investment.
Once you've identified the risks to your network, you need to identify what your resources are to address those risks. These resources can be internal or external, people or systems, hardware or software.
When you're identifying the resources available to deal with a specific risk, be as complete as you can, but also be specific. Identifying everyone in the IT group as a resource to solve a crashed server might look good, but realistically only one or two key people are likely to actually rebuild the server. Make sure you identify those key people for each risk, as well as what more general secondary resources they have to call on. So, for example, the primary resources available to recover a crashed Microsoft Exchange server might consist of one or two staff members who can recover the failed hardware and another one or two staff members who can restore the software and database. General secondary resources would include everyone in the IT group as well as the hardware vendor and Microsoft Premier Support.
An important step in identifying resources in your disaster recovery plan is to specify both the first-line responsibility and the back-end or supervisory responsibility. Make sure everyone knows who to go to when the problem is more than they can handle or when they need additional resources. Also, clearly define when they should do that. The best disaster recovery plans include clear, unambiguous escalation policies. This takes the burden off individuals to decide when and who to notify and makes it simply part of the procedure.
An old but relevant adage comes to mind when discussing disaster recovery scenarios: when you're up to your elbows in alligators, it's difficult to remember that your original objective was to drain the swamp. This is another way of saying that people lose track of what's important when they are overloaded by too many problems that require immediate attention. To ensure that your swamp is drained and your network gets back online, you need to take those carefully researched risks and resources and develop a disaster recovery plan. There are two important parts of any good disaster recovery plan:
Making sure these procedures are in place and clearly understood by all before a disaster strikes puts you in a far better position to recover gracefully and with a minimum of lost productivity and data.
Emergencies bring out both the best and worst in people. If you're prepared for the emergency, you can be one of those who come out smelling like a rose, but if you're not prepared and let yourself get flustered or lose track of what you're trying to accomplish, you can make the whole situation worse than it needs to be.
Although no one is ever as prepared for a system emergency as they'd like to be, careful planning and preparation can give you an edge in recovering expeditiously and with a minimal loss of data. It is much easier to deal with the situation calmly when you know you've prepared for this problem and you have a well-organized, tested standard operating procedure to follow.
Because the very nature of emergencies is that you can't predict exactly which one is going to strike, you need to plan and prepare for as many possibilities as you can. The time to decide how to recover from a disaster is before the disaster happens, not in the middle of it when users are screaming and bosses are standing around looking serious and concerned.
Your risk assessment phase involved identifying as many possible disaster scenarios as you could, and in your resource assessment phase you identified the resources that are available and responsible for each of those risks. Now you need to write up SOPs for recovering the system from each of the scenarios. Even the most level-headed system administrator can get flustered when the system has crashed, users are calling every 10 seconds to see what the problem is, the boss is asking every 5 minutes when you'll have it fixed, and your server won't boot.
Reduce your stress and prevent mistakes by planning for disasters before they occur. Practice recovering from each of your disaster scenarios. Write down each of the steps, and work through questionable or unclear areas until you can identify exactly what it takes to recover from the problem. This is like a fire drill, and you should do it for the same reasons—not because a fire is inevitable, but because fires do happen, and the statistics demonstrate irrefutably that those who have prepared for a fire and practiced what to do in a fire are far more likely to survive it.
Your job as a system administrator is to prepare for disasters and practice what to do in those disasters, not because you expect the disaster, but because if you do have one, you want to be the hero, not the goat. After all, it isn't often that the system administrator gets to be a hero, so be ready when your time comes.
The first step in developing any SOP is to outline the overall steps you want to accomplish. Keep it general at this point—you're looking for the big picture here. Again, you want everyone to be involved in the process. What you're really trying to do is make sure you don't forget any critical steps, and that's much easier when you get the overall plan down first. There will be plenty of opportunity later to cover the specific details.
Once you have a broad, high-level outline for a given procedure, the people you identified as the actual resources during the resource assessment phase should start to flesh in the outline. You don't need every detail at this point, but you should get down to at least a level below the original outline. This will help you identify missing resources that are important to a timely resolution of the problem. Again, don't get too bogged down in the details at this point. You're not actually writing the SOP, just trying to make sure that you've identified all of its pieces.
When you feel confident that the outline is ready, get the larger group back together again. Go over the procedure and smooth out the rough edges, refining the outline and listening to make sure you haven't missed anything critical. When everyone agrees that the outline is complete, you're ready to add the final details to it.
The people who are responsible for each procedure should now work through all of the details of the disaster recovery plan and document the steps thoroughly. They should keep in mind that the people who actually perform the recovery might not be who they expect. It's great to have an SOP for recovering from a failed router, but if the only person who understands the procedure is the network engineer, and she's on vacation in Bora Bora that week, your disaster recovery plan has a big hole in it.
When you create the documentation, write down everything. What seems obvious to you now, while you're devising the procedure, will not seem at all obvious in six months or a year when you suddenly have to use it under stress.
Multiple Copies, Multiple Locations
It's tempting to centralize your SOPs into a single, easily accessible database. You should do that, making sure everyone understands how to use it. But you'll also want to have alternative locations and formats for your procedures. Not only do you not want to keep them in a single database, you also don't want to have only an electronic version. Always maintain hard copy versions as well. Every good server room should have a large binder, prominently visible and clearly identified, that contains all of the SOPs. Each responsible person should also have one or more copies of at least the procedures he or she is either a resource for or likely to become a resource for. We like to keep copies of all our procedures in several places so that we can get at them no matter what the source of the emergency or where we happen to be when one of our pagers goes off.
Once you have created the SOPs, your job has only begun. You need to keep them up to date and make sure that they don't become stale. It's no good having an SOP to recover your ISDN connection to a branch office when you ripped the ISDN line out a year ago and put in a DSL line with three times the bandwidth at half the cost.
You also need to make sure that all of your copies of an SOP are updated. Electronic ones should probably be stored in a replicated database. However, hard copy documents are notoriously tricky to maintain. One way to do so is to make yet another SOP that details who updates what SOPs and who gets fresh copies whenever a change is made. Then put a version control system into place and make sure everyone understands his or her role in the process.
No matter how carefully you've identified potential risks, and how detailed your procedures to recover from them, you're still likely to have situations you didn't anticipate. An important part of any disaster recovery plan is a standardized escalation procedure. Not only should each individual SOP have its own procedure-specific SEP, but you should also have an overall escalation procedure that covers everything you haven't thought of—because it is certain that you haven't thought of everything.
An escalation procedure has two functions—resource escalation and notification escalation. Both have the same purpose: to make sure that everyone who needs to know about the problem is up to date and involved as appropriate, and to keep the overall noise level down so that the work of resolving the problem can go forward as quickly as possible.
The resource escalation procedure details the resources that are available to the people who are trying to recover from the current disaster, so that they don't have to try to guess who (or what) the appropriate resource might be when they run into something they can't handle or something doesn't go as it is supposed to. This helps them stay calm and focused. They know that if they run into a problem, they aren't on their own, and they know exactly who to call when they do need help.
The notification escalation procedure details who is to be notified of serious problems. Even more important, it should provide specifics regarding when notification is to be made. If your print server crashes but comes right back up, you might want to send only a general message to the users of that particular server letting them know what happened. However, if your mail server has been down for more than half an hour, a lot of folks are going to be concerned. The SEP for that mail server should detail who needs to be notified if the server is unavailable for longer than some specified time, and it should probably detail what happens and who gets notified when it's still down some significant amount of time after that.
This notification has two purposes: to make sure that the necessary resources are made available as required and to keep everyone informed and aware of the situation. If you let people know that you've had a server hardware failure and that the vendor has been called and will be on site within an hour, you'll cut down the number of phone calls exponentially, freeing you to do whatever you need to do to ensure that you're ready when the vendor arrives.
A disaster recovery plan is nice to have, but it really isn't worth a whole lot until it has actually been tested. Needless to say, the time to test the plan is at your convenience and under controlled conditions, rather than in the midst of an actual disaster. It's a nuisance to discover that your detailed disaster recovery plan has a fatal flaw in it when you're testing it under controlled conditions. It's a bit more than a nuisance to discover it when every second counts.
You won't be able to test all aspects of all disaster recovery plans. Few organizations have the resources to create fully realistic simulated natural disasters and test their response to each of them under controlled conditions. Nevertheless, there are things you can do to test your response plans. The details of how you test them depend on your environment, but they should include as realistic a test as feasible and should, as much as possible, cover all aspects of the response plan.
The other reason to test the disaster recovery plan is that it provides a valuable training ground. If you've identified primary and backup resources, as you should, chances are that the people you've identified as backup resources are not as skilled or knowledgeable in a particular area as the primary resource. Testing the procedures gives you a chance to train the backup resources at the same time.
You should also consider using the testing to cross-train people who are not necessarily in the primary response group. Not only will they get valuable training, but you'll also create a knowledgeable pool of people who might not be directly needed when the procedure has to be used for real, but who can act as key communicators with the rest of the community.
When you finish a particular disaster recovery plan, you may think your job is done, but in fact your work is just beginning. Standardizing a process is actually just the first step. You also need to improve it.
You should make a regular, scheduled practice of pulling out your disaster recovery plan with your group and making sure it's up to date. Use the occasion to actually look at it and see how you can improve on it. Take the opportunity to examine your environment. What's changed since you last looked at the plan? What servers have been retired, and what new ones have been added? What software is different? Are all of the people on your notification and escalation lists still working at the company, in the same roles? Are the phone numbers up to date?
Another way to iterate your disaster recovery plan is to use every disaster as a learning experience. Once the disaster or emergency is over, get everyone together as soon as possible to talk about what happened. Find out what they think worked and what didn't in the plan. Actively solicit suggestions for how the process could be improved. Then make the changes and test them. You'll not only improve your responsiveness to this particular type of disaster, but you'll improve your overall responsiveness by getting people involved in the process and enabling them to be part of the solution.