Planning for Disaster

 < Day Day Up > 



Some people seem to operate on the assumption that if they don’t think about a disaster, one won’t happen. This is similar to the idea that if you don’t write a will, you’ll never die—and just about as realistic. No business owner or system administrator should feel comfortable about their degree of preparedness without a clear disaster recovery plan that has been thoroughly tested. Even then, you should continually look for ways to improve the plan. You also need to understand the limitations of any disaster recovery plan: none of them are perfect, and even the best disaster recovery plan needs to be constantly examined and adjusted or it quickly gets out of date.

Planning for disaster or emergencies is not a single step, but an iterative, ongoing process. Systems are not mountains, but rivers, constantly moving and changing, and your disaster recovery plan needs to change as your environment changes. To put together a good disaster recovery plan—one you can bet your business on— you need to follow these steps:

  1. Identify the risks.

  2. Identify the resources.

  3. Develop the responses.

  4. Test the responses.

  5. Iterate.

Identifying the Risks

The first step in creating a disaster recovery plan is to identify the risks to your business and the costs associated with those risks. The risks vary from the simple deletion of a critical file to the total destruction of your place of business and its computers. To properly prepare for a disaster, you need to perform a realistic assessment of the risks, the potential costs and consequences of each disaster scenario, the likelihood of any given disaster scenario, and the resources available to address the risks. Risks that seemed vanishingly remote a few years ago are now part of our everyday life.

Identifying risks is not a job for only one person. As with all the tasks associated with a disaster recovery plan, all concerned parties must participate. There are two important reasons for this: you want to make sure that you have commitment and buy-in from the parties concerned, and you also want to make sure you don’t miss anything important.

No matter how carefully and thoroughly you try to identify the risks, you’ll miss at least one. You should always account for that missing risk by including an “unknown risk” item in your list. Treat it just like any other risk: identify the resources available to address it and develop countermeasures to take should it occur. The difference with this risk, of course, is that your resources and countermeasures are somewhat more generic, and you can’t really test your response to the risk, because you don’t yet know what it is.

Start by trying to list all the possible ways that your system could fail. Solicit help from everyone with a stake in the process. The more people involved in the brainstorming, the more ideas you’ll get and the more prevention and recovery procedures you can develop and practice. Be careful at this stage in the process to not dismiss any idea or concern as trivial, unimportant, or unlikely.

Next, look at all the ways that some external event could affect your system. The team of people responsible for identifying possible external problems is probably similar to a team looking at internal failures, but with some important differences. For example, if your business is housed in a large commercial office building, you’ll want to involve that building’s security and facilities groups even though they aren’t employees of your business. They will not only have important input into the possible threats to the business, but also information on the resources and preventative measures already in place.

Once you’ve identified as many possible risks as you can, you then need to understand and quantify just how likely a particular risk is. If you’re located in a flood plain, for example, you’re much more likely to think flood insurance is a good investment.

Identifying the Resources

Once you’ve identified the risks to your network, you need to identify what the resources are to address those risks. These resources can be internal or external, people or systems, hardware or software.

When you’re identifying the resources available to deal with a specific risk, be as complete as you can, but also be specific. Identifying everyone in the Engineering group as a resource to solve a crashed server might look good, but realistically only one or two key people are likely to actually rebuild the server. Make sure you identify those key people for each risk, as well as what more general secondary resources they have to call on. So, for example, the primary resources available to recover a crashed server might consist of your hardware vendor to recover the failed hardware and your own IT person or primary system consultant to restore the software and database. General secondary resources could include Microsoft Premier Support.

An important step in identifying resources in your disaster recovery plan is to specify both the first-line responsibility and the back-end or supervisory responsibility. Make sure everyone knows who to go to when the problem is more than they can handle or when they need additional resources. Also, clearly define when they should do that. The best disaster recovery plans include clear, unambiguous escalation policies. This takes the burden off individuals to decide when and whom to notify and makes it simply part of the procedure.

Developing the Responses

An old but relevant adage comes to mind when discussing disaster recovery scenarios: when you’re up to your elbows in alligators, it’s difficult to remember that your original objective was to drain the swamp. This is another way of saying that people lose track of what’s important when they are overloaded by too many problems that require immediate attention. To ensure that your swamp is drained and your network gets back online, you need to take those carefully researched risks and resources and develop a disaster recovery plan. There are two important parts of any good disaster recovery plan:

  • Standard operating procedures (SOPs)

  • Standard escalation procedures (SEPs)

Making sure these procedures are in place and clearly understood by all before a disaster strikes puts you in a far better position to recover gracefully and with a minimum of lost productivity and data.

Standard Operating Procedures

Emergencies bring out both the best and worst in people. If you’re prepared for the emergency, you can be one of those who come out smelling like a rose, but if you’re not prepared and let yourself get flustered or lose track of what you’re trying to accomplish, you can make the whole situation worse than it needs to be.

Although no one is ever as prepared for a system emergency as they’d like to be, careful planning and preparation can give you an edge in recovering expeditiously and with a minimal loss of data. It is much easier to deal with the situation calmly when you know you’ve prepared for this problem and you have a well-organized, tested standard operating procedure (SOP) to follow.

Because the very nature of emergencies is that you can’t predict exactly which one is going to strike, you need to plan and prepare for as many possibilities as you can. The time to decide how to recover from a disaster is before the disaster happens, not in the middle of it when users are screaming and bosses are standing around looking serious and concerned.

Your risk assessment phase involved identifying as many possible disaster scenarios as you could, and in your resource assessment phase you identified the resources that are available and responsible for each of those risks. Now you need to write up SOPs for recovering the system from each of the scenarios. Even the most level-headed system administrator can get flustered when the system has crashed, users are calling every 10 seconds to see what the problem is, the boss is asking every 5 minutes when you’ll have it fixed, and your server won’t boot. And that’s the easy case compared to the mess that can be caused by an external disaster.

Reduce your stress and prevent mistakes by planning for disasters before they occur. Practice recovering from each of your disaster scenarios. Write down each of the steps, and work through questionable or unclear areas until you can identify exactly what it takes to recover from the problem. This is like a fire drill, and you should do it for the same reasons—not because a fire is inevitable, but because fires do happen, and the statistics demonstrate irrefutably that those who prepare for a fire and practice what to do in a fire are far more likely to survive it.

Your job as a system administrator is to prepare for disasters and practice what to do in those disasters, not because you expect the disaster, but because if you do have one, you want to be the hero, not the goat. After all, it isn’t often that the system administrator or IT consultant gets to be a hero, so be ready when your time comes.

The first step in developing any SOP is to outline the overall steps you want to accomplish. Keep it general at this point—you’re looking for the big picture here. Again, you want everyone to be involved in the process. What you’re really trying to do is make sure you don’t forget any critical steps, and that’s much easier when you get the overall plan down first. There will be plenty of opportunity later to cover the specific details.

Once you have a broad, high-level outline for a given procedure, the people you identified as the actual resources during the resource assessment phase should start to fill in the blanks of the outline. You don’t need every detail at this point, but you should get down to at least a level below the original outline. This will help you identify missing resources that are important to a timely resolution of the problem. Again, don’t get too bogged down in the details at this point. You’re not actually writing the SOP, just trying to make sure that you’ve identified all of its pieces.

When you feel confident that the outline is ready, get the larger group back together again. Go over the procedure and smooth out the rough edges, refining the outline and listening to make sure you haven’t missed anything critical. When everyone agrees that the outline is complete, you’re ready to add the final details to it.

The people who are responsible for each procedure should now work through all the details of the disaster recovery plan and document the steps thoroughly. They should keep in mind that the people who actually perform the recovery might not be who they expect. It’s great to have an SOP for recovering from a failed router, but if the only person who understands the procedure is the IT person, and she’s on vacation in Bora Bora that week, your disaster recovery plan has a big hole in it.

When you create the documentation, write down everything. What seems obvious to you now, while you’re devising the procedure, will not seem at all obvious in six months or a year when you suddenly have to follow it under stress.

start sidebar
Real World

Multiple Copies, Multiple Locations

It’s tempting to centralize your SOPs into a single, easily accessible database. You should do that, making sure everyone understands how to use it. But you’ll also want to have alternative locations and formats for your procedures. Not only do you not want to keep the only copy in a single database, you also don’t want to have only an electronic version—how accessible is the SOP for recovering a failed server going to be when the server has failed? Always maintain hard-copy versions as well. The one thing you don’t want to do is create a single point of failure in your disaster recovery plan!

Every good server room should have a large binder, prominently visible and clearly identified, that contains all the SOPs. Each responsible person should also have one or more copies of at least the procedures he or she is either a resource for or likely to become a resource for. We like to keep copies of all our procedures in several places so that we can get at them no matter what the source of the emergency or where we happen to be when one of our pagers goes off.

end sidebar

Once you have created the SOPs, your job has only begun. You need to keep them up to date and make sure that they don’t become stale. It’s no good having an SOP to recover your ISDN connection to the Internet when you ripped the ISDN line out a year ago and put in a DSL line with three times the bandwidth at half the cost.

You also need to make sure that all your copies of an SOP are updated. Electronic ones should probably be stored in a database or in a folder on the Windows Small Business Server that is available off-line. However, hard-copy documents are notoriously tricky to maintain. A good method is to make yet another SOP that details who updates what SOPs, how often, and who gets fresh copies whenever a change is made. Then put a version control system into place and make sure everyone understands his or her role in the process. Build rewards into the system for timely and consistent updating of SOPs—if 10 or 20 percent of someone’s bonus is dependent on keeping those SOPs up to date and distributed, you can be sure they’ll be current at least as often as the review process.

Standard Escalation Procedures

No matter how carefully you’ve identified potential risks, and how detailed your procedures to recover from them are, you’re still likely to have situations you didn’t anticipate. An important part of any disaster recovery plan is a standardized escalation procedure. Not only should each individual SOP have its own procedure-specific SEP, but you should also have an overall escalation procedure that covers everything you haven’t thought of—because it’s certain you haven’t thought of everything.

An escalation procedure has two functions—resource escalation and notification escalation. Both have the same purpose: to make sure that everyone who needs to know about the problem is up to date and involved as appropriate, and to keep the overall noise level down so that the work of resolving the problem can go forward as quickly as possible. The resource escalation procedure details the resources that are available to the people who are trying to recover from the current disaster so that these people don’t have to try to guess who (or what) the appropriate resource might be when they run into something they can’t handle or something doesn’t go as it is supposed to. This procedure helps them stay calm and focused. They know that if they run into a problem, they aren’t on their own, and they know exactly who to call when they do need help.

The notification escalation procedure details who is to be notified of serious problems. Even more important, it should provide specifics regarding when notification is to be made. If a particular print queue crashes but comes right back up, you might want to send a general message only to the users of that particular printer letting them know what happened. However, if your e-mail has been down for more than half an hour, a lot of folks are going to be concerned. The SEP for e-mail should detail who needs to be notified when the server is unavailable for longer than some specified time, and it should probably detail what happens and who gets notified when it’s still down some significant amount of time after that.

This notification has two purposes: to make sure that the necessary resources are made available as required, and to keep everyone informed and aware of the situation. If you let people know that you’ve had a server hardware failure and that the vendor has been called and will be on site within an hour, you’ll cut down the number of phone calls exponentially, freeing you to do whatever you need to do to ensure that you’re ready when the vendor arrives.

Testing the Responses

A disaster recovery plan is nice to have, but it really isn’t worth a whole lot until it has actually been tested. Needless to say, the time to test the plan is at your convenience and under controlled conditions, rather than in the midst of an actual disaster. It’s a nuisance to discover that your detailed disaster recovery plan has a fatal flaw in it when you’re testing it under controlled conditions. It’s a bit more than a nuisance to discover it when every second counts.

You won’t be able to test everything in your disaster recovery plans. Even most large organizations don’t have the resources to create fully realistic simulated natural disasters and test their response to each of them under controlled conditions, and even fewer small businesses have those kinds of resources. Nevertheless, there are things you can do to test your response plans. The details of how you test them depend on your environment, but they should include as realistic a test as feasible and should, as much as possible, cover all aspects of the response plan. The other reason to test the disaster recovery plan is that it provides a valuable training ground. If you’ve identified primary and backup resources, as you should, chances are that the people you’ve identified as backup resources are not as skilled or knowledgeable in a particular area as the primary resource. Testing the procedures gives you a chance to train the backup resources at the same time.

You should also consider using the testing to cross-train people who are not necessarily in the primary response group. Not only will they get valuable training, but you’ll also create a knowledgeable pool of people who might not be directly needed when the procedure has to be used for real, but who can act as key communicators with the rest of the community.

Iterating

When you finish a particular disaster recovery plan, you might think your job is done, but in fact it has just begun. Standardizing a process is actually just the first step. You need to continually look for ways to improve it.

You should make a regular, scheduled practice of pulling out your disaster recovery plan with those responsible and making sure it’s up to date. Use the occasion to actually look at it and see how you can improve on it. Take the opportunity to examine your environment. What’s changed since you last looked at the plan? What equipment has been retired, and what has been added? What software is different? Are all the people on your notification and escalation lists still working at the company in the same roles? Are the phone numbers, including home phone numbers, up to date?

Another important way to iterate your disaster recovery plan is to use every disaster as a learning experience. Once the disaster or emergency is over, get everyone together as soon as possible to talk about what happened. Find out what they think worked and what didn’t in the plan. Actively solicit suggestions for how the process could be improved. Then make the changes and test them. You’ll not only improve your responsiveness to this particular type of disaster, but you’ll improve your overall responsiveness by getting people involved in the process and enabling them to be part of the solution.



 < Day Day Up > 



Microsoft Windows Small Business Server 2003 Administrator's Companion
Microsoft Windows Small Business Server 2003 Administrators Companion (Pro-Administrators Companion)
ISBN: 0735620202
EAN: 2147483647
Year: 2004
Pages: 224

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net