Recovering from Disasters | Mastering Microsoft Exchange Server 2007 SP1

From a hardware and software perspective, we have already talked about at least 90 percent of the disaster recovery puzzle. If you're an Exchange system administrator and you've protected your servers with redundant hardware, especially interserver redundant hardware, or you can restore any crashed Exchange server under your management, you've pretty much got it made. If you also have to worry about Windows 2003 and you can bring a domain controller or stand-alone server back from the dead with VSS or Automatic Server Recovery (ASR) backups or even clunky, more traditional streaming backups, you're in a good place, too.

All too often the term disaster recovery is misused in our industry. It's used to mean something as simple as restoring a database. In the big picture, this is not a disaster for most organizations. In this section, we want to talk more about real disasters and disaster recovery.

Disaster recovery adds another dimension to the reliability and availability picture. You have to deal with simultaneous multisystem unavailability, up to and including the sudden disappearance of all or a major part of your server, storage, workstation, and networking systems. The cause of such a disaster can be anything from a terrorist attack to an earthquake, from a building fire to a major power outage or lightning strike.

Disaster recovery isn't fun to think about. There are so many variables, including the potential for astronomical costs, that it's easy to either go bonkers or avoid even thinking about the whole thing. The best way to calm yourself and your boss when disaster recovery rears its ugly head is by building and living by a set of best-possible, cost-realistic strategies that specify what you'll do to avoid disasters and the actions you'll take if disaster strikes.

Disaster Recovery Strategies

Shopping for a disaster recovery solution is a lot like buying a new car. Car salespeople will frequently urge you not to worry about price and just test-drive the car you really want. Before you know it, you have spent a lot of time looking at your ideal car even though you know you can't afford it.

Well, developing disaster recovery strategies can be like buying a car at a dealership. You call in a company that specializes in disaster recovery and before you know it, you've got a proposal for a multimillion-dollar solution. The solution, by the way, is usually quite impressive. If only you could afford it.

The first thing you need to consider when developing a disaster recovery strategy is what your organization does and how a disaster might affect what it does. If e-mail and related Exchange services are central to your organization's operation and bottom line, then you need a very aggressive disaster recovery strategy. If your organization could do without e-mail for a few days, then a less aggressive strategy should be acceptable.

In building your disaster recovery strategy, don't be driven by unrealistic assessments of the importance of e-mail. And don't take a seat on the curb in discussions about the role of e-mail in your organization. You live with Exchange. You know what users are doing with e-mail, and you hear user complaints when your Exchange system isn't available. Your goal must be to drive e-mail disaster recovery deliberation toward a solution that you are comfortable with - the check-books, egos, or misperceptions of your bosses notwithstanding. As strategies are considered, you need to make sure your management clearly understands the limits of each. This is not just to protect yourself, but to set realistic management expectations from the get-go.

Piggybacking on Non-E-Mail Disaster Recovery Strategies

Unless e-mail is all your organization does, there should be a disaster recovery strategy for other IT functionality. Adding e-mail to an existing strategy can be a relatively inexpensive option. But don't piggyback if you know the non-e-mail strategy won't work for e-mail. We have been in situations where e-mail was both more and less important than other IT functions. Management loved it when we told them that e-mail required a less aggressive disaster recovery strategy. They hated it when we pressed for a more aggressive (more expensive) strategy for e-mail.

We are going to discuss five disaster recovery strategies, from the fanciest and most costly to the more mundane and reasonably priced. Remember that most of these strategies can be implemented in house or by a third party. Don't write off outsourcing for disaster recovery. For some organizations, it is a good, cost-effective option.

Here are the disaster recovery strategies that we will cover in this section:

Offsite replication of an entire system, such as CCR solutions
Offsite replication of servers, workstations, disk storage, backup hardware, software, and data
Onsite replication of an entire system, such as CCR solutions
Onsite replication of servers, disk storage, backup hardware, software, and data
Onsite presence of spare server, disk storage, and backup hardware

For many organizations, a combination of these strategies makes the most sense. Disasters come in all flavors and intensities. Sometimes they require the aggressive solutions of offsite full-system replication. Sometimes a less aggressive strategy is all that's required. The key is to understand the various disaster recovery strategies and pick the ones that best serve your organization.

Warning

Keep in mind as you read through the discussion of disaster recovery strategies that a strategy is not a plan. Once you've selected the strategy or strategies that work for your organization, you should develop a written plan that provides specifics. You need to specify your strategy in detail and provide step-by-step up-to-date instructions for recovering after a disaster. You also need clear and up-to-date documentation for your hardware systems and the software running on them. And, once you've completed your disaster recovery plan, make sure paper and electronic copies are available off site. The best-laid plans have no value if you can't find a copy when you need it.

Barry Gerber on Offsite Replication of an Entire System

I live in Los Angeles. Any disaster recovery strategy I develop for my LaLaLand clients has to take into account the possibility of earthquake-related collapsing buildings and fractured WAN infrastructure. For those clients who need to operate without missing a beat and who can afford it, offsite replication of their entire system, including up-to-the-minute replication of data, is the right answer.

The idea is that the minute a production system takes a major hit, the offsite system becomes the production system. Appropriate IT and other staff go to the offsite location and begin doing their thing. While the transition is never going to be totally transparent, with networking switchovers and the loss of last-minute data to deal with, a total offsite strategy can get an organization up and running quickly.

One addendum to this strategy is to actually use the disaster site to conduct the organization's business. Staff at each site performs a portion of all or some of the IT and other business tasks of the organization. When disaster strikes, required personnel are already at the disaster recovery site and able to keep the organization running until reinforcements arrive.

As you can imagine, this sort of disaster recovery strategy is very, very expensive. It's for banks and other financial institutions, really big hospitals, and other corporate giants who both need this sort of quick recovery capability and can afford to put it in place.

None of my clients has placed their system in one of those bunkers built into a mountain in Colorado that you might have read about or seen in the movies. However, they have implemented less aggressive strategies where a replicated system is set up in a nearby structure and data is kept up-to-date, though not up to the minute, using tape backups. Often the offsite location is in a single-story building, which is less likely to be seriously damaged in an earthquake. They still have to worry about potential damage and loss of WAN infrastructure, but it's quite okay if these folks can back up within a day or so and not within minutes or hours of a disaster. So this strategy is fine for them.

Offsite Replication of Servers, Workstations, Disk Storage, Backup Hardware, Networks, Related Software, and Data

The major difference between this strategy and offsite replication of data is that you don't replicate your entire production system off site. You replicate just enough of the system to get your organization back up and running in a reasonable time. In this disaster recovery scenario, you replicate hardware and operating system and applications software as required. However, you don't necessarily replicate data, being happy to recover data from backups shortly after a disaster strikes. You also don't necessarily replicate WAN links.

If you need to replicate data or even your entire disk storage system, consider the SAN systems that we discussed earlier in this chapter. Using capabilities built into SAN systems, the Windows Server 2003 cluster service, you can replicate the data on one SAN to another SAN. Such replication is fairly quick and well suited to disaster recovery strategies where data needs to be readily available after a disaster strikes.

Instead of going with a third-party product, you may choose to implement a Windows 2003 cluster and an Exchange Server 2007 clustered continuous replication solution. CCR does not require the same expenses for storage and third-party replication tools since the ability to replicate an Exchange database is built in to the product.

This disaster recovery strategy works if your organization can stand up to a few days of downtime. You and other IT staff need to be ready to scramble to get things running, but you don't have the staff expense and other costs associated with trying to build a full mirror of your production system.

Onsite Replication of an Entire System

This strategy is the same as offsite replication, except your replicated disaster recovery system exists in close physical proximity to your production system. This is a pretty fancy strategy, especially if you also have an offsite replication of your entire system. However, if you need to get up and running after a major system failure, onsite full-system replication might be the only answer.

Windows Server 2003 cluster services can play a major role here and in the next two strategies. Because your system is on site, you can use the very high-speed, server-to-server, server-to-storage, and server-to network links that make clustering such a great server and storage replication solution. It won't solve all of your replication problems, but it takes care of major components in the replication equation.

Onsite Replication of Servers, Disk Storage, Backup Hardware, Networks, Related Software, and Data

As we are sure you've gathered, this strategy is an onsite version of the second disaster recovery strategy we discussed. It can provide the tools you need to meet the operating requirements of your organization. As we noted in the preceding section, Windows Server 2003 cluster services and clustered continuous replication can make this strategy much easier to implement.

Onsite Presence of Spare Server, Disk Storage, Backup and Network Hardware, Software, and Data

Under this strategy, you have spares at hand, but they're not kept up-to-date by replication. Rather, you activate spares when a disaster requires.

Like so much of our discussion of disaster recovery strategies, this one brings to mind our earlier discussions of server recovery in nondisaster situations. We hope, as we come to the end of this relatively brief treatment of disaster recovery strategies, that you begin to synthesize the content of this chapter into a coherent view of the Exchange Server 2007 reliability and availability continuum.

Barry Gerber: The Tao of Disaster Recovery

A detailed discussion of actual disaster recovery operations is beyond the scope of this book. This whole chapter and the specific disaster recovery strategies that we've discussed provide detail and hints as to the how of disaster recovery. Your disaster recovery plan will provide the specific operational steps to be taken when a disaster occurs.

What I really need to talk about here is what might be called the Tao of disaster recovery. Taoism is a way of life that associates every aspect of existence with a kind of overarching spirituality. It mixes the right and left sides of the brain and, in so doing, can bring calm and understanding to even the most stressful experience.

I participated in disaster recovery operations after the September 11, 2001, tragedy in New York City. I wasn't on site and I didn't work for the biggies in the World Trade Center, but I was involved in a number of phone conversations with IT types in two buildings damaged but not destroyed by the airplane crashes. Most of what I talked about involved Exchange server recovery.

I'm a hands-on visual type, so I was especially nervous as I tried to provide help in a voice-only situation. I'm not a Taoist, but I've had enough exposure to the philosophy to know that going bonkers wasn't going to help. So I slapped myself in the face and began breathing in a consciously slow and regular manner before taking the first phone call.

It helped. I was relatively calm until I began talking to a bunch of people who had hours before seen two massive buildings collapse and kill thousands and who were worried about their own personal safety. Understandably, these folks were in a much worse state than I. My first suggestion to them was that they take a few minutes or even a few hours to relax - after, of course, clearing it with their bosses.

My clients agreed to try and called me back in 15 minutes to tell me that they had the go-ahead to wait for an hour. I strongly urged them to do anything but IT work during that hour. Given the mess that portion of New York was in, there wasn't a lot they could do. So they decided to see if they could help others in that hour.

Almost two hours later, my clients called back. It turned out that venturing out to help others made it very clear to them how lucky they were to be alive and still able to do their jobs. In spite of what they'd seen, my clients seemed calm and relaxed about the task ahead of us.

We took the recovery process in steps. After they got their power generator going, we started up their Exchange server, which had been pelted by a major portion of the ceiling and a bunch of heavy chairs from the floor above. Fortunately, they were able to shut off power to the server before its UPS had run out of battery power. Unfortunately, the server did not come back. Not only was their Exchange server dead, so were their two Windows 2000 domain controllers.

Not being major players in trade and finance, these folks didn't have any offsite disaster recovery setup. They also had no real onsite setup. Fortunately, they had backups that were stored both on site and off site. And, they had two standby servers in a closet that more or less survived the disaster. The servers both worked but didn't have current software on them.

So we set up a replacement Windows 2000 domain controller and recovered a backup of Active Directory to it. Then we set up a Windows 2000 server to support the Exchange server. At this point, I suggested we stop for 20 minutes and just talk about what was going on. I actually had a better view of things from Los Angeles by TV than they had in the still-smoky and dusty environment where they were working. This brief respite helped all of us relax, and we were able to recover the Exchange server fairly quickly.

The next day, employees were able to get some work done using internal e-mail. It took more than a week to get some sort of Internet connection running. It wasn't until several weeks later that they had their 1.5Mbps Internet connection back in place.

It took about four hours, including relaxation breaks, to get the job done. If we had pushed ourselves, I estimate it would have taken maybe 10 hours with all the mistakes we'd have made and had to correct. While these folks had a written plan for the recovery, they didn't have an easy-to-use checklist, which would have made things easier. They have one now.

The moral of this story is quite simple: Disasters are stressful. Don't try to recover from one when you're at your most stressed. And you can often make your job easier by involving someone who doesn't have the same emotional and job-related connection to your organization as you do. Don't call me. I'm disaster-recoveried out. However, you should try to get someone else involved in your recovery efforts, whether it's other Windows/Exchange system managers in your area or Microsoft or third-party consultants.