Disaster recovery is the process of rebuilding your system to a known, working state after it has been compromised or otherwise rendered inoperable. This means being capable of dealing with everything from a failing network connection to a massive catastrophe. Many businesses involved in the World Trade Center disaster showed their resiliency by having critical services back online within hours of the tragedy.
Unfortunately, for many people, especially small organizations or businesses and educational institutions, disaster recovery is nothing more than a catchphrase that, when implemented, could cost thousands of dollars and never be needed. It is difficult to convince these underfunded and understaffed organizations that a failure, seemingly insignificant at the moment, could cost them more in the long run. The lack of a stable and secure Internet presence, for example, could easily cost a Web hosting service its clients , reputation, and business.
To understand disaster recovery, you must understand that a properly developed recovery plan addresses more than simply what to do if your computer is "hacked." Specifically, there must be provisions for the following:
Hardware Failure . We will all die sometime (in the distant, distant future). Computers are not an exception.
Software Failure or Compromise . Hard disk corruption or a hacker compromise places your system and network at risk.
Infrastructure Failure . For Internet/intranet-based services, a computer that cannot communicate is useless.
Catastrophic Events . Be prepared for fires, earthquakes, floods, UFO invasions, and other unforeseen events.
To get a better idea of what each of these provisions covers, let's take a more in-depth look at each of these topics.
Disaster planning and recovery involves the steps you take if a problem occurs. Although not discussed here, it is assumed that you are already familiar with precautionary measures such as running Disk Utility to repair simple disk problems, or with packages such as Norton Utilities and Drive 10.
This chapter's focus is providing information on what can go wrong and how to plan for it.
Hardware failure is the one certainty of all computer systems. If you have a computer, especially one that runs constantly, it will eventually fail. This is an unavoidable fact. Although many computer components (power supplies ) can easily be swapped for replacement units, you should be prepared at any given moment to replace any piece of mission-critical hardware you own. If your budget doesn't allow for this, you cannot have an effective recovery plan, and you shouldn't be running mission-critical services.
SCARY, ISN'T IT?
A Web development and hosting service (that shall remain nameless) once asked me to give an analysis of their server environment and offer suggestions for improvement. Upon touring the facility, I found that the three primary servers were running on entirely different hardware platforms, had no backup systems, and were even being used for development (while hosting commercial Web sites). I bluntly suggested creating at least a semblance of redundancy for their commercial users. I was assured that they would.
A few months later, I again toured the server space, to see the exact same setup with no changes. They had decided that the $3,000 “$4,000 necessary to create a redundant Web server and backup system were more than they wanted to spend . Wondering how they could keep customers with this sort of setup, I read the contract for their hosting services, expecting to see exceptional prices. Instead, I found promises of "Nightly and Weekly Site and Database Backups." When I asked how they were providing this service, they just laughed and said they'd deal with it when they had to.
I left and shortly thereafter, so did their customers.
Although less demanding on the pocketbook than hardware failure, software corruption (be it unintentional or the result of malicious activity) is no less stressful for those who must deal with it. A machine that has been compromised poses a number of difficulties for the administrator.
First and foremost is data security: What has been compromised? The operating system? Applications? Can the existing data be saved? Because a still functional (albeit "cracked") machine is likely to have more recent files and data on it than a backup, the difficult determination of whether to "trust" a compromised machine must be made.
It is very rare that you can consider a compromised computer to be "trustable." Only when the exploit can be traced back to an individual service that does not provide a means of gaining shell access should you consider keeping the current Mac OS X installation without reformatting and reinstalling.
For example, there are quite a few CGI and PHP scripts that, when improperly installed, open themselves to exploit. These exploits often manifest themselves in the form of Web site defacing, but are well logged and contained within the directories with invalid permissions. The Web site would need to be restored, but the attack was verifiably contained to a specific area and wouldn't necessarily require you to reinstall the entire system.
On the other hand, if your Web site became defaced and there are no obvious entry points and no log of the incident, the entire server should be considered "untrustable" and reworked appropriately.
If a cracked machine is deemed "untrustable," as is often the case, how can it be restored to its original state? Backups are great, but do they include all the configuration files and settings necessary to make it "as it was"? Mac OS X is deceiving in that it puts a "Mac-like" face on the system. If you've backed up the /Library/Webserver directory, you have everything necessary to restore a Web server, right? Wrong. There is the hidden /etc/httpd directory with all your Web server settings, there are individual Sites directories that must be taken into consideration, as well as any additional Apache modules that have been installed ( /usr/libexec/httpd/ ).
To successfully restore a computer, you must be able to account for everything that makes the system do its job. Every application, script, directory, crontab entry, and /etc file needed for operation must be documented, understood , and easily replaced .
A working computer is of little use if it cannot communicate with other computers, serve your network, or otherwise fulfill its obligations as a netizen. A failure of your computing infrastructure is often worse than failure of a single computer. Network failures, for example, can be difficult to diagnose and are likely to strike multiple computers simultaneously . The good news is that in many cases your ISP handles network setup, monitoring, and maintenance. Of course, your network is only one part of your computing infrastructure a power failure can have equally disruptive results, and may even cause equipment failure.
SCARY, ISN'T IT? PART II
I enjoy using anecdotes to drive home a point. Not only do they show the book concepts in a real-world light, but they may also help open the eyes of those who haven't dealt directly with these issues (and their causes) before.
Infrastructure failures are usually the result of physical failure in wiring or supporting hardware. Sometimes, however, these failures are caused by people. For example, another "unnamed" small company who performed a variety of Internet services for clients decided that it wanted to rewire the phone system in the building. To save money, the company owners hired their friends to do the job. Everything went well until the electricians decided to strip out the old wiring as a favor. What they didn't realize was that one of the wires that they stripped out was the T1 line for the building.
To make matters worse, the line was cut flush with the conduit where it entered the building. The conduit itself was sealed in the building's concrete foundation. A split second with a pair of wire snips effectively shut the business down for four or five days, until the provider could completely restring the wiring.
Your computers are important, but without the necessary infrastructure, they're useless.
Your hardware, software, and infrastructure can fail and, with some planning, be restored. But what if they fail simultaneously and irreversibly ? This is a catastrophic failure.
Catastrophic failures can be the result of natural disasters, vandalism, or other uncontrollable events. In the case of catastrophic failure, there is no resurrection of your existing equipment. It is effectively gone, and you will eventually have to face the choice of shutting down your operations or picking up the pieces and moving on. If you have customers depending on your systems, they will likely take their business elsewhere unless given a feasible timeline within which you will redeploy their services. Administrators with a proper disaster recovery plan can often shift their entire operation to backup facilities in a matter of hours.
So, now that you're convinced that you never want to turn on a computer again, let alone deal with the responsibility of keeping it online, let's take a look at how you can develop a plan to make disaster recovery manageable and quite possible.