The simple definition of the availability of a system is the amount of time a system is up in a given year. In the past few years , the availability of a companys mission-critical systems has been a major focus point. Availability needs to be given the proper attention, in addition to performance and security, which have always been paramount in the mind of anyone who has interaction with the systems. End users and management are concerned when the system is down; they do not usually spend time worrying about availability or allocating additional money for availability when the system is up and running. Availability is not just a SQL Server issue, eitherit is applicable to every aspect of a computing environment.
There are two main aspects to consider when thinking about high availability, prevention and disaster recovery , both of which have many more facets than just technology. Prevention is anythingincluding people, processes, and technologythat, when put in place, is designed to reduce the risk of a catastrophic occurrence. The harsh reality is that, despite planning, failures can occur. A highly available system is one that can potentially mask the effects of the failure and maintain availability to such a degree that end users are not affected by the failure.
Disaster recovery is exactly what it claims to bea catastrophic event happened , and it must be dealt with. Achieving high availability must take into account both pieces of the proverbial puzzle to make the complete picture. High availability is not just the sigh of relief when the system comes back up. If you are a fighter pilot, would you rather be a daredevil and have fast planes with no safety precautions or fast planes with ejection seats? During World War I, it was thought of as a sign of weakness to carry a parachute , but obviously this attitude has changed. Taking precautions to save systemsor human livesshould never be viewed as extraneous.
The basic tenets of prevention are deceptively simple:
Deploy redundant systems in which one or more servers act as standbys for the primary servers.
Reduce and try to eliminate single points of failure.
Both tenets provide a solution to only one aspect of the technology part of the equationthe server itself. What about the network connecting to the server? What about an applications ability to handle the switch to a secondary server during an outage ? These are just two examples of possible problem areas; there will be more for your environment that you need to prepare for. The bottom line, though, is that planning, implementation, and administration of systems are equally important aspects of preventing an availability problem.
The more redundancy you have, the better off you are. However, there is a conundrum . No system is completely infallible in and of itself, but adding too many levels of redundancy might make an availability solution so complex that it becomes unmanageable.
The designers of the Titanic thought she was unsinkable, yet the ship went to the bottom of the ocean on her maiden voyage. For protection, the Titanic had 16 watertight compartments that could be sealed with 12 watertight doors. In theory, this would have kept the ship afloat despite flooding in three or four compartments. Unfortunately, the watertight compartments themselves only went to a certain level above the water line. When the iceberg struck the Titanic , the damage it caused was, in reality, only six noncontiguous small gashes totaling about 12 square feet. But they made it possible for water to overflow from one compartment into another (think of the way an ice cube tray fills with water) and eventually caused the ship to sink.
What turned out to be a fatal flaw was addressed to the best of the designers ability at the time, but the unknown always lurks in the shadows. You must do your best when you are seeking to prevent a catastrophe.
Technology is only one part of the equation. Environmental and human aspects also contributed to the Titanic s fate. She was caught in a set of circumstancesan ice field that affected most ships in the North Atlantic at that exact time, the lookout saw the iceberg too late, no other ships were close enough to assist after the Titanic struck the iceberg, and so onfor which no amount of technology could compensate.
Similarly, you might have two exact copies of your production database, but what happens if they are located in two data centers in the same state and are affected by an earthquake simultaneously ? Do not overthink your availability solutions; there is only so much you can plan for.
When the iceberg struck, the Titanic was unprepared. There had been no safety drills to test an emergency plan (if one existed), not enough lifeboats for all passengers, and so on. Survivor accounts of the chaos during the last few hours reinforce the importance of proper planning in ensuring that more good than harm occurs during a crisis. Consider that each database, or even the entire system, could need to be rescued but not have to be salvaged from the bottom of the ocean. Rash decisions for the solution could have dire consequences. The only way to execute a rescue or salvage operation is to have the right people in place with a complete, well- tested plan to direct them.
When the crisis has passed, a postmortem should be done to determine the lessons learned to prevent such an outage from occurring in the future. Chapter 12, Disaster Recovery Techniques for Microsoft SQL Server, includes an in-depth discussion on disaster recovery and the plans needed.
Keep in mind that when it comes to disaster recovery, there can be another extreme. You can be proactive to the point of being negative. The person who is constantly worried about everything might be considered obsessed. There is a happy balance somewhere. If the crew of the Titanic made each passenger walk around with life preservers, a ration kit, and some sunblock, would that have been practical? No. Being prepared does not equate to paranoia . Assume that there is always something that could happen to your systems, anticipate and plan to the best of your ability, but go about your day knowing that you cannot control every factor.
If something happensand something will happen at some point during your careerensure that you will not go down with the ship. Whether you are the captain or just one of the passengers or crew, a disaster situation might make or break your career with your employer. Not being able to answer a question such as the one posed in the beginning of the chapterWhen will it be up? because, for example, the plan was never tested or the backups themselves were never tested and timed, is generally considered unacceptable. Keep in mind that fallout might still occur when the dust settles, but as long as you were prepared and handled the situation properly, all should be fine.