Perception is reality, whether we like it or not. Because of this, it is important to make a distinction between the types of unavailability:
Anything that makes the system and
its resulting solution unavailable to the people who are using it.
There are varying degrees of total unavailability (complete system
failure, site down, and data gone, all of which are described in
more detail in Chapter 13, along with ways to mitigate them), and
this is more often than not the catastrophic or
When a system appears to be
functioning properly from an administrative standpoint, but is
unavailable to end users or customers. This can happen if only one
aspect of the solution is down, for example, if the database
administrators (DBAs) know SQL Server is up and running, but the
front-end Web servers fielding the initial
Each person in the availability equation sees a system or the entire solution differently. To mitigate issues caused by varied perceptions, service level agreements (SLAs) must be negotiated with all parties involved in the solution, including hosting companies, end users, contractors, administrators, and so on, to ensure that uptime and performance needs are met. Each SLA might state a different availability goal, but the overall availability is based on every component, not just one. Chapter 2, The Basics of Achieving High Availability, delves further into SLAs.
Take a mental poll of your current workplace or companies you have worked for in the past: how many applications and solutions were designed with availability in mind from inception? The reality is that in mostnot alldevelopment, test, and production environments, availability is an afterthought. In the minds of some application developers and management, availability is solely Information Technologys (ITs) concernmeaning not a design, just an end-of- line implementation issueright? Wrong! Availability is not just a technology problem: it encompasses people, process, and technology, as well as end-to- end designs of applications, infrastructure, and systems. Chapter 2 gets into some of the specifics of the basics of high availability, from basic infrastructure to change control.
There are two approaches for assessing your environment for availability:
The application or solution is already in place.
Availability needs to be retrofitted to the environment. In the
evaluation process of adding availability for the first time, or
enhancing existing availability
The application or solution has not yet been implemented. This more than likely means starting from the beginning, and the solution starts from a clean slate. In the evaluation of a completely new solution, planning is every bit as important, if not more important, than the final step of implementation. Planning ensures fewer problems over the solutions entire life cycle. It is much easier to get something right from the start than it is to take an existing solution and redesign it to be something else. Remember that scope creep is a problem for any planning, design, and implementation processnot just availability ones!
As noted earlier, Chapter 2 covers more specifics about the basics of high availability, but it is very important to keep in mind the differences in the two approaches and how they really are not the same. Keep this in mind as you go through each chapter in this book, as well as when you start to plan, design, and implement highly available solutions in your environment.
In assessing availability, it is important to keep in the
forefront of everyones mind that applications
directly impact the availability of systems. One could even argue
that in spite of massive amounts of other software and hardware
redundancy, in the end the application might
No ones likes the cost aspect of assessing availability and, to
a large degree, it is the toughest aspect to handle. The question
is not so much the cost of availability, but the cost of not having
availability and, subsequently, the actual cost of downtime. Money
is always a factor, as achieving availability is not cheap.
Consider the following example: A high volume e-commerce Web
site generates, on average, $10,000 in sales per
Analyzing the downtime further, when the
When the company went to call support, they realized that they
never renewed their Microsoft Premier Support agreement because
they did not really use support services in the past, so they felt
it was not needed. Instead of getting the quick response
Perhaps even more important than the tangibles are the
intangibles: how many existing customers went to the Web site to
buy something, could not access it, and did not come back because
they knew they could get the item from another e-tailer that was up
and running? Those users might be enticed by discount
For this example, there are a few conclusions that should be highlighted as part of a postmortem to prevent the problem from happening again:
A proper disaster recovery plan must be devised and implemented to avoid additional unnecessary downtime.
A clear chain of command must be established to reduce confusion and duplicate efforts.
Always have a support contract with appropriate
Support personnel need certain information sooner rather than later.
Make sure customers are welcomed back with
Chapter 12 walks you through putting together a complete disaster recovery plan. Because the downtime cost at least $60,000 in this case (again, this is the only concrete number), would a $10,000 support contract have paid for itself? Most likely it would have. If the technical problem was found to be an issue of configuration, was it completely thought out during the planning process, or was the system rushed into production? Correcting these kinds of issues can save money and time to resolution.
Everything from improper staffing to too many
Process Is there a plan? Are there normal company standards and processes that will impede the availability and possibly disaster recovery solutions?
Was enough money invested in all aspects of the
solution (people, process, design, software, hardware, support, and
so on)? Will cutting costs or restraining
Are the goals set for coming back online
unreasonable? Are maintenance
During the planning phase, you must take into account barriers to availability and mitigate any risks associated with them.