Take a mental poll of your current workplace or companies you have worked for in the past: how many applications and solutions were designed with availability in mind from inception? The reality is that in mostnot alldevelopment, test, and production environments, availability is an afterthought. In the minds of some application developers and management, availability is solely Information Technologys (ITs) concernmeaning not a design, just an end-of- line implementation issueright? Wrong! Availability is not just a technology problem: it encompasses people, process, and technology, as well as end-to- end designs of applications, infrastructure, and systems. Chapter 2 gets into some of the specifics of the basics of high availability, from basic infrastructure to change control.
There are two approaches for assessing your environment for availability:
The application or solution is already in place. Availability needs to be retrofitted to the environment. In the evaluation process of adding availability for the first time, or enhancing existing availability methods , new hardware might need to be purchased, new maintenance and other processes are added to the administrators daily tasks , the application itself might need to be patched or redesigned, and more. Some of this might be the result of outgrowing current hardware capacity or scalability, causing a perceived unavailability problem.
The application or solution has not yet been implemented. This more than likely means starting from the beginning, and the solution starts from a clean slate. In the evaluation of a completely new solution, planning is every bit as important, if not more important, than the final step of implementation. Planning ensures fewer problems over the solutions entire life cycle. It is much easier to get something right from the start than it is to take an existing solution and redesign it to be something else. Remember that scope creep is a problem for any planning, design, and implementation processnot just availability ones!
As noted earlier, Chapter 2 covers more specifics about the basics of high availability, but it is very important to keep in mind the differences in the two approaches and how they really are not the same. Keep this in mind as you go through each chapter in this book, as well as when you start to plan, design, and implement highly available solutions in your environment.
In assessing availability, it is important to keep in the forefront of everyones mind that applications directly impact the availability of systems. One could even argue that in spite of massive amounts of other software and hardware redundancy, in the end the application might prove to be the weak point in the availability chain if it is not designed to handle such things as a server name change if it should be required during disaster recovery. The application drives the requirements of how each production server, its related hardware, and third-party softwaresuch as database platforms and operating systems will be selected and assembled ; the infrastructure should not dictate the application. Designing the perfect infrastructure does no good if the one component accessing it cannot properly utilize the new multimillion-dollar infrastructure. This is also the reason that retrofitting high availability into an already existing environment is much more challenging than starting from scratch. Sometimes, it can be akin to putting a square peg into a round hole. A bad application means low availability.
No ones likes the cost aspect of assessing availability and, to a large degree, it is the toughest aspect to handle. The question is not so much the cost of availability, but the cost of not having availability and, subsequently, the actual cost of downtime. Money is always a factor, as achieving availability is not cheap. Speaking realistically , achieving five nines of availability on a limited budget is just about impossible .
Consider the following example: A high volume e-commerce Web site generates, on average, $10,000 in sales per hour . For a day, that equates to $240,000, and weekly, $1,680,000. On the busiest day, the Web site encountered an unexpected availability problem around noon. When all was said and done, the Web site was down for six hours, which is the equivalent of $60,000 in sales. That was the surface cost of downtime.
Analyzing the downtime further, when the outage happened , no plan was in place, so an hour of that time was spent gathering the team together. This also meant that whomever was called in to work on the problem had to stop working on whatever they were doing. Because no plan was in place, there was no clear chain of command, resulting in many people attempting the same thing, further tying up system resources. This impacted other schedules and mission-critical timelines , a cost that must also be measured.
When the company went to call support, they realized that they never renewed their Microsoft Premier Support agreement because they did not really use support services in the past, so they felt it was not needed. Instead of getting the quick response engineers , they were left waiting in the queue with other pay-per-incident customers. Also hindering quick time to resolution, certain information was either not known about the systems or could not be gathered because of stringent processes. It took a committee a half hour to decide to allow the information to be gathered. The technical issue was eventually solved , but due to the delays caused by the lack of a support contract and the information problem, it took longer than it should have. These were tangible costs and direct problems.
Perhaps even more important than the tangibles are the intangibles: how many existing customers went to the Web site to buy something, could not access it, and did not come back because they knew they could get the item from another e-tailer that was up and running? Those users might be enticed by discount coupons to returnbut perhaps not. An even bigger intangible was the loss of prospective customers who visited the site during the downtime, saw it was down, and never visited again. That might factor into the $10,000 hourly rate, but can you put a price on future business? Customer loyalty is directly associated with a brand name. A bad experience generates negative word of mouth. If word spreads that the site is unreliable, the more people that spread that message, the more potential there is of losing current and future business.
For this example, there are a few conclusions that should be highlighted as part of a postmortem to prevent the problem from happening again:
A proper disaster recovery plan must be devised and implemented to avoid additional unnecessary downtime.
A clear chain of command must be established to reduce confusion and duplicate efforts.
Always have a support contract with appropriate vendors that provide the level of support and response needed.
Support personnel need certain information sooner rather than later.
Make sure customers are welcomed back with open arms after a significant outage.
Chapter 12 walks you through putting together a complete disaster recovery plan. Because the downtime cost at least $60,000 in this case (again, this is the only concrete number), would a $10,000 support contract have paid for itself? Most likely it would have. If the technical problem was found to be an issue of configuration, was it completely thought out during the planning process, or was the system rushed into production? Correcting these kinds of issues can save money and time to resolution.
The preceding example shows that the lack of a proper disaster recovery plan, no support contract, and the possible missed configuration point are potential barriers to availability, or roadblocks to achieving the level of availability required. These barriers include, but are not limited to, the following:
People Everything from improper staffing to too many chiefs mulling around and giving orders during downtime.
Process Is there a plan? Are there normal company standards and processes that will impede the availability and possibly disaster recovery solutions?
Budget Was enough money invested in all aspects of the solution (people, process, design, software, hardware, support, and so on)? Will cutting costs or restraining budgets cause more downtime in the end? Every solution obviously is not given carte blanche in terms of a budget, but an availability solution must fit a budget and the availability requirements. The two cannot be in direct conflict, otherwise there will not be a successful implementation.
Time Are the goals set for coming back online unreasonable? Are maintenance windows too short, which means that if maintenance is not being done, it could cause availability problems down the road?
During the planning phase, you must take into account barriers to availability and mitigate any risks associated with them.