24.1 Why We Are Interested in High Availability? | HP-UX CSE(c) Official Study Guide and Desk Reference

Let's get to the heart of the matter right away. How much will it cost not to adopt a "High Availability" IT infrastructure. In 1996, Dataquest Perspective performed a study of the cost of downtime. For businesses such as financial brokerages, the cost of a single hour of downtime is approximately $6.45 million. Yep, you heard it correctly, and those figures are from 1996. Depending on the business sector involved, you could be losing millions of dollars every hour. See Figure 24-1 for an idea of the cost per hour of downtime for key industry sectors.

Figure 24-1. The cost of downtime.

It might be a good idea to talk the financial controller of your organization and put these figures to him; he might suddenly be very interested in High Availability. Thinking of a typical "Business Continuity Support" contract with Hewlett Packard, you are still looking at a 4-hour call-to-fix turnaround time. That means HP has committed to fix a hardware problem within 4 hours of your reporting it (obviously, speak to your local HP representative for details for your own support terms). That is still many millions of dollars in lost revenue for a single hardware problem. What causes this downtime? Figure 24-1 shows you the causes of downtime.

When we discuss High Availability, we are talking about maximizing uptime. This is quite different from 100 percent uptime. To achieve 100 percent uptime, you would be looking at Fault Tolerant systems. Hewlett Packard currently does not offer fault tolerant systems for HP-UX. HP offers a range of fault-tolerant servers based on the Windows XP operating system known as NonStop servers; see http://www.hp.com/go/nonstop for more details. The costs of a truly end-to-end Fault Tolerant solution make them too prohibitive for most customers. We are looking at achieving a compromise situation where we need to sit down with our customers and plan what is an "acceptable" level of uptime. Built in to that is planned downtime, possibly detailed in a formal Service Level Agreement (SLA). Planned downtime is time when we know in advance that we will have to perform some level of routine maintenance. This may be due to failings in the software application not allowing online backups . It may be that we foresee some operating system changes/upgrades that will require the system is rebooted. If we can plan these in advance, then everyone is "happy." Table 24-1 highlights some reasons for planned and unplanned downtime:

Table 24-1. Reasons for Downtime

Reasons for Planned Downtime	Reasons for Unplanned Downtime
Reconfigure the kernel	Hardware failures which are Single Points Of Failure
Apply software patches	Operating system failures (system crash)
Upgrade the Operating System	Application failures
Upgrade hardware	Loss of power
Perform full system backups	Loss of data center (natural disasters)
Perform database maintenance	Human error

Figure 24-2. Causes of downtime.

Source: Gartner Group , October 1999

High Availability as a design principle needs to be considered at every level of the organization. This includes everything from training your operators to ensuring that human error plays as little a part as possible to ensuring that your twin fibre channel links to your remote site do not get installed in the same conduit underground . High Availability becomes a "mantra" for good working practices as well as simply a solution to hardware failures. Can we do this alone? The simple answer is no. There are too many factors that are not directly under our control. Take, for example, "application failures" listed in the table above. What can we do to rectify this situation? Nothing aside from choosing organizations that we can feel are " partners " in our mantra of "High Availability." Hewlett Packard, along with other hardware and software vendors , are beginning to work together to offer us solutions that encompass this whole philosophy; companies like Oracle, SAP, CISCO Systems, and BEA to name but a few go toward achieving two of the three pillars of High Availability:

Technology Infrastructure : We eliminate SPOF either by having them "fault tolerant" or by adding redundancy to our design. Table 24-2 details some common SPOFs and how we can deal with them.

Table 24-2. Common Single Points of Failure

Single Point of Failure ( SPOF )	How to Provide Redundancy
SPU Failure	Provide another SPU that is capable of running the application. You will need to consider performance impacts if the redundant SPU is used for other activities
Disk Failure (both operating system and data disks)	Implement some form of data/disk mirroring. If mirroring is performed on the same site, you will have to consider the impact of losing the entire data center.
Network Failure	Provide multiple network paths between clients and the applications. This will require that both clients and servers support dynamic routing protocols such as RIP and OSPF.
Interface Card Failure	Most interface cards can be " backed up" by a redundant interface card, be it for connectivity to a network or to a device such as a disk drive (usually called Alternate Link, Auto Path, or Multi- Path ). Another aspect of an interface card failure is the ability to replace the failed card without shutting down the operating system (see OLA/R discussion in Chapter 4, "Advanced Peripherals Configuration").
Operating System Crash	This can be thought of in the same vein as an SPU failure; however, there are other considerations here as well; we have started to see the emergence of Dynamically Loadable Kernel Modules in HP-UX 11i. This will alleviate one of the major issues with providing a Highly Available operating system: the need to reboot the OS when kernel-level patches are installed. With DLKM, we will start to see that necessity diminish. We will be able to Unload, Patch, and Reload a kernel module without a reboot. (This means that all devices using the driver will be temporarily unavailable, but the timescales involved can be cut to seconds as opposed to minutes for a reboot.)
Application Failure	Some applications can be run concurrently on more than one SPU. This is highly application-dependent, but does allow for redundancy to be built in to our design in that if a program fails on one node, users will still get a "response" from the other nodes still running. This does not take into account the consistency of data updates, which, again is highly application-dependent when running in concurrent mode.
Loss of Power	Utilize more than one source of power, e.g., diesel generators. If you are using more than one main power source, is the additional power source(s) supplied by the same provider? If so, isn't your power provider now a SPOF?
Loss of Data Center	Natural disasters cannot be predicted , but we can protect against the interruption to service. We can employ sophisticated disk technologies (HP XP disk array with Continuous Access, EMC Symmetrix with SRDF) to replicate data to a remote site. Some applications can supply online data replication to a remote site at the application level if your disk technology does not support such a solution. We will require a full compliment of ancillary technologies to continue to offer connectivity to our customers: redundant SPU(s), redundant LAN/telecoms, and redundancy/training in staffing to support this. This type of outage is commonly considered under the context of a Disaster Recovery Plan.
Human Error	We will have to consider humans as an SPOF because human error is seen to be the cause of 40 percent of outages. "Redundancy" in this context must include training, product support, and managed IT processes to ensure that we minimize the likelihood of human error causing an outage.

A fundamental part of our technology infrastructure will also include intelligent diagnostics that can monitor our key components and resources. This, allied with comprehensive, easy-to-use management tools, can monitor, diagnose, and alert us to any current or impending problems.

Support Partnerships : We cannot build our High Availability solution on our own. We need to select competent and experienced software and hardware vendors that are willing to work with us and offer us the support that allows us to offer our customers the service levels they demand. We are not only thinking of support when we experience a problem; we are also thinking of support to plan, install, set up, and manage our solution. They will probably charge us for the privilege, but at least we know that the option is available.
IT Processes : The capabilities of our High Availability design will be seriously compromised if everyone involved is not fully committed to the process. Being committed is not simply "blind faith" but an involved and comprehensive understanding of the requirements placed on departments, teams , and individuals within an organization. This is achieved by close cooperation of everyone involved, backed up by detailed and ongoing training. Without this commitment to a "quality" solution, it will ultimately fail to achieve it's true potential; this can be backed up by the Gartner Group study highlighting that 40 percent of outages are caused by human error. It may be impossible to completely remove human error from the equation, but remember that it is unlikely we will achieve 100 percent uptime; we are aiming for maximum uptime.