Costs of Availability, Performance, and Scalability | Maximizing Performance and Scalability with IBM WebSphere

Availability, performance, and scalability cost. Although you'd probably prefer to design and manage a WebSphere-based application platform that can scale on demand, perform constant subsecond transactions, and have 99.999-percent availability, the budget and justification associated with those types of requirements are restricted to the likes of NASA!

Table 2-5 shows the average cost per hour of downtime for several industries. The data comes from a Dataquest report from September 1996. Of course, the average cost per hour of downtime is more than likely higher in 2003 than in 1996. However, Table 2-5 provides an indication of these costs.

Table 2-5: Costs of Downtime
Industry	Business Operation	Average Downtime Cost per Hour (in U.S. Dollars)
Financial	Brokerage operations	$6.45 million
Financial	Credit card/sales authorization	$2.6 million
Media	Pay-per-view TV	$150,000
Retail	Home shopping (TV)	$113,000
Retail	Home catalog sales	$90,000
Transportation	Airline reservations	$89,500
Media	Telephone ticket sales	$69,000
Transportation	Package shipping	$28,000
Finance	ATM fees	$14,500

If, for example, you're operating a brokerage-based WebSphere implementation and you need to ensure that you have greater than 99.95-percent availability, your business justification is that by not having a high-availability WebSphere-based system, the costs of downtime are approximately $6.45 million. In most cases ”and obviously it depends on your sizing requirements ”$6.45 million will get you a number of high-performance, highend servers from one of the leading server manufacturers!

In other words, if the downtime costs are so large, then the business case for spending the money proactively to ensure that you have a highly available system is well justified. In fact, you can graph availability in such a way that depicts it as an exponentially growing trend. The exponential factor in it is the cost of ensuring that availability. As availability nears 100 percent, the cost in providing that guarantee soars (almost exponentially).

For the most part, providing availability guarantees of 99 percent is fairly straightforward. Typically speaking, most single-server implementations will support this level of availability as measured over a period of a year. Measuring availability of a server or WebSphere implementation is about measuring the WebSphere availability index (see the "Availability: What Is It?" section) for each component within your environment.

These components include the following:

Disks
Memory
Network controllers (interface cards)
Motherboards and main boards
Other controllers, such as Input/Output (I/O), Small Computer System Interface (SCSI), and so on
Power supplies (internal)
Environmental systems (air conditioning, main power)
System cooling
CPUs
Software

The overall value of the WebSphere availability index is the mean of all those components. Typically, a system will only guarantee between 99-percent and 99.9-percent availability; in order to exceed that range, you need to go with multiple servers (I'll discuss that in Chapter 5).

Calculating the Cost of Downtime

As you'll see in this and the following section, calculating the cost of downtime is relatively simple if your requirements are simple. Generally speaking, calculating downtime cost is as hard or simple as you need (want) it to be based on your input factors. If you have a great deal of intangible impacts caused by downtime (for example, goodwill), then calculating downtime cost will be harder.

Table 2-6 lists some cost impact factors that can help determine downtime cost.

Table 2-6: Example Downtime Cost Impact Factors
Impact Factor	Relative Cost Impact	Example
Impact to sales	Variable ($1 million to $6 million per hour)	Impacts Amazon.com online book sales.
Impact to services	Variable	Customers unable to watch Webcast of a sporting event.
Impact to goodwill	Intangible	New hot product goes on sale, and the platform fails. There are many upset customers.
Impact to staff	Variable	Long hours during system instability can impact morale of staff members .
Impact to infrastructure	Variable ($100 to $1 million)	Constant power cycling, thrashing, and overloading of disks will shorten their life spans .

Using Table 2-6 as a guide, along with additional site-specific impact costs, your formula for calculating the cost of downtime can take one of several forms. First, if your WebSphere implementation is specific to sales or financial transactions, you can take the mean or average transaction cost and the transaction rate per hour and use that as a per-hour impact cost.

For example, let's say you're operating an online bookstore and selling about 5,000 books per hour at an average cost of $50 per book.

You could do an easy calculation as follows :

 Books sold per hour  Average book cost = Outage cost per hour

For this example, the calculation is as follows:

 5,000   = 0,000 per hour

In this case, your downtime cost is $250,000 per hour.

Add on top of this other aspects such as loss of goodwill for those customers who get so annoyed at the outage that they go elsewhere to purchase the book and possibly won't come back. You could apply a churn or goodwill percentage to this equation of, say, 0.5 percent per hour for all registered customers you'll lose because of these annoying outages.

If you then knew each customer would purchase a single book per month and you had one million customers, you'd know that your annual sales amount will decrease by $3 million. For example:

 (Customer base - (Customer base - Outage churn rate percentage)) = Set A Average number of books per customer per year  Average book cost = Set B     Sales lost through lost goodwill = Set A  Set B

The calculations are as follows:

 (1,000,000 - (1,000,000 - 0.5%)) = Set A                          12   = Set B  Sales lost through lost goodwill = ,000,000

So, as you can see, the cost of outages is expensive. And although these are fictitious figures, they represent the types of costs your company can incur. Your estimates and calculations may be simpler or more complex, but at the end of the day you need to research all your business and technology cost impacts. Furthermore, based on a per-hour outage during either peak or average utilization periods, calculate your hourly downtime costs.

Understanding Your Availability Needs

What is it you need for your organization? Are you running a mission-critical environment and need less than 99.99-percent availability? Or does your organization just need to have a service available that can be batch based and can be down several hours a day without impacting the bottom line?

In fact, understanding your availability needs is one of the easiest parts of designing a highly available WebSphere environment. First, you need to determine the financial impact of the WebSphere platform's downtime to your organization. (The previous section explained some example calculations.) Second, you need to determine how much downtime (outage cost) your company can withstand.

Using the example from the previous section, the total hourly outage cost that included both loss of sales and loss of goodwill came to $3 million per hour. Although this is a high figure, it does illustrate the point well. If this was your company and you were asked to calculate and understand the impact of outages, you'd need to consider this hourly outage period. You'd also need to consider MTTF, which may make the minimum average downtime two hours and not just one. Either way, this is a big figure.

The decision you need to now make is what it'll cost to ensure that your WebSphere implementation doesn't go down! Chances are, with $3.25 million for hourly outage, you'd want to aim for an availability percentage of greater than 99.99 percent. As you saw in Table 2-2, 99.99-percent availability is a small fraction of downtime. Furthermore, as highlighted in Table 2-1, it requires your WebSphere topological architecture to conform to level 5 of the business availability index.

Note	Incidentally, the example has the need for a split site configuration and the highest level of platform redundancy ”redundant servers, split WebSphere domains and cells , and clustered data services.

Therefore, you can now understand your availability needs using these two tables (Tables 2-1 and 2-2). If your downtime costs are large on a per-hour basis, the only way to minimize your downtime ”according to Table 2-2 ”is to ensure greater than a certain level of availability. At this point, you need to "sign up" for the business availability index that's best suited (according to Table 2-1) to your WebSphere availability percentage requirements.

In summary, this business availability index provides a guide for meeting your availability of service agreements.

Understanding Your Scalability Needs

As you saw earlier in the chapter, scalability is closely associated with availability. More times than not, by virtue of a highly available WebSphere environment, you'll have a highly scalable one.

Scalability, as you've seen, refers to how well (in cost and effort) and how easily (a modular and extensible application) your application platform can grow into existing or additional hardware or infrastructure software changes.

A common error with scaling application platforms is that people purchase too much infrastructure or "overscale" and "overspecify" their topological configuration. This is bad. Overscaling a platform, although great for running extra processes and reducing your capacity management workload, can hide problems and make developers lazy when implementing their application code.

Furthermore, I've seen implementations where a system has been so over-specified that it was in fact hiding all sorts of fundamental application code issues such as memory leaks and poorly performing business logic. The extra capacity insulated the problems from view, and as you know, the longer a problem or bug goes unnoticed or goes through the development life cycle, the more expensive it gets. With this particular application being in production, the cost to fix it would've been large.

It's important to weigh purchasing hardware and infrastructure for scalability versus that of availability. My recommendation is get availability before investing in scalability. My reasons are simple:

First, as I said before, scalability tends to inherently come with availability. Two servers provide a more scalable solution, off the cuff, than a single server (ignoring technicalities, two times more scalable in fact!).
Second, a highly scalable single-server solution isn't much good if it doesn't insulate you from single-server outages! A highly available server farm will provide insulation from external and internal impacts that may have the potential to initiate a platform failure in a platform that's not highly available.
Third, in most cases, you can deploy another server (thus providing more redundancy/availability) and add another line or two to your load-balancing mechanism to point to the new server, and you instantly have additional scalability.

At the end of the day, I could list many reasons why it's more advantageous to spend funds on high availability rather than scalability. There may in fact be legitimate reasons to purchase additional hardware to scale your application. For example, you may already have three WebSphere application servers, and your existing J2EE-based deployed applications require more memory. The solution is to purchase more memory in each of the three WebSphere application servers because this is a vertical scaling requirement (more memory per JVM process). In this case, there's little benefit from purchasing an additional server over that of purchasing additional memory.

In summary, use these tips when considering your scalability and availability requirements side by side:

If you want to ensure scalability, implement high availability first (either additional servers, additional interfaces, additional disks, or so on). A system isn't scalable if it's nonredundant.
If your application JVMs require more memory, scale up vertically.
If your deployed applications require more processing capacity (CPUs), scale vertically first, horizontally second.
Don't overscale your environment ”if you have cash to spend, invest it in high-availability purchases before you scale a nonredundant system.
Don't overspec your environment to insulate developers from badly written or badly performing code ”fix the root cause.

Ten Rules to Live By

The following are ten items I always live by when using WebSphere. These availability tips are ones that, if followed, should point you in the right direction for availability success:

Mirror your disks.
Use redundant network interfaces on your servers.
Use redundant network switches, hubs, and routers.
Consider systems capable of hot-swappable disks, power supplies, memory, CPUs, and peripheral cards.
Horizontally scale your environment with multiple physical servers.
Vertically scale your environment to obtain more scalability of your application components (for example, more memory equals more memory buffer and head room).
Cluster your platform services such as databases and Lightweight Directory Access Protocol (LDAP) servers.
Consider disaster recovery and split site deployment for critical systems.
Split your application environment into two or more WebSphere administrative domains or cells.
Partition your application and platform environment (for example, utility services, and customer- facing services) to compartmentalize differing workloads.