9.2 Reliability | Software Fortresses: Modeling Enterprise Architectures

Fortress reliability refers to how much I can trust the fortress to be there for me when I need it. There are two types of reliability. There is what I have previously termed pseudoreliability , which means that the fortress appears to be up and running when in reality it is off sucking down doppio macchiatos at Starbucks. This topic I discussed in the chapter on asynchronous drawbridges . The other type of reliability could be described as true reliability .

True reliability means that the fortress is honest to goodness up and running when I ask it to do something. No excuses. No doppio macchiatos. Some people call this availability .

There are two approaches to building true reliability. One is to build the fortress on expensive machines that never go down. The other is to build the fortress on clusters of inexpensive machines. We achieve cluster reliability by having enough machines in the cluster so that even if some of them go down, workflow can continue.

Reliability is somewhat related to scalability. Remember, there are two approaches to achieving scalability: scale-up or scale-out. These two choices parallel the choices you make to achieve reliability. The theory goes that scale-up ”buying a big, expensive machine ”also buys you a more reliable machine. Supporting evidence is mostly anecdotal, but the scale-up approach is certainly widely trumpeted by makers of big, expensive machines.

Where possible, I favor the scale-out approach to both scalability and reliability over the scale-up approach. Scale-out, in my experience, gives better results at a lower cost. In certain limited situations, however, scale-up is the only realistic alternative. First let me briefly explain why scale-out is such an attractive approach to reliability.

In general, small cheap machines (typically Windows-based machines) are cheaper than big expensive machines (typically mainframe-type machines). This seems reasonable. What is not so obvious is that this is true even when one takes into account the differences in the processing power between the small, cheap machine and the big, expensive machine. Most benchmarks show about a fivefold unit-of-work price differential between the small, cheap Windows-type machines and the big, expensive mainframes. This means that if it costs one dollar for a fortress to process a request on a Windows platform it will probably cost about five dollars to process that same request on a mainframe platform.

The mainframe vendors always argue that the extra cost is worthwhile for two reasons. First, mainframes can process more work requests than Windows machines can ”an argument that is valid only for architectures that cannot take advantage of a cluster approach. Second, mainframes are more reliable. This argument is more complicated, so let's take a closer look.

Let's consider some representative numbers . Suppose that it costs 10,000 dollars to build and maintain a fortress on a Windows platform that can process 1,000 requests per minute. If we're using a cluster architecture, it is safe to assume that we can build and maintain a fortress that can process 10,000 requests per minute on the Windows platform for about 100,000 dollars. Ten times the workload for ten times the cost. This is standard cluster scale-out analysis.

Now let's assume that a mainframe-based fortress that can process 10,000 requests per minute will cost about 500,000 dollars, since, as I mentioned, the cost per request is typically five times higher on mainframe systems. What do you get for your extra 400,000 dollars? According to the mainframe vendors, substantially higher reliability. But the math doesn't hold up here.

For the sake of argument we'll assume that mainframe computers are 10,000 times more reliable than Windows computers. This is highly unlikely , but I'm using extreme numbers to give the mainframes every possible benefit of the doubt. I'll assume, further, that the mainframe computer fails one day out of 100,000, giving it a 0.00001 chance of failure, or a 99.999 percent reliability rating. If the Windows machine is 1/10,000 as reliable, then it is predicted to fail one day out of 10, giving it a 0.1 chance of failure, or a 90 percent reliability rating.

Let's say we have a cluster of M machines, each with a probability P of failure. What are the chances that F of those M machines will fail at the same time?

Choose one machine in the cluster. The chances of its failing on any given day are P . The chances of its next -door neighbor failing on that day are also P because they are the same basic machines. The chances of both machines failing on the same day is P x P , or P ² , because the two events are probabilistically independent. The chances of F machines failing is thus P ^F . If P is equal to 0.1 (our highly pessimistic initial assumption), we can calculate how many failures ( F ) would match the reliability of the mainframe, which had a P of 0.00001. In this case F must be 5, because 0.1 ⁵ = 0.00001.

In other words, if we make the cluster 5 machines bigger than is necessary to handle the workload requirements, we get about the same reliability as we have with the large mainframe. We need 10 machines to handle the workload (on the basis of initial assumptions) and 5 more to achieve mainframe reliability.

If the machines cost 10,000 dollars each, our highly reliable (by mainframe standards) cluster will cost 150,000 dollars. Of that, 100,000 dollars is the cost to process workload and 50,000 dollars is the cost for the needed reliability. To process the same workload with the same reliability on the mainframe, we will have to spend 500,000 dollars, more than three times as much. And for a mere 20,000 dollars more, we can get 100 times more reliability than the mainframe can offer.

All of this assumes that the mainframes are 10,000 times more reliable than the Windows-platform machines. In fact, however, if there is any difference in reliability, I suspect it is much closer to 10 times than 10,000 times.

Those of you very familiar with probability theory will realize that I have simplified the equations slightly, but the overall numbers are close to accurate.

This whole argument assumes that a cluster architecture is actually possible. There is one area in which clustering is very difficult: the data strongbox. Data strongboxes are not amenable to running on clusters (at least, not clusters in the sense that I am using the term here). I discussed this issue in Chapter 6 (Asynchronous Drawbridges).

For most fortresses , the difficulty in clustering the data strongbox is not a problem. We generally solve this problem by dedicating one machine to the data strongbox. This machine we make as big as is necessary to handle the expected data workload, a technique I described in Chapter 6 as build-big. The rest of the fortress we design with a cluster architecture. In other words, we scale up (or build big) the data strongbox, and we scale out everything else in the fortress.

Very few fortresses need particularly large data strongboxes. The data strongbox of a presentation fortress, for example, needs to hold little more than the state of the browsers it is currently processing. By database standards, this isn't a whole heck of a lot of data.