Clustering Overview | Oracle Application Server 10g: J2EE Deployment and Administration

Although 10g AS supports clustering of mid- tier application servers and of WC (see Chapter 19) servers (called cache clusters), the primary focus of this chapter will be application server clusters. Generic web application server clustering involves having two or more mid-tier instances transparently servicing incoming requests . The idea behind clustering is that by having two or more servers working in unison , those servers can do more work, and if one member fails, the others will pick up the extra work. Therefore, clustering provides these benefits:

Increased, and more scalable performance
Fault tolerance leading to higher availability

You'll look at each of these benefits before examining the 10g AS specifics.

Scalable Performance

For large systems with hundreds and thousands of simultaneous connections or systems for which long-running processing takes place, a saturation point for even the largest server exists. At some point, there will be times when even the largest server with multiple CPUs and gigabytes for memory will not be enough to service requests. That may be a critical peak period or perhaps a sustained workload level for a popular site; in either case one server cannot support the system. It may be possible to add more CPUs and memory to the single server in a practice called vertical scaling whereby the single machine is made more powerful. This may work for a while, but the unavoidable limitation of vertical scaling is that sooner or later you'll not be able to have a big enough machine.

The alternative to vertical scaling is horizontal scaling . This involves having multiple servers, each being roughly equal in size and power, process incoming requests. There's a network load balancer in place in front of the machines that are distributing requests to individual nodes. It's based on an algorithm that distributes the load to the least busy server. If you need more power, you can add an additional server in a horizontal manner. Alternatively, the nodes themselves could also be upgraded vertically as a mixed approach of using more and larger servers. As you'll see in a later section, horizontal scaling does not necessarily imply a hardware cluster.

You'll need to choose between vertical vs. horizontal scaling. Many people prefer having one large server to host their application(s) because it's conceptually simple and there are fewer components to manage. Indeed, most small- to medium- sized systems can get away with vertical scaling.

However, with the release of Oracle's new Grid computing technology initiative, horizontal scaling is now the trend as some envision a grid of small, cheap two to four CPU nodes working together to share the workload. The idea is that if more processing power is needed, you simply plug in more nodes until there's enough processing power available. This concept isn't new, but time will tell how widely it's implemented.

High Availability

High availability (HA) is the end result of having a system that's normally accessible for processing 95 percent of the time or higher (even 98 or 99 percent). A machine and system can be "highly available" if there are no hardware or software or application problems, if the administrators are competent, and with a little luck. This is generally "good enough" for most systems because true HA is far more complex and expensive to implement.

True HA is a product of having fault-tolerant systems and systems with automatic failover at every level. This means more than just the servers; it includes web application servers, database servers, telecom equipment, power supplies , and so on.

Fault Tolerance

Fault tolerance means that if a failure occurs, the system is still accessible. Specifically, if one component fails there's a backup for it. System architects seek to identify and eliminate any single point of failure in their designs so there's no Achilles' heel that can render a system unusable.

Vertically scaled systems are inherently susceptible to being single points of failure. Although the server may run effectively for months without any problems, sooner or later it will either crash or need to be taken down for patching or maintenance. At that point the drawbacks of having a single point of failure will become evident.

Horizontally scaled systems by definition aren't themselves single points of failure because any machine can fail and the others will pick up its work. Of course there may be other components in the system that are single points of failure such as the database server or network load balancer, but these components can be constructed redundantly, too.

Tip

Don't overlook basic infrastructure and environmental issues when implementing HA systems. We've seen numerous instances in which critical, million dollar systems have crashed because of leaky roofs and water pipes flooding computer rooms, or because of servers overheating due to failed air conditioning systems. In retrospect these failures are often the result of silly mistakes, but I assure you that no one finds them funny or innocent when they occur.

Automatic Failover

Depending on how a system is implemented, the redundant components may not necessarily be configured to automatically start up in the event of a failure. This is most common with database clusters. Unless the databases are set up in a Real Application Cluster (RAC) configuration, you only have a single unique database running simultaneously .

If Server A fails in a cluster and all the databases were on Server A, they must be restarted over on Server B in order to be accessible. This can either be a manual operation or it can be automatic, but this setup isn't necessarily automatic or easy to implement. The same concept applies to any redundant device in a system. At what point is it triggered to start up and take over processing? The obvious goal with automatic failover is for this to happen automatically. However, can it happen transparently without the user ever even noticing and will active transactions be preserved or rolled back? These are items that need to be addressed when discussing automatic failover for a component.

TRUE COST OF HIGH AVAILABILITY

We cannot begin to count the number of times that we've heard a project manager declare "We want high, 7 — 24 availability" but had no real clue what that meant on the systems' side or didn't realize the financial costs in achieving that goal. It's OK and perfectly reasonable to want a highly available system, but if you need true 7 — 24 access you're going to have to pay for it. Specifically, you'll need redundant servers, network equipment, high-availability software, a good architectural design, disaster recovery, and the skilled people to make it happen. All of this costs a financial premium and needs to be designed from the very beginning of a project, not as a whim a few weeks before implementation.

Keep in mind that you need to retain high-quality people to design and administer these complex systems. Many times we've seen organizations pay the big money for an HA cluster, yet the administrators on-site lacked the skills to keep it running. In many cases their "HA cluster" had more downtime than a simple single-node system would have had because the people didn't have the training or skills necessary to do the job. In most cases these were good people, but they lacked the training, experience, and mindset for these types of systems.

Clustering Definitions

We've discussed clustering as a means of having multiple servers working together, but clustering comes in several forms, as described here:

Hardware cluster. Multiple nodes "tied" together sharing a resource such as a disk array. Each member of the cluster has access to that resource. Depending on the configuration, each member of that cluster has concurrent access to the same resource, and that concurrent access is managed through special software. If one node fails, the other nodes continue to run. Applications that were running on the failed node may automatically start up (aka failover) on the surviving nodes and continue processing.
Database cluster. Database clusters reside on hardware clusters. They either run on one node exclusively and failover to the other nodes in the event of a failure, or they may run on multiple nodes simultaneously in a RAC configuration.
Web application server cluster. These are multiple servers scaled horizontally, fronted by a network load balancer or WC to form an application server cluster. However, these nodes may not necessarily exist as a hardware cluster, where they're physically tied to each other and sharing resources. These individual nodes may be coupled together by nothing more than a load balancer dispatching HTTP requests to them independently of each other. Or they may be somewhat coupled by software processes such as Oracle Process Management Notification (OPMN) and Distributed Configuration Management (DCM) for 10g AS clusters. Or they may exist as true hardware clusters, depending on the configuration.

It's important to understand the previous terms when discussing clustering because IT tends to throw the term "cluster" around loosely and it means different things to different people. Most system administrators initially think in terms of hardware clusters, while DBAs think of RAC database clusters, and web administrators think of small- to medium-sized servers behind a load balancer. When talking about 10g AS clusters, you'll be referring to the web application server clustering definition of having multiple, small- to medium-sized independent servers fronted by WC and "clustered" by OPMN, DCM, and the Oracle Infrastructure installation.

In the next section, you'll see how all these characteristics of clustering have been implemented within 10g AS.