WebSphere High Availability and Failover Fundamentals


Because of WebSphere's logical and modular design, just about every key server feature is clusterable and supportive of failover configurations. This allows you to implement highly available configurations that promote uptime for your end users.

Essentially , what we mean by high availability is designing and then building a WebSphere-based platform in such a way that it eliminates single points of processing and failure of the environment. This not only includes adding in additional servers to gain high availability, but also ensuring environment aspects, such as data and application software, are implemented and developed with a modular and distributable model in mind.

Consider the following example for a moment. Suppose you have a WebSphere-based application that is reliant on a data store of some sort (e.g., an RDBMS). Within this application, there is a single component that manages data concurrency (such as a ID generator or an incrementor). If this is the case, you'll have difficulty deploying that application into a distributed or highly available environment if there is a single application component responsible for data concurrency. How are other cloned application instances able to make updates to a centralized data store without corrupting the data?

Another aspect of high availability, and something I touched on in Chapter 3, is the question of what is really high availability in terms of your organization. Do you really require 24/7/365 availability or would something closer to 20/5 be satisfactory? To achieve 24/7/365, you'll need to spend extra money and effort in ensuring you've removed all (within reason) single points of processing and failure and, most likely, you'll need to distribute your processing site over two or more locations.

You also need to consider the difference between operational availability and service availability. In most cases, when we talk about high availability, we're really talking about service availability ”availability of the application and services for end users. However, you can't have service availability without operational availability. On the other hand, you can have operational availability but not necessarily service availability.

Consider a scenario in which you have 15 application servers, a highly available database server tier , clustered Web servers, and actively redundant network infrastructure. You may have a data integrity issue that has caused your application or your service to become unavailable, but your operational platform is still fully online. This is the quandary of application availability. Where do you draw the line to ensure that you have suitable amount of availability at a cost that is not prohibitive? Again, consider the discussions in Chapters 2 and 3 relating to the notion of estimating your downtime and uptime costs. If you're total down-time costs are only $100 per hour , then purchasing an additional $20,000 server isn't cost justifiable (unless you have very deep pockets!).

The following sections summarize what forms of availability there are within a WebSphere environment.

WebSphere High Availability

WebSphere platform high availability, as opposed to application availability (i.e., operational availability versus service availability), can be achieved via one or more of the following methods :

  • Clustering (software, hardware, WebSphere)

  • Disaster recovery

  • Failover

Each of these concepts is supported by WebSphere, and in the next several sections, you'll explore each of these in more detail. Each method description is followed by an implementation explanation.

Clustering

Clustering is the form of availability assurance configuration in which key components are duplicated to some degree with the aim of reducing single points of failure. Although you can use clustering to achieve performance improvement by having more components processing more requests , clustering is essentially a high availability concept.

As you'll see, clustering comes in three forms in the context of WebSphere: software clustering, hardware clustering, and WebSphere internal clustering.

Software Clustering

Software clustering boils down to the implementation of what I term proprietary clustering. Proprietary clustering is usually facilitated via product- or application-specific clustering components that allow you to span out a particular service to one or more physical servers.

Some examples of software clustering implementations are as follows :

  • HTTP server clustering

  • Lightweight Directory Access Protocol (LDAP) server clustering

  • Database server clustering

The actual engine or software that drives the clustering is typically part of the product you purchase. For example, Oracle provides a "clustering" layer as part of the Oracle Parallel Server (OPS) product. Crystal Enterprise, a product that provides Web reporting capabilities, comes with its own internal clustering software.

This isn't to say that you can't use a hardware clustering solution instead of or in conjunction with any of the proprietary software clustering solutions. In fact, in many cases, a combination of the two will provide you with a linear improvement of your potential availability measurements.

Hardware Clustering

Hardware clustering is, in most cases, the more costly of the three WebSphere environment clustering solutions. However, it typically provides you with the highest form of service availability. The reason for this is that hardware clustering, as its name implies, uses redundant components (e.g., servers, networks, etc.) to provide clustering solutions.

Hardware clustering is also the most complex of the three WebSphere environment clustering technologies because it involves many parts and complex software that I tend to refer to as clusterware.

Clusterware is third-party software that you use to cluster your components. Clustering in this form usually involves more than one server, and the clusterware sits "on top" of the physical environment and overlooks key processes (e.g., databases, Web servers, etc.) and key hardware components (e.g., network devices, hard disks, storage groups, and the servers themselves ).

Some of the more well-known and WebSphere-supported clusterware solutions are as follows:

  • Microsoft Windows Internal Clustering (Windows NT, 2000, XP, and 2003 Server editions)

  • Veritas Cluster Server (most operating systems, including Solaris, Linux, AIX, HP-UX, and Windows)

  • Sun Microsystems SunCluster (Sun Solaris systems only)

  • High Availability Cluster Multiprocessing (HACMP, IBM AIX systems only)

  • MC/ServiceGuard (HP-UX systems only)

  • Compaq/HP TruCluster (Digital/Compaq/HP Tru64 Unix systems only)

Each of these clustering solutions provides varying levels of functionality and reliability that allow you to cluster servers running on their respective operating systems.

Essentially, the architecture of a hardware clustered solution is that the clusterware is configured to own and manage key aspects of the environment. As mentioned earlier, this includes peripheral hardware components as well as software processes (daemons and server processes).

There are two primary forms of hardware clustering:

  • Active-standby clustering

  • Active-active clustering

In active-standby clustering, typically two servers are configured in a cluster and one server is the master until the clusterware determines that the second server takes over. The reason for the takeover may be a faulty disk, a dead daemon process, or some other critical hardware component failure (e.g., CPU panic).

Figure 7-1 shows a before shot of a basic active-standby cluster configuration, and Figure 7-2 shows the same environment after an event triggered the cluster to failover from server A to server B.

click to expand
Figure 7-1: An active-standby cluster configuration before failover
click to expand
Figure 7-2: An active-standby cluster configuration after failover

In Figure 7-2, some form of failure took place on server A. The clusterware sits on both servers (or all servers) in a clustered configuration, and each node is continuously informing the remote server of its status. This "heartbeat" can be as simple as a network ping through to something more complex such as running a small application periodically on each remote host to understand the state.

In this case, server A failed (let's say a database process crash occurred) and server B, via one of its continual status requests, was informed that a process was dead (or missing) on server A. This would have triggered server A to shut down all its managed processes and failover to server B, or server B may have found a missing process and told the clusterware on server A to shut down and, subsequently, server B took over as master.

Because there is a failover concept involved, there is a period in which the cluster can't service requests. This period can be anything from a second or two through to several minutes or more, depending on the cluster complexity and clusterware configuration.

The second type of hardware clustering solution is the active-active cluster. This is a far more complex clustered environment, but at the same time, it provides greater availability as well as the advantage of usually providing more performance (you can use all nodes in the active-active cluster to process requests rather than have one lie dormant for lengthy periods). Active-active clusters are more complex because of the data and process management issues surrounding access.

Consider for a moment a database on a single node. When you make updates to a database, there is a whole layer of data locking, I/O, and process management that ensures there is a high level of transaction integrity. Then consider placing two or more nodes into an active-active database cluster and what level of control and management would be required to facilitate that same update, but to a database that is located on multiple physical servers. This is the degree to which hardware clusters in an active-active configuration need to be carefully designed and implemented.

The following two diagrams show an active-active clustered database configuration. Figure 7-3 shows an operational active-active cluster servicing requests. You'll notice that there are three databases operating over the two physical servers, each operating with a single instance on each node. Figure 7-4 shows the same environment, with the difference being that server A has failed in some respect, and server B has taken over all operational load of the environment.

click to expand
Figure 7-3: An active-active cluster configuration before failover
click to expand
Figure 7-4: An active-active cluster configuration after failover

As there is no pause or delay during a failover scenario ”the cluster services are still servicing requests to users or other applications. Therefore, unlike in active-standby clusters, the availability of an active-active configuration is typically higher.

The software driving your cluster will affect how you communicate to your cluster. Your interface to the cluster may be as simple as a hardware load-balancing switch, or it may be software driven.

Caution  

Be wary of committing yourself and your active-active cluster to being able to handle 100 percent of the overall cluster load when half of your cluster is down. It's possible to build an active-active cluster in which half of the cluster can handle the entire load.

WebSphere Clustering

WebSphere itself also provides several forms of clustering, all of which could be considered the same as software clustering. The reason I've split discussion of this clustering out into its own section is that this is where you'll specifically focus your attention in this chapter ”and besides, WebSphere clustering and high availability is what this book is all about!

So what then is WebSphere clustering? In essence, WebSphere clustering relates to the clustering of components operating within the application server, usually among a number of servers. The nomenclature does differ slightly between WebSphere version 4 and version 5 with respect to clustering, but essentially the concept is the same. WebSphere clustering provides the ability to distribute load and services over one or many physical nodes operating WebSphere.

As I expressed earlier in the chapter, it's important to remember that not all application components can be clustered. Think of situations (hopefully not too common!) in which you have had singleton objects in your application or some form of socket-based I/O happening on the backend. There are ways around this, but your application will quickly become cumbersome and overly complicated.

Disaster Recovery

Disaster recovery is an overly used business imperative that involves having a secondary site or tertiary sites (or more), and locating mirrored hot or cold standby infrastructure in a ready operational state. It's primarily used to prepare for disaster-level events such as earthquakes, floods, and other natural disasters; power and environmental outages; and, unfortunately more common nowadays, terrorist or war-based events.

Disaster recovery is an expensive undertaking by any organization, regardless of size . Correctly done, it typically involves duplicating all processes and infrastructure at one or more other sites, and either having the sites operate simultaneously (and gaining geographical load-balancing as a positive side effect) or operating the disaster recovery site in a hot or cold standby configuration. True disaster recovery requires locating your data centers in different cities to avoid power grid contention or naturally occurring events (e.g., earthquakes, inclement weather, etc.).

In years gone by, especially during the Cold War era, disaster recovery was used by computing sites to avoid nuclear attack on cities. In today's world, in which terrorism is a major threat, this may start to become a new motive for opting for disaster recovery depending on the criticality of data.

Simultaneously operating disaster recovery sites are becoming a commonly implemented solution when employing disaster recovery over recent years. These solutions are primarily driven by the notion that if you have a second or n th site operating, why not harness some of the computing power to increase performance and scalability?

Implementing a disaster recovery solution for a Web site that is fairly simplistic ”one that may not have a vast amount of dynamic presentation or complex backend interconnections (e.g., legacy and/or EIS systems, databases, etc.) ”is fairly straightforward. There are few moving parts, so to speak. If, however, you attempt to run a simultaneous solution among multiple sites and have legacy and backend services required at all sites, then your complexity indices very quickly increase. For example, how would you synchronize synchronous writes to multiple clustered database solutions over vast distances without affecting performance? It's possible to do so, and there are solutions available on the market from vendors such as Veritas and EMC, but there is a cost.

A "hot" disaster recovery site is slightly different from a simultaneous or split-site implementation. In the truest sense, a hot disaster recovery site is typically what comes to mind when people think of disaster recovery. A hot site operates by having the same hardware located at alternate sites. Should the primary site fail, for whatever reason, the secondary site immediately kicks in. The "hotness" works by a constant synchronization of data in an asynchronous or batched synchronization manner. That is, the data being replicated isn't necessarily written bit for bit on the remote hot site for each write I/O occurring on the primary site. Although sites that are close to one another (within 50 kilometers of each other) can get away with this form of data synchronization, sites that are distributed over a few hundred to a few thousand kilometers may suffer performance issues with write updates and synchronization from the primary to the backup site due to latency.

Figure 7-5 shows a dual-site topology in which site A is the primary site and site B is the secondary site.

click to expand
Figure 7-5: A disaster recovery hot site topology

In Figure 7-5, the example active-active database cluster environment is distributed over two sites, Seattle and Washington. With the increased distances between the two servers comes additional latency to the tune of ~45 milli-seconds. Now on a basic IP network this isn't anything to be too worried about, but consider the performance hit if your database nodes were constantly trying to synchronize cache, distribute queries, and synchronize the I/O to a shared storage array.

In summary, a hot disaster recovery site is a sound solution for having a secondary, ready-to-operate operational environment. However, you need to consider how you're going to get the data synchronized between the two sites and at what frequency. How much data loss can you sustain before your application is unusable: 1 second, 5 seconds, 15 seconds, or 2 minutes? That is, you may be able to get by with a hot disaster recovery site such as the one in Figure 7-5 if you can survive with having a delay between data synchronization between the two sites. This form of disaster recovery solution may be a cheaper and more viable option if your recovery service level agreements allow for a delayed data synchronization.

Cold disaster recovery sites are typically implementations in which a secondary site sits almost dormant until the need arises to failover to it in a disaster. In some cases, the cold site works by only synchronizing data and application configurations at the time that it's needed from the archive or backups . Other approaches include deploying all application changes and configuration to both sites at each "software release" and then synchronizing data at the time of a disaster.

In summary, cold disaster recovery sites are far less costly to maintain than hot disaster recovery sites, but there is a greater time period between restoration of data and application availability. This period may be anywhere from 1 hour to 72 hours or more.

Failover Techniques with WebSphere

WebSphere is able to take advantage of all the previously mentioned clustering and site implementation architectures. Figure 7-6 shows a complex WebSphere environment that has incorporated many different implementations of clustering and high availability facilities.

click to expand
Figure 7-6: A complex WebSphere high availability environment

Figure 7-6 includes the following levels of high availability:

  • Database clustering

  • WebSphere domains (WebSphere version 4)/ cells (WebSphere version 5)

  • Cloned WebSphere application servers

  • Vertically scaled environment (multiple JVMs per WebSphere server)

  • Horizontally scaled environment (multiple physical WebSphere servers)

  • Redundant, load-balanced frontend HTTP servers

  • Dual redundant load balancers

In this example, it would be possible to make this configuration into a split site (simultaneous processing centers) or a hot disaster recovery configuration due to the split domain/cell configuration. The concept of a split domain or cell configuration is almost like the environment in Figure 7-6: two separated application environments glued together.

This type of configuration, which I refer to as a configuration split down the spine, helps to prevent cascade failures within your application code from bringing down the entire environment. In the case of a configuration split down the spine, if the domain A or cell A applications " rupture " and cause a cascade failure on WebSphere application server A and B, the resulting effect shouldn't affect the second domain or cell, because they're split down the lines of the administration boundaries.

Let's look at the different components in more detail and put into context how each tier and key component of a WebSphere environment can be implemented in such a way that it provides a high degree of failover or clustered capabilities (and hence, high availability).

Web Server Tier: HTTP Servers

The Web server tier or HTTP servers that facilitate the transfer of HTTP/HTTPS-based requests to and from the WebSphere application servers are one of the more straightforward components in a WebSphere topology that can be implemented in a high availability form.

As you saw in Chapter 6, distributing Web servers gives you high availability of the servers that distribute requests through to the backend WebSphere application servers via the WebSphere HTTP plug-in that sits on the HTTP server. By having multiple Web HTTP servers, you achieve a very high level of redundancy in your ability to service more requests, thanks to multiple servers, and you also increase the number of possible paths through to your backend application servers.

Recall that you'll need some form of load balancer in front of your Web HTTP servers. This can be a hardware appliance from a company such as Cisco or a software product such as the WebSphere Edge Server. Both of these products, and the many more available on the market, will distribute the requests to the appropriate Web server based on your desired configuration (i.e., lease loaded, load balanced, round- robin , random selection, etc.). This will provide you with the Web HTTP server availability, and the WebSphere HTTP plug-in will distribute the user requests to any of the configured backend WebSphere application servers.

Web Container

In most cases, the next component that will service requests in the environment is the Web container, which is part of the WebSphere application server. The WebSphere HTTP plug-in passes requests to the appropriate application server clone on one of the configured backend WebSphere application servers. Once the request has made its way to the WebSphere application server, the targeted Web container will service the requests.

In the event that the application server the particular Web container is operating under fails, the next request made by the end user will be directed to the next appropriate application server clone, and the session details will be, if configured, retrieved from the user session database.

Therefore, for the Web container to be highly available, the following components need to be void of single points of failure:

  • Physical application server

  • Database server(s)

  • Network infrastructure

Once the user request has hit the particular application server (Web container), the subsequent user requests will stay with the application server, in most cases. This factor is dependent on your application architecture (e.g., do you need sessions?). The WebSphere clustering will handle all failover capabilities if the originally targeted application server (Web container) fails.

EJB Container

The EJB container, not unlike like the Web container, receives its requests from somewhere in the WebSphere environment. This maybe a Java object within the Web container, such as a servlet or JSP, or it may be another EJB somewhere else in the environment, either local or remote.

The EJB container is less bound to servicing its own originating request. For valid reasons, your application code, specifically EJBs operating under the EJB container, may call other EJBs that happen to be on remote servers. This may be accomplished through forceful remote calling (e.g., calling services on other remote systems) or through workload management (WLM).

When EJBs are participating in WLM, their reference is maintained in the global JNDI tree. Therefore, when a client Java object makes a lookup to obtain the reference to a specific EJB, that client may be calling an EJB on a remote server rather than the local one. This is configurable in how you set up local versus remote calls, and the setup is somewhat similar to that of the Web container.

Given the nature of EJBs, there are, however, many more options available for how you may want to distribute their services. This provides a high degree of control of your EJB workload at a very granular level. The different levels of WLM availability that you can configure your EJBs with are as follows:

  • Random selection

  • Round-robin

  • Random prefer local

  • Round-robin prefer local

It's important to note that these WLM settings are overridden by EJB server affinity settings, which are as follows:

  • Process affinity: All requests from a Web container Java object are serviced by the EJB container operating in the same application server. Requests are never routed outside the application server.

  • Transaction affinity: All requests from a Web container-based Java object, for example, are serviced by the referenced JNDI EJB context. All subsequent requests in the same client transaction are bound to that EJB container.

I discuss these settings in more detail in the section titled "EJB Container Failover and High Availability."

Database Server Tier

As you've seen, the database server is an immensely important component of most WebSphere application server topologies. Because of the database's importance, a number of clustering and failover configurations are available for the database server tier.

Here's an overview of the database availability options, which I discuss in more detail later in the chapter:

  • Database server clustering (active-active or active-standby)

  • Stand-alone database instances (no replication)

  • Database replication

  • Hot-standby databases (nonclustered)

Remember, the key to database availability is to first prevent database "services" from becoming unavailable through an active-active cluster or a highly distributed database environment (e.g., hot replication to many nodes), or minimize failover times. There's no use in having a hot-standby database or an active-standby database topology if your failover time exceeds that of your thread pool or JDBC connection time-out values!

Administrative Functions

Administrative functionality is more prevalent in WebSphere version 4 because of the Administration Repository database. There are capabilities within the administrative services that act similar to those of the standard EJB WLM functions. For the WebSphere version 5 platform, distributing the Deployment Manager functionality ensures that master-update and replication functions for cell configuration are available at all times.

It's important to note that stand-alone Java applications may not be privy to the WebSphere administration services. In this case, it's important that for boot-strapping your developers provide a full list (such as in a properties file) of all WebSphere administration nodes. This will provide a failsafe mechanism for the Java client to rotate different WebSphere application servers until it finds one available.

Consider using Java Application Clients (JACs). These components, although similar to stand-alone Java clients , provide the ability to hook into the WebSphere core services from outside WebSphere itself (i.e., not within a standard application server container). I discuss the implementation of JACs and Java clients and how to maximize their availability later in this chapter.

In any case, failover support for JACs, administration services, and Java clients is all provided within the bounds of the WebSphere application server architecture.

Network Components

Without the network, your WebSphere platform is broken. The network infrastructure is essential to the operational capabilities of your WebSphere environment. It's therefore imperative that you use redundant network paths and network infrastructure for critical WebSphere application environments.

Always use at least two paths between your WebSphere application servers to other tiers, and consider segmenting your networks to help prevent DoS attacks. The preferable approach is to have WebSphere and database servers communicate with one another via a private network, different from that of the "customer traffic" network. This will help prevent bottlenecks and impacts when your applications are moving large amounts of data around the network, in between nodes.

More on network availability later in this chapter.

Recap of Topologies Suitable for High Availability

In this section you examined a number of features that WebSphere supports to aid in high availability and redundant application server environments. Two key components to a WebSphere application server are the EJB container and the Web container. To obtain or promote high availability with your applications, these two components need to be clusterable.

As I discussed, WebSphere clustering through server groups, domains, cells and other features allows for increased availability of your WebSphere services. Essentially this all boils down to being able to support both vertical and horizontal clustering and load distribution, or WLM.

Using vertical clustering capabilities such as server groups, clones , or multiple application servers (depending on your version of WebSphere), you will be able to ensure that your application environment is insulated from single JVM performance issues or memory constraints that may cause poor performance or outages in a single JVM or application server environment.

Horizontal clustering will ensure that a single server outage won't cause a complete failure of your application environment, by having multiple physical WebSphere servers operating redundant instances of your application JVMs.

The (near) ultimate in high availability assurance is to use both horizontal and vertical clustering. You can further extend this, as I discussed in Chapter 5, by splitting your environment into separate administrative domains or cells (WebSphere version 4 and version 5, respectively) to ensure that there isn't cross-contamination of incorrect data or cascade corruption of configuration or JNDI tree data.




Maximizing Performance and Scalability with IBM WebSphere
Maximizing Performance and Scalability with IBM WebSphere
ISBN: 1590591305
EAN: 2147483647
Year: 2003
Pages: 111
Authors: Adam G. Neat

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net