EJB Container Failover and High Availability | Maximizing Performance and Scalability with IBM WebSphere

The EJB container and the Web container are probably the two most important application parts of the WebSphere environment. In this section, you're going to look at promoting high availability and failover capabilities within your application server's EJB container.

EJB Container Overview

The EJB container sits within the same JVM space as the Web container within your WebSphere application server. The EJB container is responsible for managing the runtime and operational life cycle of your application's EJB components . The way in which you ensure that the EJB container is highly available and is failover solid isn't too different from that of the Web container and, in fact, many settings and issues are the same, just simply transposed to an EJB world.

Like many components in WebSphere version 5, there are many differences between the WebSphere version 4 EJB container and the WebSphere version 5 EJB container. Both versions 4 and 5 provide a location service daemon (LSD) to provide a level of indirection for accessing EJB home lookup. Used for non-WebSphere-based object request broker (ORB) as well as other proprietary interfaces, the LSD can also be used directly for other WebSphere components, in which case the ORB doesn't have the WLM flags set for the period of the transaction.

Workload Management

Workload management of EJBs is an inherent feature of WebSphere's EJB container. Since version 4 of WebSphere, no implicational changes were required to take advantage of EJB WLM.

The way in which this works is that the ORB that facilitates object brokerage services in the container is configured with a WLM plug-in specific for the EJB container. When lookups are made by clients trying to obtain references to EJBs in the EJB container, the ORB identifies via a WLM flag that a particular EJB is part of a WLM server group . This allows the ORB to understand that if a request comes for the EJB again from another client but finds that it can't service the request because the initial EJB container has become unavailable, then the ORB will obtain the reference to an EJB in another clone via a list it maintains. It also provides the ability to load balance or round- robin between other available cloned EJBs.

Just like with the Web container, you can implement different routing methods for the EJB container to better support your type of WebSphere environment. The four primary forms of EJB request routing are as follows :

Weighted round-robin (default)
Prefer local
In process
Affinity based

Let's look at each in more detail.

Weighted Round-Robin

The weighted round-robin routing algorithm used within WebSphere version 5's EJB WLM routing is quite similar to the weighted round-robin method used by the Web container and the plug-in during Web-based WLM.

In essence, weighted round-robin WLM allows you to set routing weights on requests to EJBs on both local and/or remote EJB containers. For example, say you have two servers, both with two EJB containers, all of which are operating in a WLM EJB container cluster. The server A EJB round-robin configuration is configured with weighted round-robin values of 5, while server A is configured, via the ORB configuration and properties settings, with a value of 10. This allows all requests to source EJB objects from the local server, rather than going out on to the wire to access EJB objects from server B.

Prefer Local

The prefer local algorithm allows you to have your EJB clients access WLM EJB services local to the node that the client is operating from as a preference. If there are no EJB resources available in any local server groups on the local node, the EJB client, via the WLM-aware ORB, will attempt to gain access to EJB resources on remote nodes participating in WLM.

The weighted round-robin algorithm is used to select desired EJB clones on the local server if they're available. Similar to most of the EJB and Web container routing algorithms, the specific EJB clones participating as part of a server group still decrement their weighting until they reach 0, at which point they reset to their initiating weighting .

In Process

In-process routing works by ensuring that the request for an EJB is routed to an in-process EJB clone or an EJB clone local to the application server.

This is the least resource- intensive algorithm of all options available, because the requests from client to EJB are maintained within the same application server space. Therefore, no network or RMI calls are required to remote server groups or other JVMs.

If the EJB clone is down within the specific application server, weighted round-robin is used and the WLM-aware ORB is queried for the next best weighted EJB container clone.

Affinity Based

There are two types of affinity-based routing within the EJB-based server groups. Essentially , affinity routing, as its name implies, allows you to route based on one of two affinity strengths.

The first affinity type is known as applied affinity . This affinity type is typically used for EJB clients requesting IORs via the LSD. During the course of the transaction between the EJB client and the EJB itself, the linkage and routing between the two is maintained. All subsequent requests to the EJB are maintained along the initially defined routing when the transaction was initially set up. This model uses the weighted set approach, and each request decrements the weighted value by 1 until it reaches 0. However, because the ideology behind this routing algorithm is to maintain the linkage between the EJB client and the chosen EJB clone, the EJB client will continue to make queries to the EJB clone even while it has a weighted value of 0.

The second type of affinity-based routing is called transactional affinity . This form of routing follows container transactions, and all requests for EJB methods and interfaces are maintained to the initial EJB clone at the beginning of the transaction until the transaction completes. You need to be cognizant of transactional affinity and be sure that your application code has been developed with the ability to "clean up" after a failure in communicating to a remote EJB clone.

EJB Container Failover

Like Web container failover, when we talk about container failover scenarios, we're not necessarily talking about abstracting the failover technique of the container itself. Instead, we (and WebSphere) focus in on how to get the next client request to the next working or available container instance, fast. As you'd expect, WebSphere versions 4 and 5 operate slightly differently when it comes to EJB container failover.

WebSphere version 4 works simply by the client ORB maintaining a list of workload managed EJB homes across multiple WebSphere application server clones. In the event of a failover scenario, the Java client will experience a timeout for accessing the remote EJB and reattempt to look up the context. With the MaxCommFailures setting set to the default of 0 (meaning that as soon as there is any communication failure, an exception is thrown to the ORB), the ORB will reroute the EJB request to the next available EJB container clone. Figure 7-13 shows how the WebSphere version 4 model works.

Figure 7-13: EJB container failover in WebSphere version 4

If the ORB goes through all available EJB container clones (clone 1 and clone 2 in Figure 7-13) in the list and they're all marked unavailable, the ORB will rerequest an updated list of available EJB container clones from the administration server. This may loop until a clone becomes available or your EJB clients and ORB will be presented with a CORBA.NO_IMPLEMENT exception. I discuss this further shortly.

In the case of the WebSphere version 5, the failover scenario and events that follow it are a little different. There are two possible courses of action. One is if you're running native WebSphere-based application code under a WebSphere application server (as opposed to using legacy CORBA code), and second is if you're running native WebSphere-based application code via any other non-IBM EJB implementation. If, however, you're using non-WebSphere-running applications (such as CORBA or non-IBM J2EE implementations ) and you need to access local WebSphere-based EJBs, WebSphere version 5 has support for LSD.

The LSD is kind of like a route map that maintains information about specific application servers that are exposing EJB containers (and EJB homes). With the LSD approach, clients (Java, C++, or otherwise ) obtain an IOR to the LSD as opposed to the specific EJB that they're after. The LSD then provides an IOR to the best available EJB container clone for the client request. This sort of connection is deemed to be non-server-group-aware and, as such, the request isn't directly aware of load balancing and WebSphere-specific WLM policies active within the containers. Because there is limited ability for non-WLM-aware clients to automatically participate in load balancing, there are limitations in what is available in terms of failover for your application code, although it's still somewhat seamless!

For clients that are not WLM- or server group-aware, if a request is made to an already obtained EJB IOR reference, but that EJB container or application server has failed or crashed, the client receives an error (a commFailure ) and goes back to the LSD to rerequest another active EJB container clone to interface with.

If the client is said to be server group-aware, it's more than likely that it is a WebSphere-based application client. In this case, the client still may go via the LSD to obtain the initial IOR. Within the return response of the initial IOR of the first available EJB container clone has been provided to the client. If the client is, in fact, WLM-aware, it now has a list of available server groups and EJB container clones to communicate to directly. Because the client is WLM-aware, subsequent calls are made using the list of available server groups and EJB container clones it was passed by the LSD.

Handling Failovers

When errors occur with your EJB communications, you're EJB application components will react the much the same way as the plug-in does when various events occur. To be more specific, Table 7-1 explores five different scenarios within a WebSphere environment and labels their "response time to problem." This is an important table for both Web components and EJB components.

EJB components follow the same rules of time-outs and delays when it comes to failures with the server groups and/or servers. Essentially, there are five key messages that will be thrown to your EJB clients:

CORBA.COMM_FAILURE
CORBA.NO_RESPONSE
CORBA.COMPLETE_STATUS
CORBA.INTERNAL
CORBA.NO_IMPLEMENT

Those familiar with CORBA will understand these messages well, but essentially, they're the basic communications status messages that the ORB and EJB clients handle. These are also the hooks in which your EJB clients will use to failover, change server clones, or retry during the course of the failures depicted in Table 7-1.

COMM_FAILURE is just that ”a CORBA communications failure of some sort. If your EJB client is making an initial request to the ORB or LSD to obtain an IOR for an EJB home, and it is thrown a CORBA.COMM_FAILURE exception, the WebSphere ORB will determine what state the transaction is in. From this, the ORB will determine that it is an initial lookup request and simply resend the request to the LSD to locate another server clone to direct the EJB client request to.

If the exception is thrown after the initial context has been found, the WLM components of the ORB will handle the CORBA.COMM_FAILURE exception. The ORB will then mark the problem EJB container clone as bad and resend the request to another operating EJB container or server group (based on your routing algorithms).

NO_RESPONSE is an exception thrown typically when there is a server down or something is internally wrong with the remote ORB (wherever or whatever vendor ORB that may be). The handling of a NO_RESPONSE exception is much the same as the COMM_FAILURE , and the same process of redirection to another working, EJB container clone will be the resulting action by the ORB.

COMPLETE_STATUS is a resulting secondary exception message that is thrown when a COMM_FAILURE or NO_RESPONSE exception is thrown to the EJB client and/or WebSphere ORB. There are three basic state messages that will accompany the COMPLETE_STATUS exception:

COMPLETED_YES
COMPLETED_NO
COMPLETED_MAYBE

If a COMM_FAILURE message is received by the EJB client and/or ORB after an initial context lookup has been made (and the active list of server clones is present with the EJB client), and COMPLETE_STATUS is COMPLETED_YES , then everything continues on. In this case, there was some form of error somewhere in the transaction pipeline, but the transaction completed successfully.

If COMPLETE_STATUS is COMPLETED_NO , then the request is rerequested via another available EJB container clone.

If COMPLETE_STATUS is COMPLETED_MAYBE , then nothing can occur. The EJB client or client code must handle this exception itself and include logic to rerequest the request from the beginning of the transaction.

In the COMPLETED_MAYBE state, the CORBA ORB isn't sure whether the transaction completed cleaning. Therefore, it doesn't deem it safe to continue on with the transaction (consider double payments as a reason this makes good sense).

The INTERNAL exception is thrown when all remote ORBs aren't available. In this scenario, the total environment is down and a full restart is required.

The final exception type is the NO_IMPLEMENT exception. This exception is thrown by the ORB when all attempts to make requests to all EJB container clones in the once-accurate server group list are unavailable. The ORB can be tuned using the RequestRetriesCount and RequestRetriesDelay values for this scenario.

There are no golden rules for these settings, but you should set them to a value that isn't so low that your environment doesn't retry enough times to get over transient problems, but you shouldn't set them so high so that the components continually retry.

At some point, your environment may be simply broken and will require a restart. I've discussed ways to mitigate these single points of failure in the environment through split WebSphere domains and cells , multiple servers, distributed environments, and so forth. Using these design considerations, you need to model the optimum setting that suits your environment.

Workload Management and EJBs

This chapter has mainly focused on the ORB aspects of the EJB container failover and support. Now you'll go over some key considerations that need to be made or understood with regard to EJB clients. In this section, you'll look at how failover affects the different forms of EJB caching.

Note	You'll explore more aspects of EJB performance and availability best practice development in Chapter 10.

Chapter 9 looks at ways that EJBs can be cached by the EJB container. Caching can greatly improve performance by using design principles, such as EJB pools, that provide a construct for EJBs to stay activated until they're called on by EJB clients rather than starting up and shutting down at the beginning and end of each transaction.

Without wanting to repeat Chapter 9's discussion, there are essentially three states that your EJB caching settings can be in. These are referred to as caching options A, B, and C. If you don't want to use any specific EJB caching, option C is the default setting for EJB caching and is automatically enabled when you go to deploy your EJBs via the Application Assembly Tool (AAT).

Note	I discuss in Chapter 9 that once you've set up the EJB caching in AAT prior to deploying your EJB-JARs, you can modify and change the EJB caching settings in the IBM extension's deployment descriptor.

To recap how to cache EJBs, how they operate, and how they fit into a failover scenario, consider the fact that during a single transaction, WLM ensures that the client is directed to the EJB container that was involved at the beginning of the transaction. If your EJB is an entity EJB maintaining a persistence layer between your application logic and the database, you don't want entity EJBs to be managed correctly in the event of a failover situation.

However, once the transaction is completed, and potentially a new one commences, the request can go to any available WLM-based EJB container clone. In this case, the entity EJB can be reloaded at will and no harm is caused to data integrity or your persistence layer. Specific to entity EJBs, as long as you use EJB caching option C or B, then entity EJBs can happily (and safely) participate in WLM.

Option A EJB caching is what I call a full cache . Once the EJB is loaded (via ejbLoad() ), it stays loaded for the whole time. This means that other EJB clients can access the EJB construct and break the model. In Chapter 9, I discuss the fact that option A caching isn't recommended in a clustered environment because WebSphere requires that no other changes to the persistence layer (the database) can take place during a transaction unless it's managed via a singular EJB container.

Therefore, EJB caching for entity EJBs can only be done via option B and option C caching. That said, it's possible to workload-manage just the home interfaces of entity EJBs under option A. This provides a middle ground for WLM of entity EJBs and provides some limited support for failover. Note that the entity EJB itself using option A isn't clusterable ”just the home interface.

When you consider stateful and stateless EJBs, stateful EJBs cause headaches when you try to give them the ability to failover. WebSphere doesn't support this, but I've seen cases in which sites have tried to implement stateful EJB failover capabilities. The rule here is to use stateless EJBs as much as you can. Due to their nature, it's easy to have them failover to other servers and clones without having to worry about maintaining transient states.

Table 7-2 summarizes the EJBs available in WebSphere versions 4 and 5, and notes which ones are clusterable and can partake in WLM environments.

Table 7-2: Summary of Clusterable EJBs in WebSphere 4 and 5
EJB	Type	WebSphere 4	WebSphere 5	Clusterable
Session bean	Stateless	Y	Y	Y
Session bean	Stateful	Y	Y	N
Session bean	Home interface	Y	Y	Y
Entity bean	Container managed (option A)	Y	Y	N
Entity bean	Bean managed (option A)	Y	Y	N
Entity bean	Home interface (option A)	Y	Y	Y
Entity bean	Home interface (options B and C)	Y	Y	Y
Entity bean	Container managed (options B and C)	Y	Y	Y
Entity bean	Bean managed (options B and C)	Y	Y	Y
Message-driven beans		N	Y	Y

As you can see from Table 7-2, despite the inherent complexity of EJBs, there is a fair degree of cluster support for them in WebSphere. The only EJB that isn't clusterable is the full failover of entity EJBs under option A. In this situation, with option A, you wouldn't want to use it anyway because it doesn't properly support clustering due to restrictions with the data integrity of the persistence layer mechanism.

Option A, as discussed in Chapter 9, is one I recommend everyone stay away from. Option B is the number one choice for performance of EJB components. More on that in Chapter 10.