Deployment Strategies | Mastering BEA WebLogic Server: Best Practices for Building and Deploying J2EE Applications

In Chapter 12, we discussed performance tuning and testing strategies. In this section, we will examine a number of deployment strategies you can use to meet your requirement for a secure, around-the-clock accessible, high-performance, reliable system in the presence of unpredictable usage and changing market conditions. We will focus primarily on two areas: selecting the number and size of machines for running the application server and designing your WebLogic Server clusters to meet your availability requirements.

When deploying highly available, high-performance systems, we recommend that you follow the guidelines shown here to allow your system to adapt to the ever-changing needs and complexities of enterprise computing:

Choose solutions that are highly available and manageable.
Choose systems that offer performance regardless of load and can scale to meet new requirements.
Make sure that data is available and protected from corruption.
Look at availability from the user s perspective. Understand that data is only one component of availability and all layers of your system must be available and resilient to failures. In most enterprise systems, achieving high availability will mandate providing redundancy at all layers of the system to avoid single points of failure.

By combining these guidelines with the system s business and technical requirements, you can deploy a system that meets your current and future requirements. In the sections that follow we will discuss best practices for selecting and designing a robust deployment environment using these guidelines. Before we jump into the strategies, let s think about how to evaluate the different strategies to come to some conclusion about what works best for your particular situation.

Evaluating Deployment Strategies

As with most architectural decisions, the selection of an appropriate production deployment environment involves trade-offs. Business and technical requirements must be understood in order to select the appropriate deployment environment. When trying to determine the appropriate deployment strategy, we recommend the following steps:

Map the business requirements into a technical architecture that allows the system to meet these requirements.
Using this technical architecture and the application s additional technical requirements, develop the criteria that your deployment architecture must meet.
Assemble a cross-functional team to explore the wide range of possible deployment architectures and narrow them down to a few that best meet the deployment architecture criteria.
Wherever possible, reuse existing deployment architectures, or pieces of them, to jump-start your selection criteria.
Use Proof of Concept evaluations to verify that the deployment architecture you ve selected can meet the most difficult business and technical requirements.

First and foremost, the deployment architecture must meet the requirements of the business now and in the future. Once you understand the business requirements, you can map them to the technical architecture required to support the business. For example, you may have a business requirement to provide 99.5 percent availability where failing to live up to this will result in noncompliance of service-level agreements (SLAs) imposed on the system. This business requirement maps directly to a technical requirement for high availability that requires software, hardware, and network redundancy, as well as failover capabilities.

The application itself will have additional technical requirements defined by application user groups, operations, security, and any other group that interacts with or supports the system. By combining all of these requirements, you can develop criteria for the deployment architecture and apply weights to these criteria depending on the importance to your business. Common criteria include performance, manageability, scalability, flexibility, cost, security, administrative complexity, and maintenance. You can evaluate candidate deployment strategies developed in the next step against this weighted matrix of criteria to determine their appropriateness for your business and application.

Next, work with a group of interdisciplinary architects or technical personnel to select a few candidate deployment architectures that are likely to meet the business and technical requirements. Depending on the complexity of the requirements, scope of the deployment, and the group s experience with similar systems, you can evaluate each candidate either on paper or by doing a Proof of Concept (POC). One common practice is to select the best paper option and then use a POC to prove that the chosen architecture meets your requirements. It may sometimes be possible to combine this effort with preproduction functionality and performance testing of the application.

Your job is easier if an enterprise deployment environment is already in place and available for testing, requiring only a validation that the existing environment can meet the demands of the new system. Often this approach will not be appropriate because hardware, monitoring, and failover solutions are either not in place or have not yet been chosen. In this case, you should identify and develop an end-to-end slice (or portion) of the application that touches every layer of the system to use in testing the various candidate deployment architectures. This slice of the application should include the most challenging parts of the application and test the most challenging and/or strict operational requirements. You can then compare these results with the requirements selection matrix.

By performing POC tests and mapping the results against the requirements matrix, you can choose the best deployment architecture with a high level of certainty that it will meet all requirements. Unfortunately, it is not always possible to run your system in the best deployment environment. In many cases, you will have to make trade-offs. For example, you may have to deploy a more tightly coupled system than you would like in order to meet your users performance requirements. Or you may not be able to use the best availability strategy due to cost considerations. These decisions and trade-offs are best made once you clearly understand the requirements of your business and you are able to differentiate these from other selection criteria.

Best Practice

Evaluate deployment strategies by identifying and prioritizing business and technical requirements for the system, then mapping these requirements against candidate deployment strategies. Use Proof of Concept tests to validate new or unproven designs.

Now, let s look at a number of key strategies that you should consider when designing and selecting the best, or at least the appropriate , deployment architecture for your application.

Server Deployment Strategies

The first deployment strategies to consider are the size and type of server hardware to use in your environment, as well as the way to deploy your WebLogic Server applications on this hardware.

Determining the JVM to Processor Ratio

One of the most frequently asked questions is how many instances of WebLogic Server to run on a particular piece of hardware; the next is whether it is better to use a few larger SMP machines or more smaller machines. In an ideal world, applications would scale linearly as you add CPUs to the machine so that a single JVM would use all available CPUs and provide maximum performance. Unfortunately, in the real world, many factors can contribute to the nonlinear scalability of a Java application, including things such as I/O bottlenecks, garbage collection, cache-memory latency, and thread synchronization.

Garbage collection is of particular interest because it can have a dramatic effect on the application. Many older JVM implementations do not support parallel or concurrent garbage collection so the negative effect of garbage collection on performance grows considerably as we add CPUs on multiprocessor servers. For example, an application that spends 10 percent of its time performing single-threaded, stop-the-world garage collection will lose 75 percent of its throughput on a 32-processor machine, according to testing performed by Sun Microsystems (http://java.sun.com/docs/hotspot/gc/). Even on a smaller machine having only 5 CPUs this same application will lose approximately 20 percent of its throughput.

Most Java 2 SDK 1.4 JVMs have options to enable either parallel and/or concurrent garbage collection, though these are not typically the default settings. These can make significant improvements in the effects of garbage collection on JVM scalability across processors. Other factors, though, may still prevent you from achieving the level of scalability you need across a large number of processors.

Determining the ideal JVM-to-CPU ratio for a given application is an iterative process that is ideally done when stress testing the application for acceptance testing or capacity planning. On a multiprocessor machine, you should start by taking all CPUs offline except one and then tune the system until you achieve maximum throughput for that application on one CPU. This testing will provide the throughput information for one CPU to use as a baseline for determining linear scalability. From there, bring another CPU online and repeat the process. Continue this process until you cannot fully utilize the CPUs on the node or the linear scalability falls below an acceptable point. Remember, you will need to make sure that you have sufficient load to drive the number of CPUs available and that you watch for bottlenecks in other parts of the system. The goal is to determine the optimal number of CPUs for a single WebLogic Server instance. If you fail to achieve acceptable scalability during this testing, explore the possibility of running multiple instances of WebLogic Server on a machine.

Vertical Scaling

Scaling an application by simply adding more CPUs to a machine is often referred to as vertical scaling . Application server vendors have borrowed this term and have expanded it to include scaling an application by adding both processors and application server instances to a machine. A WebLogic Server instance consists of an application server running in its own Java Virtual Machine (JVM), so vertical scaling also implies multiple JVMs on the same machine. Vertical scaling can lead to better utilization of the server hardware and increased application throughput. You should balance this increase in utilization and throughput against the added configuration, maintenance, and monitoring overhead associated with running multiple server instances.

Running multiple instances of WebLogic Server on the same machine can also help minimize the effect of nonparallelized, stop-the-world garbage collection. Because multiple JVM instances will typically not all run garbage collection at exactly the same time, this means that you will almost always have at least one JVM available to schedule application- related work on other processors while another JVM does its stop-the-world garbage collection with a single thread. We recommend using the parallel and/or concurrent garbage collectors available with the new Sun HotSpot 1.4.1 and BEA JRockit JVMs.

Best Practice

Use parallel and/or generational garbage collectors to limit the scalability effect that garbage collection has on the application. Even then, you may want to explore the performance benefits of running multiple WebLogic Server instances on larger SMP machines. Before formalizing multiple instances per machine as your deployment strategy, make sure that you understand the effect this will have on configuration, maintenance, and monitoring so that you can make an informed decision.

Horizontal Scaling

Scaling an application by simply adding more machines to your environment is often referred to as Horizontal Scaling . Typically, horizontal scaling is more specifically associated with the practice of employing multiple, relatively small server machines ( generally four CPUs or fewer) in a production environment. In this scenario, each machine usually hosts a single instance of WebLogic Server and the application itself. Through the use of WebLogic Server clustering and/or external load balancers, this approach allows WebLogic Server-based applications to span several machines yet still present a single system view to the end users. You can use this strategy not only to increase scalability but also to improve the failover characteristics of your application. It also provides you with the flexibility of adding more machines on demand to handle increasing throughput requirements.

In many cases, you may want to combine horizontal and vertical scaling techniques to use multiple machines, each running multiple instances of WebLogic Server. This can make it easier to achieve both high CPU utilization and good failover and flexibility characteristics.

Best Practice

Horizontal scaling gives you some failover and flexibility that you normally cannot get with only vertical scaling. Depending on your hardware, you might also want to consider combining the two techniques to increase CPU utilization. Whether this makes sense will depend on your hardware, application, and JVM.

Now, let s move on to look at single-site deployment strategies.

Single-Site Deployment Strategies

The next set of strategies relates to the deployment of scalable and highly available systems in a single site or data center. We will concentrate on clusters that reside in the application server layer of our system, but we will consider other layers where appropriate.

Two different scenarios will be discussed in the sections that follow. These scenarios reflect different sets of availability requirements:

A simple WebLogic Cluster representing basic availability requirements
A complex WebLogic Cluster representing more demanding availability requirements

Many other deployment strategies are possible, of course, each offering varying degrees of availability. It is important that you have a firm understanding of your requirements before choosing a strategy from among the many options available. You must also consider your cost structure and current enterprise standards.

For the purpose of this discussion we will make the following assumptions:

Local clusters are defined as a grouping of two or more servers residing in the same site or data center with WebLogic Server acting as the middleware.
Software, hardware, or a combination of software and hardware are utilized to achieve a high level of availability.
The minimum configuration involves at least two instances of WebLogic Server running in a cluster with each instance residing on a different server. This configuration allows protection against failure at the both the node level and the WebLogic Server instance level and is a good, basic configuration for discussing local clusters.
Load balancers are used to provide message distribution and failover of requests to the cluster of WebLogic Server instances. In all cases, you could configure Web servers as proxy servers to perform the same functionality.
Components resident in other layers of the system are redundant and provide high availability.
We will discuss only symmetric hardware configurations, which are also called active-active configurations. Asymmetric, or active-passive , configurations are also a viable solution. This type of deployment, though, is not typical at the application server level due to its higher cost caused by the use of passive servers.

Simple WebLogic Server Clusters

First, we will consider a Simple WebLogic Server cluster, which provides a basic level of high availability. Figure 14.1 shows a simple cluster that provides a simple, cost-effective , and highly available deployment architecture.

This type of configuration is commonly used under the following situations:

A flexible and cost-effective solution is desirable.
There are no disk-sharing requirements across the WebLogic Server cluster.
Local data storage does not require high availability.
The applications do not use, or participate in, XA transactions because the transaction logs will typically require high availability and failover.
The applications do not use file-based JMS persistent messages because the JMS message stores will typically require high availability and failover.

Figure 14.1 depicts an active-active cluster running under normal conditions. Both instances of WebLogic Server are used during normal operation, with load balancing at the connection level provided by load-balancer hardware located between the clients and the WebLogic Server cluster.

Figure 14.1: Simple cluster before failure.

Figure 14.2 shows the same cluster after a failure of either the WebLogic Server instance or the machine on which it is running.

Figure 14.2: Simple cluster after failure.

The load balancer is now providing failover at the connection level, while WebLogic Server clustering software provides a single homogenous system view across both WebLogic Server instances. The key features that WebLogic Server clustering provides are these:

Fail over and load balancing of JNDI, RMI, EJB (stateless session beans, stateful session beans with in-memory state replication, and entity beans), and JMS
HttpSession replication
Cluster-wide replication of the JNDI naming service
Cluster membership discovery and cluster health monitoring

When a failure takes place on either the WebLogic Server instance or the hosting node itself, the load balancer quickly notices that the WebLogic Server listen port is no longer responding to requests. The load balancer then removes that failed server from its list of healthy servers and begins routing all requests to a different, healthy WebLogic Server instance. This failure detection can be achieved in a more intelligent manner by having the load balancer periodically check the health of the WebLogic Server instance using the weblogic.Admin PING command. Once the WebLogic Server instance is back up and listening on the appropriate port, the load balancer will discover this fact and will once again start distributing requests to that WebLogic Server instance. Refer to Chapter 11 for more detailed information on WebLogic Server clustering.

This simple clustering configuration is a form of horizontal scaling, discussed earlier, in which additional nodes are added to the environment to increase processing capability. When horizontal scaling is used in conjunction with WebLogic Server clustering, it offers a cost-effective method to achieve both flexibility and availability. Servers can be added to the cluster dynamically; once the new WebLogic Server is added to the distribution list of the load balancer, traffic will begin routing to the new instance. If a server fails, the remaining servers will take over the load for the failed server until it can be restarted, thus allowing better utilization of hardware than an active-passive configuration.

Tip	A simple WebLogic Server cluster is an appropriate strategy for a single-site installation, providing good scalability and fail-over characteristics.

Complex WebLogic Server Clusters

The second scenario we will consider has more demanding availability requirements, such as the following:

The system must support global transactions between local and distributed resources such as JMS destinations and databases.
JMS messages are persisted to the file system and must support failover and be highly available.
Failover of both the node and any WebLogic Server instance-specific functionality, such as JMS destinations and JTA transaction recovery, must take place transparently .
Distributed transactions must be recoverable and restarted in case of node or WebLogic Server failure.

Figure 14.3 presents one possible solution to these more demanding requirements. This solution would utilize the following components:

WebLogic Server instances running in a standard WebLogic Server cluster.
Redundant load balancers (not shown in the figure), which provide load balancing and failover of requests at the connection level.
WebLogic JMS using multiple distributed destinations, providing high availability to both JMS producers and consumers.
Veritas Cluster Server (VCS) or an equivalent product to provide transparent failover across nodes in the hardware cluster. VCS will manage and control both hardware and software resources, bringing resources online and taking them back offline when necessary.
Storage area networks (SANs) to provide highly available shared disks for JMS queue storage.
Veritas or another highly available file system for high performance and flexible volume management.

Figure 14.3: Complex cluster before failure.

Should one of the WebLogic Server instances fail, the VCS system will automatically migrate the instance to the other hardware, as depicted in Figure 14.4.

Figure 14.4: Complex cluster after failure.

As noted previously, Veritas Cluster Servers are being used to monitor and control applications running in the configuration, and these clusters respond to a variety of hardware and software faults. Because VCS will be managing and controlling the WebLogic Server cluster you will need to produce various scripts and determine what type of health monitoring is required. This discussion will concentrate on scripts and monitoring of WebLogic Server only, although VCS is actually monitoring and controlling other resources such as IP addresses, disks, and network-interface cards. See the Veritas Cluster Server documentation for a complete description of these activities. Minimally, you will need to develop the following scripts:

Start scripts. Scripts that start the administrative server as well as all managed servers running in the WebLogic Server cluster.

Stop scripts. Scripts used to shut down WebLogic Server administrative and managed servers.

Forced stop scripts. Scripts that shut down WebLogic Server instances that are not responding to administrative shutdown commands.

Health monitoring scripts. Scripts used to determine the health of various subsystems in WebLogic Server.

VCS will use an agent to monitor and control the WebLogic Server resources. This agent will start the servers, stop the servers, and fail over the servers after a node failure. You will need to determine the appropriate response when a failure is detected in any monitored resources. We recommend a tiered availability approach concentrated on keeping the active server as available as possible and failing over the cluster only when it cannot be restarted. This approach relies primarily on WebLogic Server s clustering infrastructure and fails over only after a hard failure of a disk, node, or non-redundant device.

You will also need to determine which failures should be handled automatically and which should only be reported so that manual action can be taken. In this scenario, the VCS agent will either perform the appropriate action itself or will propagate the information to a person through a page or send the alert itself to an enterprise monitoring console that will either take action itself or pass the alert to the appropriate personnel.

With our example scenario, no JTA or JMS migration is required during failover. The instances running on a failed VCS node will be migrated by VCS and then brought back online on the targeted node. The instance will come up and start processing just as it would when restarted locally. We should note that this is only one possible approach. We could just as easily have VCS migrate the JMS servers and JTA recovery service from the failed WebLogic Server instance to the other instance running on the other node.

Best Practice

A complex WebLogic Server cluster will cost more than a simple cluster and require additional configuration and testing, but it is appropriate if your installation requires higher levels of availability.

Multiple Site Deployment Strategies

Multiple site deployment strategies are often discussed in the context of a continuous business paradigm, combining high-availability solutions with advanced disaster recovery techniques. The ultimate goal is to be able to manage both planned and unplanned outages with minimal disruption. These strategies allow continuous availability during failures as well as software and hardware migration without affecting availability. A complete discussion of this topic is beyond the scope of this chapter, so we will limit our discussion to key concepts and examine some configuration options.

Even though local clusters, in which all of the nodes and storage subsystems are in a single data center, offer good protection against smaller disasters such as single node failures or disk crashes, they do not protect against major disasters that could destroy or damage the entire facility. To protect against these kinds of failures you need to make sure that the cluster components are geographically dispersed. While most local clusters are designed around a shared disk-storage architecture where storage resources are physically connected to all nodes via SCSI or Fibre Channel, multi-site clusters usually rely on some type of replicated data architecture.

Designing Multiple-Site WebLogic Clusters

Including WebLogic Server applications in the design of a multi-site cluster is fairly straightforward as long as you ensure that the associated data is properly replicated to all data centers. It becomes more complicated when file-based JMS is used in a distributed transaction environment with multiple resources involved in a two-phase commit transaction (2PC) due to the exactly once nature of these services.

Additional design considerations covered in this section include the following:

Active-passive or active-active cluster design
HttpSession state management and replication
Transaction collocation requirements
Data replication

Cluster Design Options

It is possible to use both active-active and active-passive clusters with WebLogic Server applications. We recommend that you follow the same design that you used for your data-replication solution. For example, if the data replication between the two data centers is bidirectional, then an active-active design of your WebLogic Server applications may be desirable. If data replication is unidirectional, however, it may force you to stick with an active-passive design.

Session Replication

Managing and replicating HttpSession state is a major consideration for most WebLogic Server applications and has been discussed at length in earlier chapters. It tends to be less important in the design of the overall multiple site cluster, however, because the loss of HttpSession data in the event of a data-center loss is often acceptable to the business. If the loss of HttpSession data is not acceptable, the data can be stored in the database using WebLogic Server s JDBC-based session persistence mechanism, allowing the session data to be replicated like any other data in the database.

As with all disaster-recovery and high-availability planning, we recommend that you begin by first examining business requirements and then applying the proper deployment strategy that will meet the requirements. All data is not created equal, and it is likely that only a portion of application data is critical to the basic operation of the application.

As described in previous chapters, the most popular form of session persistence is in-memory replication. WebLogic Server uses a primary-secondary replication scheme in which the server first used to process a user s request is designated the primary server for that user and will create the primary copy of the HttpSession object. At the end of that first request and before the response is returned to the user, the primary server will create a secondary copy of the HttpSession data on a secondary server in the cluster. Typically, the primary server for a particular session receives all future requests for that session. If the primary server fails, the first request following the failure will be routed to another server in the cluster. This behavior puts several restrictions on your cluster design:

You must ensure that subsequent user requests always come back to the same data center where the primary server is located.
A typical WebLogic Server cluster does not span across data centers. The secondary session must therefore be created on a server in the same data center as the primary server. This also means that in the event of data center failure, data stored in the HttpSession object for that particular session would be lost.

It is possible to use in-memory replication across sites. The primary issue that you run into is not the replication itself so much as it is the multicast traffic needed for JNDI replications and cluster membership and monitoring. Because multicast is a requirement, this means that you really need complete control over the link between the sites because most Internet and ISP routers are not configured to forward multicast packets. Additionally, we have found that if your connection between the data centers has a tendency to lose packets or the latency is over a few hundred milliseconds , this can cause problems with WebLogic Server s clustering mechanisms.

Best Practice

Prefer architectures that do not require using WebLogic Server clusters that span data centers. If you do need to support HttpSession failover between data centers, consider using JDBC persistence and your existing data replication technology, which will work between clusters. If you need to use in-memory replication, you need to make sure that your connection supports multicast traffic, is not prone to packet loss, and has very low latency.

Transaction Collocation Requirements

Your multiple site design should also consider that the application may use certain WebLogic Server services, such as JMS servers and JTA transaction recovery services, which are designed with the assumption that there is only one active instance of the service running in a cluster at any given time. You need to be able migrate the data associated with these services, data that is usually critical for the normal operation of the applications. Additionally, to take advantage of WebLogic Server transaction collocation optimization, it is also desired that all such operations from a specific user be directed to the same data center.

Data Replication

Finally, to provide data center failover capabilities, your design needs to ensure that all critical data is replicated to the secondary data center where the services will be restored in the event of primary data center failure. To determine the data-replication requirements, start with the following items:

Domain configuration data stored in the domain root directory
JTA transaction logs, usually located in the server directory ”for example, for a server named myserver , the transaction logs reside in the myserver subdirectory underneath the domain s root directory
JMS persistent messages, which can be stored in an RDBMS or the file systems, if applicable
Data associated with the application business logic, usually stored in the RDBMS systems

You need to identify the items that must be available at the secondary data center to restart your application and recover all critical data, messages, and transactions.

Implementing Clusters That Span Multiple Sites

Implementing a multiple site WebLogic Server cluster requires reliable, high-speed networking technologies to support the cluster-wide communication used to monitor cluster members and replicate HttpSession contents, JNDI naming service information, and other application-level data. This is usually done in a campus cluster where great distances do not separate the cluster nodes.

A number of new technologies, including Dense Wave Length Division Multiplexing (DWDM) and long-haul Gigabit Interface Converters (GBICs), provide support for Fibre Channel communication at distances up to 100 kilometers. This has become attractive to many customers because it uses standard Fibre Channel components and relatively inexpensive modular storage. Additionally, the development of DWDM technology allows cluster architects to use dark fiber, high-speed communication links provided by common carriers , to extend the distances formally subject to the limits of regular Ethernet links.

Many of these cluster techniques require that the cluster components be on the same subnet, although it can have separately routed redundant links. If your company has this type of architecture, building a WebLogic Server cluster spanning multiple data centers is relatively easy, as it normally does not require any special configuration. Some of these clusters, though, can use more sophisticated data replication techniques, and the cluster architecture can span multiple subnets connected by WANs.

Figure 14.5 illustrates a possible configuration for a single WebLogic Server cluster that spans both sites.

Figure 14.5: WebLogic cluster spanning multiple sites.

As we mentioned previously, WebLogic Server clusters depend on multicast communication for cluster-wide JNDI change notifications and heartbeat messages. In order for this configuration to work, you must ensure that multicast messages are reliably transmitted to all server instances in the cluster. Configure all routers between the sites to propagate multicast messages. Recognize that most network administrators do not allow UDP packets across different subnets, so you may need to use one subnet for the entire cluster unless you control the network configuration. Low network latency is also critical for the cluster health; if your network latency is much more than 300 milliseconds, then you need to look at either other configurations or ways to reduce the latency. While the cluster may work with higher latencies, our experience has shown that higher latencies tend to be less reliable, especially during peak loads or failures. You must also configure the cluster s Multicast TTL value high enough to keep routers from dropping multicast packets before they reach their final destination, as discussed in Chapter 11.

Note that the local load balancers at each site are configured to route requests only to the servers in that site. Therefore, it is important to configure global load balancers to use a sticky routing algorithm so that requests associated with a particular user s session stick to the same site, except in the event of a site failure. This will keep the intersite traffic to a minimum but still allow global HttpSession failover. Also note that the local load balancers in this configuration can be replaced with a farm of Web servers configured to use one of the WebLogic Server proxy plug-ins. This approach, however, currently has one drawback.

With the hardware load balancer, the available servers to route to are preconfigured and the load balancer keeps track of which ones are up dynamically. This preconfiguration allows you to limit the visibility of the load balancer to the cluster members in the local data center. With the Web server plug-ins, the plug-in is preconfigured, but they normally update their cluster membership list with data returned by the WebLogic Server cluster. This causes a problem because now the plug-in will try to load balance requests across both data centers instead of just the local one, putting more load and stress on the network channel between the data centers. You can turn off the plug-in s dynamic configuration update feature by setting DynamicServerList to OFF in the plug-in configuration; however, this means that the plug-in cannot react to server failures as quickly or elegantly. We expect that BEA will address this limitation in an upcoming release, so please check your release notes for details.

To support cross-site HttpSession failover properly, you will need to use replication groups to ensure that the HttpSession object s primary and secondary copies are in different data centers. Clearly, cross-site session replication adds some latency, but it does provide seamless failure of user sessions in the event of a site failure. All you then need to worry about is the proper replication of application data in the database, JTA transaction logs, and so on. For simplicity, we did not include JMS Servers in this design because the exactly once nature makes it more difficult and the appropriate architecture is very dependent on your application s use of JMS and your requirements.

This multiple-site cluster design using a single WebLogic Server domain may be a good choice if your primary and secondary sites have good network connectivity and you can properly route multicast packets between the sites. Again, we need to reemphasize the importance of high-speed, reliable, low-latency network connectivity between the sites for this architecture to work successfully. As the distance between the data centers grows, it becomes more difficult and more costly to achieve this type of connectivity. For continental clusters or other situations where you simply cannot meet the recommended guidelines for intersite connectivity, we strongly recommend that you consider using separate clusters in each data center, the topic of the next section.

Best Practice

Consider using a cluster that spans multiple sites if your sites have good network connectivity and you need the seamless failover of session information. Remember that your network between the sites must support multicast traffic if you want to use this architecture.

Implementing One Cluster per Site

The previous section described a multiple-site WebLogic Server cluster that used high-speed networking to achieve a single cluster across multiple sites. A single cluster is not always possible or desirable, however, and this section will explore one alternative.

In this alternative design, each site is configured with an independent WebLogic Server cluster. By defining an individual cluster for each site, you immediately eliminate all of the WebLogic Server-specific intersite communication requirements. Of course, the application may have its own intersite communication requirements, which will almost certainly include data replication. If it meets your requirements, we believe this multiple cluster design provides a simpler, more flexible architecture while still taking advantage of WebLogic Server clustering features locally in each site. Figure 14.6 illustrates this alternative multiple-site, multiple-cluster design.

Figure 14.6: WebLogic cluster per site.

When a client first requests the URL for a Web application, the global load balancer will route the request to one of the data centers. The local load balancer will then route the request to one of the available servers in the WebLogic Server cluster at that location, and a user session will be created in a primary and secondary server in that cluster. The global and local load balancers remember where they sent the last request for a particular user session and will always attempt to route all subsequent requests from that user to the same data center and server. To accomplish this behavior, you will need to configure the global load balancer using a static persist policy and the local load balancer using a sticky load-balancing algorithm, topics discussed in detail later in this chapter.

If the primary server fails, the global load balancer routes the request to the same data center, but the local load balancer will sense that the primary server has failed and route the request to a different server in the cluster. Note that the hardware load balancer does not know the identity of the secondary server containing the replicated session data for this request; it simply picks another server in the cluster and routes the request. There are then two possibilities:

The request may have been routed to the server instance that was holding the secondary copy of this session. In that case, the secondary server is promoted to primary and a new server instance is chosen to hold the secondary copy of the session.
The request may have been received by a server instance that has no knowledge of the required user session. The server will inspect the session ID to determine the locations for the primary and secondary copies of the user session. After sensing that the primary is unavailable, it will call out to the secondary server and request a copy of the session data. This new server then becomes the primary server for this session.

In both cases, the server chosen by the local load balancer has become the new primary server for subsequent requests. The load balancer will remember the new primary location and route all subsequent requests there.

This failover behavior at the local cluster level is available in this multiple-cluster configuration and in the previous single-cluster configuration. The difference is how well the design handles the failure of an entire data center. In the multiple-cluster case, the loss of all servers at one site causes a loss of HttpSession data because the data was not replicated at the other site, only on other servers at the same site. You can, however, use JDBC-based session persistence to allow session failover between data centers. Despite this limitation with in-memory replication, the advantages of independent operation and support for most data-replication techniques make this design a strong candidate architecture for multisite configurations.

Best Practice

The multiple-site, multiple-cluster architecture is a very good candidate architecture for applications requiring high availability and good disaster-recovery characteristics.

As we mentioned earlier, applications that use persistent JMS messages and/or JTA distributed transactions complicate this model. For example, you may need to bring up the JMS server from a failed WebLogic Server instance on another instance, or even another site. While there are many different ways to use JMS and JTA distributed transactions, the common theme that is usually present is that you need to bring up the JMS server and/or JTA recovery service from the failed node to process messages and/or do recovery of the in-flight transactions.

For intrasite failures, typical strategies include either migrating the service to another WebLogic Server instance in the cluster or having another machine bring up the failed instance. For complete site failures, you typically need to have the ability to bring up the entire WebLogic Server cluster at the other site so that it can drain any messages in the JMS persistent stores and recover any in-flight transactions. This is relatively easy to set up provided that you do not need the failed cluster to interact with your users; configuring it to interact with your users is possible, just more difficult.

Unfortunately, space prevents us from going into detailed discussions of the different scenarios. It s time to move on to talk about load balancers and how they work with WebLogic Server.