IBM s Experience Implementing High-Availability WebSphere Servers

     

IBM's Experience Implementing High-Availability WebSphere Servers

Here are details on IBM's experience in implementing two HACMP WAS Cluster projects at our server farms. The first case study is for an e-Commerce Application and is based on WAS V3.5 without load balancing. The second case study is based on WAS V4 with load balancing.

Case 1 ”An e-Commerce Application with HA but Without Load Balancing

To satisfy the high-availability requirements associated with an Internet facing e-commerce application (which we'll refer to as "ECAPP" here), we implemented an HACMP cluster of two AIX servers in a hot-standby configuration with both servers having WAS 3.5.2 installed. The WAS setup included the configuration of a DB2 database for WAS configuration data only. The application data that resided on the cluster servers was kept in a shared logical volume group on the AIX filesystem (most of the application data was immediately pushed to internal data processing servers). The WAS configuration database was not shared because of certain difficulties, which we'll discuss.

The following diagram (Figure 14-5) shows the configuration of the HACMP cluster for the ECAPP implementation.

Let's try to clarify some items in the diagram. Both networks shown are "public" networks to the HACMP function (no HACMP cluster configuration data flows on either network). For each network, there is a "service" IP address defined, which provides external access to the cluster. The service IP remains the same regardless of which server is active. Each server must have a unique IP address on which the operating system can be accessed. This is referred to as the "boot" IP address. When HACMP is started on the active server, it reconfigures the service adapter for each network and replaces the boot IP with the service IP.

The hardware configuration for the ECAPP server cluster consists of the following:

ECAPP Servers

Qty. 2, AIX 4.3.3, HACMP


  • 7025 F50 RS/6000 Model F50

  • 1x16-bit PCI SCSI SE adapter to 6 bays cable

  • 2x9.1GB Ultra SCSI HDD for operating system

  • 2x18.2GB Ultra SCSI HDD for logging

  • 4x10/100Mbps Ethernet adapters

  • 8x256MB SDRAM DIMMS (2GB total)

  • 2x2 way RS64 II 332MHz processor card

  • 1xSCSI Hot Swap 6 Pack

  • 1xRS-232 serial cable

External Drive Array

Qty. 1


  • 7133-D40 Advanced SSA Subsystem

  • 6x18.2GB 10K Advanced HDD

  • 1x50/60Hz AC 300 VDC Power Supply

  • 12x2.5m Advanced SSA Cable

Note the use of four separate Ethernet adapters (NICs) per single server. We choose to make both the Internet facing and internal network interfaces redundant on each server and to configure HACMP adapter failover for each NIC pair. The RS-232 serial cable is used for the HACMP serial network (the intra-node "heartbeat"). The external SSA subsystem provides the shared storage among the clustered servers.

The HACMP cluster is configured in "hot-standby" mode ”that is, one server is always active in the cluster while the other server sits idle but automatically can be made active should the primary server fail. The automatic switch of the idle server as the active server is referred to as node "failover" or "fallover." On failover, the idle server is dynamically reconfigured to assume the identity and resources of the primary server. Specifically, the following resources are "moved":

  • The service IP network address/adapter binding

  • The shared resources (in our case, the SSA backed mount points)

In addition, on failover, the WAS/DB2 configuration is started on the idle (now active) server and takes over the ECAPP processing.

The fact that the TCP/IP network addresses and hostnames are changed on failover complicates the WAS configuration in our ECAPP cluster. This is because WAS ties itself to the server's hostname on a normal install; that is, WAS uses the server's hostname as its node name and writes this host/node name to its configuration database. Thus, even if WAS were installed on the primary server as a shared resource, when WAS is started on the secondary server on a failover, the host name of the secondary server would not match the node name of the WAS installation. For this reason, we chose to install WAS separately on each cluster server. Since our WAS database is only used for configuration data, we choose to create separate database instances on each server as well.

Another complication arises because of the changing hostname to IP address associations on failover (both node and adapter). For various reasons, it is beneficial to have a fixed association between a server's TCP hostname and its IP address. This is important not only for server applications such as WAS, but also for system management functions such as backup and monitoring. One example of the difficulty with HACMP dynamically changing the adapter to IP associations (and thus the hostname to IP association) is when WAS is started on the failover node, its node name (hostname) resolves to a different IP address (the common service IP) then when it was installed (the server's boot IP) and results in error messages about the intra-WAS communication protocols.

To overcome this changing adapter/IP address complication, we defined a third IP address for each network interface point. This third IP address was made distinct from the boot IP and standby IP addresses although placed within the same subnet. The third IP is established as an alias IP for whichever adapter is active for the network connection. HACMP pre- and post-event scripts were written to configure the third IP as an alias for the service IP (using the AIX ifconfig command with the alias option). Thus, whenever HACMP is started or an adapter failover is performed, the third IP address is associated with the active (service) adapter and thus is maintained as a fixed IP address for that network connection. (Note that the third IP address is not made known outside of the server itself; external access to the server is still via the service IP address.)

We chose to configure the servers in a "rotation" type of HACMP node failover. This means that when the failed server reenters the cluster, it assumes the idle server role. We chose this approach to avoid the overhead of fallback to the restored server as would occur in a "cascade" type of failover. It also allows the entire failover process to be completely automatic.

Case 2 ”WAS V4 Web Servers with Both HA and Load Balancing

This section discusses an example of a high-availability WAS V4 implementation with load balancing that is based on a system IBM designed and implemented at one of our server farms. This WAS system supports a large Siebel (CRM) application. The implementation uses WAS V4.0, HACMP in active/active mode, and also uses Network Dispatchers at the front end. The design concepts for this system are typical of the large scale, high-availability WAS implementations that IBM has recently been designing for customers at our server farms. Most of IBM's early WAS high-availability implementations are based on WAS V3.5x and use AIX HACMP in active/standby mode. The example in this section is based on the following requirements and WAS high-availability design concepts. These requirements are typical of our recent WAS V4.0 high-availability designs.

General WAS high-availability requirements at IBM server farms:

  • High-availability ” 99.9 percent availability (about eight hours of unscheduled down-time per year)

  • Performance ” High performance, capacity on demand

  • Scalability ” Technology platform that scales for CPU, Memory, Disk

  • Reliability ” Internal redundancy, fault tolerance, and automatic failover

Technology

The servers are IBM pSeries Nodes with high-availability, scalability, reliability, and performance features. The servers typically used for WAS V4.0 are the pSeries model 660 which are Midrange Enterprise Servers (6H0, 6H1, 6M1) with four to eight processors (450MHz to 750MHz), expandable from 32GB to 64GB of Memory (RAM) with internal or external disk storage and redundant internal system components .

HACMP clustering and WAS clustering provides workload management and failover capabilities. There are the dual benefits of high-availability and workload management with WAS clustering and HACMP. HACMP provides high availability from a hardware perspective, and WAS V4 Clustering provides software failover. These technologies work in conjunction to provide fault tolerance and high availability.

The environment can start out small and grow to the full capacity of servers. It is very scalable with HACMP in an Active/Active configuration since both servers in a cluster can provide continuous processing and failover. The theory behind high availability is to reduce the single points of failure where possible and build redundancy within the infrastructure to mitigate risk. The redundancy is automatic and seamless throughout the infrastructure to reduce the impact on performance.

For the enterprise systems, HACMP (High-Available Cluster Multi-Processing) is used to help build high-available systems from the hardware level. On the software side, WebSphere Clustering is used to build a high-available WAS environment in either an active/active or active/standby mode. By coupling both HACMP and WAS Clustering, you can introduce the benefits of both hardware and software failover to build a robust and highly available node.

While HACMP or WAS Clustering by themselves provide decent HA solutions, both technologies need to be combined to mitigate the risks of a failure. Using both HACMP and WAS Clustering within a WAS Node, the two technologies work together to allow continuous processing. Shared Disk configuration provides redundant access to application data and log files to ensure that all messages are processed . In an active/active configuration, also known as mutual takeover configuration, multiple nodes can simultaneously be running workload.

Recommendations on the Best Ways to Achieve WAS High Availability

High availability including failover and recovery in WebSphere is a very broad subject, and there are a lot of options. This section provides some advice, tips, and best practices specific to using WAS (including V4 or V5) to provide high availability. Generally it is best to first consider using WebSphere's built-in failover capability. Second, consider application code best practices for failure related exception handling and recovery.Third, consider using other IBM and third-party products in conjunction with WebSphere such as HACMP, MSCS, and Network Dispatcher. Here are some details:

  • WebSphere's built-in failover capability is achieved using the Work Load Management (WLM) facility at the Web Module, EJB Module, and Administrative Server levels. WAS V4 with "Server Groups" or WAS V5 with "Clusters" provides a WebSphere administrator with the ability to create any number of application server instances, or clones, of the server group or cluster. These clones can all reside on a single node or can be distributed across multiple nodes in the WebSphere Domain. Clones can be administered as a single unit by manipulating the server group or cluster object. WebSphere clones can share application workload and provide failover support. If one of the clones fails, work can continue to be handled by the other clones in the server group or cluster, if they are still available.

  • Application code best practices for failover exception handling and recovery means that the application code should be able to handle JDBC exceptions and rollover to a " healthy " database server. This is achieved by ensuring that the application code is "failover-ready." It is important to remember that even in a WebSphere environment that employs high-availability mechanisms such as HACMP, MSCS, or Network Dispatcher, there is still a disruption in service while the database is switched from a failed server to an available server. Details on how best to code for the application to recognize that the database is not responding to requests and to easily reconnect to a database server once it has recovered from a failure are given in the IBM Redbook listed later in the chapter.

  • In addition to WebSphere's built-in failover and load balancing capabilities and application code best practices for failover exception handling and recovery, there are a variety of IBM and third-party products that can be used in conjunction with WebSphere. These products include the OS clustering software discussed earlier in this chapter, such as AIX HACMP and Microsoft OS Clustering (MSCS). Most of the OS clustering products can be used in either an active/passive (hot standby) or active/active configurations. OS clustering (also called hardware clustering) provides a cluster manager process that periodically polls (checks the "heartbeat") the other software processes in the cluster to determine if the software and hardware it is running on is still active. If a "heartbeat" is not detected , then the cluster manager moves the software process to another server in a cluster. The movement of the process from one machine to another is not instantaneous but is usually accomplished within a minute or two (thus, the term "high availability" rather than the ideal "continuous" availability). OS clustering techniques rely on a shared disk array that is shared by all servers in the cluster ”hence, OS clustering is not applicable for failing over between two sites over the wide area network.

  • IP Sprayers such as Network Dispatcher (WebSphere Edge Server) can also provide high availability and load balancing ”without the need of shared disk arrays. So with IP Sprayers, you could have failover across the wide area network. Also, IP Sprayers such as Network Dispatcher are needed for Web sites that require massive scalability since IP Sprayers provide for load distribution and failover for incoming HTTP requests.

These options for WebSphere high availability are shown in Figure 14-6. Notice that the diagram includes Network Dispatchers, WAS clusters (V4 server groups or V5 clusters), HACMP options, etc. A good document to reference for all of these options is the IBM Redbook, "IBM WebSphere V4.0 Advanced Edition: Scalability and Availability," located at http://www.redbooks.ibm.com, document SG24-6192, May 2002.

Figure 14-6. General WebSphere availability options.
graphics/14fig06.gif

The basic ideas for the different WAS load balancing and failover options are also given in the following table. The type of failover and load balancing is dependent on the high availability requirements. If you only need failover without load balancing, an HACMP implementation with an active/passive configuration is sufficient. IP Sprayers such as Network Dispatcher are required for very high scalability for your Web servers or where failover between two sites is required.

Table 14-2. WAS Load Balancing and Failover Options

High-Availability Requirement

WAS V3.5x

WAS V4

WAS V5

Failover only; Active/passive (i.e., no need for high volume)

OS Clustered Servers in Active/Passive mode using HACMP

WAS "server groups to allow software load balancing with HACMPfor system administration server failover

"WAS "clusters" for load balancing with HACMP.for system administration server failover.

Failover & Load Balancing. Single Site

WAS with HACMP active/active Net Dispatcher (WebSphere Edge Server)

WAS "server groups" with HACMP active/active or Net Dispatcher.

WAS "clusters" with HACMP .active/activeor Net Dispatcher.

Failover & Load Balancing between two or more sites.

Net Dispatcher

Net Dispatcher

Net Dispatcher




IBM WebSphere and Lotus Implementing Collaborative Solutions
IBM(R) WebSphere(R) and Lotus: Implementing Collaborative Solutions
ISBN: 0131443305
EAN: 2147483647
Year: 2003
Pages: 169

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net