1.4 Overview of clustering technologies

< Day Day Up >

In this section we give an overview of clustering technologies with respect to high availability. A cluster is a group of loosely coupled machines networked together, sharing disk resources. While clusters can be used for more than just their high availability benefits (like cluster multi-processing), in this document we are only concerned with illustrating the high availability benefits; consult your IBM service provider for information about how to take advantage of the other benefits of clusters for IBM Tivoli Workload Scheduler.

Clusters provide a highly available environment for mission-critical applications. For example, a cluster could run a database server program which services client applications on other systems. Clients send queries to the server program, which responds to their requests by accessing a database stored on a shared external disk. A cluster takes measures to ensure that the applications remain available to client processes even if a component in a cluster fails. To ensure availability, in case of a component failure, a cluster moves the application (along with resources that ensure access to the application) to another node in the cluster.

1.4.1 High availability versus fault tolerance

It is important for you to understand that we are detailing how to install IBM Tivoli Workload Scheduler in a highly available, but not a fault-tolerant, configuration.

Fault tolerance relies on specialized hardware to detect a hardware fault and instantaneously switch to a redundant hardware component (whether the failed component is a processor, memory board, power supply, I/O subsystem, or storage subsystem). Although this cut-over is apparently seamless and offers non-stop service, a high premium is paid in both hardware cost and performance because the redundant components do no processing. More importantly, the fault-tolerant model does not address software failures, by far the most common reason for downtime.

High availability views availability not as a series of replicated physical components, but rather as a set of system-wide, shared resources that cooperate to guarantee essential services. High availability combines software with industry-standard hardware to minimize downtime by quickly restoring essential services when a system, component, or application fails. While not instantaneous, services are restored rapidly, often in less than a minute.

The difference between fault tolerance and high availability, then, is this: a fault-tolerant environment has no service interruption, while a highly available environment has a minimal service interruption. Many sites are willing to absorb a small amount of downtime with high availability rather than pay the much higher cost of providing fault tolerance. Additionally, in most highly available configurations, the backup processors are available for use during normal operation.

High availability systems are an excellent solution for applications that can withstand a short interruption should a failure occur, but which must be restored quickly. Some industries have applications so time-critical that they cannot withstand even a few seconds of downtime. Many other industries, however, can withstand small periods of time when their database is unavailable. For those industries, HACMP can provide the necessary continuity of service without total redundancy.

Figure 1-4 shows the costs and benefits of availability technologies.

Figure 1-4: Cost and benefits of availability technologies

As you can see, availability is not an all-or-nothing proposition. Think of availability as a continuum. Reliable hardware and software provide the base level of availability. Advanced features such as RAID devices provide an enhanced level of availability. High availability software provides near-continuous access to data and applications. Fault-tolerant systems ensure the constant availability of the entire system, but at a higher cost.

1.4.2 Server versus job availability

You should also be aware of the difference between availability of the server and availability of the jobs the server runs. This redbook shows how to implement a highly available server. Ensuring the availability of the jobs is addressed on a job-by-job basis.

For example, Figure 1-5 shows a production day with four job streams, labeled A, B, C and D. In this example, a failure incident occurs in between job stream B and D, during a period of the production day when no other job streams are running.

Figure 1-5: Example disaster recovery incident where no job recovery is required

Because no jobs or job streams are running at the moment of the failure, making IBM Tivoli Workload Scheduler itself highly available is sufficient to bring back scheduling services. No recovery of interrupted jobs is required.

Now suppose that job streams B and D must complete before a database change is committed. If the failure happened during job stream D as in Figure 1-6 on page 11, then before IBM Tivoli Workload Scheduler is restarted on a new server, the database needs to be rolled back so that when job stream B is restarted, it will not corrupt the database.

Figure 1-6: Example disaster recovery incident where job recovery not related to IBM Tivoli Workload Scheduler is required

This points out some important observations about high availability with IBM Tivoli Workload Scheduler.

It is your responsibility to ensure that the application-specific business logic of your application is preserved across a disaster incident.

For example, IBM Tivoli Workload Scheduler cannot know that a database needs to be rolled back before a job stream is restarted as part of a high availability recovery.
Knowing what job streams and jobs to restart after IBM Tivoli Workload Scheduler falls over to a backup server is dependent upon the specific business logic of your production plan.

In fact, it is critical to the success of a recovery effort that the precise state of the production day at the moment of failure is communicated to the team performing the recovery.

Let's look at Figure 1-7 on page 12, which illustrates an even more complex situation: multiple job streams are interrupted, each requiring its own, separate recovery activity.

Figure 1-7: Example disaster recovery incident requiring multiple, different job recovery actions

The recovery actions for job stream A in this example are different from the recovery actions for job stream B. In fact, depending upon the specifics of what your jobs and job streams run, the recovery action for a job stream that are required after a disaster incident could be different depending upon what jobs in a job stream finished before the failure.

The scenario this redbook is most directly applicable towards is restarting an IBM Tivoli Workload Scheduler Master Domain Manager server on a highly available cluster where no job streams other than FINAL are executed. The contents of this redbook can also be applied to Master Domain Manager, Domain Manager, and Fault Tolerant Agent servers that run job streams requiring specific recovery actions as part of a high availability recovery. But implementing these scenarios requires simultaneous implementation of high availability for the individual jobs. The exact details of such implementations are specific to your jobs, and cannot be generalized in a "cookbook" manner.

If high availability at the job level is an important criteria, your IBM service provider can help you to implement it.

1.4.3 Standby versus takeover configurations

There are two basic types of cluster configurations:

Standby	This is the traditional redundant hardware configuration. One or more standby nodes are set aside idling, waiting for a primary server in the cluster to fail. This is also known as hot standby.
Takeover	In this configuration, all cluster nodes process part of the cluster's workload. No nodes are set aside as standby nodes. When a primary node fails, one of the other nodes assumes the workload of the failed node in addition to its existing primary workload. This is also known as mutual takeover.

Typically, implementations of both configurations will involve shared resources. Disks or mass storage like a Storage Area Network (SAN) are most frequently configured as a shared resource.

Figure 1-8 shows a standby configuration in normal operation, where Node A is the primary node, and Node B is the standby node and currently idling. While Node B has a connection the shared mass storage resource, it is not active during normal operation.

Figure 1-8: Standby configuration in normal operation

After Node A falls over to Node B, the connection to the mass storage resource from Node B will be activated, and because Node A is unavailable, its connection to the mass storage resource is inactive. This is shown in Figure 1-9 on page 14.

Figure 1-9: Standby configuration in fallover operation

By contrast, a takeover configuration of this environment accesses the shared disk resource at the same time. For IBM Tivoli Workload Scheduler high availability configurations, this usually means that the shared disk resource has separate, logical filesystem volumes, each accessed by a different node. This is illustrated by Figure 1-10 on page 15.

Figure 1-10: Takeover configuration in normal operation

During normal operation of this two-node highly available cluster in a takeover configuration, the filesystem Node A FS is accessed by App 1 on Node A, while the filesystem Node B FS is accessed by App 2 on Node B. If either node fails, the other node will take on the workload of the failed node. For example, if Node A fails, App 1 is restarted on Node B, and Node B opens a connection to filesystem Node A FS. This fallover scenario is illustrated by Figure 1-11 on page 16.

Figure 1-11: Takeover configuration in fallover operation

Takeover configurations are more efficient with hardware resources than standby configurations because there are no idle nodes. Performance can degrade after a node failure, however, because the overall load on the remaining nodes increases.

In this redbook we will be showing how to configure IBM Tivoli Workload Scheduler for takeover high availability.

1.4.4 IBM HACMP

The IBM tool for building UNIX-based, mission-critical computing platforms is the HACMP software. The HACMP software ensures that critical resources, such as applications, are available for processing. HACMP has two major components: high availability (HA) and cluster multi-processing (CMP). In this document we focus upon the HA component.

The primary reason to create HACMP Clusters is to provide a highly available environment for mission-critical applications. For example, an HACMP Cluster could run a database server program that services client applications. The clients send queries to the server program, which responds to their requests by accessing a database stored on a shared external disk.

In an HACMP Cluster, to ensure the availability of these applications, the applications are put under HACMP control. HACMP takes measures to ensure that the applications remain available to client processes even if a component in a cluster fails. To ensure availability, in case of a component failure, HACMP moves the application (along with resources that ensure access to the application) to another node in the cluster.

Benefits

HACMP helps you with each of the following:

The HACMP planning process and documentation include tips and advice on the best practices for installing and maintaining a highly available HACMP Cluster.
Once the cluster is operational, HACMP provides the automated monitoring and recovery for all the resources on which the application depends.
HACMP provides a full set of tools for maintaining the cluster, while keeping the application available to clients.

HACMP lets you:

Set up an HACMP environment using online planning worksheets to simplify initial planning and setup.
Ensure high availability of applications by eliminating single points of failure in an HACMP environment.
Leverage high availability features available in AIX.
Manage how a cluster handles component failures.
Secure cluster communications.
Set up fast disk takeover for volume groups managed by the Logical Volume Manager (LVM).
Manage event processing for an HACMP environment.
Monitor HACMP components and diagnose problems that may occur.

For a general overview of all HACMP features, see the IBM Web site:

http://www-1.ibm.com/servers/aix/products/ibmsw/high_avail_network/hacmp.html

Enhancing availability with the AIX software

HACMP takes advantage of the features in AIX, which is the high-performance UNIX operating system.

AIX Version 5.1 adds new functionality to further improve security and system availability. This includes improved availability of mirrored data and enhancements to Workload Manager that help solve problems of mixed workloads by dynamically providing resource availability to critical applications. Used with the IBM IBM pSeries®, HACMP can provide both horizontal and vertical scalability, without downtime.

The AIX operating system provides numerous features designed to increase system availability by lessening the impact of both planned (data backup, system administration) and unplanned (hardware or software failure) downtime. These features include:

Journaled File System and Enhanced Journaled File System
Disk mirroring
Process control
Error notification

The IBM HACMP software provides a low-cost commercial computing environment that ensures that mission-critical applications can recover quickly from hardware and software failures. The HACMP software is a high availability system that ensures that critical resources are available for processing. High availability combines custom software with industry-standard hardware to minimize downtime by quickly restoring services when a system, component, or application fails. While not instantaneous, the restoration of service is rapid, usually 30 to 300 seconds.

Physical components of an HACMP Cluster

HACMP provides a highly available environment by identifying a set of resources essential to uninterrupted processing, and by defining a protocol that nodes use to collaborate to ensure that these resources are available. HACMP extends the clustering model by defining relationships among cooperating processors where one processor provides the service offered by a peer, should the peer be unable to do so.

An HACMP Cluster is made up of the following physical components:

Nodes
Shared external disk devices
Networks
Network interfaces
Clients

The HACMP software allows you to combine physical components into a wide range of cluster configurations, providing you with flexibility in building a cluster that meets your processing requirements. Figure 1-12 on page 19 shows one example of an HACMP Cluster. Other HACMP Clusters could look very different, depending on the number of processors, the choice of networking and disk technologies, and so on.

Figure 1-12: Example HACMP Cluster

Nodes

Nodes form the core of an HACMP Cluster. A node is a processor that runs both AIX and the HACMP software. The HACMP software supports pSeries uniprocessor and symmetric multiprocessor (SMP) systems, and the Scalable POWERParallel processor (SP) systems as cluster nodes. To the HACMP software, an SMP system looks just like a uniprocessor. SMP systems provide a cost-effective way to increase cluster throughput. Each node in the cluster can be a large SMP machine, extending an HACMP Cluster far beyond the limits of a single system and allowing thousands of clients to connect to a single database.

In an HACMP Cluster, up to 32 RS/6000® or pSeries stand-alone systems, pSeries divided into LPARS, SP nodes, or a combination of these cooperate to provide a set of services or resources to other entities. Clustering these servers to back up critical applications is a cost-effective high availability option. A business can use more of its computing power, while ensuring that its critical applications resume running after a short interruption caused by a hardware or software failure.

In an HACMP Cluster, each node is identified by a unique name. A node may own a set of resources (disks, volume groups, filesystems, networks, network addresses, and applications). Typically, a node runs a server or a "back-end" application that accesses data on the shared external disks.

The HACMP software supports from 2 to 32 nodes in a cluster, depending on the disk technology used for the shared external disks. A node in an HACMP Cluster has several layers of software components.

Shared external disk devices

Each node must have access to one or more shared external disk devices. A shared external disk device is a disk physically connected to multiple nodes. The shared disk stores mission-critical data, typically mirrored or RAID-configured for data redundancy. A node in an HACMP Cluster must also have internal disks that store the operating system and application binaries, but these disks are not shared.

Depending on the type of disk used, the HACMP software supports two types of access to shared external disk devices: non-concurrent access, and concurrent access.

In non-concurrent access environments, only one connection is active at any given time, and the node with the active connection owns the disk. When a node fails, disk takeover occurs when the node that currently owns the disk leaves the cluster and a surviving node assumes ownership of the shared disk. This is what we show in this redbook.
In concurrent access environments, the shared disks are actively connected to more than one node simultaneously. Therefore, when a node fails, disk takeover is not required. We do not show this here because concurrent access does not support the use of the Journaled File System (JFS), and JFS is required to use either IBM Tivoli Workload Scheduler or IBM Tivoli Management Framework.

Networks

As an independent, layered component of AIX, the HACMP software is designed to work with any TCP/IP-based network. Nodes in an HACMP Cluster use the network to allow clients to access the cluster nodes, enable cluster nodes to exchange heartbeat messages and, in concurrent access environments, serialize access to data. The HACMP software has been tested with Ethernet, Token-Ring, ATM, and other networks.

The HACMP software defines two types of communication networks, characterized by whether these networks use communication interfaces based on the TCP/IP subsystem (TCP/IP-based), or communication devices based on non-TCP/IP subsystems (device-based).

Clients

A client is a processor that can access the nodes in a cluster over a local area network. Clients each run a front-end or client application that queries the server application running on the cluster node.

The HACMP software provides a highly available environment for critical data and applications on cluster nodes. Note that the HACMP software does not make the clients themselves highly available. AIX clients can use the Client Information (Clinfo) services to receive notice of cluster events. Clinfo provides an API that displays cluster status information. The /usr/es/sbin/cluster/clstat utility, a Clinfo client shipped with the HACMP software, provides information about all cluster service interfaces.

The clients for IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework are the Job Scheduling Console and the Tivoli Desktop applications, respectively. These clients do not support the Clinfo API, but feedback that the cluster server is not available is immediately provided within these clients.

1.4.5 Microsoft Cluster Service

Microsoft Cluster Service (MSCS) provides three primary services:

Availability	Continue providing a service even during hardware or software failure. This redbook focuses upon leveraging this feature of MSCS.
Scalability	Enable additional components to be configured as system load increases.
Simplification	Manage groups of systems and their applications as a single system.

MSCS is a built-in feature of Windows NT/2000 Server Enterprise Edition. It is software that supports the connection of two servers into a cluster for higher availability and easier manageability of data and applications. MSCS can automatically detect and recover from server or application failures. It can be used to move server workload to balance utilization and to provide for planned maintenance without downtime.

MSCS uses software heartbeats to detect failed applications or servers. In the event of a server failure, it employs a shared nothing clustering architecture that automatically transfers ownership of resources (such as disk drives and IP addresses) from a failed server to a surviving server. It then restarts the failed server's workload on the surviving server. All of this, from detection to restart, typically takes under a minute. If an individual application fails (but the server does not), MSCS will try to restart the application on the same server. If that fails, it moves the application's resources and restarts it on the other server.

MSCS does not require any special software on client computers; so, the user experience during failover depends on the nature of the client side of their client-server application. Client reconnection is often transparent because MSCS restarts the application using the same IP address.

If a client is using stateless connections (such as a browser connection), then it would be unaware of a failover if it occurred between server requests. If a failure occurs when a client is connected to the failed resources, then the client will receive whatever standard notification is provided by the client side of the application in use.

For a client side application that has statefull connections to the server, a new logon is typically required following a server failure.

No manual intervention is required when a server comes back online following a failure. As an example, when a server that is running Microsoft Cluster Server (server A) boots, it starts the MSCS service automatically. MSCS in turn checks the interconnects to find the other server in its cluster (server B). If server A finds server B, then server A rejoins the cluster and server B updates it with current cluster information. Server A can then initiate a failback, moving back failed-over workload from server B to server A.

Microsoft Cluster Service concepts

Microsoft provides an overview of MSCS in a white paper that is available at:

http://www.microsoft.com/ntserver/ProductInfo/Enterprise/clustering/ClustArchit.asp

The key concepts of MSCS are covered in this section.

Shared nothing

Microsoft Cluster employs a shared nothing architecture in which each server owns its own disk resources (that is, they share nothing at any point in time). In the event of a server failure, a shared nothing cluster has software that can transfer ownership of a disk from one server to another.

Cluster Services

Cluster Services is the collection of software on each node that manages all cluster-specific activity.

Resource

A resource is the canonical item managed by the Cluster Service. A resource may include physical hardware devices (such as disk drives and network cards), or logical items (such as logical disk volumes, TCP/IP addresses, entire applications, and databases).

Group

A group is a collection of resources to be managed as a single unit. A group contains all of the elements needed to run a specific application and for client systems to connect to the service provided by the application. Groups allow an administrator to combine resources into larger logical units and manage them as a unit. Operations performed on a group affect all resources within that group.

Fallback

Fallback (also referred as failback) is the ability to automatically rebalance the workload in a cluster when a failed server comes back online. This is a standard feature of MSCS. For example, say server A has crashed, and its workload failed over to server B. When server A reboots, it finds server B and rejoins the cluster. It then checks to see if any of the Cluster Group running on server B would prefer to be running in server A. If so, it automatically moves those groups from server B to server A. Fallback properties include information such as which group can fallback, which server is preferred, and during what hours the time is right for a fallback. These properties can all be set from the cluster administration console.

Quorum Disk

A Quorum Disk is a disk spindle that MSCS uses to determine whether another server is up or down.

When a cluster member is booted, it searches whether the cluster software is already running in the network:

If it is running, the cluster member joins the cluster.
If it is not running, the booting member establishes the cluster in the network.

A problem may occur if two cluster members are restarting at the same time, thus trying to form their own clusters. This potential problem is solved by the Quorum Disk concept. This is a resource that can be owned by one server at a time and for which servers negotiate for ownership. The member who has the Quorum Disk creates the cluster. If the member that has the Quorum Disk fails, the resource is reallocated to another member, which in turn, creates the cluster.

Negotiating for the quorum drive allows MSCS to avoid split-brain situations where both servers are active and think the other server is down.

Load balancing

Load balancing is the ability to move work from a very busy server to a less busy server.

Virtual server

A virtual server is the logical equivalent of a file or application server. There is no physical component in the MSCS that is a virtual server. A resource is associated with a virtual server. At any point in time, different virtual servers can be owned by different cluster members. The virtual server entity can also be moved from one cluster member to another in the event of a system failure.

< Day Day Up >