| < Day Day Up > |
|
Specifying the appropriate level of high availability for IBM Tivoli Workload Scheduler often depends upon how much reliability needs to be built into the environment, balanced against the cost of solution. High availability is a spectrum of options, driven by what kind of failures you want IBM Tivoli Workload Scheduler to survive. These options lead to innumerable permutations of high availability configurations and scenarios. Our goal in this redbook is to demonstrate enough of the principles in configuring IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework to be highly available in a specific, non-trivial scenario such that you can use the principles to implement other configurations.
IBM Tivoli Workload Scheduler provides a degree of high availability through its Backup Domain Manager feature, which can also be implemented as a Backup Master Domain Manager. This works by duplicating the changes to the production plan from a Domain Manager to a Backup Domain Manager. When a failure is detected, a switchmgr command is issued to all workstations in the Domain Manager server's domain, causing these workstations to recognize the Backup Domain Manager.
However, properly implementing a Backup Domain Manager is difficult. Custom scripts have to be developed to implement sensing a failure, transferring the scheduling objects database, and starting the switchmgr command. The code for sensing a failure is by itself a significant effort. Possible failures to code for include network adapter failure, disk I/O adapter failure, network communications failure, and so on.
If any jobs are run on the Domain Manager, the difficulty of implementing a Backup Domain Manager becomes even more obvious. In this case, the custom scripts also have to convert the jobs to run on the Backup Domain Manager, for instance by changing all references to the workstation name of the Domain Manager to the workstation name of the Backup Domain Manager, and changing references to the hostname of the Domain Manager to the hostname of the Backup Domain Manager.
Then even more custom scripts have to be developed to migrate scheduling object definitions back to the Domain Manager, because once the failure has been addressed, the entire process has to be reversed. The effort required can be more than the cost of acquiring a high availability product, which addresses many of the coding issues that surround detecting hardware failures. The Total Cost of Ownership of maintaining the custom scripts also has to be taken into account, especially if jobs are run on the Domain Manager. All the nuances of ensuring that the same resources that jobs expect on the Domain Manager are met on the Backup Domain Manager have to be coded into the scripts, then documented and maintained over time, presenting a constant drain on internal programming resources.
High availability products like IBM HACMP and Microsoft Cluster Service provide a well-documented, widely-supported means of expressing the required resources for jobs that run on a Domain Manager. This makes it easy to add computational resources (for example, disk volumes) that jobs require into the high availability infrastructure, and keep it easily identified and documented.
Software failures like a critical IBM Tivoli Workload Scheduler process crashing are addressed by both the Backup Domain Manager feature and IBM Tivoli Workload Scheduler configured for high availability. In both configurations, recovery at the job level is often necessary to resume the production day.
Implementing high availability for Fault Tolerant Agents cannot be accomplished using the Backup Domain Manager feature. Providing hardware high availability for a Fault Tolerant Agent server can be accomplished through custom scripting, but using a high availability solution is strongly recommended.
Table 1-1 on page 26 illustrates the comparative advantages of using a high availability solution versus the Backup Domain Manager feature to deliver a highly available IBM Tivoli Workload Scheduler configuration.
Solution | Hardware | Software | FTA | Cost |
---|---|---|---|---|
HA | ü | ü | ü | TCO: $$ |
BMDM | ü | Initially: $ TCO: $$ |
When identifying the level of high availability for IBM Tivoli Workload Scheduler, potential hardware failures you want to plan for can affect the kind of hardware used for the high availability solution. In this section, we address some of the hardware failures you may want to consider when planning for high availability for IBM Tivoli Workload Scheduler.
Site failure occurs when an entire computer room or data center becomes unavailable. Mitigating this failure involves geographically separate nodes in a high availability cluster. Products like IBM High Availability Geographic Cluster system (HAGEO) deliver a solution for geographic high availability. Consult your IBM service provider for help with implementing geographic high availability.
Server failure occurs when a node in a high availability cluster fails. The minimum response to mitigate this failure mode is to make backup node available. However, you might also want to consider providing more than one backup node if the workstation you are making highly available is important enough to warrant redundant backup nodes. In this redbook we show how to implement a two-node cluster, but additional nodes are an extension to a two-node configuration. Consult your IBM service provider for help with implementing multiple-node configurations.
Network failures occur when either the network itself (through a component like a router or switch), or network adapters on the server, fail. This type of failure is often addressed with redundant network paths in the former case, and redundant network adapters in the latter case.
Disk failure occurs when a shared disk in a high availability cluster fails. Mitigating this failure mode often involves a Redundant Array of Independent Disks (RAID) array. However, even a RAID can catastrophically fail if two or more disk drives fail at the same time, if a power supply fails, or a backup power supply fails at the same time as a primary power supply. Planning for these catastrophic failures usually involves creating one or more mirrors of the RAID array, sometimes even on separate array hardware. Products like the IBM TotalStorage® Enterprise Storage Server® (ESS) and TotalStorage 7133 Serial Disk System can address these kinds of advanced disk availability requirements.
These are only the most common hardware failures to plan for. Other failures may also be considered while planning for high availability.
In summary, for all but the simplest configuration of IBM Tivoli Workload Scheduler and IBM Tivoli Management Framework, using a high availability solution to deliver high availability services is the recommended approach to satisfy high availability requirements. Identifying the kinds of hardware and software failures you want your IBM Tivoli Workload Scheduler installation to address with high availability is a key part of creating an appropriate high availability solution.
| < Day Day Up > |
|