Availability management aims at responding to anomalous events fast enough that the damage is contained and repaired before the SLA criteria are missed. Typically, humans take at least a few seconds to observe, analyze, and react to a situation.[24] With the speed of modern processors, a lot of damage can occur in a few seconds.
Automation refers to a program suite that performs the "observe, analyze, and react" processes at machine speeds. At the heart of the automation tool is the automation engine that gathers events, correlates information from various sources, and manages the system state changes with scripts. The key value of an automation tool is the speed at which a new system state (for example, the database manager is down) is recognized and at which it causes the predefined transition policies[25] to be executed. At the time of the writing of this book, automation tools cannot provide recovery itself. Instead, they provide a framework where you can plug in your recovery (state change) processes, typically in the form of a program script.
Linux on the mainframe has an ideal environment for managing availability because a number of automation tools exist. For example:
A number of automation tools are also available from system management vendors. See Chapter 25, "Systems Management Tools," for some other options. |