Future Trends in Fault Management

I l @ ve RuBoard

More recovery and repair capabilities are being built into servers. Server providers are making efforts to reduce the failure rates of individual hardware components . Currently, on many servers, the loss of a processor causes the system to fail. In the future, the capability to detect a problem, de-allocate the processor, and continue operations with reduced performance will be standard on multiprocessor systems. The problem will still be reported , so that a repair can be made at a later time. You will also be able to add and remove all the key server components, such as CPU, memory, and I/O, without bringing down the system.

Because systems will no longer fail in these situations, your ability to obtain individual component failure events will become more important than ever. Otherwise, you may have difficulty realizing that something has gone wrong ”until it's too late.

Sun and Hewlett-Packard both have high availability cluster products that are continuing to be enhanced. High availability products provide automatic recovery from some problems and give you more time to repair the original faults. Today, with its Mission Critical Server Suites, HP provides a 99.95-percent uptime guarantee. This equates to less than five hours of downtime each year. By the end of 2000, HP intends to provide a 99.999-percent uptime guarantee, as part of its 5 nines: 5 minutes High Availability (HA) program. HP plans to accomplish this goal by partnering with key application providers and by improving the detection and recovery of the HP-UX operating system. The application, database, and network will also be included in the HA guarantee.

You can find more information on HP's intentions for HA on its Web site at http://www.hp.com/go/ha.

Both Sun and HP realize the importance of having complete system fault instrumentation. Sun is adding new monitoring agents to send events to SyMON, and HP is adding more EMS monitors . Both Sun and HP provide developer's kits to make it easier for third parties to add monitoring components. As system instrumentation becomes more extensive , vendors will extend their monitoring capabilities into application software, such as databases and Enterprise Resource Planning (ERP) applications.

System vendors are realizing that they need to provide monitors for more than just failure events. For a complete fault management solution, system operators need improved monitors to detect the following:

  • Configuration changes

  • Additional security intrusion conditions

  • Thresholds being exceeded for system resource usage

Using thresholds is important, to detect trends and to provide more time to react before real problems occur. Both Sun and HP are planning more predictive capabilities.

Today, a variety of notification methods are available to report events to a management station. New methods will need to be added to support emerging standards, such as the Desktop Management Interface (DMI).

Troubleshooting will become easier, because the increased granularity of events will make locating the root cause of problems easier. However, without corresponding improvements in event filtering and correlation capabilities, the increasing number of events is still likely to overwhelm system operators. Correlation will become more common, so operators will have to see only those events for which they must take some action.

More extensive help information will be included with events in the future. Currently, operators are given event messages without many suggestions on how to react. HP is looking to provide additional ways to correct problems.

Both Sun and HP want to provide more autonomous recovery capabilities on managed systems. Sun refers to these capabilities as " intelligent , autonomous agents"; HP refers to them as "self-healing systems." Regardless of the name , the concept is essentially the same: The local agents will try to take the appropriate recovery action without involving the system operator. Actions may be based on policies that are predefined by the system administrator.

Online capabilities (mentioned earlier in the chapter) will lead to faster recovery times. A system operator will not have to wait for the next planned downtime period to make repairs to the server. The operator will also be able to increase capacity more frequently.

Resource management and fault management will become more tightly integrated over the next few years . Both Sun and HP are reviewing ways to integrate their resource management tools with their event management facilities. Sun plans to integrate its Resource Manager with the SyMON product. Sun can already dynamically reconfigure a protection domain and reallocate system resources in response to performance- related events. HP has announced plans to integrate its Process Resource Manager with event management, enabling the Process Resource Manager to meet service-level objectives by reallocating resources in response to system bottlenecks.

These are just some of the new capabilities that will be coming in the next few years. Fault management is considered strategic by many vendors, and is viewed as one important way to differentiate product offerings.

I l @ ve RuBoard


UNIX Fault Management. A Guide for System Administrators
UNIX Fault Management: A Guide for System Administrators
ISBN: 013026525X
EAN: 2147483647
Year: 1999
Pages: 90

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net