Avoiding System Problems

I l @ ve RuBoard

To avoid system problems related to misconfigurations, you need to have appropriate product documentation and business policies in place. The administrators making the changes should have access to caveats and a history log of past changes. Changes should be logged and a revision control system should be used so that you can quickly revert an old configuration.

System components will fail, but you can reduce consequential problems by investing in high availability or resiliency products and features. In the 1997 D. H. Brown survey of high availability providers, Hewlett-Packard was rated above average in its ability to detect and recover from failures. HP-UX provides dynamic memory resiliency, dynamic processor resiliency, and dynamically loadable kernel modules. Single-bit CPU cache errors can be corrected automatically. Memory Error “Correcting Code (ECC) and checksums reduce memory problems, but don't eliminate the need to monitor the memory subsystem. HP supports error thresholds for memory and disks, and its Memory Page Deallocation feature enables dynamic memory deselection for failing memory locations.

As vendors improve the resiliency of their operating systems, CPU failures become less likely to cause a system to fail. In some cases, if diagnostic tools detect a problem with a CPU, the processor can be deallocated while the operating system continues to run. For example, this can be done if the rate of corrected single-bit CPU cache errors exceeds a predefined threshold. The processor can also be deallocated if a problem is found in the self-test during boot.

For companies with HP support contracts, HP Predictive Support can be used to detect trends that might lead to system problems. An engineer can then be sent to the customer site to make repairs before a problem becomes serious.

You also must back up your data regularly to prepare for any problems. The backups should be tested regularly to ensure that they are working properly.

You should also try to avoid performance and resource management problems by closely monitoring how your system is being used. Techniques for accomplishing this are described in the previous section.

I l @ ve RuBoard


UNIX Fault Management. A Guide for System Administrators
UNIX Fault Management: A Guide for System Administrators
ISBN: 013026525X
EAN: 2147483647
Year: 1999
Pages: 90

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net