Identifying Important Disk Monitoring Categories | UNIX Fault Management: A Guide for System Administrators

I l @ ve RuBoard

As already mentioned, redundancy is the best way to protect against initial failures. The key to high availability is to always maintain redundancy so that a disk fault does not adversely affect system and data availability. Sometimes, however, this isn't enough. If the resources aren't performing normally, availability can be affected by degraded performance, too.

Regardless of whether you have your disks configured for high availability, monitoring can help protect against failures. Although redundancy helps to provide protection from an initial hardware failure, monitoring needs to be done to protect not only against a second failure, but also against other failures that may affect availability.

The events and measures that you can use to maintain availability are broken down into several categories of system monitoring:

Disk configuration: Includes monitoring for the addition or removal of disk resources, as well as ensuring that the proper software and firmware versions are installed. Tracking configuration changes can help when you are troubleshooting problems later. Ensuring that a device is using the correct version of firmware can also eliminate problems.
Fault management: Includes monitoring any kind of errors or events that may occur on the hardware, the physical disk devices. This category covers device errors, such as media errors, read errors, seek-time errors, and bad blocks, as well as component failures, such as FRU failures, controller failures, RAID failures, fan failures, and more.
Resource management: Includes monitoring the resources that provide storage space on disks. Filesystem space, mount points, logical volumes , mirrored copies, and physical volume links are all important disk resources to monitor. At the filesystem level, you must ensure that sufficient filesystem space is available for applications to run successfully. All filesystems must be mounted and available to their users. Filesystems are often configured using logical volumes. When data is mirrored, it is the logical volumes that are mirrored. Monitoring data mirrors can be critical to maintaining redundancy and increasing availability. Physical volumes and the links to those volumes are considered disk resources.
Performance management: Includes monitoring disk performance metrics, such as disk read and write throughput rates. Swap rates are also important, because they can be used to identify resource contention , which can affect availability. Disk performance resources can be monitored to ensure that I/O traffic is balanced across the available disk resources.

The following sections list various tools that you can use to monitor the disk resources listed in the preceding categories. These tools include basic commands, system instrumentation available through standard interfaces, event monitors, diagnostic tools, and performance monitors .

I l @ ve RuBoard