Hack78.Avoid Catastrophic Disk Failure | BSD Sockets Programming from a Multi-Language Perspective (Programming Series)

Hack 78. Avoid Catastrophic Disk Failure

Access your hard drive's built-in diagnostics using Linux utilities to predict and prevent disaster.

Nobody wants to walk in after a power failure only to realize that, in addition to everything else, because of a dead hard drive they now have to rebuild entire servers and grab backed-up data from tape. Of course, the best way to avoid this situation is to be alerted when something is amiss with your SCSI or ATA hard drive, before it finally fails. Ideally the alert would come straight from the hard drive itself, but until we're able to plug an RJ-45 directly into a hard drive we'll have to settle for the next best thing, which is the drive's built-in diagnostics. For several years now, ATA and SCSI drives have supported a standard mechanism for disk diagnostics called "Self Monitoring, Analysis, and Reporting Technology" (SMART), aimed at predicting hard drive failures. It wasn't long before Linux had utilities to poll hard drives for this vital information.

The smartmontools project (http://smartmontools.sourceforge.net) produces a SMART monitoring daemon called smartd and a command-line utility called smartctl, which can do most things on demand that the daemon does in the background periodically. With these tools, along with standard Linux filesystem utilities such as debugfs and tune2fs, there aren't many hard drive issues you can't fix.

But before you can repair anything or transform yourself into a seemingly superpowered hard-drive hero with powers on loan from the realm of the supernatural, you have to know what's going on with your drives, and you need to be alerted to changes in the status of the health of your drives.

First, you should probably get to know your drives a bit, which smartctl can help out with. If you know that there are three drives in use on the system, but you're not sure which one the system is labeling /dev/hda, run the following command:

 # smartctl -i /dev/hda

This will tell you the model and capacity information for that drive. This is also very helpful in figuring out which vendor you'll need to call for a replacement drive if you bought the drive yourself. Once you know what's what, you can move on to bigger tasks.

Typically, before I even set up the smartd daemon to do long-term, continuous monitoring of a drive, I first run a check from the command line (using the smartctl command) to make sure I'm not wasting time setting up monitoring on a disk that already has issues. Try running a command like the following to ask the drive about its overall health:

 # smartctl -H /dev/hda smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

Well, this is good newsthe drive says it's in good shape. However, there really wasn't much to look at there. Let's get a more detailed view of things using the -a, or "all," flag. This gives us lots of output, so let's go over it in pieces. Here's the first bit:

 # smartctl -a /dev/hda smartctl version 5.33 [i386-redhat-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model:   WDC WD307AA Serial Number:   WD-WMA111283666 Firmware Version: 05.05B05 User Capacity:   30,758,289,408 bytes Device is:   In smartctl database [for details use: -P show] ATA Version is:   4 ATA Standard is:  Exact ATA specification draft version not indicated Local Time is:   Mon Sep 5 17:48:09 2005 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled

This is the exact same output that smartctl -i would've shown you earlier. It tells you the model, the firmware version, the capacity, and which version of the ATA standard is implemented with this drive. Useful, but not really a measure of health per se. Let's keep looking:

 === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

This is the same output that smartctl -H showed earlier. Glad we passed, but if we just barely made it, that's not passing to a discriminating administrator. More!

 General SMART Values: Offline data collection status: (0x05) Offline data collection activity    was aborted by an interrupting command    from host.    Auto Offline Data Collection: Disabled. Self-test execution status:     ( 113) The previous self-test completed having    the read element of the test failed.

These are the values of the SMART attributes the device supports. We can see here that offline data collection is disabled, which means we can't run "offline" tests (which run automatically when the disk would otherwise be idle). We can enable it using the command smartctl -o on, but this may not be what you want, so let's hold off on that for now. The self-test execution status shows that a read operation failed during the last self-test, so we'll keep that in mind as we continue looking at the data:

 Total time to complete Offline data collection: (2352) seconds. Offline data collection capabilities: (0x1b) SMART execute Offline immediate.    Auto Offline data collection on/off    support.    Suspend Offline collection upon new    command.    Offline surface scan supported.    Self-test supported.    No Conveyance Self-test supported.    No Selective Self-test supported. SMART capabilities:   (0x0003) Saves SMART data before entering    power-saving mode.    Supports SMART auto save timer. Error logging capability:   (0x01) Error logging supported.    No General Purpose Logging support.

This output is just a list of the general SMART-related capabilities of the drive, which is good to know, especially for older drives that might not have all of the features you would otherwise assume to be present. Capabilities and feature support in the drives loosely follow the version of the ATA standard in place when the drive was made, so it's not safe to assume that an ATA-4 drive will support the same feature set as an ATA-5 or later drive.

Let's continue on our tour of the output:

 Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 42) minutes.

When you tell this drive to do a short self-test, it'll tell you to wait two minutes for the results. A long test will take 42 minutes. If this drive were new enough to support other self-test types (besides just "short" and "extended"), there would be lines for those as well. Here's the next section of output:

 SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME   FLAG     VALUE   WORST   THRESH   TYPE    UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate   0x000b 200 200 051 Pre-fail  Always -   0 3 Spin_Up_Time   0x0006 101 091 000 Old_age   Always -   2550 4 Start_Stop_Count   0x0012 100 100 040 Old_age   Always -   793 5 Reallocated_Sector_Ct   0x0012 198 198 112 Old_age   Always -   8 9 Power_On_Hours   0x0012 082 082 000 Old_age   Always -   13209 10 Spin_Retry_Count   0x0013 100 100 051 Pre-fail  Always -   0 11 Calibration_Retry_Count 0x0013 100 100 051 Pre-fail  Always -   0 12 Power_Cycle_Count    0x0012 100 100 000 Old_age   Always -   578 196 Reallocated_Event_Count 0x0012  196 196 000 Old_age   Always -   4 197 Current_Pending_Sector 0x0012 199 199 000 Old_age   Always -   10 198 Offline_Uncorrectable  0x0012 199 198 000 Old_age   Always -   10 199 UDMA_CRC_Error_Count   0x000a 200 253 000 Old_age   Always -   0 200 Multi_Zone_Error_Rate  0x0009 200 198 051 Pre-fail  Offline -         0

Details on how to read this chart, in gory-enough detail, are in the sysctl manpage. The most immediate values to concern yourself with are the ones labeled Pre-fail. On those lines, an indicator of the need for immediate action is if the VALUE column output descends to or below the value in the ThrESH column. Continuing on:

 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining   LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure      10% 97 57559262 # 2 Extended offline Aborted by host  50% 97 - # 3 Short offline Completed without error  00% 97 - Device does not support Selective Self Tests/Logging

This output is the log output from the last three tests. The numbering of the tests is actually the reverse of what you might think: the one at the top of the list, labeled as #1, is actually the most recent test. In that test we can see that there was a read error, and the LBA address of the first failure is posted (57559262). If you want to see how you can associate that test with an actual file, Bruce Allen has posted a wonderful HOWTO for this at http://smartmontools.sourceforge.net/BadBlockHowTo.txt.

Now that you've seen what smartctl can find out for us, let's figure out how to get smartd configured to automate the monitoring process and let us know if danger is imminent.

Fortunately, putting together a basic configuration takes mere seconds, and more complex configurations don't take a great deal of time to put together, either. The smartd process gets its configuration from /etc/smartd.conf on most systems, and for a small system (or a ton of small systems that you don't want to generate copious amounts of mail), a line similar to the following will get you the bare essentials:

 /dev/hda -H -m jonesy@linuxlaboratory.org

This will do a (very) simple health status check on the drive, and email me only if it fails. If a health status check fails, it means the drive could very well fail in the next 24 hours, so have an extra drive handy!

There are more sophisticated setups as well that can alert you to changes in the status that don't necessarily mean certain death. Let's look at a more complex configuration line:

 /dev/hda -l selftest -l error -I 9 -m jonesy@linuxlaboratory.org -s L/../../ 7/02

This one will look for changes in the self-test and error logs for the device, run a long self-test every Sunday between 2 and 3 A.M. and send me messages about any attribute except for ID 9, the Power_On_Hours attribute, which I don't care about for the purposes of determining whether a disk is bad (you can check the sysctl -a output to determine an attribute's ID). The -I attribute is often used with attribute numbers 194 or 231, which usually is the temperature. It would be bad to get messages about the constantly changing temperature of the drive!

Once you have your configuration file in order, the only thing left to do is start the service. Inevitably, you'll get more mail than you'd like in the first initial runs, but as time goes on (and you read more of the huge manpage) you'll learn to get what you want from smartd. For me, just the peace of mind is worth the hours I've spent getting a working configuration. When you're able to avert certain catastrophe for a client or yourself, I'm sure you'll say the same.