8.1. I Can't Boot
Because the Partition Is Corrupt
There are a number of reasons why partitions
become corrupt. You may have lost power. Minor electrical surges
can affect what is written to a drive. As hard
drives
wear out, bad
blocks can corrupt your data.
Yes, hard drive specifications suggest that the
mean time between failures is several hundred thousand hours, which
corresponds to several decades. But that's just an average, under
ideal conditions. If all hard drives were that reliable, RAID would
not be quite so popular.
If your hard drive is failing, you may not be
able to fix the problem. The best that you can do is minimize the
corruption until you can create a backup. We'll show you how to
back up data from a failing hard drive in the
next
annoyance.
One reason for the popularity of the Reiser
filesystem is its sensitivity to hard drive corruption. If you find
corruption on your
reiserfs
-formatted filesystems, you'll
probably have a bit more time to save your data.
8.1.1. Symptoms
of Corruption
In this chapter, we'll describe two categories
of filesystem corruption. The first, whose symptoms are described
in the following annoyance, occurs when a hard drive wears out. The
second is the
occasional
glitch that you can recover from while
preserving
the data on your disk. The temporary glitch is most
commonly associated with a power failure. For example, once when I
tripped over a cord, I lost power on my desktop computer. The next
time I
booted
that computer, I saw the following message:
*** An error occurred during the filesystem check.
*** Dropping you to a shell; the system will reboot
*** when you leave the shell.
Give root password for maintenance
This problem is most commonly associated with
filesystems that do not include a journal, such as
ext2
. Whenever there's corruption, there's a
risk that Linux won't be able to find some of your files.
Journaling filesystems keep a static database of file locations.
But journaling is not a guarantee. I've had this error even on a
journaled
ext3
filesystem.
8.1.2. Basic
Checks with fsck
Whenever there is corruption, the first Linux
command you should use is
fsck
.
Ideally, you can apply this command alone to a specific, unmounted
partition. For example, I managed to clean one partition with this
simple
fsck
command:
#
fsck /dev/hda6
fsck 1.35 (28-Feb-2004)
e2fsck 1.35 (28-Feb-2004)
/: recovering journal
Cleaning orphaned inode 16915 (uid=1000, gid=0, mode=0140600, size=0)
Cleaning orphaned inode 16914 (uid=1000, gid=0, mode=0140600, size=0)
Cleaning orphaned inode 16909 (uid=1000, gid=0, mode=0140600, size=0)
Cleaning orphaned inode 302828 (uid=0, gid=0, mode=020600, size=0)
/: clean, 165245/525888 files, 694569/1050241 blocksa
|
Do not run
fsck
on a mounted partition. If you can't
unmount
the desired partition,
run
fsck
from a rescue CD such as
Knoppix.
|
|
On most Linux systems,
fsck
works on a variety of filesystem formats.
Try entering
ls /sbin/fsck*
. You
should find a variety of commands, such as:
/sbin/fsck /sbin/fsck.ext3 /sbin/fsck.msdos /sbin/fsck.xfs
/sbin/fsck.cramfs /sbin/fsck.jfs /sbin/fsck.reiserfs
Thus,
fsck
is a
frontend for all the filesystem-specific commands on your system.
The proper utility is
chosen
automatically by
fsck
based on the type of the filesystem you
run it on.
8.1.3. Finding
Bad Blocks
If your system still has bad blocks, it may be
the first sign of an
impending
failure. Hard drives can include
hundreds of thousands of blocks. If one goes bad, that may not be
the end of the world. But it may be a symptom of other problems.
Many Linux gurus believe that is the time to get a new hard
drive.
If you're still not sure, the
badblocks
command can help you determine if
your hard drive is in trouble. For example, the following command
writes
the ID number associated with each bad block to the
blockbad
file:
#
badblocks -v /dev/hda7 -o blockbad
Checking for bad blocks (read-only test): 697008/ 1050241
|
Make sure the target partition is unmounted
before running the
badblocks
command.
|
|
The previous
fsck
command probably fixed any errors on that
filesystem, and you can continue using Linux normally. The
following output is evidence that the repair was completely
successful:
0 bad blocks
When bad blocks
remain
, you should rerun
fsck
with more severe options,
described in the next section.
If you need to keep the hard drive working until
a new one arrives, back it up as soon as possible. We show you how
to do this with a partially corrupt partition in the next
annoyance. But until that new hard drive arrives, there are things
you can do to keep your current hard drive going.
8.1.4. Fixing Bad
Blocks
The
fsck
command can help you check, mark, and fix bad blocks, and can help
preserve the health of your filesystems. For that reason, current
distributions force a periodic
fsck
on each filesystem formatted in the
popular
ext2
and
ext3
formats. You can do your own
fsck
maintenance with the switches shown in
Table 8-1; some of these switches are not documented on the
fsck
manpage
.
Table 8-1. fsck command switches
|
Switch
|
Description
|
|
-b
|
Specifies a different superblock, which you can
find on ext2/ext3 systems with the
dumpe2fs
command
|
|
-c
|
Calls the
badblocks
command with the existing superblock
size
|
|
-f
|
Salvages unused chains to files
|
|
-v
|
Sets verbose mode
|
|
-y
|
Specifies a default answer of "yes";
otherwise
,
fsck
interactively asks if you
want to mark bad blocks
|
|
SUSE formats its partitions by default as
ReiserFS filesystems. This filesystem is
considered
so reliable
that SUSE doesn't force a periodic
fsck
on such partitions.
|
|
For example, the following command marks the bad
blocks on your system. If you're fortunate, each
fsck
"pass" of your partition proceeds without
incident. The following is sample output from a run on a good
partition.
#
fsck -cyfv /dev/hda5
fsck 1.35 (28-Feb-2004)
e2fsck 1.35 (28-Feb-2004)
Checking for bad blocks (read-only test): done
Pass 1: Checking inodes, blocks and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
. . .
However, I had problems with a different
partition. In the middle of this process, the test seemed to stop.
I was tempted to interrupt the command by pressing Ctrl-C, but
progress
continued
after a few minutes. As you can see here, the
test turned up problems:
Duplicate blocks found.... invoking duplicate block passes
Pass 1B: Rescan for duplicate/bad blocks
Duplicate/bad block(s) in inode 1448: 13568
Pass 1C Scan directories for inodes with dup blocks.
Error reading block 697043 (Attempt to read block from filesystem resulted
in a short read). Ignore error?
yes
Force rewrite?
yes
....
Pass 1D: Reconciling duplicate blocks
(There are 4 inodes containing duplicate/bad blocks)
File <The journal inode> (inode #8, mod time Fri Nov 12 08:43:05 2005)
has 10 duplicate block(s), shared with 1 file(s):
<The bad blocks inode> (inode #1, mod time Fri Jan 7 12:11:24 2006)
Clone duplicate/bad blocks?
yes
Error reading block 4049 (Attempt to read block from filesystem resulted in short read).
Ignore error?
yes
Force rewrite?
yes
The check continued,
revealing
hundreds of
errors. But the most important error is near the beginning of the
file. As you can see, there is corruption even in the journal. Any
pointers from the journal to other files are thus suspect.
After your bad blocks are
marked
, Linux
knows
to
avoid reading data from those locations. The time is right for a
backup. If standard techniques described in Chapter 2 don't work,
see the next annoyance.
|