Need help with corrupted Linux raid and filesystems

From: Jarle Aase (jgaa_at_jgaa.com)
Date: 11/18/04


Date: Thu, 18 Nov 2004 01:24:48 +0100

Hi,

My mailserver has had some problems with its hardware over the last months.
The machine was totally reinstalled about 6 weeks ago, and has appeared to
be OK, until last week when the 3COM NIC started to fail every 1 - 2 days.
I shut down the machine today and replaced the NIC. When I tried to boot,
Linux found no valid raid partitions.

>From the boot-screen
   md: Autodetecting RAID arrays
   md: invalid raid superblock magic on hda1
   md: hda1 has invalid sb, nor importing!
   md: could not import hda1!
   md: invalid raid superblock magic on hda3
   .... (repeats itself for each partition)
   ...
   Kernel Panic: VFS: Unable to mount root fs on hda2

The disks are partitioned as:
  /dev/hd?1 --> md0 - 100mb /boot (ext2)
  /dev/hd?2 --> md1 - 2gb swap
  /dev/hd?3 --> md2 - 8gb / (ext3)
  /dev/hd?4 --> md3 - 67gb /var/lib/courier/spool (xfs)

The machine is an (old) P4 1.6 Ghz based PC/server with 1,256 GB ram, 2 x
Western Digital 80GB ATA disks in Linux raid 1 (mirror), Linux kernel 2.6.6
or 8 (I'm not sure about the exact version, as I'm unable to access the
disks) with lilo as boot-loader. Linux is configured for 4 CPU's and
hyperthreading (to be fast if I boot it on my dual CPU workstation). The
Linux distribution is Debian "woody" with some packages from backports.org
to handle kernel 2.6.*

I unmounted the disks, and placed one disk as a second disk in my Linux
workstation. I edited /etc/raidtab on the workstations disk to fit the raid
on the second disk, and ran "mkraid" on each of the failed raid devices.
The command did not complain. Then I tried to mount the raid devices (as I
have done successfully before a few times when Linux raid has failed).
Mount did not recognize the file system type on any of the newly added raid
devices. I tried to run "fsck.ext3" and "xfs_repair" in read-only mode.
Both complained. I then ran "fsck.ext3 -y /dev/md2" and
"xfs_repair /dev/md3". xfs_repair complained about a bad superblock, but
found a second superblock. It then suggested to mount the disk to replay
the logfile. Mount failed also this time. I ran "xfs_repair -L /dev/md3",
and this time it found _lots_ of errors. When I mounted the device, it
turned out that all remaining files on that file-system was in /lost+found.
None had their original file-name. The ext3 filesystem on /dev/md2 was also
destroyed, and the remaining files in that filesystems lost+found, most of
them without their original names.

If I run "fsck.ext3 -n /dev/md0", fsck complains about a missing superblock,
and then starts to list lots of problems on the filesystem (like open
files, bad mode ...). The /boot filesystem that resides there, should not
have any open files ( - at least not for write), so the errors indicate
that something is very wrong indeed.

When I shut down the machine, I checked /var/log/kern.log - and all that
caught my attention was problems with the network card. Aside from the
network problem, the machine appeared to be fine. No error-messages was
printed to the console - as I'm used to if Linux detects serious problems
with the disks.

I made a copy of the second disk, and the copy appears to have the same
problems as the first one. (I tried to boot from the second disk, but when
that failed with the same errors as the first one, I decided to leave the
disk unaltered, and made a copy to a third disk that was used for the
recovery attempts).

Using "less -f" on a non-altered disk I can see that /dev/hd?1 (boot/ext2)
contains binry data. /dev/hd?3 (ext3/root filesystem) begins with a large
block of zeros, then a large block 0xff, then more zeros and then binary
data. /dev/hd?4 (xfs/mailspool) begins with a small block with zero, then
binary data that looks like the second block on another machine with the
xfs filesystem, and then binary data and text from email-messages on the
filesystem.

Immediately - it seems unlikely that all three filesystems on the disk(s)
should be corrupted to this degree. The machine was rebooted about 30 hours
before, and no problems (except for the NIC) surfaced until this last boot.
Also - the partition table seems OK, and the lilo works until it's unable
to mount the root file system. If the filesystem was corrupted with random
data all over (as I first suspected) - lilo or the Linux image should have
aborted with an illegal instruction.

I strongly suspect (and certainly hope!) that there is some obvious error in
my recovery attempts so far. Please reply if you have a clue about how to
get the raid and filesystems back. I hope to avoid reinstalling and
restoring from backup, as the faulty NIC has caused the backup (to another
machine) for the last few days to fail. Also - my email is not working at
the moment - so please reply to this thread.

Jarle

-- 
Jarle Aase                http://www.jgaa.com
mailto:jgaa@jgaa.com
<<< no need to argue - just kill'em all! >>>


Relevant Pages

  • [ANNOUNCE]: RIP Linux rescue system!
    ... This is a bootable CD Linux boot/rescue system! ... The bootable CD image `RIP-12.7.iso.bin' can be written to a CD disk, ... It also includes the CD/DVD UDF filesystem packet writing tools (cdrwtool, ... a Linux reiserfs and reiser4 filesystem. ...
    (comp.os.linux.announce)
  • Re: ATA Raid
    ... Don't blame Linux for the sins of other OSs. ... I once had a disk die in the middle of burning a CD, ... Linux software raid 1 is even bootable--though getting your BIOS to ... Certainly hardware raid has its place, but probably only if you have ...
    (Fedora)
  • Re: 2 HD, 2 OS, and grub issues
    ... and gave me an option to boot into XP or Linux. ... I received "Grub _" on the upper left corner. ... Disk /dev/sda: 160GB ... SATA soft raid and are not accessible from Linux. ...
    (comp.os.linux.setup)
  • ANNOUNCE: RIP Linux CD boot/rescue system!
    ... This is a bootable CD Linux rescue system! ... The bootable CD image `RIP-9.5.iso.bin' can be written to a CD/DVD disk, ... It also includes the DVD udf filesystem packet writing tools ...
    (comp.os.linux.announce)
  • Re: [opensuse] Non-destructive Increase of /boot partition?
    ... No, what raid is for is to ensure that if one disk (not the filesystem on it) goes bad, you can continue nonstop; with some setups you can actually change the disk live, with some you have to power off first. ... But if something writes bad data to the filesystem, or deletes data or structures, only a backup can save your day. ...
    (SuSE)