Need help with corrupted Linux raid and filesystems
From: Jarle Aase (jgaa_at_jgaa.com)
Date: 11/18/04
- Next message: essteeaenn_at_worldbadminton.com: "Re: Need help with corrupted Linux raid and filesystems"
- Previous message: Chris F.A. Johnson: "Re: C/C++ code beautifier"
- Next in thread: essteeaenn_at_worldbadminton.com: "Re: Need help with corrupted Linux raid and filesystems"
- Reply: essteeaenn_at_worldbadminton.com: "Re: Need help with corrupted Linux raid and filesystems"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 18 Nov 2004 01:24:48 +0100
Hi,
My mailserver has had some problems with its hardware over the last months.
The machine was totally reinstalled about 6 weeks ago, and has appeared to
be OK, until last week when the 3COM NIC started to fail every 1 - 2 days.
I shut down the machine today and replaced the NIC. When I tried to boot,
Linux found no valid raid partitions.
>From the boot-screen
md: Autodetecting RAID arrays
md: invalid raid superblock magic on hda1
md: hda1 has invalid sb, nor importing!
md: could not import hda1!
md: invalid raid superblock magic on hda3
.... (repeats itself for each partition)
...
Kernel Panic: VFS: Unable to mount root fs on hda2
The disks are partitioned as:
/dev/hd?1 --> md0 - 100mb /boot (ext2)
/dev/hd?2 --> md1 - 2gb swap
/dev/hd?3 --> md2 - 8gb / (ext3)
/dev/hd?4 --> md3 - 67gb /var/lib/courier/spool (xfs)
The machine is an (old) P4 1.6 Ghz based PC/server with 1,256 GB ram, 2 x
Western Digital 80GB ATA disks in Linux raid 1 (mirror), Linux kernel 2.6.6
or 8 (I'm not sure about the exact version, as I'm unable to access the
disks) with lilo as boot-loader. Linux is configured for 4 CPU's and
hyperthreading (to be fast if I boot it on my dual CPU workstation). The
Linux distribution is Debian "woody" with some packages from backports.org
to handle kernel 2.6.*
I unmounted the disks, and placed one disk as a second disk in my Linux
workstation. I edited /etc/raidtab on the workstations disk to fit the raid
on the second disk, and ran "mkraid" on each of the failed raid devices.
The command did not complain. Then I tried to mount the raid devices (as I
have done successfully before a few times when Linux raid has failed).
Mount did not recognize the file system type on any of the newly added raid
devices. I tried to run "fsck.ext3" and "xfs_repair" in read-only mode.
Both complained. I then ran "fsck.ext3 -y /dev/md2" and
"xfs_repair /dev/md3". xfs_repair complained about a bad superblock, but
found a second superblock. It then suggested to mount the disk to replay
the logfile. Mount failed also this time. I ran "xfs_repair -L /dev/md3",
and this time it found _lots_ of errors. When I mounted the device, it
turned out that all remaining files on that file-system was in /lost+found.
None had their original file-name. The ext3 filesystem on /dev/md2 was also
destroyed, and the remaining files in that filesystems lost+found, most of
them without their original names.
If I run "fsck.ext3 -n /dev/md0", fsck complains about a missing superblock,
and then starts to list lots of problems on the filesystem (like open
files, bad mode ...). The /boot filesystem that resides there, should not
have any open files ( - at least not for write), so the errors indicate
that something is very wrong indeed.
When I shut down the machine, I checked /var/log/kern.log - and all that
caught my attention was problems with the network card. Aside from the
network problem, the machine appeared to be fine. No error-messages was
printed to the console - as I'm used to if Linux detects serious problems
with the disks.
I made a copy of the second disk, and the copy appears to have the same
problems as the first one. (I tried to boot from the second disk, but when
that failed with the same errors as the first one, I decided to leave the
disk unaltered, and made a copy to a third disk that was used for the
recovery attempts).
Using "less -f" on a non-altered disk I can see that /dev/hd?1 (boot/ext2)
contains binry data. /dev/hd?3 (ext3/root filesystem) begins with a large
block of zeros, then a large block 0xff, then more zeros and then binary
data. /dev/hd?4 (xfs/mailspool) begins with a small block with zero, then
binary data that looks like the second block on another machine with the
xfs filesystem, and then binary data and text from email-messages on the
filesystem.
Immediately - it seems unlikely that all three filesystems on the disk(s)
should be corrupted to this degree. The machine was rebooted about 30 hours
before, and no problems (except for the NIC) surfaced until this last boot.
Also - the partition table seems OK, and the lilo works until it's unable
to mount the root file system. If the filesystem was corrupted with random
data all over (as I first suspected) - lilo or the Linux image should have
aborted with an illegal instruction.
I strongly suspect (and certainly hope!) that there is some obvious error in
my recovery attempts so far. Please reply if you have a clue about how to
get the raid and filesystems back. I hope to avoid reinstalling and
restoring from backup, as the faulty NIC has caused the backup (to another
machine) for the last few days to fail. Also - my email is not working at
the moment - so please reply to this thread.
Jarle
-- Jarle Aase http://www.jgaa.com mailto:jgaa@jgaa.com <<< no need to argue - just kill'em all! >>>
- Next message: essteeaenn_at_worldbadminton.com: "Re: Need help with corrupted Linux raid and filesystems"
- Previous message: Chris F.A. Johnson: "Re: C/C++ code beautifier"
- Next in thread: essteeaenn_at_worldbadminton.com: "Re: Need help with corrupted Linux raid and filesystems"
- Reply: essteeaenn_at_worldbadminton.com: "Re: Need help with corrupted Linux raid and filesystems"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|