Re: mounting LVM partitions fails after etch upgrade



dear all,

a while back I posted to this list because my file systems on LVM over
RAID1 would not mount cleanly anymore after upgrade from sarge to
etch. this weekend I had time to poke around in the data on both the
disks, and found out what was wrong.

as it turns out, since almost a year, *no* data at all was written to
one of the disks!! that didn't stop mdadm from happily reporting that
everything with the array was in perfect order, though. I rebooted the
system a few times during this period, and not even when assembling
the array it complained about anything.

due to the upgrade of mdadm, it seems that the s/w raid started using
both disks again, and by writing data to the 'old' disk, corrupting
some of the out-of-date data there. I'm glad I didn't try to fix this
with fsck, it probably would have completely toasted the data on both
disks.

how can such a catastrophic failure of a raid array happen, and worse,
go completely unnoticed? I don't think it's a config issue, it
perfectly mirrored all data before that point. both disks are
physically perfect, not a single bad block.

cheers,
- Dave.

On 5/6/07, Douglas Allan Tutty <dtutty@xxxxxxxxxxxxx> wrote:
On Sun, May 06, 2007 at 03:25:02PM +0200, David Fuchs wrote:
> I have just upgraded my sarge system to etch, following exactly the upgrade
> instructions at http://www.us.debian.org/releases/etch/i386/release-notes/.
>
> now my system does not boot correctly anymore... I'm using RAID1 with two
> disks, / is on md0 and all other mounts (/home/, /var, /usr etc) are on md1
> using LVM.
>
> the first problem is that during boot, only md0 gets started. I can get
> around this by specifying break=mount on the kernel boot line and manually
> starting md1, but where need I change what so that md1 gets started at this
> point as well?
>
> after manually starting md1 and continuing to boot, I get errors like
>
> Inode 184326 has illegal block(s)
> /var: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY (i.e. without the -a or -o
> options)
>
> ... same for all other partitions on that volume group
>
> fsck died with exit status 4
> A log is being saved in /var/log/fsck/checkfs if that location is
> writable.(it is not)
>
> at this point I get dropped to a maintenance shell. when I select to
> continue the boot process:

What happens if instead of forcing a boot you do what it says: run fsck
without the -a or -o options?

>
> EXT3-fs warning: mounting fs with errors. running e2fsck is recommended
> EXT3 FS on dm-4, internal journal
> EXT3-FS: mounted filesystem with ordered data mode.
> ... same for all mounts (same for dm-3, dm-2, dm-1, dm-0)
>
> EXT3-fs error (device dm-1) in ext3_reserve_inode_write: Journal has aborted
> EXT3-fs error (device dm-1) in ext3_orphan)write: Journal has aborted
> EXT3-fs error (device dm-1) in ext3_orphan_del: Journal has aborted
> EXT3-fs error (device dm-1) in ext3_truncate_write: Journal has aborted
> ext3_abort called.
> EXT3-fs error (device dm-1): ext3_journal)_start_sb: Detected aborte
> djournal
> Remounting filesystem read-only
>
> and finally I get tons of these:
>
> dm-0: rw-9, want=6447188432, limit=10485760
> attempt to access beyond end of device
>
> the system then stops for a long time (~5 minutes) at "starting systlog
> service" but eventually the login prompt comes up, and I can log in, see all
> my data, and even (to my surprise) write to the partitions on md1...
>
...which probably corrupts the fs even more.

> what the hell is going on here? thanks a lot in advance for any help!
>
What is going on is that you started with a simple booting error that
has propogated into filesystem errors. Those errors are compounded by
forcing a mount of a filesystem with errors . Remember that the system
that starts LVM and raid itself exists on the disks....

What you need is a shell with the root fs either totally unmounted or
mounted ro. Does booting single-user work? What about telling the
kernel init=/bin/sh? From there, you can check the status of the mds
with:

#/sbin/mdadm -D /dev/md0
#/sbin/mdadm -D /dev/md1
...

check the status of the logical volumes:
#/sbin/lvdisplay [lvname]

and then check the filesystems with:

#/sbin/e2fsck -f -c -c /dev/...


Only once you get the filesystems fully functional should you attempt to
boot further.

Doug.


--
To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx
with a subject of "unsubscribe". Trouble? Contact listmaster@xxxxxxxxxxxxxxxx




--
To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx with a subject of "unsubscribe". Trouble? Contact listmaster@xxxxxxxxxxxxxxxx



Relevant Pages

  • Re: cant see disk from sms
    ... got a B80 and it can't boot up currently. ... hardware errors but suprisingly found no disks are available in SMS ... I can see them from recovery shell. ... Checking the /usr filesystem. ...
    (comp.unix.aix)
  • Re: Disk Upgrade and OS upgrade to 5.2 simultaneously
    ... So, if you see a message like "Invalid client program format" when you try to boot, it's probably a firmware issue... ... Disk Upgrade and OS upgrade to 5.2 simultaneously ... If the two 36GB disks are in addition, and you use NIM, you could run NIMADM ...
    (AIX-L)
  • cant see disk from sms
    ... got a B80 and it can't boot up currently. ... hardware errors but suprisingly found no disks are available in SMS ... 9100 MB Harddisk id,lun=9,0 ... Checking the /usr filesystem. ...
    (comp.unix.aix)
  • Re: Format hard drive with XP upgrade wont work because of "newer" ve
    ... In this case, you need to boot with the XP Upgrade CD, go through the setup ... Since the "installation" program that is running sees a newer version of XP ... has a newer microsoft keyboard that the boot disks won't recognize.) ...
    (microsoft.public.windowsxp.basics)
  • Re: fsckd
    ... filesystem errors on boot, with the "hit control-D to continue or give root ... to fix the errors with fsck. ... filesystem and inability to reboot. ...
    (Debian-User)