Odd crash, now unable to access hard disks at boot time
From: Chris Metzler (cmetzler_at_speakeasy.snip-me.net)
Date: 09/29/03
- Next message: MrAnonymous: "Re: ECS K7SOM all-in -one, does it work with freeBSD and Linux? both, neither?"
- Previous message: Dances With Crows: "Re: I/O errors in cdrom drive for a vaio pcg-v505BX using RH9.0"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Mon, 29 Sep 2003 14:10:18 -0400
Hi. I'm hoping someone here can give me some advice as to how to
proceed with a bizarre problem I'm having with my Linux box. The
basic problem: after almost a year of working properly, the system
suddenly seems unable to boot off of either hard drive on IDE0/1, even
though the drives seem to be working properly when accessed in Linux
after booting.
CONFIGURATION
-------------
- A7V333-RAID w/ Athlon XP 2000, 2x Corsair 512MB CAS2 DDR333/PC2700 RAM.
- IDE0: Lite-On LTR-48125W CD-RW, WD 1200JB hard drive
- IDE1: Lite-On XJ-HD165H DVD/CD-ROM, WD 1200JB hard drive
- IDE2 (Promise): WD 800JB hard drive, WD 1200JB hard drive
- IDE3 (Promise): WD 1200JB hard drive
- Teac (?) floppy drive
- Matrox Millenium G550 video card with Viewsonic P95f+ monitor
- Creative Soundblaster Live! 5.1 sound card
- D-Link DFE-530TX+ NIC
- HP LJ1200 attached through parallel port.
This motherboard has four IDE channels -- two usual ones (IDE0/1)
and two additional ones driven through an onboard Promise RAID
controller (IDE2/3). The Promise controller is being used solely
to provide two additional IDE channels; its RAID functionality is
not being used at all.
The OS (Debian Sarge/Sid) is located on the WD 800JB (80 GB) drive,
freeing the four 120 GB drives (one on each IDE channel) to be used as
a pair of RAID1 arrays, using Linux kernel software RAID.
The BIOS for the motherboard only allows you to choose boot devices
off of either the two normal IDE channels, OR the two Promise
channels. To be able to boot off CD when desired, the machine has to
be configured to boot from IDE0/1 rather than the Promise channels,
since the on-board Promise controller only allows disks to be
connected. Putting the four disks to be RAIDed each on their own
channel fills up IDE0/1 (given that the two CD drives *must* be on
IDE0/1); this meant putting the 80GB drive containing the OS on IDE2.
To allow the OS to reside on a disk on the Promise channels while
still booting off IDE0/1, the bootloader GRUB is installed into the
MBR of the first 120 GB hard disk (IDE0 slave) which points to the OS
on the 80 GB drive. The BIOS is configured to attempt to boot off
of, in order:
1. the CD-RW (IDE0 master);
2. floppy;
3. the first 120 GB hard disk (IDE0 slave).
When it attempts to boot off the hard disk, it loads and executes GRUB
out of that disk's MBR; GRUB, in turn, loads the OS off the 80 GB
drive on IDE2.
I've had this system configured as described since early last
November, working flawlessly under heavy load for a workstation,
including numerous cold boots (after e.g. shutting down for a
thunderstorm) and some warm boots (after e.g. Linux kernel
recompiles). Not one crash or glitch of any sort, ever.
THE PROBLEM
-----------
This past Friday, I experienced a bizarre crash. While I was
sitting at and using the machine, the screen suddenly went dark
and the monitor announced it was switching to "power off" mode
in 5 seconds: the video signal had shut off, as cleanly as if
the system case power was turned off. However, all the fans
were still going, including the CPU fan. The machine would not
reset; a power cycle was necessary. When the power was cycled,
the machine would not boot: no beeps of any sort indicating
the POST had succeeded or failed, no video signal, nothing. Fans
spun, and the case IDE light came on as if some sort of disk
access was taking place, but nothing else.
A tech from the place I got the mobo/processor informed me that if a
component of the POST fails hard, the system might die before the mobo
gives information as to what part of the POST failed. So, I
disassembled the machine and started adding components back one by
one, expecting that sooner or later I'd produce a crash and could
suspect the last thing I'd reconnected. With just processor and
memory, I got the machine to complain that it had no video card. With
the video card in place and monitor connected, I watched it start the
POST, run through the memory, correctly find no IDE devices, attempt
to boot and fail (because nothing was in the CD-RW or floppy drives,
and no hard drives were attached). I then re-attached drives, then
replaced cards, etc.
The result: I'm now completely rebuilt, and was never able to
reproduce the crash. I would be willing to chalk it up to some
connection that became loose, subsequently fixed in re-connecting
(and thus re-seating) everything. Except, there's one lasting
problem: I'm no longer able to boot from the hard drives on IDE0
or IDE1.
The POST finds the drives OK, they're listed in the opening
screen after memory is counted; and entering the BIOS setup screen
shows all four devices on IDE0/1 just fine. So it's not as if
it doesn't see the drives at all. But when it comes time to
read from the MBR, trouble. If the boot sequence is left as
above, and there are no bootable media in the CD or floppy drives,
then the system hangs after trying the floppy drive, with the
floppy drive light on; a hard reset is necessary. If the boot
order is changed so that the hard drive comes first, then the
system hangs immediately upon starting to try to boot, and again
a hard reset is necessary. No error messages of any sort -- just
a hang
However, booting off the CD or a floppy is possible. In fact, I
can load GRUB off a floppy, and use *that* to access the kernel
and file systems on the 80 GB drive (IDE2 master) without any problem.
Doing so works just like things would normally work when I boot this
machine, except the first step of the boot process goes through the
floppy rather than the MBR of the IDE0 slave like it used to.
At this point, I might normally guess that there's some sort of
problem with the drive. Once the OS is booted, I'm able to mount and
mess with all disks, *including the disk in question*. I've done
hours of tests on it, filling it up and doing compares and so forth.
It works fine through the OS; there's only a problem when attempting
to access the MBR through the BIOS at boot-time.
Another hint that it's not a problem with the drive itself is that I
installed GRUB into the MBR of the *other* hard drive on IDE0/1, the
IDE1 slave, and changed the BIOS setting to boot off that drive
instead of the IDE0 slave. Exactly the same thing happened -- system
hang when it came time to boot off that disk.
The fact that there's no error messages at all made me suspicious that
maybe the MBR simply got munged somehow. So I re-installed GRUB into
the MBR, more than once. No effect.
It is as if the BIOS can no longer read from the drives, or at least
can no longer read from their MBRs. Since Linux uses its own routines
for disk access, rather than BIOS routines, it would make sense that
I'd be able to use the disks OK once Linux booted. Yet it's hard for
me to imagine some sort of corruption of the BIOS that would cause the
BIOS to behave absolutely perfectly *except* that it can't read the
hard drives *and* wouldn't generate an error of any sort.
ADDITIONAL
----------
While trying to understand the initial failure, I came upon this
thread from the Asus motherboard newsgroup:
This is perhaps a plausible explanation for the original failure;
perhaps the difference between my not being able to POST, and then
POSTing OK, was not my pulling everything out and re-assembling
it, but rather eventually cycling power on the *monitor*. But that
still doesn't explain my current inability to boot off my hard
drives. Booting off of floppies sucks, so I'd really appreciate
any suggestions/ideas/advice.
Thanks very much.
-c
- Next message: MrAnonymous: "Re: ECS K7SOM all-in -one, does it work with freeBSD and Linux? both, neither?"
- Previous message: Dances With Crows: "Re: I/O errors in cdrom drive for a vaio pcg-v505BX using RH9.0"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]