bad blocks on raid5 cause filesystem failure
From: alazarev (alazarev_at_itg.uiuc.edu)
Date: 09/21/05
- Next message: Dances With Crows: "Re: low level i/o programming"
- Previous message: phulden: "low level i/o programming"
- Next in thread: Rick Moen: "Re: bad blocks on raid5 cause filesystem failure"
- Reply: Rick Moen: "Re: bad blocks on raid5 cause filesystem failure"
- Reply: John-Paul Stewart: "Re: bad blocks on raid5 cause filesystem failure"
- Reply: kermit: "Re: bad blocks on raid5 cause filesystem failure"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 20 Sep 2005 15:21:31 -0700
We use a popular consumer RAID enclosure device. It's 16 SATA drives,
with a built in RAID controller, hot swap everything, attaches to the
host via SCSI. We've been pretty happy with it up until a few weeks
ago. It is setup in in RAID 5. Nothing unusal about the setup. Host is
RHEL4-AS 64bit, filesystem is ext3.
About a month ago, we saw some bad blocks on a drive, 5 of them in a
row. We ignored, we've seen it before and it's never been a problem. A
few weeks later, we got 4-5 more bad blocks. Did nothing. A few weeks
later, disaster, we got about 10 bad blocks in a row, and the last one
took out the filesystem. The host unmounted the filesystem
automatically.
The host logs showed that the disc containing the filesystem had a
filaure:
Sep 7 01:29:52 zeus kernel: attempt to access beyond end of device
Sep 7 01:29:52 zeus kernel: sdb1: rw=1, want=8072683984,
limit=2927171457
This error repeated about a hundred times until the journal died just
seconds later, and the system remounted in read only:
Sep 7 01:29:53 zeus kernel: Aborting journal on device sdb1.
Sep 7 01:29:53 zeus kernel: journal commit I/O error
Sep 7 01:29:53 zeus kernel: ext3_abort called.
Sep 7 01:29:53 zeus kernel: EXT3-fs error (device sdb1):
ext3_journal_start_sb: Detected aborted journal
Sep 7 01:29:53 zeus kernel: Remounting filesystem read-only
What angered us most was, there was our RAID system, sitting there
running, not even recongnizing that it caused a massive host filesystem
failure. The RAID did see bad blocks, but it never marked a drive as
bad, never sensed any failure whatsoever.
Now, my question is, isn't this problem exactly what RAID is supposed
to protect against? How could a RAID controller botch this up?
We contacted the manufacturer, and they claimed the only way to protect
against this is to enable something called "Clone and Replace", which
is: "clone the data from the failing drive to a free disk and then the
free disk will be brought into the array without a rebuild". However, I
don't see how that would have made a difference in our case, since the
RAID never saw
a failed disc anyway, it wouldn't have started using that clone in the
first place.
As a result, I'm forced to start looking for alternative vendors, who
can provide a RAID controller that can protect against bad blocks on
drives. If that means just a few bad blocks forces a rebuild, fine,
we'd rather spend money on replacing bad block drives than lose our
filesystem again.
Before I start talking to other vendors, is this bad block issue a
problem for all RAID controllers? Shouldn't any decent RAID controller
be able to rebuild from parity whenever it senses a bad drive, or a bad
block? As soon as the RAID sensed a bad block, it should have known
there was data on it, and then either failed the drive, moved to
degraded, and then start reading from parity, so that the host would
never know about a drive failure? How could an enterprise level RAID
controller fail to do this?
Any help or advice is appreicated, please feel free to email me too.
Thanks,
Alex
- Next message: Dances With Crows: "Re: low level i/o programming"
- Previous message: phulden: "low level i/o programming"
- Next in thread: Rick Moen: "Re: bad blocks on raid5 cause filesystem failure"
- Reply: Rick Moen: "Re: bad blocks on raid5 cause filesystem failure"
- Reply: John-Paul Stewart: "Re: bad blocks on raid5 cause filesystem failure"
- Reply: kermit: "Re: bad blocks on raid5 cause filesystem failure"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|