bad blocks on raid5 cause filesystem failure

From: alazarev (alazarev_at_itg.uiuc.edu)
Date: 09/21/05


Date: 20 Sep 2005 15:21:31 -0700

We use a popular consumer RAID enclosure device. It's 16 SATA drives,
with a built in RAID controller, hot swap everything, attaches to the
host via SCSI. We've been pretty happy with it up until a few weeks
ago. It is setup in in RAID 5. Nothing unusal about the setup. Host is
RHEL4-AS 64bit, filesystem is ext3.

About a month ago, we saw some bad blocks on a drive, 5 of them in a
row. We ignored, we've seen it before and it's never been a problem. A
few weeks later, we got 4-5 more bad blocks. Did nothing. A few weeks
later, disaster, we got about 10 bad blocks in a row, and the last one
took out the filesystem. The host unmounted the filesystem
automatically.

The host logs showed that the disc containing the filesystem had a
filaure:

Sep 7 01:29:52 zeus kernel: attempt to access beyond end of device
Sep 7 01:29:52 zeus kernel: sdb1: rw=1, want=8072683984,
limit=2927171457

This error repeated about a hundred times until the journal died just
seconds later, and the system remounted in read only:

Sep 7 01:29:53 zeus kernel: Aborting journal on device sdb1.
Sep 7 01:29:53 zeus kernel: journal commit I/O error
Sep 7 01:29:53 zeus kernel: ext3_abort called.
Sep 7 01:29:53 zeus kernel: EXT3-fs error (device sdb1):
ext3_journal_start_sb: Detected aborted journal
Sep 7 01:29:53 zeus kernel: Remounting filesystem read-only

What angered us most was, there was our RAID system, sitting there
running, not even recongnizing that it caused a massive host filesystem
failure. The RAID did see bad blocks, but it never marked a drive as
bad, never sensed any failure whatsoever.

Now, my question is, isn't this problem exactly what RAID is supposed
to protect against? How could a RAID controller botch this up?

We contacted the manufacturer, and they claimed the only way to protect
against this is to enable something called "Clone and Replace", which
is: "clone the data from the failing drive to a free disk and then the
free disk will be brought into the array without a rebuild". However, I
don't see how that would have made a difference in our case, since the
RAID never saw
a failed disc anyway, it wouldn't have started using that clone in the
first place.

As a result, I'm forced to start looking for alternative vendors, who
can provide a RAID controller that can protect against bad blocks on
drives. If that means just a few bad blocks forces a rebuild, fine,
we'd rather spend money on replacing bad block drives than lose our
filesystem again.

Before I start talking to other vendors, is this bad block issue a
problem for all RAID controllers? Shouldn't any decent RAID controller
be able to rebuild from parity whenever it senses a bad drive, or a bad
block? As soon as the RAID sensed a bad block, it should have known
there was data on it, and then either failed the drive, moved to
degraded, and then start reading from parity, so that the host would
never know about a drive failure? How could an enterprise level RAID
controller fail to do this?

Any help or advice is appreicated, please feel free to email me too.

Thanks,

Alex



Relevant Pages

  • Re: Building a file server - advice please
    ... > connecting up 5 drives in a RAID5 system does not affect the Mean Time To ... > important reason to use a RAID system. ... Hardware controllers generally can have an additional spare disk configured ... Hardware raid presents each raid array to the host as one disk, ...
    (comp.os.linux.setup)
  • Re: Stop Error - STOP: 0x0000007b(0xF7BA5528,0xc00000034,0x00000000, 0x00000000)
    ... Seagate drives have a jumper to go between a SATA1 or a SATA2. ... The current Adaptec RAID controller is for drives of only SATA1, ...
    (microsoft.public.windowsxp.embedded)
  • Re: SBS 2003 Server Hardware upgrade - keeping RAID 5 data drive array
    ... I'm going to assume for the moment that your external arrays are off ... of a standalone RAID controller that will be moving to the new machine. ... built in RAID controller and are the only new drives in the system. ...
    (microsoft.public.windows.server.sbs)
  • Re: RAID setup
    ... redundancy raid 10 config, use the setup-install to create the OS, Exchange ... Sorry to waffle but I cant see the cost/time benefit of dual channel sata ... The RAID controller must also accomodate a split backplane. ... Larger drives have wasted unusable ...
    (microsoft.public.windows.server.sbs)
  • Re: Gigabyte New Build Issue
    ... hard drives and the SB600 raid controller. ... uses version 2.5x of the RAID controller code, but when I flash to the F4 ... to wipe parts of the disk. ...
    (alt.comp.hardware.pc-homebuilt)