Re: bad blocks on raid5 cause filesystem failure
From: Michael (mhyman_at_yahoo.com)
Date: 09/30/05
- Previous message: lobotomy: "Re: nVidia GeForce 2 Go, X.Org >6.8.0 , nVidia propritary driver >1.0.6111 => deadlock"
- In reply to: alazarev: "Re: bad blocks on raid5 cause filesystem failure"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Thu, 29 Sep 2005 23:21:48 -0700
In article <1127329313.446678.325740@g47g2000cwa.googlegroups.com>,
alazarev@itg.uiuc.edu says...
> Thanks for the informative post. I've got a few questions though.
>
> 1) Do you have a link to the report that you read which describes the
> probablity of double fault. Sounds like an interesting read for me.
>
> 2) Correct me if I'm wrong, but if two blocks on a drive, happen to
> fail at the same time, before rebuild can finish parity on the first,
> then you will have a problem, unless you have double parity? Fine, but
> then what about 3 bad blocks in a row. At some point, the RAID
> controller should, like you say, stop all host IO and report the drive
> failed, and then rebuild the drive from parity. How many bad blocks in
> a row should cause this drive failure, three or more, right? Since we
> saw about 10 bad block failures all with the same time stamp, double
> parity would not have helped us at all. The only thing that would have
> helped us is a RAID controller that would stop IO to the host. Instead,
> our RAID still provided "fake access" for the host and thus the fs
> failure. Sound ligit to you? Any idea what functionality this is
> called, so I know to avoid it when shopping around for new RAID? I
> suppose SCSI provides much better reliability in this respect. Too bad,
> we are already in the SATA hole. Too much data to afford moving it to
> SCSI.
>
> 3) Double parity is also called RAID 6, right? Does RAID 6 provide
> double parity at the block level? Or only at the drive level?
>
> Thanks,
>
> Alex
>
>
Having even a single bad block, when that block happens to be in the
super-block area or the VTOC, will cause exactly the FS error you are
seeing:
Sep 7 01:29:52 zeus kernel: attempt to access beyond end of device
Sep 7 01:29:52 zeus kernel: sdb1: rw=1, want=8072683984,
limit=2927171457
The FS can't determine the bounds of the partition because the partition
map is damaged. FSCK will keep seuqntially scanning the disk until it
hits the end of the disk, if I remember correclty.
You are 100% right though about the RAID array not working properly. I
hate to say it, but this is an area where Linux has some maturing to do.
The kernel, the FS and the RAID array are not tightly coupled and
therefore allow for these things to slip through the cracks. The RAID
array saw the bad block, but did not reallocate a good block and add the
bad one to the list, from what you were describing. Then the FS never
knew there was a problem until it couldn't read critical areas of the
disk and at that point you are screwed.
What could you have done at the point when you discovered the problems?
Maybe copied the data to another FS, if you even had that much space.
You mention a sequence of events leading up to the big-failure, but the
one and only block that might have failed could have killed it, it just
took a bit longer in your case.
I hope you do research arrays and the post your results, because it is a
big issue, at least as I see it.
Regards...Michael
- Previous message: lobotomy: "Re: nVidia GeForce 2 Go, X.Org >6.8.0 , nVidia propritary driver >1.0.6111 => deadlock"
- In reply to: alazarev: "Re: bad blocks on raid5 cause filesystem failure"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|