Re: Wondering about raid5

From: P.T. Breuer (ptb_at_oboe.it.uc3m.es)
Date: 04/13/04


Date: Tue, 13 Apr 2004 21:33:57 +0200

Kasper Dupont <kasperd@daimi.au.dk> wrote:
> "P.T. Breuer" wrote:
> >
> > Kasper Dupont <kasperd@daimi.au.dk> wrote:
> > > I'm wondering how raid5 handles a powerfailure in
> > > the midle of a write.
> >
> > It doesn't do anything - it's dead.
> >
> > > To update a data block both
> > > the data block and the coresponding parity block
> > > needs to be written.
> >
> > Don't worry about it - the parity block will be rewritten next time the
> > raid is started and the whole thing is checked over.
>
> Uhm, is it going to computer every single parity block
> on the entire raid?

Under normal circumstances, yes.

> That would probably take about 45
> minutes on my system.

It does it in background.

> Does that mean the system will
> use so much time to boot? Or is it happening in the
> background?

The latter.

> > If, however, at
> > that time you are a disk short, then the parity will be believed and
> > your data will be recreated from it.
>
> If the lost disk was not one of the two being written
> it would would result in incorrect data being recreated.

Yes - I didn't bother with that detail, you get the idea!

> That actually means corruption of a sector where no
> write was going on.

Well, the data will be recreated wrongly for the missing disk, at the
corresponding sector, yes. That will transform itself into an actual
corruption when a replacement disk is brought in.

> > So if data is written before parity, then you win provided you come
> > up with all disks, and you lose if you come up a disk short (hey, you
> > lose the last write, so what ...).
>
> If I just lost the last write I wouldn't worry. What
> I'm worried about is incorrect data being seen in other
> sectors in the same stripe.

It will happen. Silent corruption, silent data creep - those are all
terms known in the data storage business. To fix it you will need to
have extra redundancy :-).

> > If parity is written before data, then you lose if you come up with
> > all disks again (the parity will be overwritten using old data),
> > and you win if you come up a disk short.
>
> The order of the writes cannot make a difference. I don't
> care what is in the last sector written (or the last few
> sectors even). A journaling filesystem could take care of

Journalling does not do any magic either - it cannot guarantee
that something is written after it leaves the journal cache, it
can only send it out.

> that. I worry about the rest of the stripe.

There are raid systems that reread the written data after writing it,
just to be sure, but generally you will also be liable to data
corruption at the disk level. For journalling too - if you tell the
disk to write 45 and it really writes 44, nobody will know until next
time ...

> > > Assuming a powerfailure
> > > happening, it is possible that only one of the two
> > > writes is completed, but the other isn't.
> >
> > Yep - but I wouldn't worry. Much worse things are possible. Disk
> > manufacturers claim that writes sent are always committed if power is
> > lost, because there is enough power in the capacitors to write the
> > pending requests before spindown. But if you believe that, you also
> > believe many other good things about disks ;-).
>
> Even if the disk would have enough power to complete a
> write, you still wouldn't prevent one disk from completing
> a write, where the corresponding write on another disk had
> not started yet.

Possibly. There's much worse -I don't believe write order is presrved
in any sense through the kernel raid layers, so journalling file
systems would corrupt on their own, even working perfectly (I may be
wrong, but I have examined the code and not seen anything that maintains
ordering - remember that requests to the raid device are marshalled and
copied to slave devices before being acked, but this is at an
individual level, and there is nothing to say that the requests cannot
arrive at the slaves out of order).

> > > When power is restored there will be an inconsistency.
> >
> > Yep. Or not.
>
> If power was lost at the wrong time, there will be an
> inconsistency. But of course if the raid is marked dirty
> and an unclean shutdown results in a recalculation of all
> parity sectors, the inconsistency would be fixed. So what
> does the system do? Doesn't trust any parity sectors until
> recalculation have completed? That means any read/write
> would have to be done the hard way. Or does it keep track
> of how far it have recalculated and just avoid parities
> above that?

If you break a raid system before parity recalculation is complete, it
is very nicely broken indeed. At that point what to do is largely up to
you, the admin.

> > > Does anyone here know how raid5 (and raid1) prevents
> > > this from causing data loss?
> >
> > Yes - everyone does. It doesn't. RAID isn't magic. It only prevents
> > certain sorts of data loss, not all sorts.
>
> Of course I know that. What I'm talking about is only
> those errors that could potentially happen with a raid
> system, that wouldn't have happened without.

Well, you mean raid creating a confusion, because it has two sources of
data? Yes, you can now corrupt the corroboratative data as well as the
data. But that's only 50% more danger over the danger of data corruption
alone (3 disks), and you can now expect to lose a whole disk without data
loss if you don't hit that one unlucky moment. The unlucky moment is
50% more likely, but it was only a 1000:1 chance anyway! And the risk
of losing one of two disks is something like 50% per year!

Peter



Relevant Pages

  • Re: AIX V5.3 & FASTT500 PERFORMANCE TUNING
    ... calculate the parity data every time a write is done, there is a decrease on performance when compared with reads, which doesn’t require the parity calculation. ... On a RAID_10, there is no parity calculation on either read or write, but there’s almost always a small slowdown in the write performance, due to the disk internals. ... commonly used implementation of RAID, Level 4 provides block-level striping with a parity disk. ... the information contained in this communication ...
    (AIX-L)
  • Re: Best Raid Level for Streaming?
    ... RAID 3: Striping and Parity ... In RAID level 3, data is striped across a set of disks. ... is generated and stored on a dedicated disk. ... In RAID level 5, both parity and data are striped across a set of disks. ...
    (microsoft.public.windowsmedia.server)
  • Re: Wondering about raid5
    ... There are raid systems that do not make that assumption, ... > as well have the same corruption going on in RAM. ... that writes are async in unix so even if the disk errors ... >> in any sense through the kernel raid layers, so journalling file ...
    (comp.os.linux.development.system)
  • slackware 9.1 software raid problem
    ... Setting up a RAID system with Slackware 8 is not extremely difficult once ... mirroring the root partition and booting from that mirror was not possible. ... Each disk is attached to a different IDE chain on the motherboard. ... The ability to boot from the Slackware 8 install CD. ...
    (alt.os.linux)
  • Re: Raid Controllers
    ... RAID disk drives are used frequently on servers ... Level 2 stripes data at the bit level rather than the block level. ... > Level 4 -- Dedicated Parity Drive: A commonly used implementation of RAID, ...
    (microsoft.public.windowsxp.hardware)