Re: Wondering about raid5

From: P.T. Breuer (ptb_at_oboe.it.uc3m.es)
Date: 04/14/04


Date: Tue, 13 Apr 2004 23:40:19 GMT

Kasper Dupont <kasperd@daimi.au.dk> wrote:
> "P.T. Breuer" wrote:
> >
> > It will happen. Silent corruption, silent data creep - those are all
> > terms known in the data storage business. To fix it you will need to
> > have extra redundancy :-).
>
> Actually I have considered writing a program to
> periodically compute checksums of every single
> file on my filesystems. If a checksum then
> changes without the timestamp changing you know
> there is reason to worry.

That's not a bad idea. I compute the md5sums every day on all machines
on the local net, and compare them, finding the majority and minory
group votes, and replace the latter with the former. But comparing
against the file modification date is a neat idea on a single machine.

> Of course all security here relies on the
> assumption, that if a write is send to the disk
> and the power stays on for another few seconds,
> then the data will be written. If you lose

There are raid systems that do not make that assumption, although I
don't recall what they do ... something like off-system journalling
and posterior checking combined.

> power right after a sector from the journal is
> written, it will be written again at next bootup.

Uh no - I don't really see the point of a journal most times. The trick
is to make fileops atomic. One can't do that, so the next best thing is
to make a single point at which a fileop becomes "complete", and
consider it not done until then and done after then. The journal merely
serves to hold enough data to allow the fileop to be setup (and for the
op to be undone again if never completed in time ...). It's the
atomicity that is the objective and if you are careful you can do that
without a journal for metaops. When it comes to actual data, you are
out of luck anyway because the journal can never be big enough to hold
a full data move (old data as well as new) for bigger transfers.

Anyway, if the data gets lost on the way to the journal, you are still
sorely out of luck.

> > > that. I worry about the rest of the stripe.
> >
> > There are raid systems that reread the written data after writing it,
> > just to be sure, but generally you will also be liable to data
> > corruption at the disk level. For journalling too - if you tell the
> > disk to write 45 and it really writes 44, nobody will know until next
> > time ...
>
> But those are not the problems I'm considering.

OK.

> The possibility of differences between the bytes
> written in a sector and the bytes later read
> back from that sector exists, but then you might
> as well have the same corruption going on in RAM.

Ram has crc bits to check it, and disks also have crc. The problem is
that writes are async in unix (usually) so even if the disk errors
and tells you abut it your application won't know about it ... and
the same holds at a lower level. Writes ARE async. The disk buffers
them and swears blind that it's done them, but it hasn't.

> What I'm worrying about is perfectly functional
> hardware, but damage happening because of the
> nonatomicity of writes to the different disks.
> But from the explanations I got it seems there
> is a dirty bit to take care of that, so I guess

There isn't a dirty bit (in the sense of a bit in a set of flags or on
a bitmap - unless you use my or paul clements newer bitmapped raid
drivers, which do do that). There is a synchronous volley of writes to
all the parts of the mirror from the raid device, on receipt of a write,
and if any secondary write to any part fails, the whole original write
will be errored, and the mirror component involved will go offline.

And then the admin fun starts ...

> I have no need to worry.

Well, you do.

> > Possibly. There's much worse -I don't believe write order is presrved
> > in any sense through the kernel raid layers, so journalling file
> > systems would corrupt on their own, even working perfectly (I may be
> > wrong, but I have examined the code and not seen anything that maintains
> > ordering - remember that requests to the raid device are marshalled and
> > copied to slave devices before being acked, but this is at an
> > individual level, and there is nothing to say that the requests cannot
> > arrive at the slaves out of order).
>
> There have also been talk about reordering happening
> in the disks themselves. How about hardware RAID

Yes.

> boxes, how do they handle this stuff?

I do not believe they do, but I may be wrong.

> > If you break a raid system before parity recalculation is complete, it
> > is very nicely broken indeed. At that point what to do is largely up to
> > you, the admin.
>
> What do you mean by breaking?

Marking a disk faulty, or taking the whole thing down.

> I'm just considering
> normal use of the system within the first hour of
> restoring happening after a powerfailure.

You are vulnerable in that period (I haven't thought about it, it
just doesn't bear thinking about! If you care to imagine and vocalise a
badness that can result, please go ahead ...)

> > Well, you mean raid creating a confusion, because it has two sources of
> > data? Yes, you can now corrupt the corroboratative data as well as the
> > data.
>
> I'm not talking about corruption of data on the
> physical disks. I'm only talking about corruption
> of logical sectors happening without being caused
> by corruption of a physical sector.

Well, yes, you can distinguish. I meant to embrace the corruption that
accrues through a data or parity write to a component being missed.

> > But that's only 50% more danger over the danger of data corruption
> > alone (3 disks), and you can now expect to lose a whole disk without data
> > loss if you don't hit that one unlucky moment.
>
> Yes I know.
>
> > The unlucky moment is
> > 50% more likely, but it was only a 1000:1 chance anyway!
>
> The situation I described is not unlikely. But it
> seems to have been taken care of. Inefficient, but
> shouldn't happen too often.
>
> If you lose power in the midle of sequence of small
> writes, the situation I describe would happen with
> a 50% chance.

But you are only in the middle of a sequence of small writes every
1/1000 moments, on average (this is not true in some situations, which
have me worried - such as the data farms for the experimental data
coming off the CERN detectors, where I am involved ...).

> > And the risk
> > of losing one of two disks is something like 50% per year!
>
> I don't think it is *that* high, but it is too high
> to ignore.

Yes, it would probably be more like 5-10% per year. But it could
certainly be as high as 20% (that would make the expected lifetime 5
years).

Peter



Relevant Pages

  • Re: Wondering about raid5
    ... >> raid is started and the whole thing is checked over. ... is it going to computer every single parity block ... > If the lost disk was not one of the two being written ... Silent corruption, silent data creep - those are all ...
    (comp.os.linux.development.system)
  • Re: i thought RAID 5 had a stake through its heart
    ... >> And even worse, silent data corruption can ... > failing media because RAID5 doesn't validate the checksum on a read. ... disk checksum is always present and is always dealt with by the disk. ... I'm not a RAID expert, but I am aware of the problem. ...
    (comp.os.linux.development.system)
  • Re: Has my DC been hacked?
    ... the disk IO subsystem failing it would have needed to have been ... that could in specific scenarios lead to corruption. ... > RAID 5 array. ... >>>>the AV product and version cause timestamps, ...
    (microsoft.public.windows.server.security)
  • slackware 9.1 software raid problem
    ... Setting up a RAID system with Slackware 8 is not extremely difficult once ... mirroring the root partition and booting from that mirror was not possible. ... Each disk is attached to a different IDE chain on the motherboard. ... The ability to boot from the Slackware 8 install CD. ...
    (alt.os.linux)
  • Re: Is it HighPoint, is it Seagate, or is it Windows 2000 Professional
    ... I have really learnt about the uses and limits of RAID. ... Use all the manufacturer recommended system recovery ... provisions -- Emergency Disk, have a DOS boot disk ready, ... what could cause Windows to suddenly change its mind ...
    (microsoft.public.win2000.setup)