Re: [PATCH 0/4] (RESEND) ext3[34] barrier changes

On Fri, 16 May 2008 20:20:30 -0400 Theodore Tso <tytso@xxxxxxx> wrote:

On Fri, May 16, 2008 at 11:53:04PM +0100, Jamie Lokier wrote:
If you just want to test the block I/O layer and drive itself, don't
use the filesystem, but write a program which just access the block
device, continuously writing with/without barriers every so often, and
after power cycle read back to see what was and wasn't written.

Well, I think it is worth testing through the filesystem, different
journaling mechanisms will probably react^wcorrupt in different ways.

I agree, but intentional tests on the block device will show the
drives characteristcs on power failure much sooner and more
consistently. Then you can concentrate on the worst drivers :-)

I suspect the real reason why we get away with it so much with ext3 is
that the journal is usually contiguous on disk, hence, when you write
to the journal, it's highly unlikely that commit block will be written
and the blocks before the commit block have not.

yup. Plus with a commit only happening once per few seconds, the time
window for a corrupting power outage is really really small, in
relative terms. All these improbabilities multiply.

Journal wrapping could cause the commit block to get written before its
data blocks. We could improve that by changing the journal layout a bit:
wrap the entire commit rather than just its tail. Such a change might
be intrusive though.

The other possible scenario is when the fs thinks the metadata blocks
have been checkpointed, so it reuses the journal space, only they
weren't really checkpointed. But that would require that we'd gone
through so much IO that the metadata probably _has_ been checkpointed.

In addition, because
we are doing physical block journalling, it repairs a huge amount of
damage during the journal replay. So as we are writing the journal,
the disk drive sees a large contiguous write stream, followed by
singleton writes where the disk blocks end up on disk. The most
important reason, though, is that the blocks which are dirty don't get
flushed out to disk right away!

They don't have to, since they are in the journal, and the journal
replay will write the correct data to disk. Before the journal
commit, the buffer heads are pinned, so they can't be written out.
After the journal commit, the buffer heads may be written out, but
they don't get written out right away; the kernel will only write them
out when the periodic buffer cache flush takes place, *or* if the
journal need to wrap, at which point if there are pending writes to an
old commit that haven't been superceded by another journal commit, the
jbd layer has to force them out. But the point is this is done in an
extremely lazy fashion.

As a result, it's very tough to create a situation where a hard drive
will reorder write requests aggressively enough that we would see a
potential problem.

I suspect if we want to demonstrate the problem, we would need to do a
number of things:

* create a highly fragmented journal inode, forcing the jbd
layer to seek all over the disk while writing out the
journal blocks during a commit
* make the journal small, forcing the journal to wrap very often
* run with journal=data mode, to put maximal stress on the journal
* make the workload one which creates and deletes large number
of files scattered all over the directory hierarchy, so that
we limit the number of blocks which are rewritten,
* forcibly crash the system while subjecting the ext3
filesystem to this torture test

Given that most ext3 filesystems have their journal created at mkfs
time, so it is contiguous, and the journal is generally nice and
large, in practice I suspect it's relatively difficult (I didn't say
impossible) for us to trigger corruption given how far away we are
from the worst case scenario described above.

what he said.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at