Re: 2.6.19 file content corruption on ext3



On Tue, 2006-12-19 at 16:23 -0800, Linus Torvalds wrote:

On Wed, 20 Dec 2006, Peter Zijlstra wrote:
On Mon, 2006-12-18 at 12:14 -0800, Linus Torvalds wrote:
OR:

- page_mkclean_one() is simply buggy.

GOLD!

Ok. I was looking at that, and I wondered..

However, if that works, then I _think_ the correct sequence is the
following..

The rule should be:
- we flush the tlb _after_ we have cleared it, but _before_ we insert the
new entry.

But I dunno. These things are damn subtle. Does this patch fix it for you?

I will try, but I had a look around the different architectures
implementation of ptep_clear_flush_dirty() and saw that not all do the
actual flush. So if we go down this road perhaps we should introduce
another per arch function that does the potential flush. like
flush_tlb_on_clear_dirty() or something like that.

Then we could write:

entry = ptep_get_and_clear(mm, address, ptep)
flush_tlb_on_clear_dirty(vma, address);
entry = pte_mkclean(entry);
entry = pte_wrprotect(entry);
set_pte_at(mm, address, ptep, entry);

I actually suspect we should do this as an arch-specific macro, and
totally replace the current "ptep_clear_flush_dirty()" with one that does
"ptep_clear_flush_dirty_and_set_wp()".

Because what I'd _really_ prefer to do on x86 (and probably on most other
sane architectures) is to do

- atomically replace the pte with the EXACT SAME ONE, but one that
has the writable bit clear.

bit_clear(_PAGE_BIT_RW, &(ptep)->pte_low);

- flush the TLB, making sure that all CPU's will no longer write to it:

flush_tlb_page(vma, address);

- finally, just fetch-and-clear the dirty bit (and since it's no longer
writable, nobody should be settign it any more)

ret = bit_clear(__PAGE_BIT_DIRTY, &(ptep)->pte_low);

and now we should be all done.

Hmm, should we not flush after clearing the dirty bit? That is, why does
ptep_clear_flush_dirty() need a flush after clearing that bit? does it
leak through in the tlb copy?

Also, what is this page_test_and_clear_dirty() business, that seems to
be exclusively s390 btw. However they do seem to need this.

But the "ptep_get_and_clear() + flush_tlb_page()" sequence should
hopefully also work.

Yeah, probably, not optimally so on some archs that don't actually need
the flush though. And as above, I wonder about s390.

(added our s390 friends to the CC list)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [parisc-linux] Re: [PATCH 3/9] mm: parisc pte atomicity
    ... >> would stop, thinking it had done a good flush, when actually it hadn't. ... > (which is our private flag saying we're preserving the pte entry for the ... from which point on it assumes the pte is valid. ... and they're interruption handlers (meaning we don't ...
    (Linux-Kernel)
  • Re: [parisc-linux] Re: [PATCH 3/9] mm: parisc pte atomicity
    ... >> The change does slightly worry me in that it alters the behaviour of ... there's no pte, we can't do an effective flush, so instead of relying on ... > from which point on it assumes the pte is valid. ... > a swap entry or pte_file entry does not return NULL there, ...
    (Linux-Kernel)
  • Re: [Announce] 2.6.29-rt2
    ... A linux PTE was changed and the corresponding hash table entry ... neesd to be flushed. ... This function will either perform the flush ...
    (Linux-Kernel)
  • Re: [Announce] 2.6.29-rt2
    ... A linux PTE was changed and the corresponding hash table entry ... neesd to be flushed. ... This function will either perform the flush ...
    (Linux-Kernel)
  • Re: [patch -mm] i386: use pte_update_defer in ptep_test_and_clear_{dirty,young}
    ... David's patch is actually incorrect. ... the write protected pte which maps a dirty page gets broken in two ways; ...
    (Linux-Kernel)