Re: [PATCH 1/2]: Fix BUG in cancel_dirty_pages on XFS



On Thu, Jan 25, 2007 at 03:25:29PM +1100, Nick Piggin wrote:
David Chinner wrote:
On Thu, Jan 25, 2007 at 01:01:09PM +1100, Nick Piggin wrote:

David Chinner wrote:

No. The only thing that will happen here is that the direct read
will see _none_ of the write because the mmap write occurred during
the DIO read to a different set of pages in memory. There is no
"some" or "all" case here.

But if the buffers get partially or completely written back in the
meantime, then the DIO read could see that.


Only if you can dirty them and flush them to disk while the direct
read is waiting in the I/O queue (remember, the direct read flushes
dirty cached data before being issued). Given that we don't lock the
inode in the buffered I/O *writeback* path, we have to stop pages being
dirtied in the page cache up front so we don't have mmap writeback
over the top of the direct read.

However unlikely it may be, that is what I'm talking about in my "some"
or "all" cases. Note that I'm not talking about a specific implementation
(eg. XFS I guess avoids "some"), but just the possible scenarios.

Unfortunately, behaviour is different for different filesystems. I
can only answer for XFS, which is different to most of the other
filesystems in both locking and the way it treats the page cache.
IOWs, if you want to talk about details, then have to talk about
specific implementations because.....


Hence we have to prevent mmap for dirtying the same file offset we
are doing direct reads on until the direct read has been issued.

i.e. we need a barrier.

So you need to eliminate the "some" case? Because of course "none" and
"all" are unavoidable.

"all" is avoidable, too, once you've kicked the pages out of
the page cache - you just have to block the buffered read triggered
by refaulting the page will cause until the direct I/O completes.

IOWs, at a single point in time we have 2 different views
of the one file which are both apparently valid and that is what
we are trying to avoid. We have a coherency problem here which is
solved by forcing the mmap write to reread the data off disk....

I don't see why the mmap write needs to reread data off disk. The
data on disk won't get changed by the DIO read.


No, but the data _in memory_ will, and now when the direct read
completes it will data that is different to what is in the page
cache. For direct I/O we define the correct data to be what is on
disk, not what is in memory, so any time we bypass what is in
memory, we need to ensure that we prevent the data being changed
again in memory before we issue the disk I/O.

But when you drop your locks, before the direct IO read returns, some
guy can mmap and dirty the pagecache anyway.
By the time the read returns, the data is stale.

Only if we leave the page in the page cache. If we toss the page,
the time it takes to do the I/O for the page fault is enough for
the direct I/o to complete. Sure it's not an absolute guarantee,
but if you want an absolute guarantee:

This obviously must be synchronised in
userspace.

As I said earlier.....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: speed of int vs bool for large matrix
    ... >>as much data that will fit the Level 1 cache. ... I am not speaking about file I/O but RAM I/O. ... Memory access go through the bus at 300 MHZ, ...
    (comp.lang.c)
  • Re: sharing memory map between processes (same parent)
    ... | 'warm the cache' by getting the inode into memory. ... It might make a difference if the memory remained mapped to some process ... How about we take a look at the approximate steps that mmap needs to ... So, whether the mapping is transferred to a second process, or the second ...
    (comp.unix.programmer)
  • Re: Caching control
    ... |> | invalidate/unmap them in order to discard the data from memory. ... |> writing out to disk. ... | easy to discard as clean disk cache. ... stating that a specific amount of RAM can be used only for I/O ...
    (comp.os.linux.development.system)
  • Re: Intel and AMD RDMA implementation
    ... The I/O controller uses the HyperTransport or QuickPath link ... to talk to the memory controller on the CPU that owns the memory. ... They can be directed to memory ranges that are never cached - which I suppose is cache coherent in a manner. ... They can be directed to memory that can be cached, but for which the I/O DMA transactions are not snooped, and so it is not coherent. ...
    (comp.arch)
  • Re: mmap() sendfile()
    ... > the file in memory until munmapis called. ... I do not understand why mmap() is called with MAP_PRIVATE ... > for use by sendfile(). ... > sendfilehas the same cache effect ?! ...
    (freebsd-hackers)