Re: 2.6.22-rc2: known regressions with patches



Michal Piotrowski wrote:
File systems

Subject : 2.6.21-git10/11: files getting truncated on xfs
References : http://lkml.org/lkml/2007/5/9/410
Submitter : Jeremy Fitzhardinge <jeremy@xxxxxxxx>
Handled-By : David Chinner <dgc@xxxxxxx>
Patch : http://lkml.org/lkml/2007/5/12/93
Status : patch available


I'm satisfied the patch fixes the problem for me. Can we have some
movement to get it into at least -mm? This is a real, serious
data-corrupting bug. Ideally we should get it into -rc asap.

I've put the version I'm using below, but I haven't seen a properly
changelogged and signed-off version.

Thanks,
J

From: David Chinner <dgc@xxxxxxx>

On Sat, May 12, 2007 at 01:23:27PM +0200, Jan Engelhardt wrote:

On May 10 2007 14:54, Jeremy Fitzhardinge wrote:
What CPU architecture is this happening on? Not i686 with PAE by
any chance?

Yes. Why?

I have a bug report where NFS files are corrupted only with PAE clients.
Corruption is at the end of the (newly untarred) files. Doesn't happen
without PAE.

Hm, suggestive, but I'm not convinced. Two differences to this situation:

1. Immediately after the clone ("untar"), the contents are completely
OK; it's only after a umount/mount cycle to problems appear

And if you do a "sync" rather than umount/mount?

I doubt it will matter - I don't think we are marking the inode dirty at
the right point.

The change that was at fault modifies the way we update the file
size on the inode. We added an in-memory copy of the file size to
the in-memory copy of the disk inode's file size that we already
keep. We now only update the disk inode's (in memory copy) file size
on I/O completion. Because the generic code writes the inode out
before waiting for I/O to complete, the old file size gets written
out instead of the new one.

If the write was to extending the file into an existing block there
would be no delalloc transaction to redirty the inode (happens on
log I/O completion). Hence when the I/O completes and the file size
gets updated to the in-core disk inode (which is marked dirty), the
linux inode remains clean. As a result, a sync will never flush the
inode to get the updated file size to disk.

What I don't understand is that on unmount dirty xfs inodes get
written out. Clearly this is not happening - either there's a hole
in the writeback logic (unlikely - it was unchanged) or we've missed
some case where we need to update the filesize and mark the inode
dirty.

Hmmmm - if the write was just a short append to the file, then the
block that was written to should already be mapped. Then we'll just
look up the extent by doing a BMAPI_READ lookup, set the type to
IOMAP_READ and add the block to ioend we are building.

The type IOMAP_READ determines the I/O completion behaviour - in this case
it is xfs_end_bio_read(), which fails to update the file size....

Bingo.

A patch for you to try, Jeremy. I've just started a test run on it...

Cheers,

Dave.
---
fs/xfs/linux-2.6/xfs_aops.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

===================================================================
--- 2.6.x-xfs-new.orig/fs/xfs/linux-2.6/xfs_aops.c
+++ 2.6.x-xfs-new/fs/xfs/linux-2.6/xfs_aops.c
@@ -973,8 +973,9 @@ xfs_page_state_convert(

bh = head = page_buffers(page);
offset = page_offset(page);
- flags = -1;
- type = IOMAP_READ;
+ iomap_valid = 0;
+ flags = BMAPI_READ;
+ type = IOMAP_NEW;

/* TODO: cleanup count and page_dirty */

@@ -1004,14 +1005,14 @@ xfs_page_state_convert(
*
* Third case, an unmapped buffer was found, and we are
* in a path where we need to write the whole page out.
- */
+ */
if (buffer_unwritten(bh) || buffer_delay(bh) ||
((buffer_uptodate(bh) || PageUptodate(page)) &&
!buffer_mapped(bh) && (unmapped || startio))) {
- /*
+ /*
* Make sure we don't use a read-only iomap
*/
- if (flags == BMAPI_READ)
+ if (flags == BMAPI_READ)
iomap_valid = 0;

if (buffer_unwritten(bh)) {
@@ -1060,7 +1061,7 @@ xfs_page_state_convert(
* That means it must already have extents allocated
* underneath it. Map the extent by reading it.
*/
- if (!iomap_valid || type != IOMAP_READ) {
+ if (!iomap_valid || flags != BMAPI_READ) {
flags = BMAPI_READ;
size = xfs_probe_cluster(inode, page, bh,
head, 1);
@@ -1071,7 +1072,15 @@ xfs_page_state_convert(
iomap_valid = xfs_iomap_valid(&iomap, offset);
}

- type = IOMAP_READ;
+ /*
+ * We set the type to IOMAP_NEW in case we are doing a
+ * small write at EOF that is extending the file but
+ * without needing an allocation. We need to update the
+ * file size on I/O completion in this case so it is
+ * the same case as having just allocated a new extent
+ * that we are writing into for the first time.
+ */
+ type = IOMAP_NEW;
if (!test_and_set_bit(BH_Lock, &bh->b_state)) {
ASSERT(buffer_mapped(bh));
if (iomap_valid)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [Ext2-devel] [PATCH 1/2] ext2/3: Support 2^32-1 blocks(Kernel)
    ... This has partially been addressed by Takashi's patch for fs-blocksize ... extent index block or another inode to allow storing larger EAs). ... At the same time, if we reserve too much space, it hurts EAs fitting ... think that the inode timestamps even warrant that protection. ...
    (Linux-Kernel)
  • Re: sysfs reclaim crash
    ... I don't have the pointer to d_inode ... to get the inode number. ... Though I was not able to recreate this race without the patch, ... sysfs_d_iputis invoked in dentry reclaim path under memory pressure. ...
    (Linux-Kernel)
  • [patch 3/4] Refactor do_syslog interface
    ... This patch breaks out the read operations in do_sysloginto their ... there were declarations of do_syslog and a ... static int kmsg_open(struct inode * inode, ... static int kmsg_release(struct inode * inode, struct file * file) ...
    (Linux-Kernel)
  • Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.
    ... I have rebased this patch to 2.6.22-rc1 so that it can be added to the ... ext4 patch queue. ... Currently the maximum number of subdirectories is capped ... inode link count to 1 and no longer counts subdirectories. ...
    (Linux-Kernel)
  • Re: 2.6.21-git10/11: files getting truncated on xfs? or maybe an nlink problem?
    ... I doubt it will matter - I don't think we are marking the inode dirty at ... We now only update the disk inode's file size ... log I/O completion). ...
    (Linux-Kernel)