Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
- From: Andrew Morton <akpm@xxxxxxxx>
- Date: Fri, 12 May 2006 23:59:34 -0700
Neil Brown <neilb@xxxxxxx> wrote:
On Friday May 12, akpm@xxxxxxxx wrote:
NeilBrown <neilb@xxxxxxx> wrote:
If md is asked to store a bitmap in a file, it tries to hold onto the
page cache pages for that file, manipulate them directly, and call a
cocktail of operations to write the file out. I don't believe this is
a supportable approach.
erk. I think it's better than...
This patch changes the approach to use the same approach as swap files.
i.e. bmap is used to enumerate all the block address of parts of the file
and we write directly to those blocks of the device.
That's going in at a much lower level. Even swapfiles don't assume
buffer_heads.
I'm not "assuming" buffer_heads. I'm creating buffer heads and using
them for my own purposes. These are my pages and my buffer heads.
None of them belong to the filesystem.
Right, so it's incoherent with pagecache and userspace can no longer
usefully read this file.
The buffer_heads are simply a convenient data-structure to record the
several block addresses for each page. I could have equally created
an array storing all the addresses, and built the required bios by
hand at write time. But buffer_heads did most of the work for me, so
I used them.
OK.
Yes, it is a lower level, but
1/ I am certain that there will be no kmalloc problems and
2/ Because it is exactly the level used by swapfile, I know that it
is sufficiently 'well defined' that no-one is going to break it.
It would be nicer of course to actually use the mm/page_io.c code. That
would involve implementing swap_bmap() and reimplementing the
get_swap_bio() stuff in terms of a_ops->bmap().
But the swap code can afford to skip blockruns which aren't page-sized and
it uses that capability nicely. You cannot do that.
All this (and a set_fs(KERNEL_DS), ug) looks like a step backwards to me.
Operating at the pagecache a_ops level looked better, and more
filesystem-independent.
If you really want filesystem independence, you need to use vfs_read
and vfs_write to read/write the file.
yup.
I have a patch which did that,
but decided that the possibility of kmalloc failure at awkward times
would make that not suitable.
submit_bh() can and will allocate memory, although most decent device
drivers should be OK.
There are tricks we can do with writepage. If the backing filesystem uses
buffer_heads and if you hold a ref on the page then we know that there
won't be any buffer_head allocations nor any disk reads in the writepage
path. It'll go direct into bio_alloc+submit_bio, just as you're doing now.
IOW: no gain.
So I now use vfs_read to read in the file (just like knfsd) and
bmap/submit_bh to write out the file (just like swapfile).
I don't think a_ops really provides an interface that I can use, partly
because, as I said in a previous email, it isn't really a public
interface to a filesystem.
It's publicer than bmap+submit_bh!
I haven't looked at this patch at all closely yet. Do I really need to?
I assume you are asking that because you hope I will retract the
patch.
Was kinda hoping that would be the outcome. It's rather gruesome, using
set_fs()+vfs_read() on one side and submit_bh() on the other.
Are you sure the handling at EOF for a non-multiple-of-PAGE_SIZE file
is OK?
The loss of pagecache coherency seems sad. I assume there's never a
requirement for userspace to read this file.
invalidate_inode_pages() is best-effort. If someone else has the page
locked or if the page is mmapped then the attempt to take down the
pagecache will fail. That's relatively-OK, because it'll just lead to
userspace seeing wrong stuff, and we've already conceded that.
But if the pagecache is dirty then invalidate_inode_pages() will leave it
dirty and the VM will later write it back, corrupting your bitmap file.
You should get i_writecount, fsync the file and then run
invalidate_inode_pages().
Or not run invalidate_inode_pages() - it doesn't gain anything and will
just reduce the observeability of bugs. Better off leaving the pagecache
there all the time so that any rarely-occurring bugs become all-the-time
bugs.
You might as well use kernel_read() too, if you insist on begin oddball ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- References:
- [PATCH 000 of 8] md/bitmap: Introduction - rework management of bitmap files.
- From: NeilBrown
- [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
- From: NeilBrown
- Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
- From: Andrew Morton
- Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
- From: Neil Brown
- [PATCH 000 of 8] md/bitmap: Introduction - rework management of bitmap files.
- Prev by Date: Re: [PATCH 002 of 8] md/bitmap: Remove bitmap writeback daemon.
- Next by Date: intel mac mini: USB keyboard is disfunctional after several hours of passive operation until disconnect/reconnect
- Previous by thread: Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
- Next by thread: Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.
- Index(es):