Re: [RFC] fsblock



Warning ahead: I've only briefly skipped over the pages so the comments
in the mail are very highlevel.

On Sun, Jun 24, 2007 at 03:45:28AM +0200, Nick Piggin wrote:
fsblock is a rewrite of the "buffer layer" (ding dong the witch is
dead), which I have been working on, on and off and is now at the stage
where some of the basics are working-ish. This email is going to be
long...

Firstly, what is the buffer layer? The buffer layer isn't really a
buffer layer as in the buffer cache of unix: the block device cache
is unified with the pagecache (in terms of the pagecache, a blkdev
file is just like any other, but with a 1:1 mapping between offset
and block).

There are filesystem APIs to access the block device, but these go
through the block device pagecache as well. These don't exactly
define the buffer layer either.

The buffer layer is a layer between the pagecache and the block
device for block based filesystems. It keeps a translation between
logical offset and physical block number, as well as meta
information such as locks, dirtyness, and IO status of each block.
This information is tracked via the buffer_head structure.


Traditional unix buffer cache is always physical block indexed and
used for all data/metadata/blockdevice node access. There's been
a lot of variants of schemes where data or some data is in a separate
inode,logial block indexed scheme. Most modern OSes including Linux
now always do the inode,logial block index with some noop substitute
for the metadata and block device node variants of operation.

Now what you replace is a really crappy hybrid of a traditional
unix buffercache implemented ontop of the pagecache for the block
device node (for metadata) and a lot of abuse of the same data
structure as used in the buffercache for keeping metainformation
about the actual data mapping.

Why rewrite the buffer layer? Lots of people have had a desire to
completely rip out the buffer layer, but we can't do that[*] because
it does actually serve a useful purpose. Why the bad rap? Because
the code is old and crufty, and buffer_head is an awful name. It must
be among the oldest code in the core fs/vm, and the main reason is
because of the inertia of so many and such complex filesystems.

Actually most of the code is no older than 10 years. Just compare
fs/buffer.c in 2.2 and 2.6. buffer_head is a perfectly fine name
for one of it's uses in the traditional buffercache.

I also thing there is little to no reason to get rid of that use:
This buffercache is what most linux block-based filesystems (except
xfs and jfs most notably) are written to, and it fits them very nicely.

What I'd really like to see is to get rid of the abuse of struct buffer_head
in the data path, and the sometimes to intimate coupling of the buffer cache
with page cache internals.

- Data / metadata separation. I have a struct fsblock and a struct
fsblock_meta, so we could put more stuff into the usually less used
fsblock_meta without bloating it up too much. After a few tricks, these
are no longer any different in my code, and dirty up the typing quite
a lot (and I'm aware it still has some warnings, thanks). So if not
useful this could be taken out.


That's what I mean. And from a quick glimpse at your code they're still
far too deeply coupled in fsblock. Really, we don't really want to share
anything between the buffer cache and data mapping operations - they are
so deeply different that this sharing is what creates the enormous complexity
we have to deal with.

- No deadlocks (hopefully). The buffer layer is technically deadlocky by
design, because it can require memory allocations at page writeout-time.
It also has one path that cannot tolerate memory allocation failures.
No such problems for fsblock, which keeps fsblock metadata around for as
long as a page is dirty (this still has problems vs get_user_pages, but
that's going to require an audit of all get_user_pages sites. Phew).

The whole concept of delayed allocation requires page allocations at
writeout time, as do various network protocols or even storage drivers.

- In line with the above item, filesystem block allocation is performed
before a page is dirtied. In the buffer layer, mmap writes can dirty a
page with no backing blocks which is a problem if the filesystem is
ENOSPC (patches exist for buffer.c for this).

Not really something that is the block layers fault but rather the lazyness
of the filesystem maintainers.

- Large block support. I can mount and run an 8K block size minix3 fs on
my 4K page system and it didn't require anything special in the fs. We
can go up to about 32MB blocks now, and gigabyte+ blocks would only
require one more bit in the fsblock flags. fsblock_superpage blocks
are > PAGE_CACHE_SIZE, midpage ==, and subpage <.

Core pagecache code is pretty creaky with respect to this. I think it is
mostly race free, but it requires stupid unlocking and relocking hacks
because the vm usually passes single locked pages to the fs layers, and we
need to lock all pages of a block in offset ascending order. This could be
avoided by doing locking on only the first page of a block for locking in
the fsblock layer, but that's a bit scary too. Probably better would be to
move towards offset,length rather than page based fs APIs where everything
can be batched up nicely and this sort of non-trivial locking can be more
optimal.

See now why people like large order page cache so much :)

Large block memory access via filesystem uses vmap, but it will go back
to kmap if the access doesn't cross a page. Filesystems really should do
this because vmap is slow as anything. I've implemented a vmap cache
which basically wouldn't work on 32-bit systems (because of limited vmap
space) for performance testing (and yes it sometimes tries to unmap in
interrupt context, I know, I'm using loop). We could possibly do a self
limiting cache, but I'd rather build some helpers to hide the raw multi
page access for things like bitmap scanning and bit setting etc. and
avoid too much vmaps.

And this is a complete pain in the ass. XFS uses vmap in it's metadata buffer
cache due to requirements carrier over from IRIX (in fact that's why I implemented
vmap in it's current form). This works okay most of them time, but there are
a lot of scenarios where you run out of vmalloc space as you mention. What's
also nasy is that you can't call vunmap from irq context, and vunmap beeing
rather bad for system peformance due to the tlb flushing overhead.


So as the closing comment I'd say I'd rather keep buffer_heads for metadata
for now and try to decouple the data path from it. Your fsblock patches
are a very nice start for this, but I'd rather skip the intermediate step
towards the extent based API Dave has been outlining. Having deal with the
I/O path of a high performance filesystem for a while per-page or sub-page
structures are a real pain to deal with and I'd really prefer to have data
structures for as much as possible blocks with the same state.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [RFC] fsblock
    ... buffer layer as in the buffer cache of unix: ... is unified with the pagecache (in terms of the pagecache, ... I thought that the buffer layer is essentially a method to index to sub ... aligned writewould require an RMW in the filesystem. ...
    (Linux-Kernel)
  • Re: [RFC][PATCH] block: Isolate the buffer cache in its own mappings.
    ... Controlling which buffers are dirty and when. ... keep the buffer cache in the page cache for the block device we have ... Like I said, when a live filesystem is mounted on a bdev, ...
    (Linux-Kernel)
  • Re: Question regarding concurrent accesses through block device and fs
    ... the page cache is used. ... page cache of the block device is also sometimes called buffer cache, ... Basically what can happen is that a filesystem will have perhaps allocated ... buffer head around, but we also have this new pagecache and buffer head for ...
    (Linux-Kernel)
  • Re: Cisco 7204VXR Interface ignored packet
    ... Input queue drops (ignored packets) are usually due to buffers not being ... The buffer defaults are good 98% of the ... 256 max cache size, 256 in cache ... 22496866 hits in cache, 0 misses in cache ...
    (comp.dcom.sys.cisco)
  • Re: Cisco 7204VXR Interface ignored packet
    ... The buffer defaults are good 98% of the ... i have a problem with ignore packets in the interface. ... 256 max cache size, 256 in cache ... 22496866 hits in cache, 0 misses in cache ...
    (comp.dcom.sys.cisco)