Re: If not readdir() then what?



On Mon, Apr 16, 2007 at 04:18:53PM +1000, Neil Brown wrote:
On Thursday April 12, tytso@xxxxxxx wrote:

Unfortunately, in the NFS case if there are hash collisions, under the
wrong circumstances the NFS client could loop forever trying to
readdir() a directory stream.

I think that can be easily avoided by the filesystem by simply
ensuring that the cookies returned in a request are all *after* the
'llseek' cookie. There still maybe some risk of missing or extra
entries in the directory if the filesystem cannot perform a precise
cookie->position mapping, but an infinite loop should not be a problem.

It is true that you can replace the infinite loop problem by
guaranteeing that certain filenames are simply always missing.
Neither seemed desirable to me, but it is true that if we have to err
in one direction or another, simply skipping files in the case of a
hash collision is probably a better result.

You still haven't convinced me that the limited size is a fundamental
problem.
In the case of ext3, I have suggested (various things including)
using 56bits of the hash and an 8 bit sequence number through any
hash chain that may eventuate.
The seq number could not be stable across reboots with the current
on-disk data, but could be made stable for almost all real usage with
a little bit of caching (I imagine using upper 4 bits to give a
sequence number while reading from disk, and lower 4 bits to give a
sequence number when adding new entries).

The challenge is making it be stable across inserts/deletes, never
mind reboots. And it's not a "little bit of cacheing"; in order to be
correct we would have to cache *forever*, since at least in theory an
NFS client could hold on to a cookie for an arbitrarily long period of
time (weeks, months, years, decades), yes?

Supposing I were to try my hand at making the modifications I
suggested to ext3 so that per page/hashvalue rbtrees were cached in
the pagecache and were used to provide a stable-as-possible telldir
cookie.

Well, the first problem you'll run into is that ext3 doesn't store
directories in the page cache. That's because the journaling layer
fs/jbd operates on a buffer cache basis (i.e., on a per filesystem
block level).

Secondly, storing the rbtree in the page cache doesn't help, because
the whole point of the rbtree is to have a stable pointer that doesn't
change across a b-tree node split. Whether you use a page cache or
the buffer cache, it doesn't change the fact that when you do a b-tree
node split, half the directory entries *go* *someplace* *else*. Put
another way, Linus's standard reasons for wanting to store the
directory in the page cache so we can have better readahead don't
apply when you're using htree, because with htree we're never reading
entries in page cache order.

So you wouldn't be able to store the rbtree in the page cache, since
the doing so would defeat the whole purpose of the having the rbtree
in the first place. The rbtree needs to be per directory stream,
*not* per directory block.

You could store a mapping between each directory entry and a sequence
number. That could work, modulo the need to store these entries
*forever*. But now you take a performance hit since every time you
create or delete a directory entry, you'll need to keep the cache up
to date. And since we don't store the hash in the on-disk format
(since you can always calculate it from the filename), you'll need to
store the hash as well in the cache information. So that means
storing a 64-bit hash plus the sequence counter information --- and if
we assume 32-bit alignment requirements for data structures, it means
the simplest way to do this would be to hang a data structure off of
each directory block which consumes 12 bytes of data for every
directory entry, plus either (a) extra space for pointers so you can
easily insert and delete entries as directory entries are created or
deleted, or (b) resigning yourself to use memmove() to insert and
delete entries in the linear array.

You're welcome to try, but I suspect it won't take long before you'll
see why I'm asserting that a directory fd cache in nfsd is *way* less
work. :-)

- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: If not readdir() then what?
    ... All we have to do is cache the open files. ... We keep these in an LRU list and a hash table. ... hash LRU chain, but rather a number of LRU chains. ... Have a rbtree for storing directory entries. ...
    (Linux-Kernel)
  • soft cache
    ... on of solutions to this problem is weak hash tables -- they do not prevent ... it gives no guarantee how long entries stay in cache. ... HT just points to entry in LRU singly linked list, ...
    (comp.lang.lisp)
  • [VulnWatch] Algorithmic Complexity Attacks and the Linux Networking Code
    ... The Linux networking code makes extensive use of hash tables to ... The Linux Routing Cache ... Our attack is targeted at a host and uses packets with carefully ...
    (VulnWatch)
  • Re: Rows returned are out of sync with the request.
    ... ('driver_connection' from cache) ... <- STORE= 1 at SQLRelay.pm line 145 ... However I suspect the signal was not delivered until the execute completed. ... Can $sth ever be already defined here? ...
    (perl.dbi.users)
  • problem using both Memoize::Expire and DB_File
    ... Subroutine 'test0' is memoized via a simple cache, 'test1' uses a disk-backed cache, 'test2' uses an expiring cache, and 'test3' uses both. ... } my %test0_hash; memoize 'test0', ...
    (comp.lang.perl.misc)