NFS negative dentry caching problem in 2.4.21, and patch

From: Tor Lillqvist (tml_at_iki.fi)
Date: 10/27/04

  • Next message: Andrew Morton: "Re: 2.6.10-rc1-mm1"
    To: linux-kernel@vger.kernel.org
    Date:	Wed, 27 Oct 2004 12:22:35 +0300 (EEST)
    
    

    We have had a support request open for over a month at Red Hat for
    this, and I even came up with a suggested patch, but Red Hat support
    seems to be ignoring us. No comments from there. So what do people
    here think? Yes, I realize that most gurus probably are knee-deep in
    2.6 and couldn't care less about 2.4.21... but anyway, could you
    please have a look?

    Below is a description of the problem and a suggested patch. Note that
    I have tested the patch on Red Hat's 2.4.21-20.EL only, but I don't
    see why the same issue wouldn't occur also on a stock 2.4.21 (if
    somebody is running such), or later. This is copy-pasted from the Red
    Hat support request, so when it says "RHEL3", think "Linux 2.4.21".

    Thanks in advance for any comments.

    --tml

      We have a big problem with how the NFS client code in RHEL3
      works. The problem led to a huge unexpected increase in disk usage
      for us after we moved an application from Solaris to RHEL3. We have
      now been forced to move it partially back to Solaris servers.

      The problem occurs when using a distributed application (ClearCase)
      where applications on several client machines (RHEL or Solaris) talk
      to each other using RPC and access the same files that are located on
      an NFS server. It can easily be reproduced without having ClearCase,
      though.

      The scenario is as follows: Let's say we have two NFS client machines,
      A and B. A runs RHEL3, B runs any Unix. They both mount a file system
      from an NFS server (doesn't matter what kind, we have reproduced the
      problem with EMC Celerra, NetApp and Solaris NFS servers).

      What happens is this:

      0) Let F be a filename on the NFS file system. Initially this file
      does not exist.

      1) The application on the RHEL3 machine A does a stat() on F. The NFS
      client in the kernel sends a LOOKUP request to the NFS server, which
      obviously returns failure. The stat() fails with ENOENT. OK so far.

      2) Immediately afterwards (a few seconds max), the application on
      machine B creates the file F. No problems so far.

      3) When B is done with F, a few seconds later the application on
      machine A does an unlink() on F. Because of the negative dentry
      caching in the Linux kernel, it doesn't even bother to send an NFS
      REMOVE request to the NFS server, as (it thinks) it knows for sure the
      file doesn't exist. It lets the unlink() fail with ENOENT. But the
      file definitely exists.

      The application now thinks that the file F doesn't exist any longer,
      and loses track of it. This means ever increasing disk usage as the
      above scenario happens all the time when we run ClearCase builds for
      our (large) software.

      After we moved our view servers from Solaris to RHEL3, the disk usage
      of our ClearCase view storage doubled in a few months from 150 GB to
      close to 300 GB! This was a mystery to us until we found that the
      view storage file system was full of stranded files that weren't
      supposed to be left there, and that the application didn't know of and
      thus couldn't clean out itself.

      (In case you wonder why the applications work like that, well, that's
      how ClearCase works. A is a view server, B is a build server where
      clearmake jobs (compilations) are run, and the file F is a
      view-private file created and removed during the clearmake run. The
      view server first checks if F exists, then B actually creates it,
      writes to it, then when it isn't needed any longer the view server is
      supposed to remove it.)

      It is very easy to demonstrate the problem without ClearCase: Just
      mount a file system from a NFS server on two RHEL3 boxes. Let's call
      the file system /data, and the machines A and B:

      on machine A: ls /data/foobar
      ls: /data/foobar: No such file or directory

      on machine B: touch /data/foobar
      OK.

      on machine A: rm /data/foobar
      rm: cannot lstat `foobar': No such file or directory

      But the file *does* exist.

      A workaround is to mount the file system with acdirmin=0 and
      acdirmax=0. Then the nfs_neg_need_reval() function in fs/nfs/dir.c
      always returns true, meaning the nfs code never trusts negative
      dentries, and always does a fresh LOOKUP. But this then affects all
      system calls, not just unlink(). And it hurts NFS performance a
      lot. Why the heck would it need to do a LOOKUP at all when unlinking a
      file is beyond me...

      It seems to me that the correct thing for the kernel to do would be to
      *always* just send a NFS REMOVE request for unlink() of NFS
      files. (Well, as long as the parent directory path is OK.) The nfs
      client code should never trust that it thinks is knows the file
      doesn't exist. It shouldn't care whether the file exists or not. Just
      tell the NFS server to remove it. At least, this should be a mount
      option. [Alas, that's not what my patch then does, see below.]

      It seems that not even the latest 2.6 kernel does this correctly. It
      does pass around intents in nameidata structs, but there doesn't seem
      to be any LOOKUP_* flag for unlink intent, just for open, create and
      access. At least that's what I understand from looking at the code, I
      haven't actually run any 2.6 kernel yet.

    Then I did some debugging and came up with a patch:

      The below patch should be pretty self-evident, but anyway:

      I add a new "LOOKUP" flag in include/linux/fs.h, LOOKUP_UNLINK.

      I rename lookup_hash() in fs/namei.c to lookup_hash_with_intent(), and
      add a new "flags" parameter to it. This parameter is passed on to the
      cached_lookup () function instead of always passing a zero. Of course
      I still need to provide a lookup_hash() with identical prototype as
      the old one in order not to break the kernel API, so that is now then
      just a one-liner that calls lookup_hash_with_intent() with zero
      intent.

      The only place where I change a call to lookup_hash() to call
      lookup_hash_with_intent() instead is in sys_unlink(), and here
      LOOKUP_UNLINK is passed as the "flags" parameter.

      In fs/nfs/dir.c, I add a flags parameter to the nfs_neg_need_reval()
      function. If flags has LOOKUP_UNLINK set, the function always returns
      1. The call to nfs_neg_needs_reval() from nfs_lookup_revalidate() now
      passes its flags parameter on to nfs_neg_need_reval().

      Simple, but works. I have no idea whether anything like this is
      necessary for 2.6. There isn't any LOOKUP_UNLINK intent flag there,
      although they have added LOOKUP_OPEN, LOOKUP_CREATE and LOOKUP_ACCESS
      flag bits. But the code is different anyway so it's hard to say
      whether the same issue arises there without adding debug printouts and
      tracing what happens, and I don't have time for that right now.

    Here's the patch, against a stock 2.4.21:

    diff -ru linux-2.4.21.orig/fs/namei.c linux-2.4.21/fs/namei.c
    --- linux-2.4.21.orig/fs/namei.c 2003-06-13 17:51:37.000000000 +0300
    +++ linux-2.4.21/fs/namei.c 2004-10-27 11:54:51.000000000 +0300
    @@ -767,7 +767,7 @@
      * needs parent already locked. Doesn't follow mounts.
      * SMP-safe.
      */
    -struct dentry * lookup_hash(struct qstr *name, struct dentry * base)
    +struct dentry * lookup_hash_with_intent(struct qstr *name, struct dentry * base, int flags)
     {
             struct dentry * dentry;
             struct inode *inode;
    @@ -790,7 +790,7 @@
                             goto out;
             }
     
    - dentry = cached_lookup(base, name, 0);
    + dentry = cached_lookup(base, name, flags);
             if (!dentry) {
                     struct dentry *new = d_alloc(base, name);
                     dentry = ERR_PTR(-ENOMEM);
    @@ -808,6 +808,11 @@
             return dentry;
     }
     
    +struct dentry * lookup_hash(struct qstr *name, struct dentry * base)
    +{
    + return lookup_hash_with_intent (name, base, 0);
    +}
    +
     /* SMP-safe */
     struct dentry * lookup_one_len(const char * name, struct dentry * base, int len)
     {
    @@ -1511,7 +1516,7 @@
             if (nd.last_type != LAST_NORM)
                     goto exit1;
             down(&nd.dentry->d_inode->i_sem);
    - dentry = lookup_hash(&nd.last, nd.dentry);
    + dentry = lookup_hash_with_intent(&nd.last, nd.dentry, LOOKUP_UNLINK);
             error = PTR_ERR(dentry);
             if (!IS_ERR(dentry)) {
                     /* Why not before? Because we want correct error value */
    diff -ru linux-2.4.21.orig/fs/nfs/dir.c linux-2.4.21/fs/nfs/dir.c
    --- linux-2.4.21.orig/fs/nfs/dir.c 2002-11-29 01:53:15.000000000 +0200
    +++ linux-2.4.21/fs/nfs/dir.c 2004-10-27 11:54:51.000000000 +0300
    @@ -450,13 +450,19 @@
      * We judge how long we want to trust negative
      * dentries by looking at the parent inode mtime.
      *
    - * If parent mtime has changed, we revalidate, else we wait for a
    - * period corresponding to the parent's attribute cache timeout value.
    + * If parent mtime has changed, we revalidate.
    + *
    + * Else, if we are doing this as part of an unlink(), we revalidate.
    + *
    + * Else, we wait for a period corresponding to the parent's attribute
    + * cache timeout value.
      */
    -static inline int nfs_neg_need_reval(struct inode *dir, struct dentry *dentry)
    +static inline int nfs_neg_need_reval(struct inode *dir, struct dentry *dentry, int flags)
     {
             if (!nfs_check_verifier(dir, dentry))
                     return 1;
    + if (flags & LOOKUP_UNLINK)
    + return 1;
             return time_after(jiffies, dentry->d_time + NFS_ATTRTIMEO(dir));
     }
     
    @@ -484,7 +490,7 @@
             inode = dentry->d_inode;
     
             if (!inode) {
    - if (nfs_neg_need_reval(dir, dentry))
    + if (nfs_neg_need_reval(dir, dentry, flags))
                             goto out_bad;
                     goto out_valid;
             }
    diff -ru linux-2.4.21.orig/include/linux/fs.h linux-2.4.21/include/linux/fs.h
    --- linux-2.4.21.orig/include/linux/fs.h 2003-06-13 17:51:38.000000000 +0300
    +++ linux-2.4.21/include/linux/fs.h 2004-10-27 11:55:43.000000000 +0300
    @@ -1333,6 +1333,9 @@
     #define LOOKUP_POSITIVE (8)
     #define LOOKUP_PARENT (16)
     #define LOOKUP_NOALT (32)
    +
    +#define LOOKUP_UNLINK (0x800)
    +
     /*
      * Type of the last component on LOOKUP_PARENT
      */

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Andrew Morton: "Re: 2.6.10-rc1-mm1"

    Relevant Pages

    • [GIT] NFS client updates for 2.6.22...
      ... commit a19b89cad51b6f0da8f4bafdfdcfb10264cbcdea ... should remove any cached ACL so that any subsequent request for the ACL ... NFS: Fix directory caching problem - with test case and patch. ...
      (Linux-Kernel)
    • Re: NFS Still broken in 2.6.x?
      ... I'm running into a bit of trouble with NFS on 2.6. ... The patch below should fix this problem. ... Subject: Readahead algorithm problems again... ... 32k max, 4k request ...
      (Linux-Kernel)
    • Re: Linux 2.4.33-rc2
      ... a file on the NFS file system did block. ... Assuming you mean something like the patch below? ... Got this with unpatched -rc2, tosh is NFS server, niner is client: ... No 'server not responding messages' logged. ...
      (Linux-Kernel)
    • Re: [PATCH][BUGFIX][RFC] fix soft lock up at NFS mount by making limitation of dentry_unused
      ... - kernbench-0.24 on local ext3 and nfs ... but I found one performance regression in my patch at following case. ... (elapsed time: second(s)) ... Average Optimal load -j 96 Run: ...
      (Linux-Kernel)
    • Re: nfsclient process stucks in nfsaio
      ... Some ddb & kgdb ... "NFS problems, locking up" is hightly related to my situation. ... generate lots of traffic to nfs server (write to nfs, ... Client mount options: rw,nosuid,bg,intr,nodev. ...
      (freebsd-stable)