Re: [PATCH/RFC] A method for clearing out page cache

From: Andrew Morton (akpm_at_osdl.org)
Date: 02/21/05

  • Next message: Chris Friesen: "Re: uninterruptible sleep lockups"
    Date:	Mon, 21 Feb 2005 14:41:08 -0800
    To: Ray Bryant <raybry@sgi.com>
    
    

    Ray Bryant <raybry@sgi.com> wrote:
    >
    > Andrew Morton wrote:
    > > Martin Hicks <mort@wildopensource.com> wrote:
    > >
    > >>This patch introduces a new sysctl for NUMA systems that tries to drop
    > >> as much of the page cache as possible from a set of nodes. The
    > >> motivation for this patch is for setting up High Performance Computing
    > >> jobs, where initial memory placement is very important to overall
    > >> performance.
    > >
    > >
    > > - Using a write to /proc for this seems a bit hacky. Why not simply add
    > > a new system call for it?
    > >
    >
    > We did it this way because it was easier to get it into SLES9 that way.
    > But there is no particular reason that we couldn't use a system call.
    > It's just that we figured adding system calls is hard.

    aarggh. This is why you should target kernel.org kernels first. Now we
    risk ending up with poor old suse carrying an obsolete interface and
    application developers have to be able to cater for both interfaces.

    > > If it does, then userspace could arrange for that concurrency by
    > > starting a number of processes to perform the toss, each with a different
    > > nodemask.
    > >
    >
    > That works fine as well if we can get a system call number assigned and
    > avoids the hackiness of both /proc and the kernel threads.

    syscall numbers are per-arch. We don't need to assign a syscall number for
    this one - we can surely have this ready for 2.6.12. Simply include i386
    and ia64 in the initial patch and other architectures will catch up pretty
    quickly. (It would be nice to generate patches for the arch maintainers,
    however).

    > > - Dropping "as much pagecache as possible" might be a bit crude. I
    > > wonder if we should pass in some additional parameter which specifies how
    > > much of the node's pagecache should be removed.
    > >
    > > Or, better, specify how much free memory we will actually require on
    > > this node. The syscall terminates when it determines that enough
    > > pagecache has been removed.
    >
    > Our thoughts exactly. This is clearly a "big hammer" and we want to
    > make a lighter hammer to free up a certain number of pages. Indeed,
    > we would like to have these calls occur automatically from __alloc_pages()
    > when we try to allocate local storage and find that there isn't any.
    > For our workloads, we want to free up unmapped, clean pagecache, if that
    > is what is keeping us from allocating a local page. Not all workloads
    > want that, however, so we would probably use a sysctl() to enable/disable
    > this.
    >
    > However, the first step is to do this manually from user space.

    Yup. The thing is, lots of people want this feature for various reasons.
    Not just numerical-computing-users-on-NUMA. We should get it right for
    them too.

    Especially kernel developers, who have various nasty userspace tools which
    will manually reclaim pagecache. But non-kernel-developers will use it
    too, when they think the VM is screwing them over ;)

    I think Solaris used to have such a tool - /usr/etc/chill, although I
    don't know if it had kernel support.

    > >
    > > - To make the syscall more general, we should be able to reclaim mapped
    > > pagecache and anonymous memory as well.
    > >
    > >
    > > So what it comes down to is
    > >
    > > sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
    > >
    > > where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
    > > mapped pagecache, anonymous memory, slab, ...).
    >
    > Do we have to implement all of those or just allow for the possibility of that
    > being implemented in the future? E. g. in our case we'd just implement the
    > bit that says "unmapped pagecache".

    Well... please take a look at what's involved. It should just be a matter
    of sprinkling a few test such as

    + if (sc->mode & SC_RECLAIM_SLAB) {
            ...
    + }

    into the existing code. If things turn nasty then we can take another look
    at it.
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Chris Friesen: "Re: uninterruptible sleep lockups"

    Relevant Pages

    • Re: clearing filesystem cache for I/O benchmarks
      ... (Please don't remove people from the email recipient list when doing kernel ... will, as a side-effect, remove _all_ of that file's pagecache. ... send the line "unsubscribe linux-kernel" in ...
      (Linux-Kernel)
    • Re: -mm merge plans for 2.6.23
      ... which the kernel even vaguely attempts to address, ... to be solved for mmapped files and even plain old pagecache. ... is a workload shift causing transition to state B, ... We would need to add a means by which userspace can repopulate ...
      (Linux-Kernel)
    • Re: cache limit
      ... >but the kernel is caching it at the expense of other programs. ... I also have the same problem about pagecache. ... extern unsigned long num_physpages; ... typedef int filler_t; ...
      (Linux-Kernel)
    • Re: cache limit
      ... the problem occurred due to pagecache. ... >> response time get uneven, and maximum time became 10 seconds. ... The latest 2.4 or 2.5 standard kernel does not have such a parameter. ... send the line "unsubscribe linux-kernel" in ...
      (Linux-Kernel)
    • Re: userspace pagecache management tool
      ... affects only the target process and its forked children. ... If the user wants to go and evict libc.so from pagecache ... then he can do so - the kernel has provided syscalls with which this can be ... usual and thereby avoid doing collateral damage. ...
      (Linux-Kernel)