Re: Andrea VM changes

From: Andrea Arcangeli (andrea_at_suse.de)
Date: 08/31/03

  • Next message: Zwane Mwaikambo: "[PATCH][2.6-mm] x86/64 topology cpumask fixes"
    Date:	Sun, 31 Aug 2003 01:57:14 +0200
    To: Marcelo Tosatti <marcelo@parcelfarce.linux.theplanet.co.uk>
    
    

    On Sat, Aug 30, 2003 at 08:30:36PM -0300, Marcelo Tosatti wrote:
    >
    >
    > On Sun, 31 Aug 2003, Andrea Arcangeli wrote:
    >
    > > On Sat, Aug 30, 2003 at 04:21:02PM -0300, Marcelo Tosatti wrote:
    > > > y
    > > >
    > > > On Sat, 30 Aug 2003, Marcelo Tosatti wrote:
    > > >
    > > > > >
    > > > > > Indeed, you are right.
    > > > > >
    > > > > > I'll start looking at them Monday. I'll keep you in touch. Thanks.
    > > > >
    > > > > Andrea,
    > > > >
    > > > > Would you mind to explain me 05_vm_06_swap_out-3 ?
    > > > >
    > > > > I see you change shrink_cache, try_to_free_pages_zone, etc.
    > > > >
    > > > > Can you please give me a detailed explanation of the changes there?
    > > > >
    > > > > I appreciate very much.
    > > > >
    > > > > I'll keep looking at other patches for now.
    > > >
    > > > 05_vm_09_misc_junk-3 removes the PF_MEMDIE and you also seem to remove the
    > > > OOM killer. Is that right? Why?
    > >
    > > because the oom killer is a DoS on servers, on a database setup, with 2G
    > > free, with say all tasks 2.7G large, it'll start killing all the
    > > thousand database tasks instead of the 2g netscape task that hit an
    > > userspace bug and it started allocating ram in a loop, and that will
    > > make no progress since no physical ram will be released. There's no need
    > > of oom killer to keep the system stable, with my vm, and the current
    > > probabilistic oom killer in the page fault hander
    >
    > So tasks get killed in case of page allocation failure?

    yes.

    When alloc_pages returns NULL during the page fault handling we just
    call do_exit. With 2.2-aa we were even smarter, we also checked if the
    task had iopl privilegies (something that at the moment we can do only
    in the page fault handler btw), so we could trust the task and just send
    a SIGTERM a few times, instead of doing immediatly a do_exit(SIGKILL).
    So we wouldn't screwup the graphics card for example (killing an iopl
    task isn't always safe). But I never forward ported this very nice
    feature to 2.4.

    If alloc_pages returns null in all other cases, it's up to the caller to
    return -ENOMEM to userspace as a retval of the syscall.

    > > kills the right task most of the time (unlike the stock oom killers that
    > > works well only for the desktops or developer machines). So it does a
    > > much better job and it doesn't risk to DoS the box due oom.
    >
    > Mind to explain me in more detail the OOM killing mechanism?

    the current logic depends on alloc_pages to return NULL.

    And alloc_pages will return null depending on the
    swapping/cache-shrinking.

    The current code in mainline instead is even OOM deadlock prone in the
    VM, for example not only the oom killer can do a DoSable wrong selection
    of the task on servers, but it can even fail to detect an oom condition.
    Another other thing that can easily fool the current oom killer, is the
    mlocked ram: the current oom killer will be fooled by the fact there's
    still some swap free and it'll never kick in and the box will deadlock.
    This can't happen with my tree since I don't trust the unreliable
    statistical information we have from the kernel: we simply have no way
    to (efficiently) calculate the number of freeable pages at any given
    time, and as such the only reasonable thing we can do is to try to
    swap/shrink a number of times and to giveup eventually (that is like
    counting inefficiently the number of freeable pages a few times).

    > > Another DoS generated by the oom killer is that it'll try forever to
    > > kill a UNINTERRUPTIBLE task hanging in a nfs server that is down, so it
    > > hangs the whole box for an unlimited time.
    > >
    > > I've an algorithm that will work, and that will provide very good
    > > guarantees to kill the "best" task to make the machine usable again,
    > > with the needed protection against the security DoSes, but it's in
    > > no-way similar to the current oom killer.
    >
    > My concern is about how this oom killer works.

    This oom killer on desktops may do a worse selections of the task to
    kill (the usual ssh now has a chance to be killed), but it fixes the oom
    deadlocks and it won't do stupid things on servers shall a netscape or
    whatever else app hit an userspace bug. So I've to prefer it, until I
    will write a reliable algorithm for the oom killing that won't fall into
    dosable corner cases so easily (mlock/nfs/database as the three most
    common examples of where current mainline can fail, btw the lowmem
    shortage is another very common DoS that the oom killer will never
    notice, my tree doesn't deadlock [or at least not technically, in
    practice it may look like a kernel deadlock despite syscalls returns
    -ENOMEM ;) ] during lowmem shortage on the 64G boxes).

    Andrea
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Zwane Mwaikambo: "[PATCH][2.6-mm] x86/64 topology cpumask fixes"

    Relevant Pages

    • Re: oom killer in 2.4.23
      ... can't deadlock, it can live lock if you're unlucky with ... free is a bug, livelock is something you can avoid by dropping all swap. ... if you drop all swap with 2.4.22 it'll go nuts killing tasks (see the ... people sent him bugreports about the oom killer going nuts, ...
      (Linux-Kernel)
    • Re: [PATCH] [Request for inclusion] Filesystem in Userspace
      ... the OOM killer will not kick in. ... to totally deadlock the system. ... send the line "unsubscribe linux-kernel" in ...
      (Linux-Kernel)