Re: [RFC][0/3] Virtual address space control for cgroups (v2)
- From: Balbir Singh <balbir@xxxxxxxxxxxxxxxxxx>
- Date: Thu, 27 Mar 2008 13:34:56 +0530
Paul Menage wrote:
On Wed, Mar 26, 2008 at 11:49 AM, Balbir Singh
<balbir@xxxxxxxxxxxxxxxxxx> wrote:
The changelog in each patchset documents what has changed in version 2.
The most important one being that virtual address space accounting is
now a config option.
Reviews, Comments?
I'm still of the strong opinion that this belongs in a separate
subsystem. (So some of these arguments will appear familiar, but are
generally because they were unaddressed previously).
I thought I addressed some of those by adding a separate config option. You
could enable just the address space control, by letting memory.limit_in_bytes at
the maximum value it is at (at the moment).
The basic philosophy of cgroups is that one size does not fit all
(either all users, or all task groups), hence the ability to
pick'n'mix subsystems in a hierarchy, and have multiple different
hierarchies. So users who want physical memory isolation but not
virtual address isolation shouldn't have to pay the cost (multiple
atomic operations on a shared structure) on every mmap/munmap or other
address space change.
Yes, I agree with the overhead philosophy. I suspect that users will enable
both. I am not against making it a separate controller. I am still hopeful of
getting the mm->owner approach working
Trying to account/control physical memory or swap usage via virtual
address space limits is IMO a hopeless task. Taking Google's
production clusters and the virtual server systems that I worked on in
my previous job as real-life examples that I've encountered, there's
far too much variety of application behaviour (including Java apps
that have massive sparse heaps, jobs with lots of forked children
sharing pages but not address spaces with their parents, and multiple
serving processes mapping large shared data repositories from SHM
segments) that saying VA = RAM + swap is going to break lots of jobs.
But pushing up the VA limit massively makes it useless for the purpose
of preventing excessive swapping. If you want to prevent excessive
swap space usage without breaking a large class of apps, you need to
limit swap space, not virtual address space.
Additionally, you suggested that VA limits provide a "soft-landing".
But I'm think that the number of applications that will do much other
than abort() if mmap() returns ENOMEM is extremely small - I'd be
interested to hear if you know of any.
What happens if swap is completely disabled? Should the task running be OOM
killed in the container? How does the application get to know that it is
reaching its limit? I suspect the system administrator will consider
vm.overcommit_ratio while setting up virtual address space limits and real page
usage limit. As far as applications failing gracefully is concerned, my opinion is
1. Lets not be dictated by bad applications to design our features
2. Autonomic computing is forcing applications to see what resources
applications do have access to
3. Swapping is expensive, so most application developers, I spoken to at
conferences, recently, state that they can manage their own memory, provided
they are given sufficient hints from the OS. An mmap() failure, for example can
force the application to free memory it is not currently using or trigger the
garbage collector in a managed environment.
I'm not going to argue that there are no good reasons for VA limits,
but I think my arguments above will apply in enough cases that VA
limits won't be used in the majority of cases that are using the
memory controller, let alone all machines running kernels with the
memory controller configured (e.g. distro kernels). Hence it should be
possible to use the memory controller without paying the full overhead
for the virtual address space limits.
Yes, the overhead part is a compelling reason to split out the controllers. But
then again, we have a config option to disable the overhead.
And in cases that do want to use VA limits, can you be 100% sure that
they're going to want to use the same groupings as the memory
controller? I'm not sure that I can come up with a realistic example
of why you'd want to have VA limits and memory limits in different
hierarchies (maybe tracking memory leaks in subgroups of a job and
using physical memory control for the job as a whole?), but any such
example would work for free if they were two separate subsystems.
The only real technical argument against having them in separate
subsystems is that there needs to be an extra pointer from mm_struct
to a va_limit subsystem object if they're separate, since the VA
limits can no longer use mm->mem_cgroup. This is basically 8 bytes of
overhead per process (not per-thread) which is minimal, and even that
could go away if we were to implement the mm->owner concept.
Yes, the mm->owner patches would help split the controller out more easily. Let
me see if I can get another revision of that working and measure the overhead of
finding the next mm->owner.
Paul
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [RFC][0/3] Virtual address space control for cgroups (v2)
- From: Paul Menage
- Re: [RFC][0/3] Virtual address space control for cgroups (v2)
- References:
- [RFC][0/3] Virtual address space control for cgroups (v2)
- From: Balbir Singh
- Re: [RFC][0/3] Virtual address space control for cgroups (v2)
- From: Paul Menage
- [RFC][0/3] Virtual address space control for cgroups (v2)
- Prev by Date: Re: [PATCH] Discard notification signals when a tracer exits
- Next by Date: Re: What if a TLB flush needed to sleep?
- Previous by thread: Re: [RFC][0/3] Virtual address space control for cgroups (v2)
- Next by thread: Re: [RFC][0/3] Virtual address space control for cgroups (v2)
- Index(es):
Relevant Pages
|