Re: [BUG?] 2.6.25-rc[23]-mm1 cgroup list corruption under load with VM Scalability patches
- From: Lee Schermerhorn <Lee.Schermerhorn@xxxxxx>
- Date: Tue, 18 Mar 2008 14:10:11 -0400
On Wed, 2008-03-05 at 13:09 -0800, Paul Menage wrote:
On Wed, Mar 5, 2008 at 11:37 AM, Lee Schermerhorn
<Lee.Schermerhorn@xxxxxx> wrote:
list_del corruption in cgroup_exit() on 16 cpu, 32GB ia64 NUMA platform.
I've been seeing this for a while now, but we've had known problems
[page leaks, ...] with the VM scalability series. Now the system
appears to be running very well with these patches under stress loads
that would hang it or cause OOM kill of tests with plenty of swap space
left. Eventually, [after 40-45 minutes], I hit a list corruption in
cgroup_exit().
I can't say for sure that our patches aren't causing this, but I've been
unable to keep the system up long enough under the stress load w/o the
splitlru+noreclaim patches to hit the problem.
I looked in the mailing lists and found one other thread related to
cgroup list corruption:
http://marc.info/?l=linux-kernel&m=119263666823236&w=4
Paul looked into this and couldn't see anywhere that the lists are
manipulate w/o holding the css set lock. I concur. I did find one
possible race in enabling the task cg_lists [see patch below], but this
did not solve the problem. And I did not hit the printk in the patch.
No, that's not a (malign) race - cgroup_enable_task_cg_lists() is
idempotent. In the case that you see, every thread seen in the
do_each_thread() loop will already have a non-empty cg_list field, so
it will be a no-op. So adding the additional check isn't wrong but
it's not needed.
I'll look again at the code to try to figure out where the problem is.
Paul:
just wanted to let you know that I did manage to hit this list
corruption--same stack trace: cgroup_exit() from do_exit() ...--on
25-rc3-mm1 WITHOUT any of the vm scalability [split-lru/noreclaim-mlock]
patches applied. This occurred ~9 minutes into a fairly heavy 'usex'
load on my 16 cpu ia64 platform.
An x86_64 version [w/ prebuilt binaries of the tools used] of the stress
load is available here:
http://free.linux.hp.com/~lts/Temp/
There's a README there describing the contents of the tarball. I
haven't tried this load on an x86_64 recently, so I don't know if it
will trigger the problem there.
Lee
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: Re: [BUG?] 2.6.25-rc[23]-mm1 cgroup list corruption under load with VM Scalability patches
- From: kamezawa . hiroyu
- Re: [BUG?] 2.6.25-rc[23]-mm1 cgroup list corruption under load with VM Scalability patches
- From: KOSAKI Motohiro
- Re: Re: [BUG?] 2.6.25-rc[23]-mm1 cgroup list corruption under load with VM Scalability patches
- References:
- Prev by Date: RE: [PATCH] Fix fsldma.c warning messages when it's compiled underPPC64.
- Next by Date: Re: [PATCH 00 of 31] x86: unification and xen updates
- Previous by thread: Re: [BUG?] 2.6.25-rc[23]-mm1 cgroup list corruption under load with VM Scalability patches
- Next by thread: Re: [BUG?] 2.6.25-rc[23]-mm1 cgroup list corruption under load with VM Scalability patches
- Index(es):
Relevant Pages
|