Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
- From: Frank Mayhar <fmayhar@xxxxxxxxxx>
- Date: Tue, 04 Mar 2008 11:52:56 -0800
Put this on the patch but I'm emailing it as well.
On Mon, 2008-03-03 at 23:00 -0800, Roland McGrath wrote:
Thanks for the detailed explanation and for bringing this to my attention.
You're quite welcome.
This is a problem we knew about when I first implemented posix-cpu-timers
and process-wide SIGPROF/SIGVTALRM. I'm a little surprised it took this
long to become a problem in practice. I originally expected to have to
revisit it sooner than this, but I certainly haven't thought about it for
quite some time. I'd guess that HZ=1000 becoming common is what did it.
Well, the iron is getting bigger, too, so it's beginning to be feasible
to run _lots_ of threads.
The obvious implementation for the process-wide clocks is to have the
tick interrupt increment shared utime/stime/sched_time fields in
signal_struct as well as the private task_struct fields. The all-threads
totals accumulate in the signal_struct fields, which would be atomic_t.
It's then trivial for the timer expiry checks to compare against those
totals.
The concern I had about this was multiple CPUs competing for the
signal_struct fields. (That is, several CPUs all running threads in the
same process.) If the ticks on each CPU are even close to synchronized,
then every single time all those CPUs will do an atomic_add on the same
word. I'm not any kind of expert on SMP and cache effects, but I know
this is bad. However bad it is, it's that bad all the time and however
few threads (down to 2) it's that bad for that many CPUs.
The implementation we have instead is obviously dismal for large numbers
of threads. I always figured we'd replace that with something based on
more sophisticated thinking about the CPU-clash issue.
I don't entirely follow your description of your patch. It sounds like it
should be two patches, though. The second of those patches (workqueue)
sounds like it could be an appropriate generic cleanup, or like it could
be a complication that might be unnecessary if we get a really good
solution to main issue.
The first patch I'm not sure whether I understand what you said or not.
Can you elaborate? Or just post the unfinished patch as illustration,
marking it as not for submission until you've finished.
My first patch did essentially what you outlined above, incrementing
shared utime/stime/sched_time fields, except that they were in the
task_struct of the group leader rather than in the signal_struct. It's
not clear to me exactly how the signal_struct is shared, whether it is
shared among all threads or if each has its own version.
So each timer routine had something like:
/* If we're part of a thread group, add our time to the leader. */
if (p->group_leader != NULL)
p->group_leader->threads_sched_time += tmp;
and check_process_timers() had
/* Times for the whole thread group are held by the group leader. */
utime = cputime_add(utime, tsk->group_leader->threads_utime);
stime = cputime_add(stime, tsk->group_leader->threads_stime);
sched_time += tsk->group_leader->threads_sched_time;
Of course, this alone is insufficient. It speeds things up a tiny bit
but not nearly enough.
The other issue has to do with the rest of the processing in
run_posix_cpu_timers(), walking the timer lists and walking the whole
thread group (again) to rebalance expiry times. My second patch moved
all that work to a workqueue, but only if there were more than 100
threads in the process. This basically papered over the problem by
moving the processing out of interrupt and into a kernel thread. It's
still insufficient, though, because it takes just as long and will get
backed up just as badly on large numbers of threads. This was made
clear in a test I ran yesterday where I generated some 200,000 threads.
The work queue was unreasonably large, as you might expect.
I am looking for a way to do everything that needs to be done in fewer
operations, but unfortunately I'm not familiar enough with the
SIGPROF/SIGVTALRM semantics or with the details of the Linux
implementation to know where it is safe to consolidate things.
--
Frank Mayhar <fmayhar@xxxxxxxxxx>
Google, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
- From: Roland McGrath
- Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
- References:
- Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
- From: Roland McGrath
- Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
- Prev by Date: Re: [PATCH] usb: add sysfs configuration interface for CP2101
- Next by Date: Re: Google's Summer of Code?
- Previous by thread: Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
- Next by thread: Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and SIGPROF.
- Index(es):
Relevant Pages
|