Re: Processes spinning forever, apparently in lock_timer_base()?



On Fri, 21 Sep 2007 16:58:15 +0100 (BST) Hugh Dickins
<hugh@xxxxxxxxxxx> wrote:

But once I look harder at it, I wonder what would have kept
2.6.18 to 2.6.23 safe from the same issue: per-cpu deltas from
the global vm stats too low to get synched back to global, yet
adding up to something which misleads balance_dirty_pages into
an indefinite loop e.g. total nr_writeback actually 0, but
appearing more than dirty_thresh in the global approximation.

This could only happen when: dirty_thresh < nr_cpus * per_cpu_max_delta

Looking at the 2.6.18-2.6.23 code, I'm uncertain what to try instead.
There is a refresh_vm_stats function which we could call (then retest
the break condition) just before resorting to congestion_wait. But
the big NUMA people might get very upset with me calling that too
often: causing a thundering herd of bouncing cachelines which that
was all designed to avoid. And it's not obvious to me what condition
to test for dirty_thresh "too low".

That could be modeled on the error limit I have. For this particular
case that would end up looking like:

nr_online_cpus * pcp->stat_threshold.

I believe Peter gave all this quite a lot of thought when he was
making the rc6-mm1 changes, and I'd rather defer to him for a
suggestion of what best to do in earlier releases. Or maybe he'll
just point out how this couldn't have been a problem before.

As outlined above, and I don't think we'll ever have such a low
dirty_limit. But who knows :-)

Or there is is Richard's patch, which I haven't considered, but
Andrew was not quite satisfied with it - partly because he'd like
to understand how the situation could come about first, perhaps
we have now got an explanation.

I'm with Andrew on this, that is, quite puzzled on how all this arises.

Testing those writeback-fix-* patches might help rule out (or point to)
a mis-function of pdflush.

The theory that one task will spin in balance_dirty_pages() on a
bdi that does not actually have many dirty pages, doesn't sound
plausible because eventually the total dirty count (well, actually
dirty+unstable+writeback) should subside again. This theory can
cause crappy latencies, but should not 'hang' the machine.

(The original bug report was indeed on SMP, but I haven't seen
anyone say that's a necessary condition for the hang: it would
be if this is the issue. And Richard writes at one point of the
system only responding to AltSysRq: that would be surprising for
this issue, though it's possible that a task in balance_dirty_pages
is holding an i_mutex that everybody else comes to need.)

Are we actually holding i_mutex on paths that lead into
balance_dirty_pages? that does (from my admittedly limited knowledge of
the vfs) sound like trouble, since we'd need it to complete writeback.

All quite puzzling.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: Would Jesus have packed a gun?
    ... but that idea doesn't sound right. ... Guns, I think, are ... for people who don't have real power. ... holding the Truth, not from holding office for example. ...
    (talk.politics.guns)
  • Will Pay for Immediate Professional Help on Fedora Core 4
    ... The distro kernel did not detect the soundcard at all ... I tried installing the sound drivers for NVIDIA but no joy under ... The first boot of the new kernel ran fine. ... Subsequent boots of the system hang. ...
    (linux.redhat)
  • Re: Change merge document based on condition
    ... > sound like it, but I guess I'm confused as to where table B would hang out ... Peter Jamieson ...
    (microsoft.public.word.mailmerge.fields)
  • Not sure about this . Need help
    ... just hang at there and only the sound come out (sometimes the sound ... Chip Type: GeForce MX 440 ... Monitor: Default Monitor ...
    (microsoft.public.windowsxp.help_and_support)
  • OE6 hangs at end of receiving mail
    ... program freezing at the end of email receiving. ... secs to 1 minute and nothing else can be done within OE until the hang ... I turned off the 'Play sound when new messages arrive' Tools> options> ...
    (microsoft.public.windows.inetexplorer.ie6_outlookexpress)