Re: CPU hotplug: system hang on CPU hot remove during `pfmon --system-wide'



On Wed, May 30, 2007 at 10:03:01AM -0700, Linus Torvalds wrote:
and that's where all the problems started - sleepers needing to take that mutex
recursively (which we did/do not support).

foo() takes cpu_bitmask_lock and calls
foo_bar() which also needs cpu_bitmask_lock

What is a solution to that?

The solution is to not have crap locking in the first place.

I wish it was that simple wrt cpu hotplug lock :)

It was not just cpufreq which tripped on this locking mess. There was
flush_workqueue and also more recently preempt version of rcu.

flush_workqueue()
mutex_lock(&hotplug_lock); /* we can sleep below */
for_each_online_cpu()
flush_cpu_workqueue();
mutex_unlock(&hotplug_lock);

First problem with this snippet was that a workqueue function for which
we are waiting to be flushed may want to take the same hotplug_lock,
which can deadlock (see http://lkml.org/lkml/2006/12/6/352 for
description of this).

Second problem was : keventd is running flush_workqueue() and one of the
work functions calls flush_workqueue() again which is an attempt to take
hotplug_lock() recursively.

Partly the situation is complex because of the intertwining of
so many subsystem around this one lock. In all cases, encoding two version of
a function, one which takes the hotplug_lock and the other which doesnt take,
is probably an option. IMHO that would just make the source too ugly and
difficult to audit for cpu hotplug safety.

Best would be to go for a lock which supports recursion (and which
avoids other problems inherent with the old mutex based lock) or as
Andrew insisted with the freezer.

Or, if you absolutely have to, support recursive locking. But the bassic
rule should always really be that the need for recursive locking is really
a sign of a locking bug in the first place.

So what's so wrong with just admitting that the locking was crap, and
doing it properly? The _proper_ locking doesn't need recursive locks, it
just takes the lock at the outer layers, and then does *not* take it
internally, because the called routines are called with the lock held and
know they are safe.

We have been able to do that in _every_ single kernel subsystem. What's so
special about cpufreq? NOTHING. Except that the code is sometimes horribly
bad.

--
Regards,
vatsa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: mtx_lock_recurse/mtx_unlock_recurse functions (proof-of-concept).
    ... > acquire lock recursively, even if lock itself isn't recursable. ... > I also added a KASSERTto protect against mixed locking modes, ... conditionally acquire the socket buffer lock in the KQueue filter because ... Use recursive mutexes so that this recursion is permitted. ...
    (freebsd-arch)
  • Re: deadlocks
    ... >deadlock bug that involved recursive entry into a function that acquired ... Someone had move a lock release to make it out-of-order (as I ... >>But I really don't see any harm in releasing in any order you like. ... but this was obscured by recursion and general code ...
    (comp.lang.java.programmer)
  • Re: rwlocks, correctness over speed.
    ... Only variant to allowread recursion and ... ... Most common cases of writer starvation could be solved by keeping a per thread count of shared acquired rwlocks. ... If a rwlock is currently locked as shared/read AND a thread is blocked on it to lock it exclusively/write - then new shared/read locks ...
    (freebsd-arch)
  • RE: [PATCH -rt] Preemption problem in kernel RT Patch
    ... Something tries to do a spin_lock() on a lock, ... code encapsulates the current packet in a new packet and calls ip_output ... dev_queue_xmit before the check for recursion. ... every time I put in debug to see the routing ...
    (Linux-Kernel)
  • Re: using clustered index to optimize inserts ...
    ... I will try to explain locking in terms of Sybase docs... ... Allpages Locking: Allpages locking locks both data pages and index ... the data page is locked with an exclusive lock. ... Clustered Index: The datarows will be arranged as per the clustered ...
    (comp.databases.sybase)