Re: [BUG] long freezes on thinkpad t60




* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

On Thu, 21 Jun 2007, Ingo Molnar wrote:

what worries me a bit though is that my patch that made spinlocks
equally agressive to that loop didnt solve the hangs!

Your parch kept doing "spin_trylock()", didn't it?

yeah - it changed spin_lock()'s assembly to do a "LOCK BTRL", which is a
trylock which tries to dirty the cacheline. There was a "REP NOP" after
it and a loop back to the "LOCK BTRL".

That's a read-modify-write thing, and keeps bouncing the cacheline
back and forth, and together with the fact that even *after* you get
the spinlock the "wait_for_inactive()" would actually end up looping
back, releasing it, and re-getting it.

So the problem was that "wait_for_inactive()" kept the lock (because
it actually *got* it), and looped over getting it, and because it was
an exclusive cacheline ownership, that implies that somebody else is
not getting it, and is kept from ever getting it.

ok, it's not completely clear where exactly the other core was spinning,
but i took it from Miklos' observations that the other core was hanging
in the _very same_ task_rq_lock() - which is a true spinlock as well
that acquires it. So on one core the spin_lock() was starving, on
another one it was always succeeding.

So trying to use "trylock" doesn't help. It still has all the same bad
sides - it still gets the lock (getting the lock wasn't the problem:
_holding_ the lock was the problem), and it still kept the cache line
for the lock on one core.

so the problem was not the trylock based spin_lock() itself (no matter
how it's structured in the assembly), the problem was actually modifying
the lock and re-modifying it again and again in a very tight
high-frequency loop, and hence not giving it to the other core?

The only way to avoid lock contention is to avoid any exclusive use at
all.

yeah - i'm not at all arguing in favor of the BTRL patch i did: i always
liked the 'nicer' inner loop of spinlocks, which could btw also easily
use MONITOR/MWAIT. (my patch is also quite close to what we did in
spinlocks many years ago, so it's more of a step backwards than real
progress.)

So it seems the problem was that if a core kept _truly_ modifying a
cacheline via atomics in a high enough frequency, it could artificially
starve the other core. (which would keep waiting for the cacheline to be
released one day, and which kept the first core from ever making any
progress) To me that looks like a real problem on the hardware side -
shouldnt cacheline ownership be arbitrated a bit better than that?

Up to the point where some external event (perhaps a periodic SMM
related to thermal management) broke the deadlock/livelock scenario?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [BUG] long freezes on thinkpad t60
    ... With this patch I can't reproduce the problem, ... The problem is, that the open-coded loop there is totally fine, ... cacheline accesses that causes cacheline starvation?) ... interrupt on one core tries to acquire the runqueue-lock but does not ...
    (Linux-Kernel)
  • Re: SE Headphones Amp
    ... RDH4 recommends quoted primary Henries should be measured ... wiggly loop than yours, but the principle remains the same: ... Not easy to see with the depicted core material which is so ... at zero B. A common misconception is that dB/dH falls to ...
    (rec.audio.tubes)
  • Re: atomic operation problem
    ... I can not fully understand how to use this 'lock' field. ... it for the whole duration of the instruction. ... Increment the value. ... or core wants to increment the value at the same time. ...
    (comp.arch.embedded)
  • Re: SE Headphones Amp
    ... I googled "minor BH loop". ... sections of the sigma at the zero voltage crossing point, ... the inductance will look a bit like ... Not easy to see with the depicted core material which is ...
    (rec.audio.tubes)
  • Re: Passive reradiating antenna
    ... > antenna simply has better efficiency than a standard loop of the ... but when a ferrite core is placed inside a ... ferrite core material. ... increases approximately in proportion to the permeability of the core. ...
    (rec.radio.amateur.antenna)