Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting



Andrew Morton wrote:
On Wed, 20 Aug 2008 11:42:27 -0700
Jeremy Fitzhardinge <jeremy@xxxxxxxx> wrote:


Andrew Morton wrote:

On Sat, 16 Aug 2008 09:34:13 -0700 Jeremy Fitzhardinge <jeremy@xxxxxxxx> wrote:



There are various places in the kernel which wish to wait for a
condition to come true while in a non-blocking context. Existing
examples of this are stop_machine() and smp_call_function_mask().
(No doubt there are other instances of this pattern in the tree.)

Thus far, the only way to achieve this is by spinning with a
cpu_relax() loop. This is fine if the condition becomes true very
quickly, but it is not ideal:

- There's little opportunity to put the CPUs into a low-power state.
cpu_relax() may do this to some extent, but if the wait is
relatively long, then we can probably do better.


If this change saves a significant amount of power then we should fix
the offending callsites.


Fix them how? In general we're talking about contexts where we can't
block, and where the wait time is limited by some property of the
platform, such as IPI time or interrupt latency (though doing a
cross-cpu call of a long-running function would be something we could fix).


ah, OK, I'd failed to note that you had identified two specific culprits.

Are either of these operations executed frequently enough for there to
be significant energy savings here?


The energy savings are more gravy, and not really my focus. Arjan tells
me that monitor/mwait are unusably slow in current implementations
anyway. My interest is in the virtual machine case, where bad
interactions with the vcpu scheduler can cause things to spin for 30
milliseconds or more (sometimes much more) in causes that would only be
microseconds running native.

The s390 people have reported similar things, so this is definitely not
Xen or x86 specific.

- In a virtual environment, spinning virtual CPUs just waste CPU
resources, and may steal CPU time from vCPUs which need it to make
progress. The trigger API allows the vCPUs to give up their CPU
entirely. The s390 people observed a problem with stop_machine
taking a very long time (seconds) when there are more vcpus than
available cpus.


If this change saves a significant amount of virtual-cpu-time then we
should fix the offending callsites.


This case isn't particularly about saving vcpu time, but making timely
progress. stop_machine() gets all the cpus into a spinloop, where they
spin waiting for an event to tell them to go to their next state-machine
state. By definition this can't be a blocking operation (since the
whole point is that they're high priority threads that prevent anything
else from running). But in the virtual case, the fact that they're all
spinning means that the underlying hypervisor has no idea who's just
spinning, and who's trying to do some work needed to make overall
progress, so the whole thing gets bogged down.


hm. I'm surprised that stop_machine() is executed frequently enough
for you to care. What's causing it?


The big user is module load/unload, which have been observed to take
multiple seconds in stop_machine with some pathological overload
conditions. It's a pretty major hiccup if you hit it. (It's not
something that you'd deliberate set up except for testing, but it means
that something which might otherwise be a brief transient overload could
turn into a very brittle state with wildly varying performance
characteristics.)

Also Xen suspend/migrate uses stop_machine, and that's actually fairly
latency-sensitive. A live migrate can only have a few 10s ms of
downtime for the virtual machine, so having stop_machine() with
latencies of a similar or longer scale is noticeable.

Now perhaps we could solve stop_machine by modifying the scheduler in
some way, where you can block the run queue so that you sit in the idle
loop even though there's runnable processes waiting. But even then,
stop_machine requires that interrupts be disabled, which means the we're
pretty much limited to spinning.


If stop_machine() is the _only_ problematic callsite and we reasonably
expect that no new ones will pop up then sure, a
stop_machine()-specific fix might be appropriate.

Otherwise, sure, we'd need to loko at something more general.


Well smp_call_function() does a spin wait, waiting for the other cpu(s)
to finish running the function. If it's a long-running function, then
that spinning could be arbitrarily long - not that it's a good idea to
call something long-running in interrupt context like that, but you
could see it as a quality of implementation issue.

And again, in a virtual environment, all that spinning competes with
cpus trying to do real work, so even a "short" spin could be arbitrarily
long if it's preventing the event it is waiting for from occurring.

I'm pretty sure there are other places in the kernel which can make use
of a more general facility. There are ~300 non-arch uses of cpu_relax()
in ~100 files, which are all (roughly) waiting for something to become
true. Some are polling on hardware state, and some are waiting for
states set by uncooperative subsystems, but I'd be surprised if a
significant number couldn't be converted to use a higher-level
trigger/spinpletion mechanism.

And the fact that there are so many existing instances in the kernel
suggests that new ones will appear, and they could be encouraged to use
a high-level mechanism from the outset.

J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting
    ... There's little opportunity to put the CPUs into a low-power state. ... If this change saves a significant amount of power then we should fix ... The trigger API allows the vCPUs to give up their CPU ... spinning means that the underlying hypervisor has no idea who's just ...
    (Linux-Kernel)
  • Re: [PATCH RFC 1/3] Add a trigger API for efficient non-blocking waiting
    ... There's little opportunity to put the CPUs into a low-power state. ... If this change saves a significant amount of power then we should fix ... The trigger API allows the vCPUs to give up their CPU ... spinning means that the underlying hypervisor has no idea who's just ...
    (Linux-Kernel)
  • Re: Explain this about threads
    ... each time it initiates a time-consuming activity in another thread until the activity finishes. ... Waiting for an event from another thread in a SpinWait loop, prevents the other thread to signal the event, so basically you are wasting CPU cycles for nothing. ... Define the count as such that you spin for less than the time needed to perform a transition to the kernel and back, ... Spinning for a longer period is just a waste of CPU cycles, you better give up your quantum by calling Sleepor PInvoke the Kernel32 "SwitchToThread" API in that case. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Explain this about threads
    ... each time it initiates a time-consuming activity in another thread until ... This is better than spinning in a polling loop ... Waiting for an event from another thread in a SpinWait loop, ... Spinning for a longer period is just a waste of CPU cycles, ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: why my threads do not wake up
    ... > can use all 8 CPUs. ... Because the threads are waiting for a condition to be signalled. ... some other thread signals the condition variable. ... Condition variables are hard to use correctly until you get the hang of ...
    (comp.programming.threads)