Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifier interface



Davide Libenzi wrote:
On Wed, 17 Jun 2009, Gregory Haskins wrote:


I am not clear on what you are asking here. So in case this was a
sincere question on how things work, here is a highlight of the typical
flow of a packet that ingresses on a physical adapter and routes to KVM
via vbus.

a) interrupt from eth to host wakes the cpu out of idle, enters
interrupt-context.
b) ISR runs, disables interrupts on the eth, schedules softirq-net-rx
c) ISR completes, kernel checks softirq state before IRET, runs pending
softirq-net-rx in interrupt context to NAPI-poll the ethernet
d) packet is pulled out of eth into a layer-2 bridge, and switched to
the appropriate switch-port (which happens to be a venet-tap (soon to be
virtio-net based) device. The packet is queued to the tap as an xmit
request, and the tap's tx-thread is awoken.
e) the xmit call returns, the napi-poll completes, and the
softirq-net-rx terminates. The kernel does an IRET to exit interrupt
context.
f) the scheduler runs and sees the tap's tx-thread is ready to run,
schedules it in.
g) the tx-thread awakens, dequeues the posted skb, copies it to the
virtio-ring, and finally raises an interrupt on irqfd with eventfd_signal().


Heh, Gregory, this isn't a job interview. You didn't have to actually
detail everything ;) Glad you did though, so we've something to talk
later.


:)




At this point, all of the data has been posted to the virtio-ring in
shared memory between the host and guest. All that is left is to inject
the interrupt so the guest knows to process the ring. We call the
eventfd_signal() from kthread context. However, the callback to inject
the interrupt is invoked with the wqh spinlock held so we are forced to
defer the interrupt injection to a workqueue so the kvm->lock can be
safely acquired. This adds additional latency (~7us) in a path that is
only a handful of microseconds to being with. I can post LTTV
screenshots, if it would be helpful to visualize this.


So, what you're trying to say, is that the extra wakeup required by your
schedule_work() processing, makes actually a difference in the time it
takes to go from a) to g),

Yes

where there are at least two other kernel
thread wakeups?

Actually there is only one (the tx-thread) aside from the eventfd
imposed workqueue one. Incidentally, I would love to get rid of the
other thread too, so I am not just picking on eventfd ;). The other is
a lot harder since it has to update the virtio-ring and may need to page
in guest memory to do so.


Don't think in terms of microbenchs,

I'm not. This goes directly to end-application visible performance.
These types of bottlenecks directly affect the IOPS rate in quantifiable
ways of the subsystem in question (and once I re-stablize the vbus tree
against the latest kvm/*fd code, I will post some numbers). This is
particularly true for request-response type operations where stack
latency is the key gating factor. Consider things like high-performance
shared-nothing clusters to a ramdisk (yes, these exist and have
relevance). The overhead of the subsystem can be very low, and is
usually gated by the transaction throughput, which is gated by the IOPS
rate of the medium. A 7us bump when we are only talking about dozens of
microseconds overall is a huge percentage. Other examples might be
high-resolution clock sync (ala ptpd) or FSI trading applications.

The ultimate goal here is to get something like a KVM guest on par with
baremetal to allow the convergence of the HPC space with the
data-center/cloud trend of utilizing VMs as the compute fabric.
Baremetal doesn't have a similar context switch in its stack, and these
little microsecond bumps here and there are the reason why we are not
quite at baremetal levels today with KVM. Therefore, I am working
through trying to eliminate them.

To flip it around on you: try telling a group like the netdev guys that
they should put extra context switches into the stack because they don't
really matter. Be sure to wear extra thick asbestos. ;)

think in how much are those 7us (are
they? really? this is a sync, on-cpu, kernel thread, wake up) are
impacting the whole path.
Everything looks worthwhile in microbenches.





Right, understood, and note that this is precisely why I said it would
oops. What I was specifically trying to highlight is that its not like
this change imposes new requirements on the existing callbacks. I also
wanted to highlight that its not really eventfd's concern what the
callback tries to do, per se (if it wants to sleep it can try, it just
wont work). Any reasonable wakeup callback in existence would already
assume it cannot sleep, and they wouldn't even try to begin with.

On the other hand, what I am introducing here (specifically to eventfd
callbacks, not wait-queues [*]) is the possibility of removing this
restriction under the proper circumstances. It would only be apparent
of this change iff the callback in question tried to test for this (e.g.
checking preemptible()) state. It is thus opt-in, and existing code
does not break.


The interface is just ugly IMO.

Well, everyone is of course entitled to an opinion, but with all due
respect I think you are looking at this backwards. ;)

You have eventfd_signal() that can sleep,
or not, depending on the registered ->signal() function implementations.
This is pretty bad, a lot worse than the theoretical us spent in the
schedule_work() processing.


I wouldn't describe it that way. Whether eventfd_signal() sleeps or not
even under your control as it is. Really what we have is an interface
(eventfd_signal()) that must support being invoked from atomic-context.
As an interface designer, you should never be saying "can this sleep",
but rather "what contexts is it legal to call this interface". You
have already spec'd that eventfd_signal() is atomic-safe, and I am not
proposing to change that. It is, and always will be (unless we screw
up) atomic safe.

Consider kmalloc(GFP_KERNEL), and kmalloc(GFP_ATOMIC). The former says
you must call this from process context, and the latter says you may
call it from atomic context. If you think about it, this is technically
orthogonal to whether it can sleep or not, even though people have
become accustomed to associating atomic == cant-sleep, because today
that is true (at least in mainline). As a counter example, in -rt,
atomic context *is* process context, and kmalloc(GFP_ATOMIC) can in fact
sleep. But all the code still works unmodified because -rt is still
ensuring that the interface contract is met: it works in atomic-context
(its just that atomic-context is redefined ;)

So, all that aside: I restate that eventfd should *not* care about what
its clients do, as long as they meet the requirements of the interface.
In this case, the requirement is that they need to work from atomic
context (and with my proposal, they still will). Its really a question
of should eventfd artificially create an atomic context in some kind of
attempt to enforce that? The answer, IMO, is that it shouldn't if it
doesn't have to, and there are already community accepted patterns (e.g.
RCU) for accomplishing that.

One could also make a secondary argument that holding a spinlock across
a critical section of arbitrary complexity is probably not ideal. Its
fine for quick wake_ups as you originally designed eventfd for.
However, now that we are exploring the generalization of the callback
interface beyond simple wake-ups, this should probably be re-evaluated
independent of my current requirements for low-latency.

The fact is that eventfd is a really neat general signaling idiom.
However, its currently geared towards "signaling = wakeup". As we have
proven with this KVM *fd effort, this is not necessarily accurate to
describe all use cases, nor is it optimal. I'd like to address that.
An alternative, of course, is that we make a private anon-fd solution
within KVM. However, it will be so similar to eventfd so it just seems
wasteful if it can be avoided.

Kind Regards,
-Greg

Attachment: signature.asc
Description: OpenPGP digital signature



Relevant Pages

  • Re: implementing roles in OOP......
    ... > announcement but the Y method name makes no sense in that context. ... > priori knowledge of the structure there is no way for the client to know ... I can imagine an interface that would allow adding Composites ... The example given is exactly the sort of thing that does happen, ...
    (comp.object)
  • Re: LinuxPPS & spinlocks
    ... I meant that if you have to lock between user context and interrupt ... Ok, I see, but how you can get your PPS source data struct starting ...
    (Linux-Kernel)
  • TRANSFER arbitrary data ? (long)
    ... subroutine to minimize a function of n variables, ... end interface ... that to an integer array and back. ... subroutine minimize(Func, x, context) ...
    (comp.lang.fortran)
  • Re: GRU driver feedback
    ... "atomic pte lookup" which looks up the pte atomically using similar ... Existence of the mm is guaranteed thru an indirect path. ... struct cannot go away until the GRU context that caused the interrupt ... When the GRU hardware sends an interrupt, ...
    (Linux-Kernel)
  • Re: LinuxPPS & spinlocks
    ... does nothing (if we are already in an interrupt handler). ... context, but doing like this I will disable interrupts ... protect syscalls from each others and from pps_register/unregister, ... Note that *data* is protected by a lock, ...
    (Linux-Kernel)