RE: [PATCHv5 3/3] vhost_net: a kernel-level virtio server



Michael,
We are very interested in your patch and want to have a try with it.
I have collected your 3 patches in kernel side and 4 patches in queue side.
The patches are listed here:

PATCHv5-1-3-mm-export-use_mm-unuse_mm-to-modules.patch
PATCHv5-2-3-mm-reduce-atomic-use-on-use_mm-fast-path.patch
PATCHv5-3-3-vhost_net-a-kernel-level-virtio-server.patch

PATCHv3-1-4-qemu-kvm-move-virtio-pci[1].o-to-near-pci.o.patch
PATCHv3-2-4-virtio-move-features-to-an-inline-function.patch
PATCHv3-3-4-qemu-kvm-vhost-net-implementation.patch
PATCHv3-4-4-qemu-kvm-add-compat-eventfd.patch

I applied the kernel patches on v2.6.31-rc4 and the qemu patches on latest kvm qemu.
But seems there are some patches are needed at least irqfd and ioeventfd patches on
current qemu. I cannot create a kvm guest with "-net nic,model=virtio,vhost=vethX".

May you kindly advice us the patch lists all exactly to make it work?
Thanks a lot. :-)

Thanks
Xiaohui
-----Original Message-----
From: kvm-owner@xxxxxxxxxxxxxxx [mailto:kvm-owner@xxxxxxxxxxxxxxx] On Behalf Of Michael S. Tsirkin
Sent: Wednesday, September 09, 2009 4:14 AM
To: Ira W. Snyder
Cc: netdev@xxxxxxxxxxxxxxx; virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; mingo@xxxxxxx; linux-mm@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; hpa@xxxxxxxxx; gregory.haskins@xxxxxxxxx; Rusty Russell; s.hetze@xxxxxxxxxxxx
Subject: Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server

On Tue, Sep 08, 2009 at 10:20:35AM -0700, Ira W. Snyder wrote:
On Mon, Sep 07, 2009 at 01:15:37PM +0300, Michael S. Tsirkin wrote:
On Thu, Sep 03, 2009 at 11:39:45AM -0700, Ira W. Snyder wrote:
On Thu, Aug 27, 2009 at 07:07:50PM +0300, Michael S. Tsirkin wrote:
What it is: vhost net is a character device that can be used to reduce
the number of system calls involved in virtio networking.
Existing virtio net code is used in the guest without modification.

There's similarity with vringfd, with some differences and reduced scope
- uses eventfd for signalling
- structures can be moved around in memory at any time (good for migration)
- support memory table and not just an offset (needed for kvm)

common virtio related code has been put in a separate file vhost.c and
can be made into a separate module if/when more backends appear. I used
Rusty's lguest.c as the source for developing this part : this supplied
me with witty comments I wouldn't be able to write myself.

What it is not: vhost net is not a bus, and not a generic new system
call. No assumptions are made on how guest performs hypercalls.
Userspace hypervisors are supported as well as kvm.

How it works: Basically, we connect virtio frontend (configured by
userspace) to a backend. The backend could be a network device, or a
tun-like device. In this version I only support raw socket as a backend,
which can be bound to e.g. SR IOV, or to macvlan device. Backend is
also configured by userspace, including vlan/mac etc.

Status:
This works for me, and I haven't see any crashes.
I have done some light benchmarking (with v4), compared to userspace, I
see improved latency (as I save up to 4 system calls per packet) but not
bandwidth/CPU (as TSO and interrupt mitigation are not supported). For
ping benchmark (where there's no TSO) troughput is also improved.

Features that I plan to look at in the future:
- tap support
- TSO
- interrupt mitigation
- zero copy


Hello Michael,

I've started looking at vhost with the intention of using it over PCI to
connect physical machines together.

The part that I am struggling with the most is figuring out which parts
of the rings are in the host's memory, and which parts are in the
guest's memory.

All rings are in guest's memory, to match existing virtio code.

Ok, this makes sense.

vhost
assumes that the memory space of the hypervisor userspace process covers
the whole of guest memory.

Is this necessary? Why?

Because with virtio ring can give us arbitrary guest addresses. If
guest was limited to using a subset of addresses, hypervisor would only
have to map these.

The assumption seems very wrong when you're
doing data transport between two physical systems via PCI.
I know vhost has not been designed for this specific situation, but it
is good to be looking toward other possible uses.

And there's a translation table.
Ring addresses are userspace addresses, they do not undergo translation.

If I understand everything correctly, the rings are all userspace
addresses, which means that they can be moved around in physical memory,
and get pushed out to swap.

Unless they are locked, yes.

AFAIK, this is impossible to handle when
connecting two physical systems, you'd need the rings available in IO
memory (PCI memory), so you can ioreadXX() them instead. To the best of
my knowledge, I shouldn't be using copy_to_user() on an __iomem address.
Also, having them migrate around in memory would be a bad thing.

Also, I'm having trouble figuring out how the packet contents are
actually copied from one system to the other. Could you point this out
for me?

The code in net/packet/af_packet.c does it when vhost calls sendmsg.


Ok. The sendmsg() implementation uses memcpy_fromiovec(). Is it possible
to make this use a DMA engine instead?

Maybe.

I know this was suggested in an earlier thread.

Yes, it might even give some performance benefit with e.g. I/O AT.

Is there somewhere I can find the userspace code (kvm, qemu, lguest,
etc.) code needed for interacting with the vhost misc device so I can
get a better idea of how userspace is supposed to work?

Look in archives for kvm@xxxxxxxxxxxxxxxx the subject is qemu-kvm: vhost net.

(Features
negotiation, etc.)


That's not yet implemented as there are no features yet. I'm working on
tap support, which will add a feature bit. Overall, qemu does an ioctl
to query supported features, and then acks them with another ioctl. I'm
also trying to avoid duplicating functionality available elsewhere. So
that to check e.g. TSO support, you'd just look at the underlying
hardware device you are binding to.


Ok. Do you have plans to support the VIRTIO_NET_F_MRG_RXBUF feature in
the future? I found that this made an enormous improvement in throughput
on my virtio-net <-> virtio-net system. Perhaps it isn't needed with
vhost-net.

Yes, I'm working on it.

Thanks for replying,
Ira
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
    ... Existing virtio net code is used in the guest without modification. ... structures can be moved around in memory at any time ... Userspace hypervisors are supported as well as kvm. ... In this version I only support raw socket as a backend, ...
    (Linux-Kernel)
  • Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
    ... Existing virtio net code is used in the guest without modification. ... structures can be moved around in memory at any time ... Userspace hypervisors are supported as well as kvm. ... In this version I only support raw socket as a backend, ...
    (Linux-Kernel)
  • Re: [00/41] Large Blocksize Support V7 (adds memmap support)
    ... block support than yours, I don't think we made progress on that. ... The approach has already been tried (see the XFS layer) and found lacking. ... we'll be able to detect the situation where the memory is really ... His initial problem was not with the patches as such but the fact that they ...
    (Linux-Kernel)
  • Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation
    ... Timur> Arjan tells me the only time an mlocked page can move is ... Timur> with hot plug of memory, ... Timur> the systems that we support. ... Does it seem reasonable to add a new system call to let userspace mark ...
    (Linux-Kernel)
  • [patch 5/5] drivers: leave vm_flags alone
    ... Get rid of some vm_flags twiddling from driver code. ... this + the last 4 patches is that all converted remap_vmalloc_range ... memory can support get_user_pages - do we want that? ...
    (Linux-Kernel)