Re: [PATCH 0/3] have pooled sunrpc services make more intelligent allocations




On Tue, 2008-06-03 at 13:42 -0400, Jeff Layton wrote:
On Tue, 03 Jun 2008 11:53:42 -0500
Tom Tucker <tom@xxxxxxxxxxxxxxxxxxxxx> wrote:

Jeff:

This brings up an interesting issue with the RDMA transport and
RDMA_READ. RDMA_READ is submitted as part of fetching an RPC from the
client (e.g. NFS_WRITE). The xpo_recvfrom function doesn't block waiting
for the RDMA_READ to complete, but rather queues the RPC for subsequent
processing when the I/O completes and returns 0.

I can use these new services to allocate CPU local pages for this I/O.
So far, so good. However, when the I/O completes, and the transport is
rescheduled for subsequent RPC completion processing, the pool/CPU that
is elected doesn't have any affinity for the CPU on which the I/O was
initially submitted. I think this means that the svc_process/reply steps
may occur on a CPU far away from the memory in which the data resides.

Am I making sense here? If so, any thoughts on what could/should be
done?

Thanks,
Tom


I confess I didn't think hard about the RDMA case here (and haven't
been paying as much attention as I probably should to the design of
it). So take my thoughts with a large chunk of salt...

On a NUMA box, the pages have to live _somewhere_ and some CPUs will be
closer to them than others. If we're concerned about making sure that
the post-RDMA_READ processing is done on a CPU close to the memory,
then we don't have much choice but to try to make sure that this
processing is only done on CPUs that are close to that memory.

Assuming that this post-processing is done by nfsd, I suppose we'd need
to tag the post-RDMA_READ RPC with a poolid or something and make sure
that only nfsds running on CPUs close to the memory pick it up. Perhaps
there could be a per-pool queue for these RPC's or something...

Either way, the big question is whether that will be a net win or loss
for throughput. i.e. are we better off waiting for the right nfsd to
become available or allowing the first nfsd that becomes available to
make the crosscalls needed to do the RPC? It's hard to say...

Not only that, but it would lead to more disorder in the RPC processing
which might kill write-behind.


In the near term, I doubt this patchset will harm the RDMA case.

Agreed.

After
all, the distribution of memory allocations is pretty lumpy now. On
a NUMA box with RDMA you're probably doing a lot of crosscalls with
the current code.

Probably no worse than the socket's transport since the skbuf's aren't
necessarily allocated on the CPU calling svc_recv.





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages