Re: [RFC/PATCH] Optimize zone allocator synchronization
- From: Nick Piggin <nickpiggin@xxxxxxxxxxxx>
- Date: Wed, 7 Nov 2007 16:31:59 +1100
On Wednesday 07 November 2007 17:19, Andrew Morton wrote:
On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook <csnook@xxxxxxxxxx> wrote:
Don Porter wrote:
From: Donald E. Porter <porterde@xxxxxxxxxxxxx>
In the bulk page allocation/free routines in mm/page_alloc.c, the zone
lock is held across all iterations. For certain parallel workloads, I
have found that releasing and reacquiring the lock for each iteration
yields better performance, especially at higher CPU counts. For
instance, kernel compilation is sped up by 5% on an 8 CPU test
machine. In most cases, there is no significant effect on performance
(although the effect tends to be slightly positive). This seems quite
reasonable for the very small scope of the change.
My intuition is that this patch prevents smaller requests from waiting
on larger ones. While grabbing and releasing the lock within the loop
adds a few instructions, it can lower the latency for a particular
thread's allocation which is often on the thread's critical path.
Lowering the average latency for allocation can increase system
throughput.
More detailed information, including data from the tests I ran to
validate this change are available at
http://www.cs.utexas.edu/~porterde/kernel-patch.html .
Thanks in advance for your consideration and feedback.
I did see this initial post, and didn't quite know what to make of it.
I'll admit it is slightly unexpected :) Always good to research ideas
against common convention, though.
I don't know whether your reasoning is correct though: unless there is
a significant number of higher order allocations (which there should
not be, AFAIKS), all allocators will go through the per CPU lists which
batch the same number of objects on and off, so there is no such thing
as smaller or larger requests.
And there are a number of regressions as well in your tests. It would be
nice to get some more detailed profile numbers (preferably with an
upstream kernel) to try to work out what is going faster.
It's funny, Dave Miller and I were just talking about the possible
reappearance of zone->lock contention with massively multi core and
multi threaded CPUs. I think the right way to fix this in the long run
if it turns into a real problem, is something like having a lock per
MAX_ORDER block, and having CPUs prefer to allocate from different
blocks. Anti-frag makes this pretty interesting to implement, but it
will be possible.
That's an interesting insight. My intuition is that Nick Piggin's
recently-posted ticket spinlocks patches[1] will reduce the need for this
patch, though it may be useful to have both. Can you benchmark again
with only ticket spinlocks, and with ticket spinlocks + this patch?
You'll probably want to use 2.6.24-rc1 as your baseline, due to the x86
architecture merge.
The patch as-is would hurt low cpu-count workloads, and single-threaded
workloads: it is simply taking that lock a lot more times. This will be
particuarly noticable on things like older P4 machines which have
peculiarly expensive locked operations.
It's not even restricted to P4s -- another big cost is going to be the
cacheline pingpong. Actually it might be worth trying another test run
with zone->lock put into its own cacheline (as it stands, when the lock
gets contended, spinners will just sit there pushing useful fields out
of the holder's memory -- ticket locks will do better here, but they
still write to the lock once, then sit there loading it).
A test to run would be, on ext2:
time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)
(might need to increase /proc/sys/vm/dirty* to avoid any writeback)
I wonder if we can do something like:
if (lock_is_contended(lock)) {
spin_unlock(lock);
spin_lock(lock); /* To the back of the queue */
}
(in conjunction with the ticket locks) so that we only do the expensive
buslocked operation when we actually have a need to do so.
(The above should be wrapped in some new spinlock interface function which
is probably a no-op on architectures which cannot implement it usefully)
We have the need_lockbreak stuff. Of course, that's often pretty useless
with regular spinlocks (when you consider that my tests show that a single
CPU can be allowed to retake the same lock several million times in a row
despite contention)...
Anyway, yeah we could do that. But I think we do actually want to batch
up allocations on a given CPU in the multithreaded case as well, rather
than interleave them. There are some benefits avoiding cacheline bouncing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [RFC/PATCH] Optimize zone allocator synchronization
- From: Don Porter
- Re: [RFC/PATCH] Optimize zone allocator synchronization
- References:
- [RFC/PATCH] Optimize zone allocator synchronization
- From: Don Porter
- Re: [RFC/PATCH] Optimize zone allocator synchronization
- From: Chris Snook
- Re: [RFC/PATCH] Optimize zone allocator synchronization
- From: Andrew Morton
- [RFC/PATCH] Optimize zone allocator synchronization
- Prev by Date: [PATCH -libata] nv_hardreset: update dangling reference to bugzilla entry
- Next by Date: Re: [2.6 patch] always export sysctl_{r,w}mem_max
- Previous by thread: Re: [RFC/PATCH] Optimize zone allocator synchronization
- Next by thread: Re: [RFC/PATCH] Optimize zone allocator synchronization
- Index(es):
Relevant Pages
|