Re: [PATCH 0/3] 64-bit futexes: Intro



On Mon, Jun 02, 2008 at 01:22:57PM -0700, Linus Torvalds wrote:

[ I added Nick and DavidH to the Cc, since they have at least historically
shown interest in locking algorithms ]

On Mon, 2 Jun 2008, Ingo Molnar wrote:

i suspect _any_ abstract locking functionality around a data structure
can be implemented via atomic control over just a single user-space bit.

Well, theory is theory, and practice is different.

That's *especially* true when it comes to locking, which is just so
tightly coupled to the exact details of which atomics are fast on a
particular architecture.

Also, different people will want to see different performance profiles.

For example, it makes a *huge* difference whether you are strictly fair or
not. I can almost guarantee that a 100% fair implementation may be really
"nice" from a theoretical standpoint, but suck really badly in practice,
because if you want best performance, then you want the lock to have a
very strong CPU affinity - and the faster you can do your lock ops, the
more of a unfairness and CPU affinity they get!

And rwlocks in particular are actually much more interesting in this
respect, because they not only have that CPU affinity fairness, but also
the reader-vs-writer fairness. You optimize for one load, and it may give
you almost perfect performance, but at the cost of sucking at another
load.

For example, some loads are almost entirely read-read locks, with only
very occasional write locks for some uncommon config change thing. Do you
want to optimize for that? Maybe. And yes, you can make those uncontended
read-read locks go really quickly, but then (especially if you continue to
let reads through even when writers want to contend), that can slow down
writers a *lot*, to the point of starvation.

Different CPU's will also show different patterns.

Anyway, I was busy most of the weekend, but I've now coded up a partial
actual example implementation. Its' probably buggy. Uli is very correct in
saying that it's easy to screw up futex'es, but even in the absense of
futexes, it's just easy to screw up any threaded logic.

But if anybody wants to look at a rough draft that at least limps along
_partially_, there's

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git;a=summary

for you.

And I'll freely admit that
(a) the above thing is pretty hacked up. No guarantees as to correctness
or anything else.
(b) I looked _mainly_ at the "all readers" case, and considered that the
primary goal. Which explains why it does really well on that
particular load.
(c) However, I refuse to starve writers. In fact, my thing will always
prefer a writer to any new readers, on the assumption that the sanest
rwlock situation is the "tons of readers, occasional writer".
(d) it does ok for me on the write-write contention case too, but the
random mixed "all threads do 5% writelocks" test-case seems to suck.

As an exmaple: the current glibc code I have seems to be uniformly and
boringly middle-of-the-road. It really doesn't seem to be horrible at all.
That said, the above thing _is_ 2-3x faster for me on that read-read case
(from single thread up to 20 threads - but just tested on one particular
machine), so if that is what you want to aim for, it's certainly easy to
beat.

But for the write case, while I can easily beat it for a low-thread count
with little contention, my example thing above benchmarks about equal for
bigger thread counts (caveat: again - I've only tested on one machine,
which is a single-socket dual-core thing. Caveat emptor).

And the pthreads implementation actually beats my hacked-up one at least
for the 5% random writelocks case. It's not beating it by a huge amount,
but it's not in the noise level either (maybe 15%). And I bet the pthreads
implementation is a hell of a lot better tested ;)

Had a bit of a look though this, seems pretty OK to me. Obviously with
rwlocks it *is* impossible to do non-atomic unlocks, so I can't see
a way to significantly improve your code there.

What you *could* maybe do, to slightly speed up the reader fastpath, at
the expense of the writer fastpath, is to also have the active writer add
4 to the count too, so your unlock can start with a lock xadd -4, count
in order to get the write-intent on the cacheline straight up.

Though that's a pretty tiny "optmisation", and not going to be an amazing
win, even when it does save a state transition on the cacheline...

I'd be more interested to know why this code can't be evolved into a full
rwlock implementation? This is a rather standard (though neat) looking rwlock
-- so my question is what can the patented 64-bit futex locks do that this
can't, or what can they do faster?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • [RFC PATCH 2/5] Priority Sifting Reader-Writer Lock Documentation
    ... +The main design goal is to lessen the rwlock impact on the irq and softirq ... +The writer fast path uses a single bit to indicate that a fast path writer is ... +other writer nor reader in their critical section. ... +The writer slow path first sets the UC_SLOW_WRITER bit and increments the WS ...
    (Linux-Kernel)
  • Re: [PATCH] netfilter: use per-cpu spinlock and RCU (v5)
    ... update but still removes the expensive rwlock in earlier versions. ... Locking is done per-cpu, the fast path locks on the current cpu ... to avoid OOM for their setups. ...
    (Linux-Kernel)
  • Re: kernel performance issue - spins on readers/writer locks
    ... Use lockstat(1M) to gather statistics on kernel locks, ... experiencing slowdown during this time). ... They allow multiple readers or one writer to a resource. ...
    (comp.unix.solaris)
  • Re: low-end persistence strategies?
    ... The writer makes the locks, deletes the oldest data file and renames ... Rename the temporary file takes advantage of the fact that a rename ...
    (comp.lang.python)
  • Re: [PATCH 0/3] 64-bit futexes: Intro
    ... What you *could* maybe do, to slightly speed up the reader fastpath, at ... the expense of the writer fastpath, is to also have the active writer add ... In fact, in the second version of my locks, I didn't do futex operations ... consider things like timeouts etc. Timeouts are "hard" to handle because ...
    (Linux-Kernel)