Re: bizarre timing results

From: Gav (gav_at_cs.york.ac.uk)
Date: 04/04/05


Date: Mon, 04 Apr 2005 20:46:30 +0100

right. i have some definative answers to my questions now, and for those who
are interested, i'll share them here.

PART 1 - the variants

in order to figure out what was going on in my project's data transfer
mechanism, i made several variants of a program that dopes the same amount
of "real work".

the real work is nothing more than reading from and writing to a portion of
a buffer in memory. throughout all variants the buffer is the same size and
programatically the same object, so there can be no bias there.

each variant is given a letter which can be provided to the test app
(speed.cpp) in order to execute the app with that variant of the solution.

variant 1: "c" coupled

this is the original version, as used by my project currently. it uses
condition variables in order to synchronise the reading and writing, making
sure that only data which has previously been written is read. this tightly
couples the two threads, and are thus rendered symbiotic - imagine two
runners tied at the torso. the length of the rope would be analogous to the
size of the buffer.

variant 2: "u" uncoupled

this is another version originally put forward. the two threads are no
longer symbiotic - they are independent, though because they still utilise
the same buffer for their operations the actual "end product" (i.e. what is
read) is undefined and useless. for a better comparison the mutexes are
kept in, but are essentially useless since they do not protect the only
shared data, the buffer. we don't protect the buffer since the coupled
variant doesn't; the coupled variant doesn't have to of course, since the
state of the coupling makes sure that the same portion of the buffer isn't
being dually accessed anyway.

in our anallogy the two runners' rope has been cut, but they still run
simultaneously.

variant 3: "s" sequential

this variant functions as the control of the experiment and works by simply
running the writer function and then the reader function in turn. no
concurrency happens and no threads are used. for a better comparison the
mutexes are kept in, but are, of course, useless.

in our analogy the two runners run in relay. they do not run at the same
time.

variant 4: "f" fast_coupled

this variant delivers the same end-semantics as our coupled variant. but
whereas coupled signals immediately upon a quantum of work being completed,
fast_coupled only signals when it can no longer do any more work, thereby
minimising calls to pthread_cond_[wait|signal].

however this is (sadly) not a great solution, since it will always utilise
all of the buffer before calling pthread_cond_wait. if this were the only
problem it wouldn't be too bad, because it waits until it has utilised all
of the buffer before calling pthread_cond_signal also, all other threads
attempting to use the buffer will be stuck in their own pthread_cond_wait
loops, even if the os switches context to them legitimately.

in our analogy, one runner runs as far as he can away from the other, then
once the rope is taught, he signals the other runner to run, and vice
versa.

variant 5: "v" very fast coupled

this variant has the same end-semantics as before, however it completely
disperses with use of the condition variables.

instead yield() is used once no more data can be processed (i.e. the buffer
is utilised to its fullest). this means that we have no guarantee that the
while() loop will not be called more often than it should be (whereas with
wait conditions we can be reasonably sure that we will only be woken when
the sate has changed such that we can continue with execution).

however with this variant we allow the os to preempt us and switch context
such that the other worker thread may utilise the buffer since no
conditions are used.

it also means that the buffer isn't always used to its fullest, a good
thing, since the buffer size could be large for other reasons and adversely
affect the latency between reading and writing corresponding quanta.

in the analogy, both runners run as far and as fast as they can, but if one
gets so far away from the other such that the rope is taught, he stops a
while.

PART 2 - the experiments

i did three experiments to check the performance of each implementation. as
the experiment machine is a uniprocessor core, and as there is no external
i/o necessary we expect the threadless to be the fastest implementation
throughout.

the first experiment was to test a fairly average-case scenario. the buffer
is not oversize, but plentiful, and enough cycles of data were done in
order to force a considerable amount of context switches in the coupled
variants.

the second experiment was to test best-case performance. the buffer was
extended and cycles reduced until only one context switch minimum was
required for the data to be processed fully by the coupled threads. we
expect a good coupled implementation to perform around as well as the basic
uncoupled variant. in our previous analogy the rope was lengthened to the
length of the entire run.

the third experiment was to test worst-case perforance. the cycles were
increased somewhat, and the buffer reduced so as to force maximum coupling
between the threads, and thus a very high minimum of context switches
neccessary. in our previous analogy the rope was shortened to a bare
minimum of a couple of steps.

all were carried out on a pentium-m based system with 512m ram and running
at a speed of 1700mhz.

the a vanilla linux 2.6.11.4 kernel was used, glibc 2.3.4 (with nptl) and
gcc 3.4.3-20050110.

PART 3 - the results

here's a quick reference of the implementations:

                 coupled uncoupled sequential f-coupled v-f-coupled
cond. vars yes no no yes no
threads yes yes no yes yes
cycles/signal 1 n/a n/a * n/a

(*) as many as possible, before it is forced to wait.

Experiment A:

20000000 cycles were done, each cycle reading/writing 16 floats from a
buffer 320000 floats large.

all coupled variants would have to context switch _at_least_ 63 times.

        coupled uncoupled sequential f-coupled v-f-coupled
real 2m29.004s 0m12.529s 0m12.597s 0m12.434s 0m12.617s
user 0m33.940s 0m12.034s 0m12.053s 0m11.766s 0m12.096s
sys 1m50.002s 0m0.012s 0m0.013s 0m0.040s 0m0.031s

Experiment B:

200000 cycles were done, each cycle reading/writing 16 floats from a buffer
3200000 floats large.

all coupled variants would have to context switch at least only once.

        coupled uncoupled sequential f-coupled v-f-coupled
real 0m1.448s 0m0.131s 0m0.132s 0m0.129s 0m0.132s
user 0m0.304s 0m0.112s 0m0.109s 0m0.109s 0m0.107s
sys 0m1.084s 0m0.014s 0m0.018s 0m0.017s 0m0.019s

Experiment C:

2000000 cycles were done, each cycle reading/writing 16 floats from a buffer
32 floats large.

all coupled variants would have to context switch _at_least_ 1000000 times.

        coupled uncoupled sequential f-coupled v-f-coupled
real 0m14.327s 0m1.072s 0m1.069s 0m10.705s 0m2.793s
user 0m3.237s 0m1.030s 0m1.038s 0m2.694s 0m1.603s
sys 0m10.619s 0m0.003s 0m0.001s 0m7.630s 0m1.095s

PART 4 - the conclusion

experiments a and b simply show us that using pthreads as they are typically
suggested to be used in the (perhaps naive) "coupled" variant renders the
worst performance generally. all the other variants provide roughtly the
same performance. experiment c is far more telling, and really splits the
wheat from the chaff:

in the current implementation of condition variables the pthread_cond_signal
call is not to be called lightly. it apparently forces a context switch
immediately, since comparing f-coupled and coupled in experiment c shows
that even if we force a context switch after two cycles the implementation
that forces utilisation of both cycles before calling pthread_cond_signal
performs far better than the implementation which calls pthread_cond_signal
after one cycle, *even though* pthread_cond_wait is called just as often.

typically one might expect that only pthread_cond_wait would force a context
switch, whereas pthread_cond_signal one might expect would allow the
current context to continue.

the current implementation might therefore be considered broken, since it
would be the thread holding the current context rather than the kernel
scheduler that will determine when some other symbiotic thread might then
go on. in our implementation "f-coupled", we allow the other thread to
continue (i.e. we pthread_cond_signal()) only when we have achieved our
maximum amount to be precessed. any preemption before then is futile, since
the other thread will still be pthread_cond_wait()ing on us.

pthread_cond_signal() is also apparently very slow anyway. comparing our
v-f-coupled variant to f-coupled in experiment c, we see even though they
both do exactly the same processing, and (due to the triviality of that
processing and the small size of the buffer) likely have roughly the same
number of context switches, the variant which utilises sched_yield() rather
than condition variables to force switching between the threads is far
faster.

this shouldn't be: one of the great things about using (multiple) condition
variables is that threads which are now probably executable can be
precisely tracked and switched-to without having to wake potentially dead
threads. yield() is a far more shotgun-like approach, and doesn't let the
program provide any semantics about which threads should be woken, yet it
far outperforms the pthreads implementation.

in our implementation that uses condition variables, f-coupled, we use two
condition variables; because of this the pthread implementation (and
thereby the scheduler) should be able to deduce which of the two threads
should be switched to next without having to do any costly idling or
switching outside the process. if a thread signals "lessused" and then some
time later waits on "moreused", we are actually doing some of the
schedulers job for it: we are actually saying dont ever run me again until
some other thread signals "moreused", and by the way i think that the other
thread, the one waiting on "lessused" might well be runnable now. it should
certainly not perform as badly as yield(), which forces an arbitrary switch
potentially outside the process.

the code is included as an attachment - it can be compiled with:

g++ speed.cpp -o speed -lpthread

comments welcome.

cheers,

gav



Relevant Pages

  • Problems with IWbemPathKeyList::GetKey
    ... inconsistency in the documentation for this function with respect to ... buffer pointers to get the buffer sizes). ... It requires a VARIANT for this argument. ... ULONG uKeyIx, ...
    (microsoft.public.win32.programmer.wmi)
  • Problems with IWbemPathKeyList::GetKey
    ... inconsistency in the documentation for this function with respect to ... buffer pointers to get the buffer sizes). ... It requires a VARIANT for this argument. ... ULONG uKeyIx, ...
    (microsoft.public.windowsxp.wmi)
  • Performance of the Streams Read and Write
    ... Variant 1 of a buffered I/O: ... just some garbage in the buffer ... for i in First..Last loop ...
    (comp.lang.ada)
  • Re: user arrays and VARIANT *
    ... VARIANT is explicitly documented in the MSDN; just type VARIANT to the index, ... Look up the documentation of the Get method. ... there appears to be no way to avoid a buffer overrun from what I see in the ... two dimensions array. ...
    (microsoft.public.vc.mfc)
  • [PATCH] perf_counter: dont count scheduler ticks as context switches
    ... calls perf_swcounter_event to record a context switch event. ... variant doesn't record a context switch event, and takes a struct ... This adds the new variant rather than ...
    (Linux-Kernel)