Re: Next patches for the 2.6.25 queue



On Fri, Dec 14, 2007 at 11:43:35AM -0500, Mathieu Desnoyers wrote:
* Adrian Bunk (bunk@xxxxxxxxxx) wrote:
On Thu, Dec 13, 2007 at 09:46:42AM -0500, Mathieu Desnoyers wrote:
Hi Andrew,

I would like to post my next patches in a way that would make it as
easy for you and the community to review them. Currently, the patches
that have really settled down are :

* For 2.6.25
...
- Immediate Values
- Redux version, asked by Rusty
...

I might have missed it:

Are there any real numbers (opposed to estimates and microbenchmarks)
available how much performance we actually gain in which situations?

It might be some workload with markers using Immediate Values or
something like that, but it should be something where the kernel
runs measurably faster with Immediate Values than without.

Currently I'm somewhere between "your Immediate Values are just an
academic code obfuscation without any gain in practice" and "janitors
should convert all drivers to use Immediate Values", and I'd like to
form an opinion based on in which situations the kernel runs faster by
how many percent.

That's also based on observation like e.g. that __read_mostly should
improve the performance, but I've already seen situations in the kernel
where it forced gcc to emit code that was obviously both bigger and
slower than without the __read_mostly [1], and that's part of why I'm
sceptical of all optimizations below the C level unless proven
otherwise.


Hi Adrian,

Hi Mathieu,

Yes, I had numbers that were presented in the patch headers, but I
re-ran some tests to have a clearer picture. Actually, what makes this
difficult to benchmark is the measurement error caused by the system's
"background noise" (interrupts, softirqs, kernel threads...). Note that
we are measuring cache effects and, therefore, any program which does
the same operation many times in a loop will benefit from space and time
locality and won't trigger many cache misses after the first loop.

So, here is what I have done to get a significant difference between the
with and without immediate values :

I ran, in userspace, a program that does random memory access
(3 times, in a 10MB array) between each getppid() syscall, everything
wrapped in a loop, repeated 1000 times (enough so the results are
reproduceable between runs). Tests were done on a 3GHz Pentium 4 with
2GB of ram with Linux 2.6.24-rc5.

gcc version?

I instrumented getppid() with 40 markers, so the impact of memory reads
won't be burried in the "background noise". Since each markers is using
a 24 bytes structure (8 bytes aligned), and are next to each other in
memory, we will cause (depending on the alignment of structures in the
cache lines) :

L1 cache lines : 64 bytes
L2 cache lines : 128 bytes

8-9 memory reads (L2 cache misses)
15-16 L2 accesses (L1 cache misses)

for each getppid() syscall.

The result is as expected :

Number of cycles for getppid

* Without memory pressure : 1470 cycles
* With memory pressure (std. dev. calculated on 3 groups of 1000 loops on
compiled out case : 416.54 cycles)
* 40 markers without immediate values : 14938 cycles
* 40 markers with immediate values : 12795 cycles
* Markers compiled out : 12427 cycles

for a 14% speedup reached by using immediate values of data reads.
There seems to be no significant difference between compiling out the
markers and using immediate values to disable them.

OK, it's good to see that there are situations where we can measure the
benefits.

Note that since the markers are located in the same cache lines, those
40 markers are the equivalent to have about 8 markers _not_ on the same
cache lines (in real life, that's very likely to be the case).

So, the conditions to have a speedup here :

- A significant amount of cache lines must be saved.
- They must be read from memory often.

So, we will likely see a real-life impact in situations such as :
instrumenting spinlocks; whenever they would be taken/released many
times in a system call made by an application doing random memory access
(a hash-based search engine would be a good example, a database would
also be a suitable workload), we should be able to measure the impact.
However, this is hard to reproduce/measure, so this is why I created a
synthetic workload simulating this behavior.

So I would really suggest using the immediate values for applications
such as :
- code markup (the markers)
- dynamically enable what would have otherwise been selected in
menuconfig (such as profiling, scheduler/timer statistics for
powertop...)

where the goal is to have _zero_ measurable impact on performance on any
workload.

Are the effects of your immediate values on gcc optimizations really
well enough researched for being able to do this claim of
"zero measurable impact on performance on any workload"?

It should be quite easy to produce an artifical example where it has a
huge influence on performance, and proving that this will never ever
happen in practice sounds like a hard task.

Mathieu

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: Next patches for the 2.6.25 queue
    ... It might be some workload with markers using Immediate Values or ... locality and won't trigger many cache misses after the first loop. ... I instrumented getppidwith 40 markers, so the impact of memory reads ...
    (Linux-Kernel)
  • Re: [patch 2/8] Immediate Values - Architecture Independent Code
    ... which use a load immediate to remove a data cache hit. ... Immediate values refer to the memory address of a previously declared integer. ... Random walk L1 and L2 trashing surrounding a getppid() call: ...
    (Linux-Kernel)
  • Re: [patch 10/10] Scheduler profiling - Use immediate values
    ... We could characterize this in memory to L1 cache transfers per ... bandwidth required by the kernel will go down. ... regarding whether we want static markers at all in the kernel. ...
    (Linux-Kernel)
  • Re: free - The Memory is still not returned back to OS
    ... If the 'free' does not return the memory back to the OS, ... It is not uncommon for there to be a "cache" of heap ... the immediate return of the freed memory to the OS and others allow ... Both are perfectly valid techniques, but are appropriate for different sorts of systems. ...
    (comp.arch.embedded)
  • [patch 7/8] Immediate Values - Documentation
    ... +This document introduces Immediate Values and their use. ... +instruction stream based test: +7.8394 cycles ... If we put 2 markers for irq entry/exit, ... +load based test on a system with high memory pressure. ...
    (Linux-Kernel)