Re: [patch 2/8] Immediate Values - Architecture Independent Code



On Sun, Aug 12, 2007 at 11:07:04AM -0400, Mathieu Desnoyers wrote:
Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _immediate_read() version, which uses standard global
variables, and optimized per architecture immediate_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__immediate" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure : getppid() takes 1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

I still think this is bad idea:
1) already existing CPU erratas against popular models. As I learned
yesterday -- SILENT reboot (often) or hang (rare).
2) new type pretending to be standard and idiomatic -- immediate_t
3) new almost-C-keyword -- immediate_if. What does it mean? You'll never
guess just by looking at it.
4) numbers! it's very entertaining to look at numbers with 4 digits
after comma and without any attempts to do error estimates.
5) examples!

DEFINE_IMMEDIATE_TYPE(struct task_struct*, immediate_task_struct_ptr_t);
immediate_task_struct_ptr_t myptr;

this is close to hungarian notation, and threats were made to use
immediate stuff everywhere.
6) hundreds lines of new code playing with dynamic modifying


For what you ask?


For 1 (one) cacheline saving in schedule() which
unlike getpid/getppid is non-trivial function, so all rosy numbers about
speed savings aren't really applicable. And nobody measured how big it is
on macro benchmark.


So, this stuff should be put into CoolHacks case and put aside.
It just not worth it!




A side note, folks at KS can discuss barrier for merging speed improvements.
My feelings are around 1%.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: Next patches for the 2.6.25 queue
    ... It might be some workload with markers using Immediate Values or ... locality and won't trigger many cache misses after the first loop. ... I instrumented getppidwith 40 markers, so the impact of memory reads ...
    (Linux-Kernel)
  • Re: Next patches for the 2.6.25 queue
    ... It might be some workload with markers using Immediate Values or ... locality and won't trigger many cache misses after the first loop. ... I ran, in userspace, a program that does random memory access ... Since each markers is using ...
    (Linux-Kernel)
  • Re: free - The Memory is still not returned back to OS
    ... If the 'free' does not return the memory back to the OS, ... It is not uncommon for there to be a "cache" of heap ... the immediate return of the freed memory to the OS and others allow ... Both are perfectly valid techniques, but are appropriate for different sorts of systems. ...
    (comp.arch.embedded)
  • Re: Cached memory never gets released
    ... Stock linux 2.4.26 kernel. ... Due to flash bug 3M of memory gets lost due to font memory getting lost ... The output of "free" cache number steadily grows. ... longer to exhaust all of system memory with the cache. ...
    (Linux-Kernel)
  • Re: Problem: Creating a raw binary string
    ... > While its true that a 64-bit cpu will move twice the data per instruction it ... > Memory bus width plays an important role here and unless it too is widened / ... You are forgetting the two levels of cache in the processor. ... The memory chips are addressed in Row col fashion. ...
    (alt.comp.lang.borland-delphi)