Re: [PATCH RFC/RFB] x86_64, i386: interrupt dispatch changes



Sorry to reply so late on this slightly offtopic rant...

On Wednesday 05 November 2008 21:26, Ingo Molnar wrote:
* Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
On Tue, Nov 04, 2008 at 09:44:00PM +0100, Ingo Molnar wrote:

It's only an issue on ancient CPUs that export all their LOCKed
cycles to the bus. Pentium and older or so. The PPro got it right
already.

??? LOCK slowness is not because of the bus. And I know you know
that Ingo, so I don't know why you wrote that bogosity above.

.. of course the historic LOCK slowness was all due to the system bus:
very old CPUs exported a LOCK signal to the system bus for every
LOCK-prefix access (implicit and explicit) and that made it _really_
expensive. (hundreds of cycles)

... on reasonably modern CPUs the LOCK-ed access has been abstracted
away to within the CPU, and the cost of LOCK-ed access is rather low
(think 10-20 cycles - of course only if there's no cache miss cost)
(That's obviously the case with the GDT, with is both per CPU and well
cached.)

Locked instruction AFAIR is about 50 cycles on Core2. I think it is
a bit lower on K8. On Nehalem, which has optimisations for these,
I have heard it is still about 20-25 cycles. Although I don't have
one, so I don't actually know.

These (on my Core2) don't seem to pipeline at all with other
instructions either. So on my Core2, a locked instruction is worth
maybe 150-200 regular pipelined, superscalar instructions.

There is another big reason why lock instructions are expensive,
and that is because they have to prevent subsequent loads from
passing any previous stores becoming visible. This in theory could
be somewhat speculated, but no matter what happens, the program
visible state can't be committed until the stores are.

I heard from an Intel hardware engineer that Nehalem has some
really fancy logic in it to make locked instructions "free", that
was nacked from earlier CPUs because it was too costly. So obviously
it is taking a fair whack of transistors or power for them to do it.
And even then it is far from free, but still seems to be one or two
orders of magnitude more expensive than a regular instruction.


on _really_ modern CPUs LOCK can be as cheap as just a few cycles - so

Oh, maybe I'm mistaken about Nehalem then? How many is "just a few"?
If it is 25 non-pipelined cycles, then that's still 100 instructions
if it is a 4 issue machine.


low that we can stop bothering about it in the future. There's no
fundamental physical reason why the LOCK prefix (implicit or explicit)
should be expensive.

Even if they could make it free on the software side, it is obviously
expensive on the hardware side. Not bothering about it is a copout.
The atomic instruction speedups in Nehalem are cool, but what would
have been even cooler is if Intel had decided *not* to spend resources
making this cheaper because they found Linux has so few locked
instructions :)

Even if somehow the x86 ISA didn't have the implicit memory ordering
requirement in the lock instruction, I think it's obviously a special
case path that doesn't fit in with a load/store uarch (whether they
implement it in uops with ll/sc like thing or whatnot, it's going to
need special logic).

IMO, we shouldn't stop bothering about LOCK prefix in the forseeable
future.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • RE: 4.7 vs 5.2.1 SMP/UP bridging performance
    ... Note that SMP should cost twice as much extra, ... > lock is contested. ... they don't lock the bus any more ... For example, with your test above, I see 212 cycles for the UP case on ...
    (freebsd-current)
  • Re: Lies, damn lies and benchmarks
    ... When running using just the 16-bit registers, ... extra cycles when run on the 386 over the 286 (these were mostly system ... instructions which didn't get run too often anyways), ... The FPU was another story, the 287 FPU was usually run at an asynchronous ...
    (comp.security.misc)
  • Re: SSE2-Sort within a register
    ... register files. ... cycles. ... 128 bit SSEinstructions are split into Doubles ... Most 128 bit SSE and SSE2 ...
    (comp.lang.asm.x86)
  • Re: Adjusting PC Hyperthreading for Spice Simulation
    ... ago), 350 CPU cycles for a code cache miss was not atypical, but RAM ... delay in which a sequence of instructions totalling 100 cycles could be ... and others) support speculative execution and out of order execution ...
    (sci.electronics.design)
  • Re: NASM 0.98.39 vs. NASM 2.03.01 disassembly
    ... source register. ... The output of the register is gated to the data bus only during ... instructions. ... sub-fields may be connected to a latch instead of the main bus since the ...
    (alt.lang.asm)