Re: nasm over gas?

From: Jamie Lokier (jamie_at_shareable.org)
Date: 09/08/03

  • Next message: Stephen Hemminger: "Re: [PATCH] Re: today's futex changes"
    Date:	Mon, 8 Sep 2003 17:58:31 +0100
    To: "Ihar 'Philips' Filipau" <filia@softhome.net>
    
    

    Ihar 'Philips' Filipau wrote:
    > Jamie Lokier wrote:
    > >Ihar 'Philips' Filipau wrote:
    > >
    > >> It will depend on arch CPU only in case if you have unlimited i$ size.
    > >> Servers with 8MB of cache - yes it is faster.
    > >> Celeron with 128k of cache - +4bytes == higher probability of i$ miss
    > >>== lower performance.
    > >
    > >Higher probability != optimal performance.
    > >
    > >It depends on your execution context. If it's part of a tight loop
    > >which is executed often, then saving a cycle in the loop gains more
    > >performance than saving icache, even on a 128k Celeron.
    > >
    >
    > You think as system-programmer.
    > Every bit of i$ waste - hit user space applications too.
    > 128k of $ - is for every app.
    >
    > If you gained one cycle by polluting one more cache line - do not
    > forget that this cache line probably contained some info, which was able
    > to avoid cache miss for another application. So you gained cycle here -
    > and lost it immediately in another app. Not good.

    Usually the whole L1 cache is flushed between appliation context
    switches anyway, so the cost of a miss is borne by the application
    which causes it.

    And that _still_ doesn't change the truth of my statement. One cycle
    saved in a loop which executes 1000 times is worth more than an L1
    i-cache miss, always.

    > If you can improve performance by NOT polluting cache - it would be
    > another story :-)))

    Yes, of course that is better when it is possible.

    Modern OOO CPUs are subtle beasts. Like I said, I added a single
    "nop" (one-byte instruction) to a tight graphics loop once, and the
    loop went significantly faster. I could not explain it, except that I
    know the Pentium Pro instruction decode stage has many quirks.

    > But still I never saw this kind of thing being used in kernel.
    > Instead of writing normal asm we have something like i386/mmx.c. And
    > i386/checksum.S not the best sample of asm in kernel too. Sad.

    Those were written before GAS had a macro facility. I agree with you,
    it should be used more in the kernel.

    The two examples you gave have been carefully tuned on particular
    CPUs, by trial and error. Changing the instruction order makes a big
    difference to their performance.

    -- Jamie
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Stephen Hemminger: "Re: [PATCH] Re: today's futex changes"

    Relevant Pages

    • Re: I want to learn forth but...
      ... on each pass of the loop. ... deeper in the execution stack. ... faster than even a register bank of any real size. ...
      (comp.lang.forth)
    • Re: Reducing build-times for large projects.
      ... If the line before it executed, then the very next inline code has ... if your architecture does speculative execution and fetch. ... > THE FUNCTION THAT IS ALREADY IN CACHE! ... Loop unrolling "works" because you get to do N iterations worth of the ...
      (comp.lang.cpp)
    • Re: Running a background task
      ... background task that runs more or less forever, while normal execution ... entry On_String (String: in SideAB) ... It's generally not a good idea to reuse the identifier "String", since it hides the predefined type String, which will cause confusing error msgs if you ever try to use that type. ... end loop; ...
      (comp.lang.ada)
    • Re: Python indentation
      ... >>indentation aimlessly wandering out, then in, then out again, then in at ... > It is as much a single flow of execution as an if elif else. ... >>assembly languages have a loop structure quite so ugly as that. ... it's confusing and misleading too. ...
      (comp.lang.python)
    • Re: I want to learn forth but...
      ... on each pass of the loop. ... deeper in the execution stack. ... memory, and the part I quoted gave another loop approach (cyclic ...
      (comp.lang.forth)