Re: [PATCH] [2.6] [2/2] hlist: remove IFs from hlist functions

From: Andi Kleen (ak_at_suse.de)
Date: 02/14/04

  • Next message: Tim Vandermeersch: "[PATCH] Make Krups javastation work with 2.6.2"
    Date:	Sat, 14 Feb 2004 19:59:49 +0100
    To: Alex Pankratov <ap@swapped.cc>
    
    

    On Wed, 11 Feb 2004 08:48:44 -0800
    Alex Pankratov <ap@swapped.cc> wrote:

    >
    > Andi Kleen wrote:
    >
    > > Alex Pankratov <ap@swapped.cc> writes:
    > >
    > >>
    > >>No, because its 'pprev' field *is* getting modified.
    > >
    > > I didn't notice this before, sorry. But this could end up
    > > being a scalability problem on big SMP systems. Even though
    > > the cache line of this is never read it will bounce all the
    > > time between all CPUs using hlists and add considerably
    > > latency and cross node traffic. Remember Linux is supposed
    > > to run well on 128 CPU machines now.
    >
    > That's a bit above my head. How does this potential latency
    > compare to the speed up due to not having CMPs ? My cycle
    > counting skills are a bit dusty :)

    A full cache miss is extremly costly on a modern Gigahertz+ CPU because
    memory and busses are far slower than the CPU core. As a rule of
    thumb 1000+ cycles. An CMP is extremly cheap (a few cycles at worst),
    the only thing that could be more expensive is an mispredicted conditional
    jump triggered by the CMP. But even that would be at best a few tens of cycles.
    If everything is mispredicted which should be common it's extremly fast
    (a few cycles only)

    In addition on a big multiprocessor machine bouncing cache lines between
    CPUs is costly because it adds load to the interconnect.

    One way to avoid the possible mispredicted jump would be to reorganize the
    code that the compiler can use CMOV. The issue is that on x86 CMOV
    cannot do a conditional store to memory, so it has to use something like

            unsigned long *target = realtarget;
            unsigned long dummy;
            if (condfalse)
                    target = &dummy; /* <---- can be converted to CMOV */
            *target = dummy;
                            
    (gcc should do that on its own, but it doesn't). I'm not really sure
    it's practical to do that for the CMP here and if it's even worth it.

    Most likely the hlist loops are dominated by cache misses in walking the
    nodes anyways.

    >
    > >
    > > Maybe you can make it UP only, but I'm still not sure it's
    > > worth it.
    > >
    >
    > Sorry, I didn't the 'UP' part.

    Uniprocessor = !CONFIG_SMP with an #ifdef.

    -Andi
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Tim Vandermeersch: "[PATCH] Make Krups javastation work with 2.6.2"

    Relevant Pages

    • Re: Application threading [was Re: Dual-Core CPU??]
      ... Having said that the cache now has far more work to do, ... ops and some spare cycles for fetch ahead and maybe some HW links too. ... is shared between opcode fetch and others chores. ... In the barrel engine I am working on, the practical form has inner & ...
      (comp.sys.sun.hardware)
    • Re: AMD64 assembly code optimization
      ... that's why I wouldn't have any prefetch in short loops. ... replacing the wide-spread memory references to access a small array ... My processor has 64K L1 data cache and that in takes 3 cycles to fetch ...
      (comp.lang.asm.x86)
    • Re: number of machine instructions - for algorithm measuring
      ... i don't have to do it as stated by the exercise. ... Compilers typically understand the ... > in cache the access might take just one clock cycle. ... It can take hundreds or even thousands of cycles on newer ...
      (comp.lang.c)
    • Re: number of machine instructions - for algorithm measuring
      ... Mark F. Haigh wrote: ... Compilers typically understand the ... >> in cache the access might take just one clock cycle. ... It can take hundreds or even thousands of cycles on newer ...
      (comp.lang.c)
    • Re: Nocona [Intel 64-bit cpu timing]
      ... The P4 is purely to push clock at the expense of execute resources. ... data would fit in the L1 cache of most processors. ... effectively I get the same bandwidth, that is one read from memory ... 133M cycles per second. ...
      (sci.crypt)