RE: Scaling noise

From: Robert White (rwhite_at_casabyte.com)
Date: 09/05/03

  • Next message: Dave Hansen: "Re: [RFC] NR_CPUS=8 on a 32 cpu box"
    To: "'William Lee Irwin III'" <wli@holomorphy.com>, "'Larry McVoy'" <lm@work.bitmover.com>, "'Brown, Len'" <len.brown@intel.com>, "'Giuliano Pochini'" <pochini@shiny.it>, "'Larry McVoy'" <lm@bitmover.com>, <linux-kernel@vger.kernel.org>
    Date:	Thu, 4 Sep 2003 18:34:25 -0700
    
    

    Not to throw a flag on the play but...

    Larry asks why penalize low end systems by making the kernel
    many-cpu-friendly. The implicit postulate in his question is that the
    current design path is "unfair" to the single and very-small-N N-way systems
    in favor of the larger-N and very-large-N niche user base.
     
    Lots of discussions then ensue about high end scalability and the
    performance penalties of doing memory barriers and atomic actions for large
    numbers of CPUs.

    I'm not sure I get it. The large-end dynamics don't seem to apply to the
    question, in support nor rebut. Is there any concrete evidence to support
    the inference?

    What really is the impact of high-end scalability issues on the low end
    machines?

    It *seems* that the high-end diminishing returns due to having any one CPU's
    cache invalidated by N companions is clearly decoupled from the
    uni-processor model because there are no "other" CPUs to cause the bulk of
    such invalidations when there is only one processor.

    It *seems* that in a single-kernel architecture, as soon as you reach "more
    than one CPU" what "must be shared" ...er... must be shared.

    It *seems* that what is unique to each CPU (the instances CPU private data
    structure) are wholly unlikely to be faulted out because of the actions of
    other CPUs. (that *is* part of why they are in separate spaces, right?)

    It *seems* that the high-end degradation of cache performance as N increases
    is coupled singularly to the increase in N and the follow-on multi-state
    competition for resources. (That is, it is the real presence of the 64 CPUs
    and not the code to accommodate them is the cause of the invalidations. If,
    on my 2-way box I set the MAX_CPU_COUNT to 8 or 64, the only difference in
    runtime is the unused memory for the 6 or 62 per-cpu-data structures. The
    cache-invalidation and memory-barrier cost is bounded by the two real CPUs
    and not the 62 empty slots.)

    It *seems* that any attempt to make a large N system cache friendly would
    involve preventing as many invalidations as possible.

    And finally, it *seems* that any large-N design, e.g. one that keeps cache
    invalidation to a minimum, would, by definition, directly *benefit* the
    small N systems because they would naturally also have less invalidation.

    That is, "change a pointer, flush a cache" is true for all N greater than 1,
    yes? So if I share between 2 or 1000, it is the actions of the 2 or 1000 at
    runtime that exact the cost. Any design to better accommodate "up to 1000"
    will naturally tend to better accommodate "up to 4".

    So...

    What/where, if any, are the examples of something being written (or some
    technique being propounded) in the current kernel design that "penalizes" a
    2-by or 4-by box in the name of making a 64-by or 128-by machine perform
    better?

    Clearly keeping more private data private is "better" and "worse" for its
    respective reasons in a memory-footprint for cache-separation tradeoff, but
    that is true for all N >= 2.

    Is there some concrete example(s) of SMP code that is wrought overlarge? I
    mean all values of N between 1 and 255 fit in one byte (duh 8-) but cache
    invalidations happen in more than two bytes, so having some MAX_CPU_COUNT
    bounded at 65535 is only one byte more expansive and no-bytes more expensive
    in the cache-consistency-conflict space. At runtime, any loop that iterates
    across the number of active CPUs will be clamped at the actual number, and
    not the theoretical max.

    I suspect that the original question is specious and generally presumes
    facts not in evidence. Of course, just because *I* cannot immediately
    conceive of any useful optimization for a 128-way machine that is inherently
    detrimental to a 2-way or 4-way box, doesn't mean that no such optimization
    exists.

    Someone enlighten me please.

    Rob White

    -----Original Message-----
    From: linux-kernel-owner@vger.kernel.org
    [mailto:linux-kernel-owner@vger.kernel.org] On Behalf Of William Lee Irwin
    III
    Sent: Wednesday, September 03, 2003 11:16 AM
    To: Larry McVoy; Brown, Len; Giuliano Pochini; Larry McVoy;
    linux-kernel@vger.kernel.org
    Subject: Re: Scaling noise

    At some point in the past, I wrote:
    >> The lines of reasoning presented against tightly coupled systems are
    >> grossly flawed.

    On Wed, Sep 03, 2003 at 11:05:47AM -0700, Larry McVoy wrote:
    > [etc].
    > Only problem with your statements is that IBM has already implemented all
    > of the required features in VM. And multiple Linux instances are running
    > on it today, with shared disks underneath so they don't replicate all the
    > stuff that doesn't need to be replicated, and they have shared memory
    > across instances.

    Independent operating system instances running under a hypervisor don't
    qualify as a cache-coherent cluster that I can tell; it's merely dynamic
    partitioning, which is great, but nothing to do with clustering or SMP.

    -- wli
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Dave Hansen: "Re: [RFC] NR_CPUS=8 on a 32 cpu box"

    Relevant Pages

    • Re: Automatic parallelization - was Re: LISP Object Oriented?
      ... the cache coherence system. ... that package to main memory and the design of that memory system, ... ramifications far beyond the design of the package alone. ... Now, the situation changes as you add CPUs, but the hit each CPU takes ...
      (comp.lang.lisp)
    • Re: [patch 0/6] mm: alloc_percpu and bigrefs
      ... David S. Miller a écrit: ... with no per-cpu or per-node additional memory but got no comment. ... It's important to place mostly read parts together, so that a cache lines can ... bus trafic between CPUS. ...
      (Linux-Kernel)
    • Re: [PATCH 1/24] make atomic_read() behave consistently on alpha
      ... If you need to guarantee that the value is written to memory at a particular time in your execution sequence, you either have to read it from memory to force the compiler to store it first (and a volatile cast in atomic_read will suffice for this) or you have to use LOCK_PREFIX instructions which will invalidate remote cache lines containing the same variable. ... Communicating both with interrupt handler and with other CPUs. ...
      (Linux-Kernel)
    • Re: Concurency in the functional world
      ... you're saying the message is some big data structure consed out ... global memory ... ... So we have a global memory, which n CPUs act on. ... cache to speed access to the global memory. ...
      (comp.lang.functional)
    • Re: [PATCH 1/24] make atomic_read() behave consistently on alpha
      ... CPU can hold that data in cache as long as it wants before it writes ... it to memory. ... not-yet-written volatile value. ... Communicating both with interrupt handler and with other CPUs. ...
      (Linux-Kernel)