Re: [Lse-tech] [patch] sched-domain cleanups, sched-2.6.5-rc2-mm2-A3

From: Erich Focht (efocht_at_hpce.nec.com)
Date: 04/01/04

  • Next message: Len Brown: "Re: ACPI SCI IOAPIC bug (Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered)"
    To: Nick Piggin <nickpiggin@yahoo.com.au>
    Date:	Thu, 1 Apr 2004 00:23:05 +0200
    
    

    On Wednesday 31 March 2004 04:08, Nick Piggin wrote:
    > >>I'm with Martin here, we are just about to merge all this
    > >>sched-domains stuff. So we should at least wait until after
    > >>that. And of course, *nothing* gets changed without at least
    > >>one benchmark that shows it improves something. So far
    > >>nobody has come up to the plate with that.
    > >
    > > I thought you're talking the whole time about STREAM. That is THE
    > > benchmark which shows you an impact of balancing at fork. At it is a
    > > VERY relevant benchmark. Though you shouldn't run it on historical
    > > machines like NUMAQ, no compute center in the western world will buy
    > > NUMAQs for high performance... Andy typically runs STREAM on all CPUs
    > > of a machine. Try on N/2 and N/4 and so on, you'll see the impact.
    >
    > Well yeah, but the immediate problem was that sched-domains was
    > *much* worse than 2.6's numasched, neither of which balance on
    > fork/clone. I didn't want to obscure the issue by implementing
    > balance on fork/clone until we worked out exactly the problem.

    I had the feeling that solving the performance issue reported by Andi
    would ease the integration into the baseline...

    > >>The point is that we have never had this before, and nobody
    > >>(until now) has been asking for it. And there are as yet no
    > >
    > > ?? Sorry, I'm having balance at fork since 2001 in the NEC IA64 NUMA
    > > kernels and users use it intensively with OpenMP. Advertised it a lot,
    > > asked for it, atlked about it at the last OLS. Only IA64 was
    > > considered rare big iron. I understand that the issue gets hotter if
    > > the problem hurts on AMD64...
    >
    > Sorry I hadn't realised. I guess because you are happy with
    > your own stuff you don't make too much noise about it on the
    > list lately. I apologise.

    The usual excuse: busy with other stuff...

    > I wonder though, why don't you just teach OpenMP to use
    > affinities as well? Surely that is better than relying on the
    > behaviour of the scheduler, even if it does balance on clone.

    You mean in the compiler? I don't think this is a good idea, that way you
    loose flexibility in resource overcomitment. And performance when overselling
    the machine's CPUs.

    > > Again: I'm talking about having the choice. The user decides. Nothing
    > > protects you against user stupidity, but if they just have the choice
    > > of poor automatic initial scheduling, it's not enough. And: having the
    > > fork/clone initial balancing policy means: you don't need to make your
    > > code complicated and unportable by playing with setaffinity (which is
    > > just plainly unusable when you share the machine with other users).
    >
    > If you do it by hand, you know exactly what is going to happen,
    > and you can turn off the balance-on-clone flags and you don't
    > incur the hit of pulling in remote cachelines from every CPU at
    > clone time to do balancing. Surely an HPC application wouldn't
    > mind doing that? (I guess they probably don't call clone a lot
    > though).

    OpenMP is implemented with clone. MPI parallel applications just exec,
    they're fine. IMO the static affinity/cpumask handling should be done
    externally by some resource manager which has a good overview on the
    long-term load of the machine. It's a different issue, nothing for the
    scheduler. I wouldn't leave it to the program, too unflexible and
    unportable across machines and OSes.

    > > There are companies mainly selling NUMA machines for HPC (SGI?), so
    > > this is not a niche market. Clusters of big NUMA machines are not
    > > unusual, and they're typically not used for databases but for HPC
    > > apps. Unfortunately proprietary UNIX is still considered to have
    > > better features than Linux for such configurations.
    >
    > Well, SGI should be doing tests soon and tuning the scheduler
    > to their liking. Hopefully others will too, so we'll see what
    > happens.

    Maybe they are happy with their stuff, too. They have the cpumemsets
    and some external affinity control, AFAIK.

    > >>Let's just make sure we don't change defaults without any
    > >>reason...
    > >
    > > No reason? Aaarghh... >;-)
    >
    > Sorry I mean evidence. I'm sure with a properly tuned
    > implementation, you could get really good speedups in lots
    > of places... I just want to *see* them. All I have seen so
    > far is Andi getting a bit better performance on something
    > where he can get *much* better performance by making a
    > trivial tweak instead.

    I get the feeling that Andi's simple OpenMP job is already complex
    enough to lead to wrong initial scheduling with the current aproach.
    I suppose the reason are the 1-2 helper threads which are started
    together with the worker threads (depending on the used compiler).
    On small machines (and 4 cpus is small) they significantly disturb
    the initial task distribution. For example with the Intel compiler
    and 4 worker threads you get 6 tasks. The helper tasks are typically
    runnable when the code starts so you get (in order of creation)
       CPU Task Role
        1 1 worker
        2 2 helper
        3 3 helper
        4 4 worker
        1-4 5 worker
        1-4 6 worker

    So the difficulty is to find out which task will do real work and
    which task is just spoiling the statistics. I think...

    Regards,
    Erich

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Len Brown: "Re: ACPI SCI IOAPIC bug (Re: Fixes for nforce2 hard lockup, apic, io-apic, udma133 covered)"

    Relevant Pages

    • Re: PII vs PIII
      ... >> What CPUs were in those machines and what did you use them for Lane? ... >> breaking the pins off CPUs, a move always seen as being irreversible. ... I managed to break off the wrong pin. ...
      (comp.os.linux.hardware)
    • Re: When a computer start
      ... it's reasonable for non-PC machines to have either a gated register ... or a tiny ROM or even a CMOS-RAMon top of the memory range. ... At least AMD64 work the same way as 32-bit CPUs on RESET. ...
      (alt.lang.asm)
    • Re: Need help to reanimate an AT&T AT&T 3B1/7300 UNIX-PC
      ... There was only one Sun that I can recall. ... Ultra-60s (dual 450 MHz CPUs) as my desktop machine. ... which are 64-bit machines native -- with the ability to run ... That is part of my first unix box -- a Cosmos CMS/UNX running on an 8 ...
      (comp.sys.3b1)
    • Re: a dozen cpus on a chip
      ... This will make it a lot harder to say how many CPUs are in a chip. ... SMPs cost much more than loosely-coupled machines, but there are good commercial reasons to use them. ... Keeping the illusion of symmetric shared memory really simplifies the programming model--a hugely important issue that non-programmers usually have no idea about. ... Ever since Danny Hillis & Co. back in the 80s, people have been pushing one sort or another of massively parallel machine. ...
      (sci.electronics.design)
    • Re: Parallel Common-Lisp with at least 64 processors?
      ... FORK-MAPCAR automatically distributes across all the CPUs. ... For launching on multiple machines connected across the InterNet, ... pre-distributed to the various hosts, ... any given time and distributing tasks via this new protocol we've ...
      (comp.lang.lisp)