Re: [OOPS] 2.6.9-rc4, dual Opteron, NUMA, 8GB

From: Andi Kleen (ak_at_suse.de)
Date: 10/14/04

  • Next message: Maciej W. Rozycki: "Re: 2.6.9-rc4 No local APIC present or hardware disabled"
    Date:	Thu, 14 Oct 2004 20:40:18 +0200
    To: Brad Fitzpatrick <brad@danga.com>
    
    

    On Thu, Oct 14, 2004 at 11:29:52AM -0700, Brad Fitzpatrick wrote:
    > On Wed, 13 Oct 2004, Brad Fitzpatrick wrote:
    >
    > ...
    > >
    > > It turns out that NUMA is the culprit and LVM has no effect on any
    > > configuration. The machine has 6 memory slots. If I have 4 sticks of
    > > 2GB in the machine, it makes zone 0 w/ 8GB and zone 1 disabled. If I
    > > add two more 2GB sticks (total 12GB, filling all possible 6 slots),
    > > then I have 8GB in zone 0 and 4GB in zone 1, and then a mke2fs on a
    > > 280GB /dev/sdb1 works fine.
    > >
    > > In conclusion:
    > >
    > > NUMA + 12 GB -> works
    > > NUMA + 8 GB -> OOPS
    > > no numa + 8 GB -> works
    > >
    > > And I suspect that without CONFIG_SMP, 8GB will also work, by virtue
    > > of there only being one NUMA zone. I haven't tested that yet, but
    > > Randy referred me to these posts, where somebody with the same problem
    > > confirms that disabling NUMA and SMP fixes his problem:
    > >
    > > http://marc.theaimsgroup.com/?l=linux-kernel&m=109328505204081&w=2
    > > http://marc.theaimsgroup.com/?l=linux-kernel&m=109330259511819&w=2
    > >
    >
    > Replying to myself, in hopes it'll benefit others.
    >
    > It also works with NUMA and 8GB if I balance the memory between the two
    > processors, 2 sticks of 2GB on each.
    >
    > Then the zones look like:
    >
    > Scanning NUMA topology in Northbridge 24
    > Number of nodes 2 (10010)
    > Node 0 MemBase 0000000000000000 Limit 00000000ffffffff
    > Node 1 MemBase 0000000100000000 Limit 00000001ffffffff
    > node 1 shift 24 addr 100000000 conflict 0
    > node 1 shift 25 addr 1fe000000 conflict 0
    > Using node hash shift of 26
    > Bootmem setup node 0 0000000000000000-00000000ffffffff
    > Bootmem setup node 1 0000000100000000-00000001ffffffff
    > No mptable found.
    > On node 0 totalpages: 1048575
    > DMA zone: 4096 pages, LIFO batch:1
    > Normal zone: 1044479 pages, LIFO batch:16
    > HighMem zone: 0 pages, LIFO batch:1
    > On node 1 totalpages: 1048575
    > DMA zone: 0 pages, LIFO batch:1
    > Normal zone: 1048575 pages, LIFO batch:16
    > HighMem zone: 0 pages, LIFO batch:1
    >
    >
    > The eServer 325 manual in one place says you must add memory in
    > pairs, and in other places seems to suggest (but not say it's required?)
    > that you add memory in 1 & 2, then 5 & 6 (with SMP), and then 3 & 4.
    >
    > It would've been nice if Linux either flat-out halted on boot saying,
    > "This is stupid, move your memory around" or ran in a degraded memory
    > configuration reliably without crashing. The scary part is that the BIOS
    > was cool with the configuration and Linux was /mostly/ cool with it, most
    > of the time, until certain usage patterns (like mke2fs) made it crash.

    Linux has no such limitations in the NUMA code, so it wouldn't
    make sense to test for it. Most likely it was just an hardware problem -
    specific memory configurations don't work correctly in your motherboard
    and in specific memory access patterns triggered by mkfs, possible including
    bus master IO from the RAID controller, it corrupted data.

    It's not possible in Linux to test for all the limitations all the systems
    it could run on may have in their memory configuration. Linux just assumes that it can use the memory it gets passed. I would talk to your BIOS vendor about
    better sanity checking or perhaps your motherboard vendor about not
    adding such restrictions

    > Andi, I can still get you information, but for now I'll mark this one down
    > as "user error". Perhals you'll have an idea on how to make Linux more
    > robust if another user winds up with this same issue, as Christopher
    > Swingley and I did.

    There is no way to make Linux robust with unreliable memory subsystems,
    sorry. It would be like trying to make a human more robust
    with an unreliable O2 supply. Memory just has to work.

    There are tools like memtest86 that can be used to diagnose such
    things in a controlled environment. But in your case it probably
    involved PCI traffic, and that is hard to simulate with standard tools.

    -Andi
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Maciej W. Rozycki: "Re: 2.6.9-rc4 No local APIC present or hardware disabled"

    Relevant Pages

    • Migrate pages from a ccNUMA node to another
      ... We are left with Non Uniform Memory Architectures. ... You can make use of the forthcoming NUMA APIs to set up your NUMA environment: ... (e.g. it is a reference benchmark) ... Page migration tries to help you out in these situations. ...
      (Linux-Kernel)
    • Re: linux server slow.
      ... I have booted Linux with "linux numa=off". ... linux partition is able to see all the memory allocated for it. ... Since we have only 4 CPU's we donot need the numa. ...
      (RedHat)
    • Re: NUMA API - wish list
      ... > resources from some NUMA domains to others, ... all available memory bandwidth and the best average memory latency. ... require to go all the way to a full workload manager ... But NUMA knowledge is purely for optimization. ...
      (Linux-Kernel)
    • Re: [rfc] balance-on-fork NUMA placement
      ... course will leave the memory behind...), but it does give a bit more ... We certainly used to copy all page tables on fork. ... the NUMA scheduler, and every time we decide it's a bad idea. ... cycle slower, not faster, so exec is generally a much better time to do ...
      (Linux-Kernel)
    • Re: PG_zero
      ... problem twice as much memory as I do to get the same boost. ... > Remember it's not global at all on the larger systems, since they're NUMA. ... higher latency of the buddy allocator, this is why the per-cpu lists ... The buffering is really a "refill", ...
      (Linux-Kernel)