AMD64 Northbridge errors

From: Marcelino Mata (mmata_at_multimatic.com)
Date: 11/29/05

  • Next message: Allen Smith: "Re: AMD64 Northbridge errors"
    Date: Mon, 28 Nov 2005 18:27:49 -0500
    To: "General Red Hat Linux discussion list" <redhat-list@redhat.com>
    
    

    Running RHEL 3.0 x86_64 U6 (2.4.21-37.Elsmp)

    I have searched, logged a call with HP and Redhat support and have
    turned up nothing. HP says I have memory problems, Redhat says it's a
    known non-critical error.

    I am not sure if I am chasing after the correct problem but all six of
    my AMD64 HP XW9300 (based off Tyan Thunder K8WE?) with anywhere between
    4-16Gb RAM and two Opteron CPU's get the following errors :

    Nov 10 17:18:46 node4 kernel: CPU 0: Silent Northbridge MCE
    Nov 10 17:18:46 node4 kernel: Northbridge status 94044100:ac080a13
    Nov 10 17:18:46 node4 kernel: Error chipkill ecc error
    Nov 10 17:18:46 node4 kernel: ECC error syndrome ac08
    Nov 10 17:18:46 node4 kernel: bus error local node response, request
    didn't time out
    Nov 10 17:18:46 node4 kernel: generic read
    Nov 10 17:18:46 node4 kernel: memory access, level generic
    Nov 10 17:18:46 node4 kernel: link number 0
    Nov 10 17:18:46 node4 kernel: dram scrub error
    Nov 10 17:18:46 node4 kernel: corrected ecc error
    Nov 10 17:18:46 node4 kernel: previous error lost
    Nov 10 17:18:46 node4 kernel: NB error address 000000000126dd40
     
     
    Nov 14 19:14:16 node4 kernel: CPU 0: Silent Northbridge MCE
    Nov 14 19:14:16 node4 kernel: Northbridge status a6000001:0005001b
    Nov 14 19:14:16 node4 kernel: Error gart error
    Nov 14 19:14:16 node4 kernel: GART TLB error generic level generic
    Nov 14 19:14:16 node4 kernel: err cpu1
    Nov 14 19:14:16 node4 kernel: processor context corrupt
    Nov 14 19:14:16 node4 kernel: error uncorrected
    Nov 14 19:14:16 node4 kernel: previous error lost
    Nov 14 19:14:16 node4 kernel: NB error address 00000000dffe0038

    Five of the computers have between 1-30 references to these error
    messages in the past 3 weeks. One computer has over 30,000 instances of
    these error messages. I am getting the majority of these messages on
    computers with >4Gb RAM but I have had the messages on computers with
    only 4GB RAM.

    The main reason I am focusing on these messages is that the computers
    have crashed numerous times since being put online. The computer with
    30K instances of the error message has crashed about 1-2 times per week.
    I am running the latest BIOS.

    I can not turn on diskdump since they have Nvidia SATA controllers (not
    support by diskdump) and netdump has not produced anything since during
    the kernel crash no data was written ( network driver went down? ).

    Has anyone else seen these messages or have any idea how to identify the
    problem? Could my crashes be due to Northbridge errors or am I barking
    up the wrong tree.

    Marcelino

    Reference Information below

    lspci information
    -----------------

     00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller
    (rev a3)
    00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
    00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
    00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
    00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
    00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97
    Audio Controller (rev a2)
    00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
    00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller
    (rev f3)
    00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller
    (rev f3)
    00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
    00:0a.0 Ethernet controller: nVidia Corporation CK804 Ethernet
    Controller (rev a3)
    00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
    00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    HyperTransport Technology Configuration
    00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    Address Map
    00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    DRAM Controller
    00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    Miscellaneous Control
    00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    HyperTransport Technology Configuration
    00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    Address Map
    00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    DRAM Controller
    00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    Miscellaneous Control
    05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A
    IEEE-1394a-2000 Controller (PHY/Link)
    0a:00.0 VGA compatible controller: nVidia Corporation NV41GL [Quadro FX
    1400] (rev a2)
    40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
    (rev 12)
    40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
    40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
    (rev 12)
    40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
    61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
    Fusion-MPT Dual Ultra320 SCSI (rev 07)
    61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
    Fusion-MPT Dual Ultra320 SCSI (rev 07)
    61:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5782
    Gigabit Ethernet (rev 03)
    80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller
    (rev a3)
    80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller
    (rev a3)

    lsmod
    -----
    Module Size Used by Tainted: P
    nfs 95984 7 (autoclean)
    audit 127208 2 (autoclean)
    nfsd 86096 8 (autoclean)
    lockd 60528 1 (autoclean) [nfs nfsd]
    sunrpc 91944 1 (autoclean) [nfs nfsd lockd]
    netconsole 19208 0 (unused)
    autofs4 16912 2 (autoclean)
    tg3 69936 1
    nvnet 71168 1
    sg 37880 0 (autoclean)
    sr_mod 17676 0 (autoclean)
    ide-scsi 12832 0
    ide-cd 34408 0
    cdrom 33096 0 [sr_mod ide-cd]
    keybdev 3104 0 (unused)
    mousedev 6728 0 (unused)
    hid 21992 0 (unused)
    input 7520 0 [keybdev mousedev hid]
    ehci-hcd 21200 0 (unused)
    usb-ohci 22864 0 (unused)
    usbcore 85152 1 [hid ehci-hcd usb-ohci]
    ext3 87856 2
    jbd 57088 2 [ext3]
    raid0 4368 1
    sata_nv 5116 5
    libata 49352 0 [sata_nv]
    mptscsih 43792 0 (unused)
    mptbase 50472 3 [mptscsih]
    diskdumplib 6548 0 [mptscsih mptbase]
    sd_mod 14964 10
    scsi_mod 130124 6 [sg sr_mod ide-scsi sata_nv libata
    mptscsih sd_mod]

    -- 
    redhat-list mailing list
    unsubscribe mailto:redhat-list-request@redhat.com?subject=unsubscribe
    https://www.redhat.com/mailman/listinfo/redhat-list
    

  • Next message: Allen Smith: "Re: AMD64 Northbridge errors"

    Relevant Pages

    • Re: A7N8X-XE zeigt im BIOS keine Netzwerkkarte an
      ... 0000:00:00.0 Host bridge: nVidia Corporation nForce2 AGP ... 0000:00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 ...
      (de.comp.os.unix.linux.hardware)
    • A7N8X-XE und Kernel 2.6.8: IDE und USB: "unknown device"
      ... 0000:00:00.0 Host bridge: nVidia Corporation nForce2 AGP ... 0000:00:00.1 RAM memory: nVidia Corporation nForce2 Memory Controller 1 ...
      (de.comp.os.unix.linux.hardware)
    • RE: Kernel 2.6.9-55 issues
      ... Subject: Kernel 2.6.9-55 issues ... 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller ... nVidia Corporation CK804 USB Controller ...
      (RedHat)
    • [-mm PATCH 10/10] Memory controller add documentation
      ... Differentiate between RSS and Page Cache - Paul Menage ... The infrastructure allows easy addition of other types of memory to control ... +c. Provides *zero overhead* for non memory controller users ... a container on hitting a limit, ...
      (Linux-Kernel)
    • [PATCH 20/33] memory controller add documentation
      ... Enable control of both RSS and Page Cache pages ... The infrastructure allows easy addition of other types of memory to control ... +c. Provides *zero overhead* for non memory controller users ... a cgroup on hitting a limit, ...
      (Linux-Kernel)