Re: AMD64 Northbridge errors

From: Allen Smith (lazlor_at_bigboy.lotaris.org)
Date: 11/29/05

  • Next message: Greg Golin: "Re: AMD64 Northbridge errors"
    To: redhat-list@redhat.com
    Date: Mon, 28 Nov 2005 15:38:05 -0800
    
    

    On Monday 28 November 2005 03:27 pm, Marcelino Mata wrote:
    >
    > Running RHEL 3.0 x86_64 U6 (2.4.21-37.Elsmp)
    >
    > I have searched, logged a call with HP and Redhat support and have
    > turned up nothing. HP says I have memory problems, Redhat says it's a
    > known non-critical error.
    >
    > I am not sure if I am chasing after the correct problem but all six of
    > my AMD64 HP XW9300 (based off Tyan Thunder K8WE?) with anywhere between
    > 4-16Gb RAM and two Opteron CPU's get the following errors :
    >
    > Nov 10 17:18:46 node4 kernel: CPU 0: Silent Northbridge MCE
    > Nov 10 17:18:46 node4 kernel: Northbridge status 94044100:ac080a13
    > Nov 10 17:18:46 node4 kernel: Error chipkill ecc error
    > Nov 10 17:18:46 node4 kernel: ECC error syndrome ac08
    > Nov 10 17:18:46 node4 kernel: bus error local node response, request
    > didn't time out
    > Nov 10 17:18:46 node4 kernel: generic read
    > Nov 10 17:18:46 node4 kernel: memory access, level generic
    > Nov 10 17:18:46 node4 kernel: link number 0
    > Nov 10 17:18:46 node4 kernel: dram scrub error
    > Nov 10 17:18:46 node4 kernel: corrected ecc error
    > Nov 10 17:18:46 node4 kernel: previous error lost
    > Nov 10 17:18:46 node4 kernel: NB error address 000000000126dd40
    >
    >
    > Nov 14 19:14:16 node4 kernel: CPU 0: Silent Northbridge MCE
    > Nov 14 19:14:16 node4 kernel: Northbridge status a6000001:0005001b
    > Nov 14 19:14:16 node4 kernel: Error gart error
    > Nov 14 19:14:16 node4 kernel: GART TLB error generic level generic
    > Nov 14 19:14:16 node4 kernel: err cpu1
    > Nov 14 19:14:16 node4 kernel: processor context corrupt
    > Nov 14 19:14:16 node4 kernel: error uncorrected
    > Nov 14 19:14:16 node4 kernel: previous error lost
    > Nov 14 19:14:16 node4 kernel: NB error address 00000000dffe0038
    >
    > Five of the computers have between 1-30 references to these error
    > messages in the past 3 weeks. One computer has over 30,000 instances of
    > these error messages. I am getting the majority of these messages on
    > computers with >4Gb RAM but I have had the messages on computers with
    > only 4GB RAM.
    >
    > The main reason I am focusing on these messages is that the computers
    > have crashed numerous times since being put online. The computer with
    > 30K instances of the error message has crashed about 1-2 times per week.
    > I am running the latest BIOS.
    >
    > I can not turn on diskdump since they have Nvidia SATA controllers (not
    > support by diskdump) and netdump has not produced anything since during
    > the kernel crash no data was written ( network driver went down? ).
    >
    > Has anyone else seen these messages or have any idea how to identify the
    > problem? Could my crashes be due to Northbridge errors or am I barking
    > up the wrong tree.
    >
    > Marcelino
    >
    > Reference Information below
    >
    > lspci information
    > -----------------
    >
    > 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller
    > (rev a3)
    > 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev a3)
    > 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
    > 00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
    > 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3)
    > 00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97
    > Audio Controller (rev a2)
    > 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2)
    > 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller
    > (rev f3)
    > 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller
    > (rev f3)
    > 00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
    > 00:0a.0 Ethernet controller: nVidia Corporation CK804 Ethernet
    > Controller (rev a3)
    > 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
    > 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    > HyperTransport Technology Configuration
    > 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    > Address Map
    > 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    > DRAM Controller
    > 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    > Miscellaneous Control
    > 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    > HyperTransport Technology Configuration
    > 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    > Address Map
    > 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    > DRAM Controller
    > 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
    > Miscellaneous Control
    > 05:05.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A
    > IEEE-1394a-2000 Controller (PHY/Link)
    > 0a:00.0 VGA compatible controller: nVidia Corporation NV41GL [Quadro FX
    > 1400] (rev a2)
    > 40:01.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
    > (rev 12)
    > 40:01.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
    > 40:02.0 PCI bridge: Advanced Micro Devices [AMD] AMD-8131 PCI-X Bridge
    > (rev 12)
    > 40:02.1 PIC: Advanced Micro Devices [AMD] AMD-8131 PCI-X IOAPIC (rev 01)
    > 61:06.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
    > Fusion-MPT Dual Ultra320 SCSI (rev 07)
    > 61:06.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
    > Fusion-MPT Dual Ultra320 SCSI (rev 07)
    > 61:09.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5782
    > Gigabit Ethernet (rev 03)
    > 80:00.0 Memory controller: nVidia Corporation CK804 Memory Controller
    > (rev a3)
    > 80:01.0 Memory controller: nVidia Corporation CK804 Memory Controller
    > (rev a3)
    >
    > lsmod
    > -----
    > Module Size Used by Tainted: P
    > nfs 95984 7 (autoclean)
    > audit 127208 2 (autoclean)
    > nfsd 86096 8 (autoclean)
    > lockd 60528 1 (autoclean) [nfs nfsd]
    > sunrpc 91944 1 (autoclean) [nfs nfsd lockd]
    > netconsole 19208 0 (unused)
    > autofs4 16912 2 (autoclean)
    > tg3 69936 1
    > nvnet 71168 1
    > sg 37880 0 (autoclean)
    > sr_mod 17676 0 (autoclean)
    > ide-scsi 12832 0
    > ide-cd 34408 0
    > cdrom 33096 0 [sr_mod ide-cd]
    > keybdev 3104 0 (unused)
    > mousedev 6728 0 (unused)
    > hid 21992 0 (unused)
    > input 7520 0 [keybdev mousedev hid]
    > ehci-hcd 21200 0 (unused)
    > usb-ohci 22864 0 (unused)
    > usbcore 85152 1 [hid ehci-hcd usb-ohci]
    > ext3 87856 2
    > jbd 57088 2 [ext3]
    > raid0 4368 1
    > sata_nv 5116 5
    > libata 49352 0 [sata_nv]
    > mptscsih 43792 0 (unused)
    > mptbase 50472 3 [mptscsih]
    > diskdumplib 6548 0 [mptscsih mptbase]
    > sd_mod 14964 10
    > scsi_mod 130124 6 [sg sr_mod ide-scsi sata_nv libata
    > mptscsih sd_mod]
    >
    > --
    > redhat-list mailing list
    > unsubscribe mailto:redhat-list-request@redhat.com?subject=unsubscribe
    > https://www.redhat.com/mailman/listinfo/redhat-list
    >

    I have seen this on 3 similar setups. We swapped out memory and that resolved it for 2 of them. On the third we had to do a complete swap (memory/mb/ps/cpu) to make them go away.

    -- 
    redhat-list mailing list
    unsubscribe mailto:redhat-list-request@redhat.com?subject=unsubscribe
    https://www.redhat.com/mailman/listinfo/redhat-list
    

  • Next message: Greg Golin: "Re: AMD64 Northbridge errors"

    Relevant Pages

    • Re: network detection problem (Alexander and John)
      ... any best option other than this for redhat because i ... 00:00.0 Host bridge: Intel Corp. 82810 GMCH [Graphics ... Memory Controller Hub] ... > Nautilus was unable to mount floppy drive. ...
      (Fedora)
    • Re: Events 675 676 677
      ... Are those IP addresses from computers in your network and in the domain and are they ... controller as far as timewise and that your dns configuration for the domain is ... > Client Address: 172.16.0.70 ... > Failure Code: 0x6 ...
      (microsoft.public.win2000.security)
    • Re: Looking to set up an infosec lab
      ... For servers, if you want your lab to mirror the "real world" as much as ... possible, I'd recommend a version of RedHat 7 or newer, RedHat ...
      (Pen-Test)
    • Re: why mainframes are still used?
      ... >>At that time the only small computers were basically toys. ... functioned as a console controller and boot controller (it's what read ... "super-minis" because they were bigger and more powerful than PDP-11 ... -- Mike B. ...
      (comp.os.vms)
    • Re: What could have made the 5200 a success?
      ... consoles that existed back in the 80's. ... console games in department stores for the most part, and computers ... and computer games were sold in computer stores. ... I said above that the 5200 Controller was a better ...
      (rec.games.video.classic)