MCE on an NForce4 board (again)



Hi there list,

this is a repost from the x86-64.org discuss list. I think it could be
relevant here too, if not please excuse me and forget this message. I am not
subscribed to the LKML, but I can follow the thread on usenet, so no need to
CC. Feel free to do it if you think it's relevant!

Since a couple of days I have been experiencing weird lockups on my AMD64
machine running Debian Sid.
The computer is composed like this:

Asus A8N-E (NForce4 socket 939)
Athlon64 X2 3800+
2x512MB Corsair TwinX
2x250GB Sata Western Digital
2xDVD/DVDRW on PATA channels

None of the components has *ever* been overclocked and the whole is one year
old.

I have logged on in single mode and tried to find a way to reproduce the
crash. It has been enough to do a stupid script like

while true; do
tar -xjvf linux-2.6.20.3.tar.bz2 && rm -fr linux-2.6.20.3
done

and regularly, after one or two cycles, the system would lockup with this
error:

CPU 0: Machine Check Exception: 4 Bank 4: b200000000070f0f
TSC 185ef6d81ca

It also gave me the hint to use mcelog, so I rebooted and installed it. The
decoded error read:

CPU 0 4 northbridge TSC 185ef6d81ca
Northbridge Watchdog error
bit57 = processor context corrupt
bit61 = error uncorrected
bus error 'generic participation, request timed out
generic error mem transaction
generic access, level generic'
STATUS b200000000070f0f MCGSTATUS 4

I googled and found a thread on the x86-64.org discuss list [1] that blamed
the RAM for the
problem. So I have started doing some tests:

1) If I use only 512MB of my RAM, alternatively, I don't get the error.
2) Memtest+ has been running for 10hrs and no errors have been detected.
I'll have it running for the day just to be sure.

Additionally, right before the MCE, I could read another error:

ata3: CPB flags CMD err, flags=0x11

Googling this brought up some threads in the LKML about the sata_nv driver
(the ADMA bit, I think).

Since I am running a 2.6.20 kernel (a testing version from the Debian kernel
team) I have tried booting older kernels but looks like anything I have
tried does not work (but this is not your problem).

So I have booted a livecd (I had around an Ubuntu with a 2.6.12 kernel) and
with my great surprise the machine worked without problems.

Now, do you think it would be even slightly possible that some regression in
the sata_nv module could trigger an MCE? If not, would you put your money on
the RAM, or could the motherboard be blamed? I hope it's not the CPU ;)

Thanks for your help, I hope the information I gave you is enough. Please
feel free to suggest any more tests and diagnostics!

[1] https://www.x86-64.org/pipermail/discuss/2005-March/005902.html
--
Email.it, the professional e-mail, gratis per te: http://www.email.it/f

Sponsor:
Finalmente il design sposa il riscaldamento?vieni a scoprire i prodotti
Tubor
Clicca qui: http://adv.email.it/cgi-bin/foclick.cgi?mid=6164&d=20070323


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: Newbie Setting up xserver
    ... Could be, but now that I know that low memory is causing the problem, I will ... The kernel gobbles up the rest of the physical memory huh. ... Actually my RAM is split into two ... >it could be that some of the processes related to setting the timezone ...
    (comp.os.linux.setup)
  • Re: PCMCIA: unable to map memory
    ... MB RAM. ... Linux conflicts here with the pcmcia-controller. ... - A patch for the kernel (2.2. ... BadRAM by Rick van Rein (and it's successor BadMEM ...
    (comp.os.linux.portable)
  • Re: [PATCH] reserve RAM below PHYSICAL_START
    ... The "reserved RAM" can be mapped by virtualization software with to ... You're right but the relocatable kernel only works if you relocate it ... only works up to address near 1G, we can't reserve more than 1G with ...
    (Linux-Kernel)
  • Re: [BUG] 2.6.26-rc1 lost half the RAM on UltraSPARC 5
    ... Top of RAM: 0x17f46000, Total RAM: 0xff40000 ... Memory hole size: 128MB ... Kernel: Using 1 locked TLB entries for main kernel image. ... Movable zone start PFN for each node ...
    (Linux-Kernel)
  • Re: After executing MDCreateThread data abort occurs
    ... This is almost always RAM configuration, memory map, or cache problems. ... target.The code under Startup and KernelStart routines are running fine ... So the Kernel is not letting this address 0xc2000000 to be programmed. ...
    (microsoft.public.windowsce.embedded)