Re: Need help with crash message.



Jean-David Beyer wrote:
The Natural Philosopher wrote:
Jean-David Beyer wrote:
David W. Hodgins wrote:
On Sat, 25 Oct 2008 09:02:19 -0400, Jean-David Beyer
<jeandavid8@xxxxxxxxxxx> wrote:

My machine crashed this morning. Before crashing, it seems to have
logged messages such as:

Oct 23 11:14:47 trillian kernel: EDAC MC0: CE
page 0x12ec2d, offset 0x0, grain 4096, syndrome 0x1042, row 4, channel
0, label "": e7xxx CE

Oct 25 03:35:44 trillian kernel: EDAC MC0: UE
page 0x3c8d3, offset 0x0, grain 4096, row 0, labels ":": e7xxx UE

I've never seen this before, but according to
/usr/src/linux/Documentation/edac.txt, the CE entries are memory
Correctable Errors, while UE are Uncorrectable Errors. Looks like you
have ecc memory modules, and errors are being detected.

I do have ECC memory modules.

Try reseating the memory modules, and then running memtest. You'll
probably have to replace the bad ram, so you may want to run memtest with
one module at a time, to figure out which is bad.

I would have to run two modules at a time. Some of them have run for 4.5 years with no trouble and some for about 2 years also with no trouble.

Thank you. I have eight 1-GByte memory modules, and I must run them in pairs. IIRC, memtest-86 tries to identify which modules are bad.


That sounds like the bunny.

I ran it for a full pass (all 8 modules): took about 8 hours, and all OK.
It then ran for another 5 1/2 modules OK, and I stopped it because I must run the system normally again tonight.

Remember, it may not actually be the memory. It could be something else on the bus corrupting ACCESS to the chips.

I suppose so, but nothing else is on the path between the memory modules and the MCH. I wonder which bus. I have the usual two IDE busses and four PCI-X busses (not sockets, busses). But the memory goes straight to the MCH chip and on to the processors. Everything else goes to the ICH3 chip or to two P64H2 chips (that drive the four PCI-X busses).

MCH?

Have you any 3rd party cards in there at all though?.

I admit straws are being clutched at here..I am thinking of a time when a slow VIDEO CAPTURE card would, provided that exactly the right memory address was being DMA'ed into from a FLOPPY DISK, corrupt two bytes of it, for example.

However its no consolation knowing what caused it, if e.g. the answer is 'new motherboard'..


What worries me are that e pages shown above are completely different.


.