Re: weird mdadm crash



On Thu, Mar 08, 2007 at 08:30:57AM -0800, michael wrote:
Hello,

Have an etch box that does nothing but rsync data with another.
About every other day or so, the box will completely freeze.
Everything, screen blank, no keyboard, and the hard drive light
is on solid.
I can hard reboot it and it comes up, and there is nothing in the logs that
suggest anything.
The root system is an mdadm raid 5 array, and everytime I reboot
it from a crash, the array is always degraded. It auto rebuilds itself,
and away it goes again. A few days later, it will lock up.

I have no idea where to start looking for problems. I'm pretty sure its gotta
be hardware, but not sure where to look first.
Any suggestions would be great!


AIUI, the order of mostly likely-to-least likely failure is:
power-supply, hard-drives, memory, other stuff.

power-supplies are hard to test without equipment, unless you know
you've got sensors set up properly. But, its still worth a shot -- set
up lmsensors and look at your voltages. If they're more than +/- 5%
from spec then start with a new power-supply. Hard-drives should
generally leave some kind of logs right before they go down, and with
raid, you shouldn't see a lock-up, unless you're sharing controllers,
maybe. If the drives are SMART enabled, then check that out. I think
memory errors are pretty much impossible to diagnose through any
method other than swapping sticks in a systematic way.

good luck

A

Attachment: signature.asc
Description: Digital signature