Re: how to automatically reset linux server on hard disk failure?
- From: The Natural Philosopher <tnp@xxxxxxxxxxxxxxx>
- Date: Wed, 01 Jul 2009 20:24:29 +0100
Frank Langner wrote:
What you actually need is a custom modified kernel, with a disk driver that executes a full authority JMP REBOOT if it detects failure.
Kernels are always resident in RAM IIRC. And device drivers that take interrupts HAVE to be.
That is, actually quite a simple way out of the problem. If that is what you want.
Sounds good. But how to get such a driver, because I can not write it on my own. :-) I'm hoping that I might not be the first one asking for such a functionality.
Oh I thi8jk you are.. the rest of us found it easier to diagnose and fix disk drive problems, than to write custom code to kludge round them..;-)
I ahve done that sort of thing occasioanlly, very badly, to help in identifying problems when working very near hardware, but I certainly wouldn't recommend it as an approach in any user environment.
Ah. In which case rebooting is not what you want, you want to shut it down completely.However, a machine that loses its disk drive is not in my opinion a machine that is configured correctly. I have never had such otrher than with gross hardware faults - which booting didnt fix - or gross software bugs in the driver (RAID controller) that eventually caused us to send the whole unit back and get a refund.
That's not the background of my problem. The server is fine and runs well. But I want it to be prepared for the worst case, which means the customer is running the server until both hard disks of its RAID1 are defective. In this case the other cluster node has to migrate all resources to itself, but right now the failed node is still alive enough to make the cluster think that everything is ok.
Please don't let us argue about the fact, that no reasonable IT department or service personal would let it come so far. Agreed. But reality showed us more than once, that some people in fact wait until the system is broken completely, before taking actions. Therefor I want to harden the cluster for such circumstances.Ok. Now the context is clear it seems more reasonable.
I am no cluster expert. But I would have thought that perhaps you should be doing something like :-
First of all, use a tertiary non RAID disk for swap. That solves the problem of not being able to page stuff back in if the main 'disk' array goes bad.
Then write a watchdog timer. And incorporate the relevant bits from reboot.c
http://www.gelato.unsw.edu.au/lxr/source/arch/i386/kernel/reboot.c
Its a long time since I wrote a daemon, but it goes something like this..
fork a process and exit the main program. That puts the daemon in background. I cant remember what happens to stout and errout..
Set up an infinite loop using long period sleeps that occasionally wakes up.
Try and read or write the raw disk (so as to evade ram caches).
On failure, execute shutdown code. Or shutdown whatever is needed to sort the cluster dynamics (ethernet?)
This ain't simple script, but its not an impossible program to write.
Provided you have a swap disk separate, that will run even if the main disk goes down, or use mlockall maybe as John said, to keep the little watchdog in memory.
.
- References:
- Re: how to automatically reset linux server on hard disk failure?
- From: Frank Langner
- Re: how to automatically reset linux server on hard disk failure?
- Prev by Date: Re: how to automatically reset linux server on hard disk failure?
- Next by Date: Re: how to automatically reset linux server on hard disk failure?
- Previous by thread: Re: how to automatically reset linux server on hard disk failure?
- Next by thread: Kernel panic after installing Oracle Unbreakable Linux
- Index(es):
Relevant Pages
|