Re: How fault tolerant can Linux be?



Andrew Gideon <c172driver1@xxxxxxxxxx> said:
It's all caused me to wonder about how fault tolerant a Linux box may be
built. At least the higher end SANs, for example, provide for no single
point of failure. This includes dual controllers as well as power
supplies, network ports, etc.

Can a Linux box be built that would survive the death of a CPU or a
motherboard? Obviously, it would involve having redundant hardware; that's
not really my question. I'm asking more about whether Linux can exploit
that hardware redundancy to keep functioning.

Death of CPU: there are pretty few machines where the motherboard is
designed to handle a dead CPU -- and on the other hand, a CPU can die in
so many ways that in many cases it could be hard to distinguish. But
even with the most suitable hardware, I think Linux couldn't handle this.
No evidence, just a gut feeling.

Death of power supply: I suspect PSUs are pretty much hardware issue,
so all depends on whether the hardware is built to enable hot-swapping
PSUs.

Death of motherboard: pretty much "death of machine", then. Solvable
by clustering software; several are available - f.ex. Steeleye
Lifekeeper.

Death of disk: should handle (HW or SW RAID), but much depends on bus
and bus adapter. SCSI and FC should be better in recovering from device
faults than (parallel) ATA. Don't know enough to say how SATA fares.

Death of disk adapter: should handle, with FC at least. Could be
possible with SCSI as well (and there seems to be some support for
hotplug PCI in recent kernels, so if the hardware supports PCI device
changes on the fly, recent Linux kernels might be good enough to
support it, too.

The above goes for network adapters as well; there seems to be
some support for network link failover, which should handle the
case of dead network adapter, and it could even be possible to
hot-swap the failed adapter in some circumstances.

I suppose that an alternative to this is a cluster (ie. like that built out
of GFS). But I'm wondering what can be done w/in a single "box".

My gut feeling is that PSU failures and HDD failures are the most
common; the latter is rather easily defeated with good external
SCSI or FC enclosure with hot-swap support. The former requires
either pricey hardware or clustering.

Lately, I've started to think that the reliability of a simple machine
is good enough to make just clustering two simple boxes the best
alternative. After all, if you're serious, you won't get enough
reliability out of a single box (I've seen downtime due to hardware
issues on a Sun E10k box, which is pretty much failsafe - but not
completely), so you have to cluster. But then, if you cluster two
almost-unbreakable boxes, it'll be very seldom that your cluster pays
you back anything - but still you can't get rid of the cluster, as
a single machine is a too big risk. So, rather save money and just
build the cluster on two simple boxes, and rely on the cluster for
machine HA. Disk HA is handled with external disk enclosure
connected to all servers in your cluster.

This may sound heretic, but check your risks and requirements; current
machines (as far as the hardware is concerned) can easily run a year
without failure. When something fails, it is a disk or a PSU. You'll
anyway have to prepare for the disk failure by some redundancy scheme
(RAID1, RAID5). For the PSU, with money you can find hardware with
hot-swappable PSUs. As for other failures, they're rare - and clustering
helps in them as well as it does with the PSU failures. So, when
the failure comes (say, PSU fail), can your environment sustain the
cluster switchover time? Considering that you won't see these each
year?
--
Wolf a.k.a. Juha Laiho Espoo, Finland
(GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
"...cancel my subscription to the resurrection!" (Jim Morrison)
.



Relevant Pages

  • Re: How fault tolerant can Linux be?
    ... CPU failure is rare, and machines that support CPU hot-swap are also rare, ... The big problem with PSU failures is that they don't necessarily fail ... so you have to cluster. ...
    (comp.os.linux.hardware)
  • Re: Replace Mobo in XP??
    ... > hardware than that it was originally installed upon. ... > The concerns with your conclusion is that you know when the failure is ... This has everything to do with FAST failure recovery and also just hardware ... > data replacement to do a repair installation and move on with life. ...
    (microsoft.public.windowsxp.hardware)
  • Re: Replace Mobo in XP??
    ... > hardware than that it was originally installed upon. ... > The concerns with your conclusion is that you know when the failure is ... This has everything to do with FAST failure recovery and also just hardware ... > data replacement to do a repair installation and move on with life. ...
    (microsoft.public.windowsxp.general)
  • Re: Replace Mobo in XP??
    ... hardware than that it was originally installed upon. ... > presence/absence of ACPI mobo BIOS. ... > on any XP system so that failure recovery on new hardware is more ... The concerns with your conclusion is that you know when the failure is going ...
    (microsoft.public.windowsxp.hardware)
  • Re: Replace Mobo in XP??
    ... hardware than that it was originally installed upon. ... > presence/absence of ACPI mobo BIOS. ... > on any XP system so that failure recovery on new hardware is more ... The concerns with your conclusion is that you know when the failure is going ...
    (microsoft.public.windowsxp.general)