Re: How fault tolerant can Linux be?



Juha Laiho wrote:

Andrew Gideon <c172driver1@xxxxxxxxxx> said:
It's all caused me to wonder about how fault tolerant a Linux box may be
built. At least the higher end SANs, for example, provide for no single
point of failure. This includes dual controllers as well as power
supplies, network ports, etc.

Can a Linux box be built that would survive the death of a CPU or a
motherboard? Obviously, it would involve having redundant hardware;
that's
not really my question. I'm asking more about whether Linux can exploit
that hardware redundancy to keep functioning.

Death of CPU: there are pretty few machines where the motherboard is
designed to handle a dead CPU -- and on the other hand, a CPU can die in
so many ways that in many cases it could be hard to distinguish. But
even with the most suitable hardware, I think Linux couldn't handle this.
No evidence, just a gut feeling.

CPU failure is rare, and machines that support CPU hot-swap are also rare,
but Linux does support CPU hot swap on some models, notably the IBM
System/390. I've never heard of an x86 machine with hot-swappable
processors, but that doesn't mean that they aren't out there.

Death of power supply: I suspect PSUs are pretty much hardware issue,
so all depends on whether the hardware is built to enable hot-swapping
PSUs.

The hot-swap enclosure and associated circuits are also a possible point of
failure--this should be mostly or entirely passive though so failure should
be relatively unlikely. As far as the actual hot-swap goes, you can hot
swap power supplies on an MS-DOS machine without it missing a beat. The OS
doesn't has no involvement in that process other than possibly logging the
change. This is really no different in nature than a laptop going from
charger to internal battery and back.

Death of motherboard: pretty much "death of machine", then. Solvable
by clustering software; several are available - f.ex. Steeleye
Lifekeeper.

Depends on your definition of "motherboard". If you're talking about a
glorified single-board computer, which most PCs are, then there's no
possibility of recovering from this obviously. If you're talking
passive-backplane machines then there might be--backplane failure would
still bring the machine down but anything likely to cause failure of a
passive backplane (i.e. a circuit board with nothing but copper traces and
sockets) is likely to be of such a catastrophic nature that there is no
reasonable protection against it.

Death of disk: should handle (HW or SW RAID), but much depends on bus
and bus adapter. SCSI and FC should be better in recovering from device
faults than (parallel) ATA. Don't know enough to say how SATA fares.

The SATA standard requires hot-swap support, however some of the early SATA
chips did not fully implement the standard.

One thing to be aware of with SCSI--there are failure modes that can bring
down the entire SCSI bus--a disk failing in such a way as to short a data
line to ground for example will do this, and on the rare occasions that it
happens until that disk is identified and removed all other disks on that
cable or backplane segment are down.

Death of disk adapter: should handle, with FC at least. Could be
possible with SCSI as well (and there seems to be some support for
hotplug PCI in recent kernels, so if the hardware supports PCI device
changes on the fly, recent Linux kernels might be good enough to
support it, too.

You need to have hot-plug PCI support on both the motherboard and the
daughter for this to work. It's available for a price, and there Linux
support for HP/Compaq's flavor of it.

The above goes for network adapters as well; there seems to be
some support for network link failover, which should handle the
case of dead network adapter, and it could even be possible to
hot-swap the failed adapter in some circumstances.

I suppose that an alternative to this is a cluster (ie. like that built
out
of GFS). But I'm wondering what can be done w/in a single "box".

My gut feeling is that PSU failures and HDD failures are the most
common; the latter is rather easily defeated with good external
SCSI or FC enclosure with hot-swap support. The former requires
either pricey hardware or clustering.

The big problem with PSU failures is that they don't necessarily fail
entirely. Often they go off-regulation, i.e. they continue to work but no
longer provide in-spec voltages, which can cause a variety of other
problems. To be sure of catching these you need continuous monitoring of
power supply voltages--this would be handled on a PC based machine usually
by a process running in background using the motherboard's onboard
monitoring capabilities. On other machines (including some x86 machines
purpose-designed as servers) there might be an additional processor
dedicated to this purpose. If the PSU is off-regulation then all other
redundancy in components that rely on that PSU for power is in vain, as
data will still be corrupted. This can be circumvented to some extent
using a system with batteries or supercapacitors inline between the power
supply and the system, but then you add additional points of failure.

Lately, I've started to think that the reliability of a simple machine
is good enough to make just clustering two simple boxes the best
alternative. After all, if you're serious, you won't get enough
reliability out of a single box (I've seen downtime due to hardware
issues on a Sun E10k box, which is pretty much failsafe - but not
completely), so you have to cluster. But then, if you cluster two
almost-unbreakable boxes, it'll be very seldom that your cluster pays
you back anything - but still you can't get rid of the cluster, as
a single machine is a too big risk. So, rather save money and just
build the cluster on two simple boxes, and rely on the cluster for
machine HA. Disk HA is handled with external disk enclosure
connected to all servers in your cluster.

This may sound heretic, but check your risks and requirements; current
machines (as far as the hardware is concerned) can easily run a year
without failure. When something fails, it is a disk or a PSU. You'll
anyway have to prepare for the disk failure by some redundancy scheme
(RAID1, RAID5). For the PSU, with money you can find hardware with
hot-swappable PSUs.

As for other failures, they're rare - and clustering
helps in them as well as it does with the PSU failures. So, when
the failure comes (say, PSU fail), can your environment sustain the
cluster switchover time? Considering that you won't see these each
year?

--
--John
to email, dial "usenet" and validate
(was jclarke at eye bee em dot net)
.



Relevant Pages

  • Re: How fault tolerant can Linux be?
    ... point of failure. ... it would involve having redundant hardware; ... so you have to cluster. ... it is a disk or a PSU. ...
    (comp.os.linux.hardware)
  • Re: MacBook Pro - no bluetooth
    ... although I'm intrigued by this 'breaking news' thing. ... machines back after waiting so long, then at least some of them called ... Let me down with a drive failure after three ... I think for day to day shifting around reliability the MacBooks are the ...
    (uk.comp.sys.mac)
  • Re: two factories, different profile, how to compare?
    ... One in China, the other in Taiwan. ... As you know machines do breakdown, lets assume that there are 10 ... plotting their cumulative failure ...
    (sci.stat.math)
  • Re: absurdly simple LAN problem
    ... > to the other though I can ping localhost on each. ... Then you should use the command: ... To at least see where the point of failure is likely to be. ... command, from BOTH machines, as well as: ...
    (Debian-User)
  • RE: Logging admin access to workstations
    ... Audit Account Logon Events - this will record the success or failure of a user to authenticate to the local computer across the network. ... I also use a product called Eventlog Monitor from GFI that continually scans eventlogs of machines and will email you depending on what type of action and severity is found. ... Logging admin access to workstations ...
    (Security-Basics)