Re: How fault tolerant can Linux be?



J. Clarke wrote:

Juha Laiho wrote:

Andrew Gideon <c172driver1@xxxxxxxxxx> said:
It's all caused me to wonder about how fault tolerant a Linux box may be
built. At least the higher end SANs, for example, provide for no single
point of failure. This includes dual controllers as well as power
supplies, network ports, etc.

Can a Linux box be built that would survive the death of a CPU or a
motherboard? Obviously, it would involve having redundant hardware;
that's
not really my question. I'm asking more about whether Linux can exploit
that hardware redundancy to keep functioning.

Death of CPU: there are pretty few machines where the motherboard is
designed to handle a dead CPU -- and on the other hand, a CPU can die in
so many ways that in many cases it could be hard to distinguish. But
even with the most suitable hardware, I think Linux couldn't handle this.
No evidence, just a gut feeling.

CPU failure is rare, and machines that support CPU hot-swap are also rare,
but Linux does support CPU hot swap on some models, notably the IBM
System/390. I've never heard of an x86 machine with hot-swappable
processors, but that doesn't mean that they aren't out there.


Get a look at PRIMEQUEST built by Fujitsu. It is built using the same basic
design as high-end Sparc systems, consists of several CPU/memory boards and
IO boards, can be partitioned in separate partitions (or domains is more
common for Sun hardware) on board boundary and separate boards can be added
and removed online (dynamic reconfiguration). When you add CPU board online
it is basically the same as hot plugging CPU and memory on this board into
partition.

Of course, the question is what precisely is meant under "survive death of
single CPU". Considering Solaris design, if CPU was executing in user mode,
there is chance that CPU gets deactivated online, process that was running
on that CPU gets killed and system runs without interruption - _except_ for
this process. If this is the main service, something needs to watch it and
possibly restart. If CPU was executing in kernel mode, system usually
panics - incurring reboot - CPU gets deactivated and system comes up in
degraded mode. Something similar should happen for memory failures or whole
board, assuming partition consists of several boards. So it is not actually
"fault tolerant" because you *do* have service interruption. "Fault
tolerant" implies that system can continue without any interruption
whatever. I have seen such system (you could pull out CPU anytime), but I
doubt that any existing system provides such level of redundancy.

[...]
Death of motherboard: pretty much "death of machine", then. Solvable
by clustering software; several are available - f.ex. Steeleye
Lifekeeper.

Depends on your definition of "motherboard". If you're talking about a
glorified single-board computer, which most PCs are, then there's no
possibility of recovering from this obviously. If you're talking
passive-backplane machines then there might be--backplane failure would
still bring the machine down but anything likely to cause failure of a
passive backplane (i.e. a circuit board with nothing but copper traces and
sockets) is likely to be of such a catastrophic nature that there is no
reasonable protection against it.


See above. In addition to CPU board failure, PRIMEQUEST provides redundant
interconnect so it should survive death of single interconnect circuit. I
do not know if it does it online or over reboot.

[...]

I suppose that an alternative to this is a cluster (ie. like that built
out
of GFS). But I'm wondering what can be done w/in a single "box".


Actually any fault tolerant system you mention (LAN or FC switches)
internally builds a cluster; they just have much smaller detection and
failover times. So you could replace them with Linux cluster but you hardly
can achieve the same density (try to build full-fledged cluster with 32 FC
ports in a box 1HU using standard hardware).

[...]

Lately, I've started to think that the reliability of a simple machine
is good enough to make just clustering two simple boxes the best
alternative. After all, if you're serious, you won't get enough
reliability out of a single box (I've seen downtime due to hardware
issues on a Sun E10k box, which is pretty much failsafe - but not
completely), so you have to cluster. But then, if you cluster two
almost-unbreakable boxes, it'll be very seldom that your cluster pays
you back anything

It is not how much you get back as long as everything is OK - it is how much
you *lose* when system is down. This can be as high as several billions
that quite outweighs any price for acquiring and setting up cluster.

- but still you can't get rid of the cluster, as
a single machine is a too big risk. So, rather save money and just
build the cluster on two simple boxes, and rely on the cluster for
machine HA.

Exactly. Important observation is that cluster is usually High Available but
*not* Fault Tolerant system. You do have downtime, it is just cluster helps
you to keep downtime reasonably short and - what is equally or probably
more important - automates task of failure detection and recovery.

=arvi=
.



Relevant Pages

  • Re: How fault tolerant can Linux be?
    ... point of failure. ... it would involve having redundant hardware; ... so you have to cluster. ... it is a disk or a PSU. ...
    (comp.os.linux.hardware)
  • SUMMARY: e3500 reboot after "fatal error FATAL" // CPU address controller issue (??)
    ... with self-diagnosis // sporatic reboots // hardare failure issues. ... machine would crash and there would be very few diagnostic messages to ... Only then were we able to locate the failed piece of hardware. ... failed cpu board in slot 7" ...
    (SunManagers)
  • Re: The future of CPU based computing, mini clusters.
    ... I am fairly indifferent about process isolation inside a cluster. ... Using a few handfuls of clusters as the "main cpu". ... Your CPU runs the ATI OS code to manage the ATI GPU. ... Seems memory is an issue. ...
    (comp.arch)
  • Re: periods and deadlines in SCHED_DEADLINE
    ... If you want to do G-EDF with limited and different budgets on each CPU ... either 1 cpu or the full cluster. ... A "full cluster" therefore should be created around some memory level. ...
    (Linux-Kernel)
  • Re: Failover Question
    ... >> When we induced a failure, ... >> the SQLCLUSTER1 instance manages the database on the ... >> primary server in the cluster and the SQLCLUSTER2 ... >> the SQLCLUSTER1 instance and database, ...
    (microsoft.public.sqlserver.clustering)