Re: IBM pSeries Self-Healing Technology.

From: Jean-David Beyer (jdbeyer_at_exit109.com)
Date: 03/29/04


Date: Sun, 28 Mar 2004 18:17:33 -0500

Mike Cox wrote:
> Jean-David Beyer <jdbeyer@exit109.com> wrote in message
>
>>I have just completed a box with 2 XEON 3.06GHz processors, 4GBytes RAM,
>>and 4 Ultra/320 SCSI 10,000rpm hard drives (and another 7200 rpm EIDE
>>hard drive) mainly to run IBM DB2 on. I could put a RAID controller card
>>in, but have not done so. I am waiting for DB2 V8.2 to come out, since
>>that should be soon, and I do not want to install twice. I can run DB2
>>V6.1 on my old machine in the meantime.
>>
>>The motherboard permits up to 16GBytes RAM. I am running Red Hat
>>Enterprise Linux 3 ES which allows up to 8GBytes (as supplied), and any
>>single process can use up to (almost) 4GBytes, so it might make sense
>>for me to get 4 more memory modules if that becomes a problem. But all
>>that can be done right now with Intel *86 type hardware if you want to.
>>
>>The motherboard is SuperMicro X5DP8-G2.
>>http://www.supermicro.com/products/motherboard/Xeon/E7501/X5DP8-G2.cfm
>
>
> What happens when one processor goes down? Are you in deep crap?

I am sure it depends on just how it goes down. If it just stops
executing instructions, I suppose it is up to the OS to stop dispatching
processes to it. I do not know if Linux OS does that or not.

But more likely, it would go down by executing wrong instructions, using
wrong registers, wrong addresses, etc. I do not know how that would be
detected with current day chips.

In the old days, they ran three computers with synchronized clocks with
single processors in each, and had a box compare the three results at
certain times and if all three did not agree, they knew there was
trouble. If 2 out of 3 agreed, they complained about the third. If none
agreed, they were in trouble anyway.

The big problem was the box that did the comparisons: if it was too
complex, it would be the weak link in the system. If it was too dumb, it
would not deal with uncontrollable variations in the processing of the
machines.

Where I worked, they build some machines like that for what I would have
thought to be critical (not usually life-safety) situations, but the
tendancy was to use each processor for something different; i.e.,
management would pay lip-service to fail-safe, but were too cheap to pay
for three machines to get the work of one.

> On
> my pSeries, it would be business as usually, because my pSeries would
> shutdown the bad CPU and put all the work on the other one. If it
> detects a malfunctioning component, the service processor will alert
> you and will use the redundant one.

How does it detect a bad processor? Does some other processor do it? How
do you know which processor is bad when two disagree?

> If a ram stick is bad, the
> service processor will use the only the good ones, and alert you.

Well, if a ram stick makes a mistake, the whole thing can be down before
anything has a chance to stop it.

> If
> one power supply goes down, the other one will happily take on the
> work.
>
> That is what x86 needs. An architecture of reliablity and
> self-healing.

Well, I do not see any evidence of that in the x86 architecture. The the
processors, for example, are far too complex to be amenable to the
proper analysis of knowing failure modes well enough to prepare for
them. Maybe a real RISC architecture could do it, but even the so-called
RISC processors have gotten pretty complex. And the rest of the *86
architecture is pretty complex, too. My machine has a E7501 chip set
that requires a massive heat sink on the MCH chip and heat sinks on the
P64H2 chips as well. I do not suppose they do that because those chips
are so simple.

> x86 has the speed but now it needs the reliability and
> fail-over capabilities. Until then, AIX and pSeries will do the real
> work, and x86/Linux will still be on non-mission critical stuff like
> software development where you have a back-up of everything in CVS
> (and customers dont' get mad if you have to reboot or you lose their
> data for that day).

Since I started running Linux about 6 years ago, I have had to reboot
very seldom, other than replacing kernels. On this machine I have
rebooted several times, like today, because I am just configuring up
networking among three machines, where the two others are on separate
networks, and I have to get the forwarding working as well through the
firewall. I bungled it so bad at one point that the router machine kept
trying to send to a machine that would never listen (wrong IP address)
and the only way I could get it to quit trying was to reboot. I am sure
there was a better way, but I could not find it. In any case, the
machine kept running, but it was cluttering the log files.

-- 
   .~.  Jean-David Beyer           Registered Linux User 85642.
   /V\                             Registered Machine   241939.
  /( )\ Shrewsbury, New Jersey     http://counter.li.org
  ^^-^^ 18:00:00 up 4:18, 2 users, load average: 4.16, 4.18, 4.01


Relevant Pages

  • Re: Seeing VERSIONINFO under Vista?
    ... Part of the problem people have with the Intel architecture deals with segment registers. ... and these chips had fascinating problems. ... read-pause-write to MEMORY (I heard a talk on the first PowerPC multiprocessor. ...
    (microsoft.public.vc.mfc)
  • Re: "CHiPs" may be returning to television
    ... Action Classic 'CHiPs' May Get A Series Reboot ... I hear a new take of the 1970s comedic action drama CHiPs ... Heuton writing/executive producing. ... surpass if you were a producer or network: ...
    (rec.arts.tv)
  • Re: 32bit/64bit confusion
    ... The 32-bit generation of chips had hardware ALU support for 32-bit ... architecture, ... ALUs, thanks to microcode, and with 8 to 64 data lines to memory. ...
    (comp.arch)
  • Re: Computer purchasing question
    ... They haven't been able to get away from the antiquated P4J architecture. ... Their latest offerings are actually *lower* in core frequency; ... they are not designed as 64 bit chips. ... I used to be steadfastly in the Intel camp, ...
    (microsoft.public.windowsxp.video)
  • "CHiPs" may be returning to television
    ... Action Classic 'CHiPs' May Get A Series Reboot Produced by Topher ... I hear a new take of the 1970s comedic action drama CHiPs ... Television, with Topher Grace ... Heuton writing/executive producing. ...
    (rec.arts.tv)