Hardware reliability (was: Re: Linux community software-update-anarchy polemic)

From: Anonymous Coward (acoward_at_mail.ru)
Date: 03/12/04


Date: 12 Mar 2004 05:15:06 -0800


"Shmuel (Seymour J.) Metz" <spamtrap@library.lspace.org.invalid> wrote
>>Why?
>
>Because power failures are not the only types of outages.
See below*.

>>In this case, I was making clear my definition
>>of "local" as meaning that the connection is presumed reliable,
>
>You can presume what you want, but your hardware is not bound by your
>presumption.
True, but ultimately some presumptions must be made; we mere mortals
can't make contingency plans for all possible problems. I also sit
here typing this with the presumption that all of the air molecules in
Earth's atmosphere aren't about to simultaneously travel away from
Earth long enough to make us all explode in a vaccuum, but the laws of
physics don't necessarily prohibit that from happening, and the air
molecules aren't bound by my presumption.

>>Other local things in computer systems are commonly presumed
>>reliable (that is, their reliability is axiomatic):
>
>Not in circles where reliability matters.
Yes, they are. Even in circles where reliability matters. When
reliability matters, things are generally made more redundant, but
ultimately something has to be presumed reliable. As an extreme case,
on the space shuttle, there are entire single computers which aren't
presumed reliable, but their majority vote is presumed reliable.

>>for example, the cpu<->localmemory connection; in
>>contemporary systems, the possibility of the severing of that
>>connection is not a consideration in software design.
>
>It most certainly is.
Not in any systems that I'm aware of. I'd be very interested if you'd
point out such a system. For clarification, I'm referring to the point
of view of software stored in the local memory and running on the cpu,
not from the point of view of an external system. For example, in my
space shuttle example above, a program running on an individual
computer C doesn't consider the possibility that its cpu will get
disconnected from its local memory, but an external system does
consider the possibility of C failing. A program running on a
particular cpu-localmemory pair simply says, "my cpu will not get
disconnected from my local memory, and if it does, I will just die and
let somebody else replace me".

>>but asserting that the RAM<->disk connection is reliable isn't
>>unreasonable, especially in systems that already assert the
>>reliability of the cpu<->RAM connection, and even moreso in
>>systems that assert the reliability of the RAM itself without any
>>ECC or even parity checking.
>
>That's called Russian Roulette.
Living in our universe is Russian Roulette.

>CPUs fail. DRAM fails. A failing adapter can hang the entire system.
*You're arguing using the premise that the surface of a disk is a more
permanent data storage place than DRAM. I explained that adding a
battery backup to the system eliminates the possibility of power
failure as a justification for that premise. Now you're saying that
the possibility of failure of particular hardware components is a
justification. One such component you cite is DRAM itself, but that's
not valid, because disks can fail as easily as DRAM (and in fact in
some cases more easily, e.g. if the system gets dropped). Other
components which you cite include the adapter (presumably disk to bus
adapter) and the CPU. But those don't impact DRAM; in the case that
those components' failure could generate spurious logic signals which
would be interpreted by DRAM as "erase thyself", those spurious
signals could just as easily be interpreted by the disk as "erase
thyself". In the case that those components' failure could generate
lethal power spikes (such as if the system gets struck by lightning),
those power spikes could fry DRAM, but they could fry the disk too.
Therefore I presume that what you really mean isn't that the
possibility of failure of system components (other than the storage
components themselves, i.e. DRAM and the disk) is itself justification
for your premise, but that in the event of such a failure, the
replacement of those components (or equivalently, the movement of the
storage components to another, functional machine) necessitates the
removal of all power, including battery backup power, from the system,
and that the removal of such power would erase DRAM but not the disk.
However, if this is your argument, you fail to take into account the
fact that battery backup power could be supplied to the DRAM
independently of the rest of the system. DRAM+battery can be a
legitimate first-class storage device just like a disk. Yes, the would
have different failure modes, but the failure modes of the disk would
not be a subset of the failure modes of DRAM+battery (nor vice versa;
I'm not claiming that DRAM+battery would be a more permanent storage
device than a disk).

BTW here also I apologize for my long delay; see my other message
posted at the same time as this one.



Relevant Pages

  • Re: Rash of hard drive failures, am I missing something?
    ... A disk recovry utility allowed me to copy the data on to the ... A few mins later the second partition ... I haven't had a totaly drive failure ever, ... Any recent power issues in your neighborhood? ...
    (comp.sys.mac.system)
  • E4000 problems
    ... unavailable ac..i presume you have a second cpu on that board as well, ... error..that error is probably caused by a defective power supply.. ... Fan failure, Key Switch Fan failure, AC Power failure, System 5.0 Volt ... Precharge failure, System 3.3 Volt Precharge failure, Peripheral 12 Volt ...
    (SunManagers)
  • Re: Disk Scanning & Defragmentation
    ... perish the though of my trying to say anything doesn't play a role ... subsequently leading to failure? ... And of course fewer disks wear out due to excessive defragging. ... In my experience the reasons for disk ...
    (microsoft.public.windowsxp.general)
  • [HPADM] Re: [hpadm] disk problem
    ... Please check the disk using ioscan, ... Disk at hardware path 10/12.9.0: Hardware failure ... Product Identifier: SCSI Disk ...
    (HP-UX-Admin)
  • Re: Disk Scanning & Defragmentation
    ... to be assuming that there was only that one single failure mechanism ... the amount of use, specifically the amount of head movement, does ... And of course fewer disks wear out due to excessive defragging. ... disk failure. ...
    (microsoft.public.windowsxp.general)

Loading