Re: Hot plug vs. reliability

From: Bill Davidsen (davidsen_at_tmr.com)
Date: 05/27/04

  • Next message: Bill Davidsen: "Re: Hot plug vs. reliability"
    To: linux-kernel@vger.kernel.org
    Date:	Thu, 27 May 2004 10:54:53 -0400
    To: root@chaos.analogic.com
    
    

    Richard B. Johnson wrote:
    > On Thu, 27 May 2004, Zoltan Menyhart wrote:
    >
    >
    >>I've got some questions about how hot plugging can (or cannot)
    >>ensure reliability:
    >>
    >>When we produce machines, we execute tests like burn in, stress,
    >>validation, etc. tests. In addition, every time a machine is switched
    >>on, a power on self test is executed.
    >
    >
    > The POST routine only verifies that some hardware "works" at the
    > instant it's tested. It has nothing to do with reliability.
    >
    >
    >>When we hot plug (add, remove, swap) a component that has never been
    >>seen, how can we make sure that the modified machine achieves the
    >>same MTBF as the original machine had, without passing any of the
    >>tests I mentioned above ?
    >>
    >
    >
    > If you want a highly-reliable machine of any type, the components
    > are normally burned-in to catch "infant mortality" problems. If
    > you "hot-plug" a component, that component should have undergone
    > the same kind of burn-in if you wish to maintain some degree
    > of reliability. Again a POST routine does not assure anything.
    > And, in fact, it's just normally initialization. If you look
    > at the stupid, ludicrous, "testing" done in the early IBM/PC
    > BIOS, you will understand that it was just some junk that
    > some committee decided had to be done, like moving values
    > around between CPU registers -- If the CPU didn't work, it
    > couldn't test itself -- if the CPU did work, it couldn't
    > test itself, etc... Just crap.
    >
    > Now, memory testing has some validity because you generally
    > need to access it once to get all the bits into a "known"
    > state where the charge-pump (refresh) will keep it. However,
    > I doubt that much bad memory has actually been detected during
    > POST. It's much later, when programs or the kernel crash,
    > that bad memory is detected.
    >
    > [SNIPPED...]
    >
    > So your concern that POST hasn't been run when you hot-plug
    > a component isn't a problem. You cannot "test-in" reliability.
    > You need to design it in, test it to make sure it's been
    > built like it was designed, then burn it in to solve the
    > infant mortality problem.

    If reliability is your goal, testing at plug time is necessary but not
    sufficient. It avoids kernel failures caused by trying to use devices
    which are disfunctional (the kernel is far better at non-functional than
    broken). And some of the better drivers are far more robust at init time
    than in normal operation, not a bad thing at all. The init code can
    function as POST if it's written to do so.

    Testing is a part of the reliability chain, as you note it isn't a
    substitute for all the other parts.

    -- 
        -bill davidsen (davidsen@tmr.com)
    "The secret to procrastination is to put things off until the
      last possible moment - but no longer"  -me
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Bill Davidsen: "Re: Hot plug vs. reliability"

    Relevant Pages

    • Re: Hot plug vs. reliability
      ... >>When we produce machines, we execute tests like burn in, stress, ... It has nothing to do with reliability. ... It avoids kernel failures caused by trying to use devices ...
      (Linux-Kernel)
    • Re: Hot plug vs. reliability
      ... > When we produce machines, we execute tests like burn in, stress, ... It has nothing to do with reliability. ... Again a POST routine does not assure anything. ...
      (Linux-Kernel)
    • Re: [ANNOUNCE] Ramback: faster than a speeding bullet
      ... processes leading to watchdog timeouts, bad kernel pointers, kernel ... So we have a flock of people arguing that you can't trust Linux. ... marginal reliability increase. ... but it just does not belong on a server. ...
      (Linux-Kernel)
    • Re: Emergency alternative to a dying machine
      ... > you can run it 64-bit by installing an approriate kernel such as ... > reliability (I have been disappointed in the reliability of my PNY ... > GeForce cards). ... Will the above kernel support the full 2Gb, ...
      (comp.os.linux.hardware)
    • Re: unreliable burning
      ... that sometimes you can get a bad batch that seem to burn, ... have 8x disks but have beenburning at 4x to increase reliability. ... few disks have managed to burn successfully but failed verification. ...
      (alt.comp.hardware.pc-homebuilt)