Re: Hot plug vs. reliability

From: Russ Anderson (rja_at_sgi.com)
Date: 05/27/04

  • Next message: Jeff Garzik: "Re: idebus setup problem (2.6.7-rc1)"
    To: Zoltan.Menyhart@bull.net
    Date:	Thu, 27 May 2004 11:06:18 -0500 (CDT)
    
    

    Zoltan Menyhart wrote:
    >
    > We cannot remove safely failing memory / CPUs. In most of the cases
    > it is too late.

    This is a key point. To get the most value out of hot plug (as
    a reliability feature) the system must be able to detect and
    "ride through" component failures. Conversly, if the system
    crashes on the first component failure, the ability to hot remove
    the broken component has little value.

    For example, memory hot-plug has the most value if the system can
    "ride through" a memory uncorrectable, isolate the bad memory
    (ie not re-use the page with the bad DIMM cells), shoot the application
    that hit the uncorrectable (or better yet, have some checkpoint/restart
    mechanism to avoid killing the application), migrate data off
    the physical DIMM (etc) to get the system to the point that the
    bad DIMM can be physically replaced, and re-integrate the new memory.

    My point is that a key part of the whole hot plug story is the
    ability to detect and ride thought the initial errors that would
    prompt someone to want to replace the component. And without that
    part the significant effort to do the rest of the pieces has significantly
    less value.

    > We (in the OS) can see some corrected CPU, memory, I/O
    > and platform errors. Yet the OS has not got and should not have the
    > knowledge when a component is "enough bad". I think it is the firmware
    > that has all the information about the details of the HW events.
    > Do you know of some firmware services which can say something like:
    > "hey, remove the component X otherwise your MTBF will drop by 95 %..." ?

    The difficulty with predictive analysis is determining the exact
    indicator of a potential failure. Many times the first indication
    is a fatal error that crashes the system (which is why error recovery
    to "ride through" failures is so important). Other errors, such
    as memory singlebits, may (or may not) increase the probability of
    failure, but does is increase the probability enough to warrent
    a service action? (Service actions have costs, too.)

    A technical difficulty with predictive analysis is that each component
    has a different failure characteristics and the failure charicteristics
    can change with spacific technologies. For example, smaller die
    technologies can increase the soft failure rates. And by the
    time the long term failure characteristics are fully understood
    the technology is obsolete. :-(

    -- 
    Russ Anderson, OS RAS/Partitioning Project Lead  
    SGI - Silicon Graphics Inc          rja@sgi.com
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Jeff Garzik: "Re: idebus setup problem (2.6.7-rc1)"

    Relevant Pages

    • Re: xmalloc string functions
      ... than a NULL return from malloc(). ... pointer value to null at the point I want to trigger the failure. ... I've also had VMWare report out-of-resource at times when the only resource that was tight was memory, and again it gave me the chance to recover the situation which saved me significant work because I had two VMs running and the state between them was important and took time setting up. ... allocations without reference to other circumstances (number of ...
      (comp.lang.c)
    • Re: Promise SATA 300 TX4
      ... FAILURE - out of memory in start ... Promise SATA 300 TX4 ... The disappearing HD is always a different one, so I don't think that the HDs have a problem. ...
      (freebsd-current)
    • Re: Klatuu Returns
      ... I hope you don't have any memory chips in the top drawer. ... We have storms here. ... Hard drive failure ... ... The system kept getting slower, ...
      (microsoft.public.access.modulesdaovba)
    • Re: style question,itoa
      ... It is hardly a random failure as long as there is exactly one program ... running on some computer which allocates memory. ... garantee that memory is available on Linux (see man 3 malloc). ... The code calling some utility routine is in the ...
      (comp.unix.programmer)
    • Kernel 2.6.16.18 with general protection fault, perhaps nfsd
      ... Capabilities: Power Management version 2 ... Prefetchable memory behind bridge: d3000000-d3ffffff ... VIA Technologies, Inc. VT82C686/A PCI to ISA Bridge ...
      (Linux-Kernel)