Re: Hot plug vs. reliability

From: Matthias Fouquet-Lapar (mfl_at_kernel.paris.sgi.com)
Date: 05/27/04

  • Next message: Ingo Molnar: "[patch] io-apic cleanup #2, BK-curr"
    To: Zoltan.Menyhart@bull.net
    Date:	Thu, 27 May 2004 17:02:14 +0200 (CEST)
    
    

    > I agree, in this case there is no loss of MTBF.
    > Yet let's call this activity as run time re-partitioning of the machine.
    > (Most people - me too - consider hot plugging as physically plugging
    > things in / out.)

    You're right, it's confusing and I made the same assumptions you make :
    physically moving parts. (and I worked on a systems a couple of years
    back where we actually had hotswap :-))

    > But the new comers are tested in a different environment, with
    > different tolerance range. I just simply do not trust :-)

    Not really. It's up to the vendor and at least here at SGI we have pretty
    tight rules and tolerances.

    > I do not think the timing / the delays are auto adjusting. You select
    > a component X to work next to the component Y because you know that
    > X in "here" and Y in "there" in the tolerance range...

    They do (impedance match). An example are SRAMs used for CPUs with external
    caches for example. I've learned a lot about that :-)). You also
    have stuff like auto-learning for echo-clock timings etc, but this is really
    very platform and CPU specific

    > I think the OS has to be platform independent. How can a platform independent
    > OS know if <n> errors of this / that type requires what intervention ?
    > We'll have the same binary of the OS (+ drivers) for a small desk top or
    > for a 32 CPU "main frame". Only the firmware is different...

    An OS is never platform independent, there always is a machine dependant layer.
    I'm not really concerned about the total numbers of errors in a system,
    regardless if we have one, 32 or 512 CPUs. If we see a component starting to
    fail, it should be isolated in order to avoid catastrophic failure

    > Most of our clients just do not want to touch their 10 year old rubbish
    > Fortran programs. If I get a hint of danger (today it does not come from the FW)
    > I could take a check point and call for service intervention...

    That's a well know problem (although I think 20 years or more are more
    likely ...)
    I think however there are new applications coming up using large or
    ultra-scale systems where more fault tolerance can be designed in at the OS,
    libarary or even user level

    Amicalement

    Matthias Fouquet-Lapar Core Platform Software mfl@sgi.com VNET 521-8213
    Principal Engineer Silicon Graphics Home Office (+33) 1 3047 4127

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Ingo Molnar: "[patch] io-apic cleanup #2, BK-curr"

    Relevant Pages

    • Re: IPS Reliability/Availability
      ... It will not be included in the MULTI-Gig test for fairly obvious reasons ... platform this year - results will be available in our multi-Gigabit IPS ... applications and traffic across up to 8 more application CPUs (yep, ... with real-world attacks from CORE IMPACT. ...
      (Focus-IDS)
    • Re: IPS Reliability/Availability
      ... platform this year - results will be available in our multi-Gigabit IPS ... applications and traffic across up to 8 more application CPUs (yep, ... with real-world attacks from CORE IMPACT. ...
      (Focus-IDS)
    • RE: IPS Reliability/Availability
      ... Interestingly our tests on this platform were well below the advertised ... It also uses a network processor for traffic management, ... applications and traffic across up to 8 more application CPUs (yep, ... Snort: Open Source Network IDS - http://www.snort.org ...
      (Focus-IDS)
    • Re: IPS Reliability/Availability
      ... We have several IPS vendors submitting Multi-Gig IPS products on the Bivio ... platform this year - results will be available in our multi-Gigabit IPS ... Having done some very "quick and dirty" tests of the Bivio box on its own, ... applications and traffic across up to 8 more application CPUs (yep, ...
      (Focus-IDS)
    • Re: [2.6 patch] remove support for gcc < 3.2
      ... That's the plan. ... VAX is also an interesting platform because it's another platform ... send the line "unsubscribe linux-kernel" in ...
      (Linux-Kernel)