Re: [PATCH] NMI watchdog config option (was: Re: [PATCH] NMI lockup and AltSysRq-P dumping calltraces on _all_ cpus via NMI IPI)

From: Maciej W. Rozycki (macro_at_linux-mips.org)
Date: 05/17/05

  • Next message: Martin J. Bligh: "Re: 2.6.12-rc4-mm2 build failure"
    Date:	Tue, 17 May 2005 18:04:05 +0100 (BST)
    To: Linus Torvalds <torvalds@osdl.org>
    
    

    On Tue, 17 May 2005, Linus Torvalds wrote:

    > > Mostly or perhaps even exclusively due to BIOS bugs -- you know, that
    > > piece of hidden firmware that runs in the SMM under our feet and fiddles
    > > randomly with hardware we can do nothing about.
    >
    > I'd love to just blame the BIOS, but we've definitely had our own share of
    > bugs too. NMI makes all the fast system call etc stuff much more
    > "exciting", and we've several times had code that does actually disable
    > interrupts for a long time - which may be exceedingly impolite, but then
    > the NMI watchdog makes it a fatal error rather than something that is just
    > a nuisanse.

     Well, this is actually not a problem with the watchdog itself. And I'd
    rather say it's doing us a favour (and a great job) finding these buggy
    bits of code that keep interrupts off CPUs. ;-)

     Otherwise NMIs should be completely transparent. Well, yeah, that's
    theory -- for this to be the case we'd have to use a task gate which is
    rather time consuming and using an interrupt gate means we need to take
    some explicit care elsewhere indeed.

     OTOH, we can always get an NMI from the chipset in response e.g. to a bus
    error of some kind (unfortunately it's often impossible to reroute these
    errors to a more useful interrupt, like an MCE), so we need to be prepared
    for one at any time. But these errors are expected to be rare, so it's
    hard to test their effects, unlike these of the watchdog.

    > Of course, our own bugs we can fix (and hopefully we have done so - many
    > people _do_ obviously use the NMI watchdog as-is), so yes, in that sense
    > BIOS (and hardware) bugs end up being a special case.

     The problem with the SMM as currently used by BIOSes is unfortunately the
    design, not any particular implementation.

      Maciej
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Martin J. Bligh: "Re: 2.6.12-rc4-mm2 build failure"

    Relevant Pages

    • Disabling x86 System Management Mode
      ... "SMM is an operating mode of the Intel 386SL and later microprocessor in which all normal execution is suspended, and special separate software is executed in high-privilege mode. ... disable interrupts ... foo returns the number of cycles it ran. ...
      (Linux-Kernel)
    • Re: NMI watchdog + NOHZ question
      ... Yes it would be probably safer to do the tick disabling with ... interrupts off already. ... One way to handle all that would be to have a big NMI white/black ... Yes when it was still on it also found bugs. ...
      (Linux-Kernel)
    • Re: Penalties for segment overrides on 8086?
      ... interrupts as well, so I don't want to potentially be halting ... which has a non-maskable interrupt (services the keyboard). ... Is there a CMOS inside to let us disable the NMI? ... Disabling ...
      (comp.lang.asm.x86)
    • Re: Disabling x86 System Management Mode
      ... At least some SMM implementations restore the old TSC value. ... I've attached my kernel module ... cat /proc/interrupts; rmmod houba ... 30011 timer interrupts ...
      (Linux-Kernel)
    • Re: Catching NForce2 lockup with NMI watchdog
      ... but careless code may enable them by accident). ... > The NMI vector goes to Linux code. ... The problem happens when the SMM is active (i.e. the BIOS code is being ...
      (Linux-Kernel)