Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

From: Benjamin Herrenschmidt (benh_at_kernel.crashing.org)
Date: 03/19/05

  • Next message: Andrew Morton: "Re: [2.6.11-mm3] umount: Scheduling while atomic"
    To: Linas Vepstas <linas@austin.ibm.com>
    Date:	Sat, 19 Mar 2005 12:24:07 +1100
    
    

    On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote:
    > On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark:
    > >
    > > Additionally, in "real life", very few errors are cause by known errata.
    > > If the drivers know about the errata, they usually already work around
    > > them. Afaik, most of the errors are caused by transcient conditions on
    > > the bus or the device, like a bit beeing flipped, or thermal
    > > conditions...
    >
    >
    > Heh. Let me describe "real life" a bit more accurately.
    >
    > We've been running with pci error detection enabled here for the last
    > two years. Based on this experience, the ballpark figures are:
    >
    > 90% of all detected errors were device driver bugs coupled to
    > pci card hardware errata

    Well, this have been in-lab testing to fight driver bugs/errata on early
    rlease kernels, I'm talking about the context of a released solution
    with stable drivers/hw.

    > 9% poorly seated pci cards (remove/reseat will make problem go away)
    >
    > 1% transient/other.

    Ok.

    > We've seen *EVERY* and I mean *EVERY* device driver that we've put
    > under stress tests (e.g. peak i/o rates for > 72 hours, e.g.
    > massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
    > driver tripped on an EEH error detect that was traced back to
    > a device driver bug. Not to blame the drivers, a lot of these
    > were related to pci card hardware/foirmware bugs. For example,
    > I think grepping for "split completion" and "NAPI" in the
    > patches/errata for e100 and e1000 for the last year will reveal
    > some of the stuff that was found. As far as I know,
    > for every bug found, a patch made it into mainline.

    Yah, those are a pain. But then, it isn't the context described by
    Nguyen where the driver "knows" about the errata and how to recover.
    It's the context of a bug where the driver does not know what's going on
    and/or doesn't have the proper workaround. My point was more that there
    are very few cases where a driver will have to do recovery of PCI error
    in known cases where it actually expect an error to happen.

    > As a rule, it seems that finding these device driver bugs was
    > very hard; we had some people work on these for months, and in
    > the case of the e1000, we managed to get Intel engineers to fly
    > out here and stare at PCI bus traces for a few days. (Thanks Intel!)
    > Ditto for Emulex. For ipr, we had inhouse people.
    >
    > So overall, PCI error detection did have the expected effect
    > (protecting the kernel from corruption, e.g. due to DMA's going
    > to wild addresses), but I don't think anybody expected that the
    > vast majority would be software/hardware bugs, instead of transient
    > effects.
    >
    > What's ironic in all of this is that by adding error recovery,
    > device driver bugs will be able to hide more effectively ...
    > if there's a pci bus error due to a driver bug, the pci card
    > will get rebooted, the kernel will burp for 3 seconds, and
    > things will keep going, and most sysadmins won't notice or
    > won't care.

    Yes, but it will be logged at least, so we'll spot a lot of these during
    our tests.

    Ben.

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Andrew Morton: "Re: [2.6.11-mm3] umount: Scheduling while atomic"

    Relevant Pages