Re: pci error recovery procedure
- From: "Zhang, Yanmin" <yanmin_zhang@xxxxxxxxxxxxxxx>
- Date: Thu, 07 Sep 2006 11:18:56 +0800
On Thu, 2006-09-07 at 04:39, Linas Vepstas wrote:
On Wed, Sep 06, 2006 at 10:04:31AM +0800, Zhang, Yanmin wrote:Such dumping are random data and might be useless. The error recovery procedures
On Wed, 2006-09-06 at 03:17, Linas Vepstas wrote:
On Mon, Sep 04, 2006 at 01:47:30PM +0800, Zhang, Yanmin wrote:What's another driver's hardware? A function of the previous multi-function
Does here 'reset' mean hardware slot reset?
Again, consider the multi-function cards. On pSeries, I can only enable
DMA on a per-slot basis, not a per-function basis. So if one driver
enables DMA before some other driver has reset appropriately, everything
breaks.
I should have said: If one driver of a multi-function card enables DMA before
another driver has stabilized its harware, then everything breaks.
card? Or a function of another device?
Yes. Either. Both. Doesn't matter. Enabling DMA is "granular" at a
different size scale than pci functions, and possibly even pci devices
or slots, dependeing on the architecture. Before DMA can be enabled,
*all* affected device drivers have to be approve, and have to be ready
for it.
If we enabled both DMA and MMIO at the same time, there are many cases
where the card will immediately trap again -- for example, if its
DMA'ing to some crazy address. Thus, typically, one wants DMA disabled
until after the card reset. Without the mmio_enabled() reset, there
is no way of doing this.
Did you asume the card reset is executed by callback mmio_enabled?
I am assuming that, when a driver receives the mmio_enabled() callback,
it will perform some sort of register i/o. For example, I am currently
planning to modify the e1000 driver to do the following:
-- The error_occurred() callback returns PCI_ERS_RESULT_CAN_RECOVER
-- The arch enables mmio, and then calls the mmio_enabled() callback.
-- The mmio_enabled() callback in the driver takes a full dump of all
of the regsters on the card. It then returns PCI_ERS_RESULT_NEED_RESET
are to process pci hardware errors instead of device driver bug.
-- The arch performs the full electrical #RST of device. Recovery fromThe steps are exquisite. Scenario:
this point proceeds as before.
The e1000 NIC and another device (maybe a function) are on the same bus. The
error_detected of the second device returns PCI_ERS_RESULT_NEED_RESET, so although
error_detected of e1000 returns PCI_ERS_RESULT_CAN_RECOVER, the slot will
be reset immediately, then error recovery will go to call slot_reset callback
directly. The mmio_enabled is not called.
My above scenario is just to say something is easy to be out of control if the steps
are complicated.
Thanks for your explanations. My point is that if driver could enable DMA,
Again, consider the multi-function cards. On pSeries, I can only enable
DMA on a per-slot basis, not a per-function basis. So if one driver
enables DMA before some other driver has reset appropriately, everything
breaks.
What does 'I' above stand for? The platform error recovery procedure
Yes. The pSeries platform error recovery procedure can only enable DMA
on a per-slot basis.
I guess it means platform, that is,
only platform enables DMA for the whole slot.
Yes.
But why does the last sentence
become driver enables DMA?
In your proposal, you were suggesting that MMIO and DMA be enabled with
one and the same routine, and I was attempting to explain why that can't
work.
it could do so in the new error_resume. Driver should do more checking before
enabling DMA.
Your scenario only exists when:
1) Only platform could enable DMA and enable it per-slot instead of per-function.
2) And at least one device doesn't want a hard slot reset to recover while
all other impacted devices also don't want a hard slot; Because if one device want a
hard reset, mmio_enabled of all impacted drivers won't be called.
3) And at least one device's DMA is crazy.
If using my new API, I just need destroy one condition above. My requirement is:
Only if a device uses DMA and the driver is not sure or sure if DMA is pending,
its error_detected should return PCI_ERS_RESULT_NEED_RESET. Otherwise, error_detected
is allowed to return whatever.
It's not fare to such other platforms although I have no such example now.
Could driver enable DMA for a function?
No, not on pSeries hardware.
You stick to keep mmio_enabled which is not used currently, but if there will beIf mmio_enabled is not used currently, I think we could delete it firstly. Later on,
if a platform really need it, we could add it, so we could keep the simplied codes.
It would be very difficult to add it later. And it would be especially
silly, given that someone would find this discussion in the mailing list
archives.
a new platform who uses a more fine-grained steps to recover pci/pci-e, would
you say 'it would be very difficut' and refuse add new callbacks?
Yes.
Current error handler infrastructure could support pci-e, but I want a better
It doesn't prevent software from merging some steps. And, we want
to implement pci/pci-e error recovery for more platforms instead of just
pSeries.
Yes. The API was designed so that it could be supported on every
current and future platform we could think of. You haven't yet
claimed that "pci-e can't be supported".
solution to faciliate driver developers to add error handlers more easily. My
startpoint is driver developer. If they are not willing to add error handlers,
it's impossible to do so for all drivers by you and me.
Based on whatIt's not easy. Just like above scenario, mmio_enabled might be jumped over when
I understand, changing the API wouldn't make the implementation
any easier. (It is very easy to call a callback, and then
examine its return value.
coordinating 2 more devices.
Checking current e100/e1000/ipr error handlers, they look ugly.
Removing a few callbacks does notAbove comments are totally from error recovery design point of view. No considering
materially simplify the recovery mechanism. Managing these
callbacks is *not* the hard part of implementing this thing.)
for driver developers.
BTW, most discussion is about if mmio_enabled should be deleted. As for merging
slot_reset and resume, my reason is that there is no platform specific operation
between calling slot_reset and resume.
Yanmin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: pci error recovery procedure
- From: Linas Vepstas
- Re: pci error recovery procedure
- References:
- Re: pci error recovery procedure
- From: Linas Vepstas
- Re: pci error recovery procedure
- From: Zhang, Yanmin
- Re: pci error recovery procedure
- From: Linas Vepstas
- Re: pci error recovery procedure
- From: Zhang, Yanmin
- Re: pci error recovery procedure
- From: Linas Vepstas
- Re: pci error recovery procedure
- Prev by Date: Re: 2.6.18-rc5-mm1: strange /proc/interrupts contents on HPC nx6325
- Next by Date: Re: [-mm patch] ACPI_SONY shouldn't default m
- Previous by thread: Re: pci error recovery procedure
- Next by thread: Re: pci error recovery procedure
- Index(es):
Relevant Pages
|
Loading