Re: PATCH/RFC: [kdump] fix APIC shutdown sequence



On Mon, Aug 06, 2007 at 05:08:05PM +0200, Martin Wilck wrote:
PATCH/RFC: [kdump] fix APIC shutdown sequence

This patch fixes a problem that we have encountered
with kdump under high I/O load on some machines.
The machines showing the errors have an Intel ICH7
chip set with a 6702PXH PCI Express-to-PCI Bridge
(8086:032c) containing an IO-APIC.


I quickly went through the problem description and the
patch. I think currently problem is not fully understood
and we are trying to put a patch. I think we need to
do little more study of the problem and then think of
a solution.

The bug symptom is that certain controllers connected
to the 6702PXH bridge wouldn't receive any IRQs in the
kdump kernel. In the error case (which is about 20% of
all cases) the IRR bit of the IO-APIC pin for that
controller is always set after the start of the kdump
kernel, indicating an IRQ in progress. We haven't found
a way to recover from this situation when it has once
occured, except for a system reset.

The error is caused by IRQs arriving while the APIC
subsystem is deactivated in machine_crash_shutdown().

Apparently, the IO-APIC gets stuck if it sends an IRQ
message to a Local APIC and never receives an EOI for that
message. This can have several possible reasons:


We need to zoom onto one precise reason to solve the issue
Speculation will not help.

1. If, under SMP, the IO-APIC logical destination field is
set by the IRQ balancing code to one of the "other"
CPUs (i.e. not the crashing_cpu), and an IRQ arrives
on the respective pin after that CPU has shut down
its local APIC (but before the IO-APIC pin is masked)
the IRQ message can't be delivered.

Point 1 and Point 2 seems to be same.


2. The crashing CPU itself disables its local APIC
before the IO-APIC, leaving a short time window
where the IOAPIC can receive IRQs, but not
deliver them.


I doubut that it would be the issue. Looking at intel IOAPIC (82093AA)
documentation, it says that IRR bit of IOAPIC will be set only if
destination CPU has accepted the interrupt. So if we have disabled
the LAPIC, it will not accept the interrupt and IRR bit of IOAPIC
should not be set.

3. An IRQ is received and delivered to a local APIC, but
no CPU ever executes the IRQ handler and therefore no
EOI is sent.


We do issue EOI for all the pending interrupts in second
kernel. Look at setup_local_APIC(). Once the second is booting, it
checks if there are any pending interrupts (ISR bit is set). If yes,
it goes ahead and issues an extra EOI. This should also clear the
IRR register of IOAPIC.


After a lot of failed attempts, i have come up with the
following patch, which fixes the problem.

The patch first masks all IO-Apic pins to avoid a sitation
where the IO-Apic can receive, but not deliver, the IRQs.
Moreover, it enables interrupts for a short period before
eventually starting the kdump kernel, so that EOIs can be
sent to the APICs as necessary.

Notes:
a) Simply calling disable_IO_APIC() early doesn't
work, probably because that also clears the IRQ vector
information, so that arriving EOI messages can't be
associated with pins by the IO-APIC.

disable_IO_APIC() code does not clear the vector information
in routing table. It just masks the interrupt. So even if
an EOI is issued later in second kernel, it should clear the
IRR bit at IOAPIC.

b) We have tried patches that avoid re-enabling interrupts,
but so far without success. Re-enabling IRQs is of course
dangerous while dumping, and I'd rather find a way to avoid it.
c) There are indications that besides the EOI, it's also
necessary that the PCI IRQ pin is deasserted at least for
a short time. That usually requires that the driver IRQ
handler is called and tells the FW that the IRQ was received.
Whether or not this is a requirement hasn't been finally
clarified yet.

I doubt this. There are situations when there is no device
driver for the device and device pushes the interrupt (frequently
observed in the case of kdump). Kernel still keeps on receiving
the interrupt without driver telling device to lower the interrupt
line.

d) The problem is only seen with the IO-APIC in the 6702PXH
PCI bridge, which is the system's secondary IO-APIC. On the
system's main IO-APIC, we see other IRQs (timer etc) arrive
and never get an EOI, but we see no errors.

The patch below is against 2.6.23-rc1. The problem was
originally analyzed and the patch developed against the
Red Hat EL5 kernel (2.6.18-8.el5). I verified that the
problem still occurs with 2.6.23-rc1, and that the patch
below fixes the problem.


I can imagine one possibility. There might be pending interrupts
on a non-crashing cpu. When second kernel boots, we initialize only
one cpu and issue EOI for pending interrupts only on that CPU. So
if an interrupt is pending on other CPU, then IRR bit for that interrupt
on IOAPIC will remain set and one would not get further interrupts from
that device.

- Can you please see if you can reproduce same problem with a
single processor (maxcpus=1)

- Can you please print local apic (print_local_APIC) and
ioapic registers (print_IO_APIC) and verify above theory?

Thanks
Vivek
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: PATCH/RFC: [kdump] fix APIC shutdown sequence
    ... the IO-APIC gets stuck if it sends an IRQ ... that never received an EOI. ... destination CPU has accepted the interrupt. ... it will not accept the interrupt and IRR bit of IOAPIC ...
    (Linux-Kernel)
  • Re: 2.6.12 PREEMPT_RT && PPC
    ... > work building and minimally booting for PPC. ... I've applied most of your patch and have released the -51-37 ... > expects to be called in order to terminate the interrupt. ... ->end, then the handling of the IRQ ...
    (Linux-Kernel)
  • Re: [patch] warn on release_region() from irq context
    ... > the rationale for this patch is that the lock validator (which now ... I did the patch below - but that's not enough, ... the timer irq - so floppy.c file needs a serious redesign to fix these ... + * Interrupt, DMA and region freeing must not be done from IRQ ...
    (Linux-Kernel)
  • [git Patch 1/1] avoid IRQ0 ioapic pin collision
    ... that patch introduced a work-around for a VIA chipset ... Our system uses an ACPI Interrupt Source Override to inform the OS that the ... the gsi up to the current value of pci_irq. ... and the IRQ is assigned to another unrelated device. ...
    (Linux-Kernel)
  • [(repost) git Patch 1/1] avoid IRQ0 ioapic pin collision
    ... I have made slight changes to the patch I ... Our system uses an ACPI Interrupt Source Override to inform the OS that the ... the gsi up to the current value of pci_irq. ... and the IRQ is assigned to another unrelated device. ...
    (Linux-Kernel)