Re: APIC error on SMP machine

From: James Cleverdon (jamesclv_at_us.ibm.com)
Date: 10/01/03

  • Next message: Andrew Morton: "Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch"
    To: Chris Rankin <rankincj@yahoo.com>, linux-kernel <linux-kernel@vger.kernel.org>
    Date:	Tue, 30 Sep 2003 18:52:47 -0700
    
    

    On Tuesday 30 September 2003 2:42 pm, Chris Rankin wrote:
    > Linux-2.4.22-SMP, 1 GB RAM, devfs, gcc-3.2.3.
    >
    > Hi,
    >
    > Today, my dual PIII (Coppermine) refused to boot, and wrote a large number
    > of these messages to the serial console instead:
    >
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    > APIC error on CPU1: 04(04)
    >
    > Can anyone tell me what these might mean, please? The kernel source implies
    > that it's a "Send accept error", but this doesn't help me in an "Ah, I can
    > fix that!" sense.
    >
    > Does this APIC error just mean that the CPU is unhappy in this slot, and is
    > refusing to listen to the motherboard? Or is the motherboard refusing to
    > listen to the CPU?

    Neither. An APIC send accept error means that when trying to send an
    interrupt, it was not accepted by the target. In this case, the target is a
    CPU, either your other CPU or the same one (a CPU can send itself an
    interrupt).

    While there are several reasons why this can happen, the most common ones are:

    1) The target CPU is "full". The local APIC on P54Cs through P3s only has two
    interrupt latches per interrupt "level", which is the high nibble of the IRQ
    vector number. So, if a CPU had already latched interrupt vectors 0x30 and
    0x3A, it would have to reject any other 0x3X vector that was sent until it
    could service one of the two latched vectors.

    You can force this to happen by manually binding too many IRQs that happen to
    be on the same "level" to one CPU, then causing a lot of interrupt traffic on
    those devices.

    In order to avoid this problem, Linux spreads the IRQs among as many vector
    levels as possible. Still, the vector assignment is done before any devices
    have requested interrupts. You may get unlucky and have 3 devices on one
    level.

    2) The interrupt cannot be delivered because something is wrong with it. This
    can happen if the kernel screws up and picks "clustered" APIC mode on a
    "flat" system or vice versa. A dual P3 system should be flat. Check your
    dmesg log to make sure it was properly detected. (This seldom happens unless
    you're doing interrupt development work in Linux.)

    3) Maybe the other CPU is broken and physically cannot accept the interrupt.
    Do any previous kernels boot?

    > Background:
    > This machine has been misbehaving for a while. I thought I had worked
    > around the problem by underclocking the FSB from 133 MHz to 100 MHz, but
    > that now looks like it was just a "reprieve". I have tried running "nosmp",
    > "pci=noacpi" and "noapic pci=noacpi" without success, and have resorted to
    > yanking the CPU out of this slot entirely. (I suspect that the CPU is fine,
    > however.) I have also restored the FSB to 133 MHz, so I am currently
    > running the SMP kernel on a single 933 MHz PIII.
    >
    > Cheers,
    > Chris
    >
    > -

    -- 
    James Cleverdon
    IBM xSeries Linux Solutions
    {jamesclv(Unix, preferred), cleverdj(Notes)} at us dot ibm dot comm
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Andrew Morton: "Re: [PATCH] Mutilated form of Andi Kleen's AMD prefetch errata patch"

    Relevant Pages

    • Regression: 19-9-08 CURRENT wont boot with Promise enabled (was: Re: SATA disks no longer sh
      ... I then re-applied pjd's zfs-patchset and rebuild world and kernel overnight. ... memory serves me right the failing boot only showed a single nVidia MCP55 SATA ... CPU: IntelXeonCPU 3.06GHz ... <ACPI PCI bus> on pcib0 ...
      (freebsd-current)
    • Re: [parisc-linux] [patch 15/23] Add cmpxchg_local to parisc
      ... could be vastely used in the kernel. ... the local ops has just been standardized in 2.6.22 though a patchset I ... I always thought preemption required some sort of interrupt or trap. ... that only one CPU writes to the local_t data. ...
      (Linux-Kernel)
    • Re: Help with a driver im writing.
      ... Tasklets don't interfere each other on single CPU with non-preemeptable ... put it into a critical section with interrupt off anyway. ... > It's running on kernel version 2.6.14 single intel cpu, ... > My write routine is a loop in which i shift data out to the ...
      (comp.os.linux.development.system)
    • FreeBSD 7 RC2 - Kernel Panic during install boot
      ... Trying to boot the 7.0 RC2 CD, the kernel panics with: ... Here's the console log of the failing boot. ... <CPU Frequency Thermal Control> on cpu0 ...
      (freebsd-stable)
    • booting failure on 2.6.17-mm1
      ... The whole compile process is fine, but when i boot the kernel, ... Elf64 kernel loaded... ... starting cpu hw idx 0000000000000002... ... Device tree strings 0x0000000002ad8000 -> 0x0000000002ad9055 ...
      (Linux-Kernel)