Re: Oops in 2.6.19.1



On Sunday 31 December 2006 16:27, Adrian Bunk wrote:
On Sat, Dec 30, 2006 at 04:59:35PM +0000, Alistair John Strachan wrote:
On Thursday 28 December 2006 04:14, Alistair John Strachan wrote:
On Thursday 28 December 2006 04:02, Alistair John Strachan wrote:
On Thursday 28 December 2006 02:41, Zhang, Yanmin wrote:
[snip]

Here's a current decompilation of vmlinux/pipe_poll() from the
running kernel, the addresses have changed slightly. There's no
xchg there either:

Could you reproduce the bug by the new kernel, so we could get the
exact address and instruction of the bug?

It crashed again, but this time with no output (machine locked
solid). To be honest, the disassembly looks right (it's like Chuck
said, it's jumping back half way through an instruction):

c0156f5f: 3b 87 68 01 00 00 cmp 0x168(%edi),%eax

So c0156f60 is 87 68 01 00 00..

This is with the GCC recompile, so it's not a distro problem. It
could still either be GCC 4.x, or a 2.6.19.1 specific bug, but it's
serious. 2.6.19 with GCC 3.4.3 is 100% stable.

Looks like a similar crash here:

http://ubuntuforums.org/showthread.php?p=1803389

I've eliminated 2.6.19.1 as the culprit, and also tried toggling
"optimize for size", various debug options. 2.6.19 compiled with GCC
4.1.1 on an Via Nehemiah C3-2 seems to crash in pipe_poll reliably,
within approximately 12 hours.

The machine passes 6 hours of Prime95 (a CPU stability tester), four
memtest86 passes, and there are no heat problems.

I have compiled GCC 3.4.6 and compiled 2.6.19 with an identical config
using this compiler (but the same binutils), and will report back if it
crashes. My bet is that it won't, however.

There are occasional reports of problems with kernels compiled with
gcc 4.1 that vanish when using older versions of gcc.

AFAIK, until now noone has ever debugged whether that's a gcc bug,
gcc exposing a kernel bug or gcc exposing a hardware bug.

Comparing your report and [1], it seems that if these are the same
problem, it's not a hardware bug but a gcc or kernel bug.

This bug specifically indicates some kind of miscompilation in a driver,
causing boot time hangs. My problem is quite different, and more subtle. The
crash happens in the same place every time, which does suggest determinism
(even with various options toggled on and off, and a 300K smaller kernel
image), but it takes 8-12 hours to manifest and only happens with GCC 4.1.1.

Unless we can start narrowing this down, it would be a mammoth task to seek
out either the kernel or GCC change that first exhibited this bug, due to the
non-immediate reproducibility of the bug, the lack of clues, and this
machine's role as a stable, high-availability server.

(If I had another Epia M10000 or another computer I could reproduce the bug
on, I would be only too happy to boot as many kernels as required to fix it;
however I cannot spare this machine).

--
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [PATCH 1/2] LogFS proper
    ... Please comment the structure with kernel doc comments and avoid the tail ... Do enums have a significant ... Also the BUG itself will give you enough clue where it happened, ... which leaves only the prepared filesystem image to worry about. ...
    (Linux-Kernel)
  • Re: 2.6.25 crash: EIP: [] xfrm_output_resume+0x64/0x100 ss:esp 0068:c03a1e5c
    ... please include in all bug reports as ... This linux box is an ipsec gateway and ... # Linux kernel version: 2.6.25 ... # PCI IDE chipsets support ...
    (Linux-Kernel)
  • Server Crash (2.6.17-1.2157)- BUG: soft lockup detected on CPU#3!
    ... I just installed Fedora Core Kernel 2.6.17-1.2157_FC5smp and immediately got a "BUG: soft lockup detected on CPU#3!", I've never had this on any other kernel version before, but on my desk top PC and now this server with this specific kernel. ... isg-dev7 kernel: CPU: 3 ... kernel BUG at include/linux/list.h:185! ... MEM window: dd200000-dd3fffff ...
    (Fedora)
  • please pull from the trivial tree
    ... Fix spelling in E1000_DISABLE_PACKET_SPLIT Kconfig description ... +- Finding patch that caused a bug ... +Always try the latest kernel from kernel.org and build from source. ... Length of input string in bytes ...
    (Linux-Kernel)
  • Re: miscompilation of volatiles?
    ... a bug in that gcc port. ... you must read the documentation that the ... If the compiler documentation says that, ...
    (comp.lang.c)