Oops in do_page_fault

chase.venters_at_clientec.com
Date: 10/25/05

  • Next message: Steven Rostedt: "Re: 2.6.14-rc5-rt6 -- False NMI lockup detects"
    Date:	Tue, 25 Oct 2005 15:00:24 -0400 (EDT)
    To: linux-kernel@vger.kernel.org
    
    

    Greetings,

    Please forgive me in advanced for the length of this description - I don't
    want to leave out any important details.

    About two weeks ago I came home from work to find that the fan on my XFX
    GeForce 6800GT PCI-E had failed. My computer, which was previously playing
    music, was simply playing the last buffer-sized frame of audio repeatedly
    as if it were a skipping CD. I cursed, got frustrated, and ordered a
    7800GT to replace it.

    While I was waiting, I took advantage of my Asus P5GDC-V Deluxe's onboard
    Intel 915 graphics to get by in X. This was stable for a few days (prior
    to the graphics card failure, this system was stable for a year).
    Eventually, though, I was ripping a CD while listening to music and doing
    some other minor things when I noticed that the system started crashing in
    a very odd way.

    I managed to run dmesg from a remote shell that was open, and saw so
    really strange traces I didn't manage to save. Pretty soon I realized that
    every process that tried to touch the disk would go into a
    TASK_UNINTERRUPTIBLE sleep and freeze. The system took about 30 seconds to
    crap out - amusingly enough, my music continued to play until the song was
    over; then that died too.

    Writing it off to a possible bug in the video driver, I rebooted and
    noticed ReiserFS doing lots of cleanup. I continued on my way until the
    system crashed again an hour later. Stability gradually grew much worse -
    I went from a year of stability, to days, to hours, to minutes...
    Eventually, I decided the best option would be to leave it alone until
    replacement parts arrived.

    In the mean time I ran memtest86 exhaustively to verify that my value RAM
    wasn't on the fritz.

    The replacement card arrived, and annoyed by what seemed to be excessive
    corruption on my partition, I used a LiveCD to set one disk in the RAID10
    to faulty, removed it, made an ext2 partition on it, moved my data to it
    (which thankfully fit), rebuilt the ReiserFS partition on the RAID, moved
    all the data back over, and resynced the disk.

    The system seemed to work perfectly for days. I was happy to have fixed
    the problem. Then, though, I noticed that my brand new fresh partition was
    kicking up very similar errors (I think I remember seeing something about
    vs-7000 nesting filesystem, as well as complaints about free space
    calculations). It took 5 minutes before the system froze during an "emerge
    traceroute". This time, the behavior got bad really fast. I could reliably
    reproduce the behavior by running "emerge traceroute" (the last thing I
    ever saw before death was portage checking /usr/share/doc). Two times it
    freezed, two times it actually *immediately* rebooted without even a
    visible panic, etc.

    I replaced the motherboard with an identical motherboard, upgraded to a
    better cooler (CPU is a 540J prescott 3.2GHz), went from a 380 watt to a
    500 watt dual 12v-rail supply. Strangely enough, after these changes, I
    can now reproduce the crash reliably, but I'm getting (depending on the
    kernel version I boot) different but consistent behavior each time.

    In 2.6.13, I get an Oops (translated by hand, sorry for inexact formatting):

    Oops 0000 #1
    PREEMPT SMP
    (list of modules linked in includes some alsa modules, nvidia.ko and
    sk98lin.ko)

    CPU 0
    EIP 0060:[<c01182c3>] Tainted: P VLI
    EFLAGS: 00010086 (2.6.13)
    EIP is at do_page_fault+0xa3/0x5db
    eax: f5e50000 ebx: 0000000b ecx: 0000000d edx: 0000000d
    esi: 0000000e edi: c0567451 ebp: 00000000 esp: f5e5a10c

    ds: 007b es: 007b ss: 0068

    2.6.13 will oops reproducibly as above upon completion of "emerge
    traceroute". Each 2.6.13 oops always happens at do_page_fault+0xa3/0x5db.
    ebx and ecx are also observed to be constant.

    Oddly enough, the second Oops I got on 2.6.13 reported a CPU # 2949119.

    I also tested "emerge traceroute" on the same partition by booting
    2.6.11.7 and 2.6.12.4. Both of these kernels failed to Oops / panic, but
    simply froze.

    My next step will be to try and replace the CPU (though I really
    appreciate any comments as to whether I'm likely looking at a hardware
    problem anymore). I ordered a replacement CPU and got sent a 478 instead
    of a 775, so it looks like I'm going to have to go grab one up locally.

    I tried to rebuild my kernel with SysRQ and a serial console to be of
    better help; unfortunately, I can't seem to do enough IO before crashing
    to succeed.

    Thanks,
    Chase Venters
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Steven Rostedt: "Re: 2.6.14-rc5-rt6 -- False NMI lockup detects"

    Relevant Pages

    • Problem Booting Up, Requires Clicking F1 Key
      ... I have a Dell Dimension 8200 computer with windows XP Pro installed, ... sent me a replacement HD yesterday and I installed it today. ... partition was not included. ... P.S. the noise level for the replacement HD is so much better compared to ...
      (microsoft.public.windowsxp.general)
    • Oopses / ReiserFS superblock corruption with 2.6.9
      ... During two weeks of running 2.6.9 I was hit by two oopses, ... The first oops occurred when I attempted to log in under X (X.org ... Unable to handle kernel paging request at virtual address 02014742 ... No such luck with the root partition though - I got told ...
      (Linux-Kernel)
    • Re: 2.6.22-rc6(mm1) Unable to handle kernel NULL pointer dereference - git-bisect result
      ... The filesystem at /var/tmp was xfs. ... Finally I switched the partition back to ext3, ... But the oops is now repeatable even on this new ext3 partition. ... Test application (the syscall number will need per-arch editing): ...
      (Linux-Kernel)
    • Re: 2.6.20-mm1 - Oops using Minix 3 file system
      ... i get an OOPS when using the minix 3 ... Here are the steps to reproduce this oops (they involve using qemu to ... - Then launch Minix inside qemu to make a minix partition on this ...
      (Linux-Kernel)
    • Re: My desperation: Oops during mkfs.ext3 on large partitions
      ... This is why badblocks was working even on 100GB partition while mkfs.ext3 was NOT, the first uses a few kB of ram while the latter seems to use a lot, running over the broken pages and causing kernel Oops and panics. ... I even replaced two HDs because of a data loss which actually was NOT due to an HD failure, but to the broken RAM! ...
      (Linux-Kernel)