Re: 2.4.22-pre lockups (now decoded oops for pre10)

From: Marcelo Tosatti (marcelo_at_conectiva.com.br)
Date: 08/07/03

  • Next message: Charles Lepple: "Re: 2.6.0-test2: unable to suspend (APM)"
    Date:	Thu, 7 Aug 2003 09:45:36 -0300 (BRT)
    To: Stephan von Krawczynski <skraw@ithnet.com>
    
    

    On Thu, 7 Aug 2003, Stephan von Krawczynski wrote:

    > On Wed, 6 Aug 2003 15:15:39 -0300 (BRT)
    > Marcelo Tosatti <marcelo@conectiva.com.br> wrote:
    >
    > > Stephan,
    > >
    > > I'm pretty worried about this problem.
    > >
    > > Your oopses seem to be the result of some kind of memory corruption. On
    > > the other oopses we could see the kernel oopsing on
    > > remove_page_from_hash_queue due to corrupted pointers (as Willy pointed
    > > out).
    > >
    > > Can you please try to crash your box again with
    > >
    > > CONFIG_DEBUG_SLAB=y
    > >
    > > Again, thanks a lot for your reports.
    >
    > Ok, I have two things.
    > First, another oops. I upgraded the system to rc1 yesterday and it did not
    > survive a single day. Here's the decoded oops, the box was "clean" meaning no
    > weird modules or the like:
    >
    >
    > ksymoops 2.4.8 on i686 2.4.22-rc1. Options used
    > -V (default)
    > -k /proc/ksyms (default)
    > -l /proc/modules (default)
    > -o /lib/modules/2.4.22-rc1/ (default)
    > -m /boot/System.map-2.4.22-rc1 (default)
    >
    > Warning: You did not tell me where to find symbol information. I will
    > assume that the log matches the kernel and modules that are running
    > right now and I'll use the default options above for symbol resolution.
    > If the current kernel and/or modules do not match the log, you can get
    > more accurate output by telling me the kernel version and where to find
    > map, modules, ksyms etc. ksymoops -h explains the options.
    >
    > Unable to handle kernel NULL pointer dereference at virtual address 00000004
    > c0145060
    > *pde = 00000000
    > Oops: 0002
    > CPU: 1
    > EIP: 0010:[<c0145060>] Not tainted
    > Using defaults from ksymoops -t elf32-i386 -a i386
    > EFLAGS: 00010283
    > eax: 00000000 ebx: c822feb4 ecx: c822fe60 edx: e07e7780
    > esi: 00000000 edi: e07e7780 ebp: f59bfe3c esp: f59bfe2c
    > ds: 0018 es: 0018 ss: 0018
    > Process nfsd (pid: 1737, stackpage=f59bf000)
    > Stack: f0cce7a0 00000001 f59bfe38 c822fe60 f0cce7f4 eec54ef4 00000000 e07e7760
    > f59be000 f59bfea8 c0183ef5 e07e7780 e07e77cc c02ed880 e07e7760 f8c84fc8
    > f59bfea8 dfe6c960 00000000 e07e7760 dfe6c960 00000000 f59c6e04 f59bfea8
    > Call Trace: [<c0183ef5>] [<f8c84fc8>] [<f8c856f1>] [<f8c8cee4>] [<f8c8e295>]
    > [<f8c923f4>] [<f8c80699>] [<f8c65938>] [<f8c923f4>] [<f8c91a38>] [<f8c91a58>]
    > [<f8c80411>] [<c010592e>] [<f8c80210>]
    > Code: 89 50 04 c7 41 54 00 00 00 00 c7 43 04 00 00 00 00 8b 44 24
    >
    >
    > >>EIP; c0145060 <fsync_buffers_list+50/1b0> <=====
    >
    > >>ebx; c822feb4 <_end+7e84c94/3852ee40>
    > >>ecx; c822fe60 <_end+7e84c40/3852ee40>
    > >>edx; e07e7780 <_end+2043c560/3852ee40>
    > >>edi; e07e7780 <_end+2043c560/3852ee40>
    > >>ebp; f59bfe3c <_end+35614c1c/3852ee40>
    > >>esp; f59bfe2c <_end+35614c0c/3852ee40>
    >
    > Trace; c0183ef5 <reiserfs_sync_file+65/d0>
    > Trace; f8c84fc8 <[nfsd]nfsd_sync+78/d0>
    > Trace; f8c856f1 <[nfsd]nfsd_commit+a1/b0>
    > Trace; f8c8cee4 <[nfsd]nfsd3_proc_commit+94/130>
    > Trace; f8c8e295 <[nfsd]nfs3svc_decode_commitargs+35/e0>
    > Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
    > Trace; f8c80699 <[nfsd]nfsd_dispatch+119/21d>
    > Trace; f8c65938 <[sunrpc]svc_process+4d8/570>
    > Trace; f8c923f4 <[nfsd]nfsd_procedures3+2f4/320>
    > Trace; f8c91a38 <[nfsd]nfsd_version3+0/10>
    > Trace; f8c91a58 <[nfsd]nfsd_program+0/28>
    > Trace; f8c80411 <[nfsd]nfsd+201/370>
    > Trace; c010592e <arch_kernel_thread+2e/40>
    > Trace; f8c80210 <[nfsd]nfsd+0/370>
    >
    > Code; c0145060 <fsync_buffers_list+50/1b0>
    > 00000000 <_EIP>:
    > Code; c0145060 <fsync_buffers_list+50/1b0> <=====
    > 0: 89 50 04 mov %edx,0x4(%eax) <=====
    > Code; c0145063 <fsync_buffers_list+53/1b0>
    > 3: c7 41 54 00 00 00 00 movl $0x0,0x54(%ecx)
    > Code; c014506a <fsync_buffers_list+5a/1b0>
    > a: c7 43 04 00 00 00 00 movl $0x0,0x4(%ebx)
    > Code; c0145071 <fsync_buffers_list+61/1b0>
    > 11: 8b 44 24 00 mov 0x0(%esp,1),%eax
    >
    >
    > 1 warning issued. Results may not be reliable.
    >
    >
    > As you can see reiserfs seems involved. Regarding reiserfs and my last postings
    > I can assure you that all reiserfs partitions were checked via reiserfsck right
    > before installation of rc1 - as Oleg advised - and found:
    > "Comparing bitmaps.. vpf-10640: The on-disk and the correct bitmaps differs"
    > I was told to use --fix-fixable option which I did and it indeed fixed the
    > problem. Trying reiserfsck after that found no errors any more. So I see no
    > chance that corrupt data on the media (through former crashes) is responsible
    > for this one. Hint: spelling in reiserfsck should be checked ;-)

    It might be a problem in reiserfs. You're getting oopses on different
    places with different stack traces, which is weird.

    I'll take a closer look at this oops now.

    > Second, I re-install the box with CONFIG_DEBUG_SLAB="y" right now. Please tell
    > me if I should perform special steps (SYSRQ or the like) after the next crash
    > happens, or if the decoded oops will be sufficient.

    The decoded oops should be sufficient.

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/


  • Next message: Charles Lepple: "Re: 2.6.0-test2: unable to suspend (APM)"

    Relevant Pages

    • [patch] Printk kernel version in WARN_ON
      ... today, all oopses contain a version number of the kernel, which is nice ... because the people who actually do bother to read the oops get this ...
      (Linux-Kernel)
    • Re: [PATCH] partitions/msdos.c
      ... I have to admit, I saw the oops only with a glimpse of the eye, trying to do ... As syslog is on a reiserfs partition, ... the attention of several kernel developpers, to no avail so long ... ... Feb 26 18:47:41 r50 kernel: Call Trace: ...
      (Linux-Kernel)
    • Re: 2.6.17-rc5-mm2
      ... During boot of my Debian sarge system, this kernel always gives this ... If it doesn't fail with that oops, it usually tends to fail with other ... unjustifiably difficult to try reproducing the other oopses until this ...
      (Linux-Kernel)
    • Re: Software RAID-5 attempt to access beyond end of device...
      ... The reiserfs is on top of an lvm2 on top of a raid5 ... >information that has previously been stored on disk. ... Sep 7 20:32:04 cu kernel: Buffer I/O error on device dm-0, ... PCI: PCI BIOS revision 2.10 entry at 0xf1150, ...
      (Linux-Kernel)
    • Reiserfs 3.6 + quota enabled, crash on delete
      ... all the processes accessing the reiserfs ... ReiserFS: sda3: warning: clm-2100: nesting info a different FS ... kernel BUG at fs/reiserfs/prints.c:362! ... cpu family: 15 ...
      (Linux-Kernel)