Re: Errors and later panics in 2.6.0-test11.

From: Jens Axboe (axboe_at_suse.de)
Date: 12/04/03

  • Next message: Con Kolivas: "Re: 2.4.23-ck1"
    Date:	Thu, 4 Dec 2003 15:07:38 +0100
    To: Linus Torvalds <torvalds@osdl.org>
    
    

    On Thu, Dec 04 2003, Jens Axboe wrote:
    > On Wed, Dec 03 2003, Linus Torvalds wrote:
    > >
    > >
    > > On Wed, 3 Dec 2003, David Martínez Moreno wrote:
    > > >
    > > > I've just rebooted about six hours ago, and it's giving panics elsewhere:
    > > >
    > > > [...]
    > > > Ending XFS recovery on filesystem: md0 (dev: md0)
    > > > b44: eth0: Link is down.
    > > > b44: eth0: Link is up at 100 Mbps, full duplex.
    > > > b44: eth0: Flow control is on for TX and on for RX.
    > > > eth0: no IPv6 routers present
    > > > Unable to handle kernel paging request at virtual address 00100104
    > >
    > > That's the LIST_POISON stuff: 00100100 is the "bad list pointer". Somebody
    > > tried to remove a page twice.
    > >
    > > Doesn't mean a lot - if your "struct page" got corrupted, anything can
    > > happen. Quite possibly it's a double free.
    > >
    > > > I can rebuild the Debian mirror for not using the RAID and using the SATA
    > > > disks separately, but will be tomorrow, it's a lot of space to move, and I
    > > > need remote intervention.
    > > >
    > > > Anyway I'd love to know before doing if it will be useful, looking at what
    > > > Jens has said just ten minutes ago about RAIDs 0/5. Will it help to you? Say
    > > > so and I'll go for it.
    > >
    > > It might be more useful to leave it as RAID0, if you're willing to try out
    > > patches to try to debug this. The slab-debugging thing I sent out earlier
    > > is one such patch (but may well cause out-of-memory problems under load),
    > > and possibly the atomic-decrement checker patch (appended). And maybe Jens
    > > and Neil can come up with something..
    >
    > I can reproduce on raid5 with linear dm on top (using XFS). I need to
    > kill the slab and memory debugging, I've put some bio debugging in there
    > instead (the memory debugging interferes with it). It's definitely a bio
    > use after free case, clone_endio() ends up with a freed bio.
    >
    > Program received signal SIGEMT, Emulation trap.
    > 0xc02dd454 in handle_stripe (sh=0xc17cf630) at drivers/md/raid5.c:1009
    > 1009 wbi = wbi2;
    > (gdb) bt
    > #0 0xc02dd454 in handle_stripe (sh=0xc17cf630) at drivers/md/raid5.c:1009
    > #1 0xc02de31f in raid5d (mddev=0xdfd2d200) at drivers/md/raid5.c:1436
    > #2 0xc02e675a in md_thread (arg=0xdffdc1a0) at drivers/md/md.c:2692
    > #3 0xc010752d in kernel_thread_helper () at arch/i386/kernel/process.c:226
    >
    > wbi (dev->written) has already been freed by someone else.
    >
    > My puny 512MB test box cannot use your slab-debug patch :). The
    > atomic-checker didn't catch anything.

    I can just as easily reproduce with ext2, so I don't think XFS has
    anything to do with it. This is still raid5 with dm linear on top.

    (gdb) bt
    #0 0x0b10dead in ?? ()
    #1 0xc0170e25 in bio_endio (bio=0xda37fae0, bytes_done=0, error=0)
        at fs/bio.c:689
    #2 0xc02e8a0d in clone_endio (bio=0xda38c120, done=7168, error=0)
        at drivers/md/dm.c:266
    #3 0xc02dd78c in handle_stripe (sh=0xdfc18de0) at drivers/md/raid5.c:1209
    #4 0xc02de38f in raid5d (mddev=0xdfd62200) at drivers/md/raid5.c:1437
    #5 0xc02e67da in md_thread (arg=0xdfd58260) at drivers/md/md.c:2692
    #6 0xc010752d in kernel_thread_helper () at arch/i386/kernel/process.c:226
    (gdb) p *(struct bio *) 0xda37fae0
    $1 = {bi_sector = 42974, bi_next = 0x0, bi_bdev = 0xb10dead, bi_flags =
    1041,
      bi_rw = 1, bi_vcnt = 3, bi_idx = 0, bi_phys_segments = 0,
      bi_hw_segments = 0, bi_size = 0, bi_max_vecs = 0, bi_io_vec =
    0xda357ce0,
      bi_end_io = 0xb10dead, bi_cnt = {counter = 0}, bi_private = 0xb10dead,
      bi_destructor = 0xc0170050 <bio_destructor>, free_eip = 0xc02e8a26}

    EIP is bad, bio debug magic 0x0b10dead. bi_flags has the uptodate bit
    set, the clone bit, and the 10th bit (debug dead bit). The bio itself
    was freed by 0xc02e8a26, dm.c:clone_endio().

    -- 
    Jens Axboe
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Con Kolivas: "Re: 2.4.23-ck1"

    Relevant Pages

    • Re: Errors and later panics in 2.6.0-test11.
      ... instead (the memory debugging interferes with it). ... It's definitely a bio ... clone_endioends up with a freed bio. ... send the line "unsubscribe linux-kernel" in ...
      (Linux-Kernel)
    • Re: Errors and later panics in 2.6.0-test11.
      ... On Thu, Dec 04 2003, Jens Axboe wrote: ... >> kill the slab and memory debugging, I've put some bio debugging in there ... > EIP is bad, bio debug magic 0x0b10dead. ... send the line "unsubscribe linux-kernel" in ...
      (Linux-Kernel)
    • Re: [PATCH 1/2] Avoid bio_endio recursion
      ... On Wed, 25 Jun 2008, Jens Axboe wrote: ... bi_end_io function wanting to do a segment loop on the bio. ... don't think you need to disable interrupts, ... Disabling interrupts is needed. ...
      (Linux-Kernel)
    • Re: [PATCH 1/2] Avoid bio_endio recursion
      ... On Tue, 24 Jun 2008, Jens Axboe wrote: ... bi_end_io function wanting to do a segment loop on the bio. ... Disabling interrupts is needed. ...
      (Linux-Kernel)
    • Re: disk IO directly from PCI memory to block device sectors
      ... On Mon, Sep 29 2008, Jens Axboe wrote: ... disk IO directly from PCI memory to block device ... The splice approaches look very interesting...thanks... ... Apart from fixing the bugs in it, it's also more clever in using the bio ...
      (Linux-Kernel)