[PATCH] NULL pointer deref in tcp_do_twkill_work()

olof_at_austin.ibm.com
Date: 02/27/04

  • Next message: Andrew Morton: "Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)"
    Date:	Fri, 27 Feb 2004 16:09:21 -0600 (CST)
    To: davem@redhat.com
    
    
    

    This bug has been encountered before (ref http://bugme.osdl.org/show_bug.cgi?id=1627),
    but that time it went away by itself. We're now seeing it again with
    another heavy network load on 2.6.3.

    Recap:

    We're crashing in tcp_do_twkill_work():

            tw_for_each_inmate(tw, node, safe,
                               &tcp_tw_death_row[slot]) {
                    __tw_del_dead_node(tw);
                    spin_unlock(&tw_death_lock);
                    tcp_timewait_kill(tw);
                    tcp_tw_put(tw);
                    killed++;
                    spin_lock(&tw_death_lock);
                    if (killed > quota) {
                            ret = 1;
                            break;
                    }
            }

    The crash is in __tw_del_dead_node, where tw is taken off of the
    tw_death_node list. At the time of the crash, the pprev list pointer is
    NULL, which indicates that tw is no longer on the list.

    I'm suspicious of the fact that the tw_death_lock is released, and since
    it's released, the "safe" inmate could be descheduled on another CPU. Once
    the lock is reaquired and the loop is redone, the new tw (old safe) is
    now already taken off the list so we go astray.

    Shouldn't the loop always restart from the beginning instead of using the
    (not thread-)"safe" list iteration? Since it keeps taking the entries off
    the list, the behaviour should be identical but it wouldn't be racy.

    The alternative is to not drop the lock, but I'm guessing we need to do
    that to avoid deadlocks. I don't know the TCP code well enough to tell for
    sure.

    Proposed patch is attached. We've given it a bit of runtime here but the
    crashes happen randomly so it'll take a while to gain full confidence that
    it's fixed. But the above reasoning around the patch seems to make sense.

    Thanks,

    Olof

    Olof Johansson Office: 4E002/905
    Linux on Power Development IBM Systems Group
    Email: olof@austin.ibm.com Phone: 512-838-9858
    All opinions are my own and not those of IBM

    
    

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/



  • Next message: Andrew Morton: "Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)"