Re: PROBLEM: 2.4 oops: proc_pid_stat()



Thank you for these thoughts, Willy. Because the affected machine is a
server in another state I first tried going ahead and upgrading to 2.6.17.3.
The machine has now been up for 91 days and counting; in the year since
the mentioned change that brought on the segfaults I do not believe it
had ever ran that long. So it seems 2.4.32 to 2.6.17.3 either fixed
or happened to sidestep the issue. If I had another machine with the
problem I would be happy to investigate further, but, unfortunately, this
machine needs to stay up.

I just wanted to let possible others with similar problems know and
thank you and Grant Coady for your feedback!


On Sun, Sep 17, 2006 at 07:50:32AM +0200, Willy Tarreau wrote:
On Sat, Sep 16, 2006 at 04:24:02PM -0700, Chris Frost wrote:
[1.] One line summary of the problem:
2.4.32 proc_pid_stat() repeatedly segfaults.

[2.] Full description of the problem/report:
2.4.32 kernel, after being up for a few days to a few weeks, repeatedly
segfaults in proc_pid_stat(), triggered by w, ps, and other programs.

[3.] Keywords (i.e., modules, networking, kernel):
kernel proc_pid_stat()

[4.] Kernel version (from /proc/version):
$ cat /proc/version
Linux version 2.4.32 (root@tiger) (gcc version 3.3.5 (Debian 1:3.3.5-13)) #1 Wed Dec 21 10:57:37 CST 2005

[5.] Output of Oops.. message (if applicable) with symbolic information
resolved (see Documentation/oops-tracing.txt)
See attached oops.w.log (triggered by w) and oops.ps.log (triggered by ps).

[6.] A small shell script or example program which triggers the
problem (if possible)
Once it starts oopsing, "w" and "ps aux" trigger an oops every time.

[7.] Environment
i386 Debian sarge on a network server in a closet.


<snip hardware information from original email>


[7.7.] Other information that might be relevant to the problem
(please look in /proc and include all information that you
think to be relevant):
ls -lR /proc does not trigger an oops.

[X.] Other notes, patches, fixes, workarounds:
I am not familiar with this aspect of linux, but 2.4.33.3's proc_pid_stat()
appears to be identical. It also appears there have been several bug fixes
in 2.6 (including race condition corrections); perhaps there are issues
fixed in 2.6 but whose patches not been backported to 2.4?

The code in 2.6 is quite different. Your problem here does not seem related
to locking, because you can repeat it at will. I rather think that one of
your tasks is going ill. I looked at the oops and compared with the code.
In your case, the task->sig pointer equals 0x170, which is clearly wrong.
I suspect it's a kernel thread which goes mad, because user tasks should
not be able to write anything there.

When this problem happens, could you try to identify the wrong task ?
Basically, this should help :

# cd /proc
# for i in [0-9]*; do
echo "Trying pid $i..."
if ! cat $i/stat > /dev/null; then
echo "!!! BAD PID : $i !!!"
fi
done

If it is a kernel thread (low pid), you will not be able to find
which one until you reboot, because the only other entry which
might return its name is "status" which also uses collect_sigign_sigcatch().
Otherwise, it is easy to find the full command line of the process :

# echo $(tr '\000' ' ' < /proc/$BADPID/cmdline)

The machine in question become unstable, about a year ago, when an additional
harddrive and ram were added and the kernel upgraded from 2.2.19 (with ext3)
to 2.4.31. There, there may certainly be a hardware issue playing a role
here. However, since the problem is completely reliable once it starts within
a given boot and occurs, given time, across reboots, it seems likely
that a software bug may be involved.

To be honnest, I'm skeptical. This is the first report of such an easily
reproductible problem. Since you added RAM in your system, I would strongly
suggest passing memtest on it during a full night. Random bit flips might
turn a null into non-null, causing some unexpected code paths to be taken.

--
Chris Frost | <http://www.frostnet.net/chris/>
-------------+----------------------------------
Public PGP Key:
Email chris@xxxxxxxxxxxx with the subject "retrieve pgp key"
or visit <http://www.frostnet.net/chris/about/pgp_key.phtml>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • RE: Move a mail server
    ... I have a sendmail server running on Redhat Linux 7.2. ... Kernel panic: No init found. ... I found a procedure for upgrading the kernel here: ...
    (RedHat)
  • RE: Move a mail server
    ... That Config is strictly for a List Server Config! ... Subject: Move a mail server ... Kernel panic: No init found. ... I found a procedure for upgrading the kernel here: ...
    (RedHat)
  • Re: kernel panics on a ProLiant 800 server
    ... > before it crashed the other day with a kernel panic. ... > then installed some errata updates using apt including upgrading the ... This must be a hardware problem since the server is pretty ... > crashing with kernel panics. ...
    (RedHat)
  • Ibm Serveraid Problem with 2.4.25
    ... Just upgraded my server with the 2.4.25 kernel and I noticed an error:/ ... I tried upgrading the serveraid bios to the newest version, ... send the line "unsubscribe linux-kernel" in ...
    (Linux-Kernel)
  • NFS problems with through 2.5.x to 2.6.0-test9
    ... When the server is running the ... kernel, as a client the 2.6 series seem to work perfectly, excluding ... Interesting problem arose when I attempted switch the server's kernel to ... with and without nfsv4 support compiled in (was considering testing it at ...
    (Linux-Kernel)