Diagnosing occassional random reboots



A server which has been running steadily for years is beginning to reboot. To the best of my knowledge, nothing has changed. It is a dual-processor PIII. It runs stable.

It is tucked away in the loft and usually has no monitor attached so tracking this down is difficult. However even if I brought it into a more convenient area, short of sitting staring at the screen waiting for a crash or reboot, I'm not sure it would help much.

I've tried rebuilding a newer kernel from backports.org. And trimmed it right down as much as possible. There is nothing useful in syslog. A typical series of reboots looks like:

dougie pts/0 tbird2xp:0.0 Tue Oct 31 17:15 still logged in
runlevel (to lvl 2) 2.6.17 Tue Oct 31 17:12 - 17:21 (00:08)
reboot system boot 2.6.17 Tue Oct 31 17:12 (00:08)
dougie pts/0 tbird2xp:0.0 Tue Oct 31 17:09 - crash (00:02)
runlevel (to lvl 2) 2.6.17 Tue Oct 31 16:59 - 17:12 (00:12)
reboot system boot 2.6.17 Tue Oct 31 16:59 (00:21)
dougie pts/0 tbird2xp:0.0 Tue Oct 31 16:05 - crash (00:54)
runlevel (to lvl 2) 2.6.17 Tue Oct 31 15:16 - 16:59 (01:43)
reboot system boot 2.6.17 Tue Oct 31 15:16 (02:04)
date new time Sun Oct 29 07:11
date old time Sun Oct 29 07:12
root pts/3 kitchens Sun Oct 29 07:11 - crash (2+08:04)
dougie pts/2 kitchens Sat Oct 28 20:29 - crash (2+19:46)
dougie pts/1 kitchens Sat Oct 28 11:37 - 16:04 (1+05:27)
dougie pts/0 tbird2xp:0.0 Fri Oct 27 13:16 - crash (4+03:00)


And the syslog shows nothing notable around the time. Usuall just lines from postfix as it processes the mail queue, then:

Oct 31 17:12:22 nick syslogd 1.4.1#17: restart (remote reception).
Oct 31 17:12:22 nick kernel: klogd 1.4.1#17, log source = /proc/kmsg started.
Oct 31 17:12:23 nick kernel: Inspecting /boot/System.map-2.6.17
Oct 31 17:12:23 nick kernel: Loaded 21314 symbols from /boot/System.map-2.6.17.

I'm not sure how to go about tracking this down. My searching of the archives shows that these symptoms could describe a faulty physical component, such as memory or PSU. So my next step is probably going to be trying to swap the PSU and doing a memtest. One thing about the reboots is that they often appear to be in clusters. For example, around 7AM to 9AM on Oct 24 it looks like it was bouncing for about two hours off and on:

# last reboot
reboot system boot 2.6.8 Wed Oct 25 05:03 (06:50)
reboot system boot 2.6.8 Wed Oct 25 04:31 (07:22)
reboot system boot 2.6.8 Tue Oct 24 11:09 (1+00:44)
reboot system boot 2.6.8 Tue Oct 24 10:59 (00:06)
reboot system boot 2.6.8 Tue Oct 24 09:52 (01:01)
reboot system boot 2.6.8 Tue Oct 24 09:50 (01:03)
reboot system boot 2.6.8 Tue Oct 24 09:49 (01:05)
reboot system boot 2.6.8 Tue Oct 24 09:37 (01:17)
reboot system boot 2.6.8 Tue Oct 24 09:05 (01:49)
reboot system boot 2.6.8 Tue Oct 24 08:53 (02:00)
reboot system boot 2.6.8 Tue Oct 24 08:51 (02:03)
reboot system boot 2.6.8 Tue Oct 24 07:28 (03:26)
reboot system boot 2.6.8 Tue Oct 24 07:26 (03:27)
reboot system boot 2.6.8 Tue Oct 24 07:24 (03:29)
reboot system boot 2.6.8 Tue Oct 24 07:01 (03:52)
reboot system boot 2.6.8 Tue Oct 24 06:18 (04:36)

I'm a bit stumped on how to solve this and would appreciate any thoughts on strategy.

Dougie


--
To UNSUBSCRIBE, email to debian-user-REQUEST@xxxxxxxxxxxxxxxx with a subject of "unsubscribe". Trouble? Contact listmaster@xxxxxxxxxxxxxxxx



Relevant Pages

  • Re: Openserver 6.0 wc -l /usr/adm/syslog reboots system
    ... I copied syslog to syslog.old and zeroed out syslog in maintenance mode. ... wc -l syslog.old caused immediate reboot. ... tool to check the disk. ... Maybe a loop that uses dd to break it up into 100k chunks, ...
    (comp.unix.sco.misc)
  • Re: Openserver 6.0 wc -l /usr/adm/syslog reboots system
    ... I copied syslog to syslog.old and zeroed out syslog in maintenance mode. ... wc -l syslog.old caused immediate reboot. ... And attempting badtrk returns a message that the command is removed ... Maybe a loop that uses dd to break it up into 100k chunks, then wc -l each chunk to see if one crashes, That points to some funky data that wc chokes on, which would be pretty strange since your cat test shows it didn't choke on the same data via stdin, but, *shrug* it's gotta be something. ...
    (comp.unix.sco.misc)
  • Re: Solaris 10 11/06 x86_64: almost impossible to shutdown the system
    ... umounted (and syslog is still alive as we see) how can syslog still ... These are stored in msgbuf which survives a reboot. ... But it doesn't really umount filesystems. ... There are two problems in init ...
    (comp.unix.solaris)
  • Little confusing Problem
    ... I have just installed RH9, ... When I reboot a couple of strange things happen, ... when syslog starts via run level 3, I seem to get all logs printed to the ... masquerading works fine.. ...
    (alt.os.linux.redhat)
  • Re: Solaris 10 11/06 x86_64: almost impossible to shutdown the system
    ... This is displayed 1 second before poweroff/reboot! ... umounted (and syslog is still alive as we see) how can syslog still ... These are stored in msgbuf which survives a reboot. ... with init 5, init 6, shutdown, telinit ... ...
    (comp.unix.solaris)