Generate NMI to crash a hung system...




Hello,

We have a numbr of Intel ProLiant servers running RedHat Enterprise
Linux 3. From time to time, and for no apparent reason these boxes just
hang. We can't SSH to them, but they still respond to a ping. We connect
to the iLO and attempt to login to the console, but after putting in the
username and password again it just hangs and won't actually give a
command prompt.

In an effort to try and figure out why this is happening, we need to
force a hung box in this state to perform a crash dump so we can send
it off to whoever for some analysis.

So, I setup a netdump-server and setup the server which crashes as a
client. Tested it and it all worked OK. I modified the kernel parameter
kernel.unknown_nmi_panic to 1 from 0 so when I sent it an NMI it should
die on it's arse and give me a nice dump...

However, the box went this morning. So I logged onto the iLO, generated
an NMI, but all it did was dump a couple of log lines onto the netdump
server, no actual crash dump was produced.

Have I missed anything out here? I did this on another couple of RHEL3
test boxes and got a lovely big vmcore file of about 4 gig on my
netdump server, but I'm getting nothing on the server I actually WANT a
crashdump from.

Thanks in advance - Lee


--
big_sid
------------------------------------------------------------------------
big_sid's Profile: http://forums.yourdomain.com.au/member.php?userid=28
View this thread: http://forums.yourdomain.com.au/showthread.php?t=171879

.



Relevant Pages

  • PROBLEM: Network hang: "eth0: Tx timed out (f0080), is buffer full?"
    ... We had to reboot a server twice after a network card hang. ... The last minutes previous to the network hang and system hard reset ... Dec 8 10:43:25 urutu kernel: NETDEV WATCHDOG: eth0: transmit timed out ...
    (Linux-Kernel)
  • SBS 2000 SP1 (/SP4) system hang
    ... Each hard disk was an 72 GB Ultra 3 hard disk and the result ... while I believe I knew the location to move the cache configuration /drives ... they do not use ISA server anyway, (after the completion of the migration I ... manually with out the non-core service then starting that it did not hang ...
    (microsoft.public.windows.server.sbs)
  • RE: MS rec. change for DNS & DHCP running on 1 server needs undoin
    ... application hang and printer: ... Please do clean boot on printer server and problematic clients at same ... Please try to show me more details about your print spooler issue, ... Please let me know the detail steps of the recommendation action, ...
    (microsoft.public.windows.server.sbs)
  • Re: IIS6 hang
    ... > sql 2000 ... > about 800 sessions to IIS at any given time from onsite ... > crash and hang agent or iisstate to help me out. ... > information from another server. ...
    (microsoft.public.inetserver.iis)
  • Re: Windows 2003 EE DC and Domino Server hanging weekly
    ... I don't have sure if will get something cause even perfmon stops monitoring during hang. ... We are experiencing a server hang always between 9 to 10AM on Mondays. ... Terminal Services, file sharing, IBM Domino, Remote console and even physical console become unavailable, but still responds to ICMP. ...
    (microsoft.public.windows.server.general)