Re: Generate NMI to crash a hung system...



big_sid wrote:
Hello,

We have a numbr of Intel ProLiant servers running RedHat Enterprise
Linux 3. From time to time, and for no apparent reason these boxes just
hang. We can't SSH to them, but they still respond to a ping. We connect
to the iLO and attempt to login to the console, but after putting in the
username and password again it just hangs and won't actually give a
command prompt.

In an effort to try and figure out why this is happening, we need to
force a hung box in this state to perform a crash dump so we can send
it off to whoever for some analysis.

So, I setup a netdump-server and setup the server which crashes as a
client. Tested it and it all worked OK. I modified the kernel parameter
kernel.unknown_nmi_panic to 1 from 0 so when I sent it an NMI it should
die on it's arse and give me a nice dump...

However, the box went this morning. So I logged onto the iLO, generated
an NMI, but all it did was dump a couple of log lines onto the netdump
server, no actual crash dump was produced.

Have I missed anything out here? I did this on another couple of RHEL3
test boxes and got a lovely big vmcore file of about 4 gig on my
netdump server, but I'm getting nothing on the server I actually WANT a
crashdump from.

Thanks in advance - Lee


If its crashed that badly, it may well no longer be able to dump anything to a file system.

In really bad cases we used to use a hardware emulator instead of the actual processor chip...however that seems very expensive.

I suspect its run out of resources somewhere. Or is locked in a processor loop. Ping response OK shows that kernel interrupts are at least happening..my guess is it can't fork a process though..its run out of something.

I've seen this behaviour on older UNIX boxes with process limits on them. Usually you get an 'err: fork failed: too many processes' or somesuch, or used to.

Not much help to you, but maybe will help get the brain started on a useful track.

We used to leave a root login running on the console. Sometimes a ps or top would work and show us what was clogging it up.

You might do worse than to write a cron script that dumps stuff out and see what was happening PRIOR to the crash.





.



Relevant Pages

  • Re: main form with subform and combo boxes
    ... Why are you using combo boxes to display static information rather than ... depending on selections of the combo boxes. ... Server Name, Backup Policy, Database Name, and Application Name). ... along with any DBs, Apps, or Policies tied to that Server Name to be ...
    (comp.databases.ms-access)
  • Re: Secure workgroups!
    ... many cheap boxes rather than few expensive boxes. ... require an authentication server and a directory, ... which VPN product to use! ... simpler to manage than using hardware ...
    (microsoft.public.security)
  • Re: Databse Link will not connect
    ... If you go to Enterprise Manager, ... If I am on one of the boxes in Group C, I am unable to open the link to ... the servers in Group C are using 64-bit Windows. ... By dblink I am referring to a linked server. ...
    (comp.databases.ms-sqlserver)
  • Re: How do I create multiple text boxes with the same info that will
    ... I read sync'd text boxes to mean..same ... Use Server Side Includes (SSI) ... Your server will need to be configured to use SSI and you may need to rename ... Use PHP Includes ...
    (microsoft.public.publisher.webdesign)
  • Re: Mainframe Applications and Records Keeping?
    ... 200 Intel boxes that were sitting there 90% idle and put on a single box and share CPU and Memeory resources. ... If you know the distributed workload and when it is busy and when it is not, you might be able to put many boxes on a single mainframe. ... it is still recommended today that you do not put more that one function on a single Intel server OS. ...
    (bit.listserv.ibm-main)