Re: 500+ processes spawned on my server in a "S" status by cron job and unresponsive ssh service
- From: Rahul <nospam@xxxxxxxxxxxxxx>
- Date: Wed, 24 Dec 2008 15:21:33 +0000 (UTC)
Florian Diesch <diesch@xxxxxxxxxxxxx> wrote in
news:dao926-7up.ln1@xxxxxxxxxxxxxxxxxxxxx:
Most likely ssh is waiting to time out.
For reliable remote monitoring I'd use a specialized application like
nagios
Thanks Florian. That is exactly what I am doing now. The scripts are from
the days before I knew Nagios. I still have not ported to Nagios
entirely. Reason: Nagios seems excellent at monitoring "public" remote
services (ssh, dhcp, ping, smtp etc.) but I found it very messy (or
impossible? maybe I just haven't figured out a good way yet!) to see some
of the private stuff on remote servers that need monitoring. eg. my
pbs_mom, disk usage, etc. That's the monitoring-stuff I am still doing
via a ssh + remote command injection route.
The don't require CPU time but use system resources, like memory.
Can I impose a user-specific maximum on the amount of time a process can
remain in this "S" state? Are there legitimate processes that'd remain in
the "S" state for a long time (order of days)?
Finally, I did have a timeout implimented as a sort of a perl hack
wrapper around my ssh block (I couldn't find a native ssh option that'd
do that).
It seems to work well, in the sense that the loop goes around
unresponsive servers that don't respond to ssh. Problem seems to be that
these ssh calls don't get killed but into an "S" state. Maybe a perl guru
knows what my net-borrowed snippet does wrongly! Here it is:
#!/usr/bin/perl -w
# Usage: rcom <command>
# Execute <command> on each node in the cluster.
$rsh="ssh -x -n";
$end_node=256;
for ($i=1;$i<=$end_node;$i++){
$node=sprintf("star%02u",$i);
push @nodelist, $node;
}
my $timeout = 20; # seconds
foreach $node(@nodelist){
print "Executing to node $node\n";
eval {
local $SIG{ALRM} =
sub { die "Sorry, timed out. Please try again\n" };
alarm $timeout;
# some operation that might take a long time to complete
!(system("$rsh $node "."\'". "@ARGV" . "\'") && print "Executed
on node $node\n") || print "Failed on $node\n";
alarm 0;
};
}
--
Rahul
.
- Follow-Ups:
- References:
- Prev by Date: Re: 500+ processes spawned on my server in a "S" status by cron job and unresponsive ssh service
- Next by Date: Re: 500+ processes spawned on my server in a "S" status by cron job and unresponsive ssh service
- Previous by thread: Re: 500+ processes spawned on my server in a "S" status by cron job and unresponsive ssh service
- Next by thread: Re: 500+ processes spawned on my server in a "S" status by cron job and unresponsive ssh service
- Index(es):
Relevant Pages
|