system meltdown with High IO, high load average
From: lostgeeks (lostgeeks_at_yahoo.com)
Date: 01/17/05
- Next message: Dances With Crows: "Re: Open source bug capture/replay tools ?"
- Previous message: Stephane CHAZELAS: "Re: regex: listing all textstrings to be found more that 2 times in a file"
- Next in thread: Robert Heller: "Re: system meltdown with High IO, high load average"
- Reply: Robert Heller: "Re: system meltdown with High IO, high load average"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 17 Jan 2005 09:35:00 -0800
We run a fairly large email system on RedHat Linux, on 2.4.21-20.ELsmp
kernel with 4GB of memory and attached to a IBM TotalStorage DS4400
storage solution via a qLogic QLA2312 fibre channel host bus adaptor.
We
have about 700-800GB of data on the storage array, and we use LVM/ext2.
We're seeing problems related to IO await, low service times.
When the problem occurs, the load average goes to between 400 and 950.
The
system remains very responsive; there appears to be plenty of idle CPU.
However, access to the disk array becomes very slow (e.g. if you login
to
the server, everything is fast up until you want an "ls -l" of a large
directory or you attempt to view a file that is more than trivial size
on
the disk array.) System metrics (from "iostat") show the time to
complete
I/O to the disk array remains low, but requests for I/O to the disk
array
are waiting in queues a long time. We see many imap and pop processes
all
waiting on I/O and eventually some of them appear to be hung waiting
(the
processes remain in "D" state -- uninterruptable sleep). With email
packages constantly making connections for new imap and pop processes,
and
with existing processes not terminating as quickly, the total number of
processes climbs rapidly.
Load averages have a tendency to become very spikey -- from average
loads
of 12 to 200.
Post-mortem analysis shows no correlation between IOPs (I/O per second)
or
data sectors written per second and the problem. However, the problem
appears at least loosely correlated to periods of sustained higher data
sectors read from the disk array.
Later, we added an additional 1GB of memory to the machine. (total to
5GB
of system memory) We had no further problems that day; but the problems
returned the next day. Further analysis shows that after memory was
added
and the server rebooted, the machine's I/O transactions per second
climbed
without displaying the symptoms; however, at the same time the number
of
sectors read per second was lower than before. So it could mean the mix
of
work coincidently changed after the boot; but this remains unclear.
Has anyone seen similar symptoms? We'd appreciate any pointers,
suggestions, comments or perhaps even questions. Very much
appreciated.
Thank you so very much,
- Derek
- Next message: Dances With Crows: "Re: Open source bug capture/replay tools ?"
- Previous message: Stephane CHAZELAS: "Re: regex: listing all textstrings to be found more that 2 times in a file"
- Next in thread: Robert Heller: "Re: system meltdown with High IO, high load average"
- Reply: Robert Heller: "Re: system meltdown with High IO, high load average"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|