system meltdown with High IO, high load average

From: lostgeeks (lostgeeks_at_yahoo.com)
Date: 01/17/05


Date: 17 Jan 2005 09:35:00 -0800

We run a fairly large email system on RedHat Linux, on 2.4.21-20.ELsmp
kernel with 4GB of memory and attached to a IBM TotalStorage DS4400
storage solution via a qLogic QLA2312 fibre channel host bus adaptor.
We
have about 700-800GB of data on the storage array, and we use LVM/ext2.

We're seeing problems related to IO await, low service times.

When the problem occurs, the load average goes to between 400 and 950.
The
system remains very responsive; there appears to be plenty of idle CPU.
However, access to the disk array becomes very slow (e.g. if you login
to
the server, everything is fast up until you want an "ls -l" of a large
directory or you attempt to view a file that is more than trivial size
on
the disk array.) System metrics (from "iostat") show the time to
complete
I/O to the disk array remains low, but requests for I/O to the disk
array
are waiting in queues a long time. We see many imap and pop processes
all
waiting on I/O and eventually some of them appear to be hung waiting
(the
processes remain in "D" state -- uninterruptable sleep). With email
packages constantly making connections for new imap and pop processes,
and
with existing processes not terminating as quickly, the total number of
processes climbs rapidly.

Load averages have a tendency to become very spikey -- from average
loads
of 12 to 200.

Post-mortem analysis shows no correlation between IOPs (I/O per second)
or
data sectors written per second and the problem. However, the problem
appears at least loosely correlated to periods of sustained higher data
sectors read from the disk array.

Later, we added an additional 1GB of memory to the machine. (total to
5GB
of system memory) We had no further problems that day; but the problems
returned the next day. Further analysis shows that after memory was
added
and the server rebooted, the machine's I/O transactions per second
climbed
without displaying the symptoms; however, at the same time the number
of
sectors read per second was lower than before. So it could mean the mix
of
work coincidently changed after the boot; but this remains unclear.

Has anyone seen similar symptoms? We'd appreciate any pointers,
suggestions, comments or perhaps even questions. Very much
appreciated.

Thank you so very much,
- Derek



Relevant Pages

  • Re: What is the most fast sorting algorithm?
    ... merging works because you can read each run sequentially, ... You need to balance the number of passes with the I/O ... characteristics of your disk array. ... if you had a couple of GB of memory ...
    (comp.programming)
  • [RFC] page replacement requirements
    ... Submitting too much I/O at once can kill latency and even lead to deadlocks when bounce buffers are involved. ... Must be able to deal with multiple memory zones efficiently. ... When on completion of the write to their backing-store the reference bit is still unset a callback is invoked to place them so that they are immediate candidates for reclaim again. ... For traditional page replacement algorithms this is not a big issue since we just implement per zone page replacement; ...
    (Linux-Kernel)
  • RFC: CONFIG_PAGE_SHIFT (aka software PAGE_SIZE)
    ... answered I disliked the dependency on defrag for reliable I/O and I ... all the memory allocations are ... the moment you need to relay on order> 0 allocations ... printf("%d usecn", usec); ...
    (Linux-Kernel)
  • Re: Forth as an operating system
    ... service to I/O events or the lowest OS overhead in general. ... We say that an interrupt cannot be the fastest because it ... We say that multiple memory operations cannot ... ISR are the fastest solution possible. ...
    (comp.lang.forth)
  • Re: Mainframe not a good architecture for interactive was Re: What is the future of COBOL? Answer:
    ... > Mainframes are MEMORY centric using ECC type memory. ... Supposing I said there are Intel I/O chips that can maintain around a GIG ... applications require fast I/O throughput for optimal performance. ...
    (comp.lang.cobol)