Multi-core CPUs & our present Fault Finding capability



Firstly, I hope this is the correct mailing group to post on.
Apologies in advance if I have mailed to the wrong group.
Please advise.

--------------------

On one site I visit they are running 2.6.22.17-0.1 SMP x86_64 GNU/Linux
Kernel from a OpenSUSE 10.3 Distribution.

The system CPU is the new AMD Phenom(tm) 9600 Quad-Core Processor on a
GA-MA790FX-DQ6 (rev. 1.0) AMD 790FX Chipset motherboard, with 8GB of
memory.

I have noticed that at times (i.e.: intermittently over a 96/128 hour
period) that one (or worst case two) cores will be 100% hung usually
because of some missed event or at best some failing/hard-waiting loop.
At these times, the system keyboard is effectively dead or at the very least
unresponsive (my typical test of toggling the caps lock doesn't lead to any
led status change). All other processes are running correctly, there is no
memory leak detectable, or any failing device noticeably visible. The
failing process/task are locked solid so gdb doesn't succeed in appending
to them as a means of finding the failing party.

One can SSH into the failing system correctly but a "kill -9" doesn't remove
the suspect task(s), at best they are just zombie'd, according to "top"
which is as a matter of course is always running on the console.

Via SSH one can 'echo "t" > /proc/sysrq-trigger' which dumps individual task
backtrace to syslog as you would expect, but I have failed yet to force any
kernel core dump at all (yes friends, ulimit is set correctly) no matter
what is tried.

So, now that your got the situation. I hope you can see why my interests in:

A> Developing a way to "reset" explicitely an individual CPU core, without
resorting to a complete power down. A in-built kernel "feature" like this
would be a excellent system management device to have in ones toolbox.

B> Establishing a kernel mechanism/call that would 100% guarantees a core
dump, irrespective of other kernel/system considerations.

Your comments and suggestions on how to implement and build such would be
most appreciated.

Thanks.

==============================
grahame ?aT? wild possum ?com?
==============================
.



Relevant Pages

  • Re: Announce: Linux-next (Or Andrews dream :-))
    ... And the rate of change in each major portion of the kernel (drivers, ... arch, core, network, etc) is exactly proportional to the amount of the ... and we also tried to simply even re-architect the whole tree so ... And we fix them up, ...
    (Linux-Kernel)
  • Re: WTF: Stack Size 4k
    ... > hell are those of us who use and love the SCSI boards based on the ... > kernel with the 8k stack size by choosing to not allow that as an option ... I, for one, see great folly in upgrading ANY datacenter to Fedora Core 2 ...
    (Fedora)
  • Re: SMP Kernel
    ... since the memory is shared between all the cores the Kernel can be ... invoked by any core receiving an interrupt and thus executed by that ... Are there an independent scheduler per cpu? ... To unsubscribe from this list: ...
    (Linux-Kernel)
  • Re: Is C close to the machine?
    ... If a mismatch in workloads is noticed, the pipeline may be repartitioned, generally by moving one kernel across threads. ... Thus, if core 0 has more work than core 1, C may move to core 1, resulting in partition,, ... You're going to have to worry about copying the data at least once and probably twice (data gets created, copied into buffer, data gets copied from buffer, data gets used). ... The overhead of coroutining is not the state save, its the loss of locality of the caching and prediction structures. ...
    (comp.arch)
  • Re: Processor type kernel option for Core Duo (not Core 2)
    ... >> I've got a Centrino Core Duo laptop; ... >> can't work out which processor type option to use for the kernel. ... the CPU type selection within the kernel configuration (make ... > hardware and the software support. ...
    (comp.os.linux.setup)