user-space hangs - followup



I had previously posted regarding a problem that I'm seeing here. The
characteristics of the problem are that all kernel threads and interrupts
appear to continue working, but all user-space processes appear to hang.
Symptoms of the "hang" are that network traffic which passes THROUGH the
box continue to pass, and pings to the box succeed, but any tcp/udp
traffic to or from the box will fail during the hang. Another quirk of
the situation is that the hang is not permanent, but will resolve itself
after a period of time (anywhere from 15 minutes to 8 hours).

Gil Hammond (GH) had suggested (on this newsgroup) that it sounded as
though all user processes were getting stacked up behind a lock/mutex or
other resource in the kernel; I've been trying to pursue this line of
reasoning further to research this problem (which is still occuring).

Another engineer that I spoke with elsewhere, observed that it didn't
sound like a simple mutex deadlock-type issue, because a deadlock like
that should never recover. He observed that it sounded more like a
counter that was wrapping or otherwise going out of range. Based on that
theory, I've been looking back through my driver again - a good exercise
anyway, since I've been working on this driver for eight years now, and
there are some parts that I don't remember much anymore!!

One thing I'm finding is that we have code such as the following in many
parts of the driver:

timeout_jiffies = jiffies + jiffies_from_ms(timeout_msec) ;
while ( (rc = test_some_other_condition(device)) &&
(jiffies < timeout_jiffies ) ) {
schedule();
}
// Did we succeed in getting the MUTEX flag?
if ( rc != 0 ) {
return -1;
}
return 0;

I now note that this is actually vulnerable to counter-wrapping; for
example, if timeout_jiffies gets set to (MAX_JIFFIES - 5), and
test_some_other_condition() takes more than 5 jiffies, the counter would
wrap and we'd wait awhile for the timeout to occur. (mind you, we have
an 850Mhz Celeron in our machines, so wrapping should still only take
about 8.4 minutes). Anyway, my question at this point is this:

Although the counter could wrap and leave me in this loop for awhile, I'm
still calling schedule(), as I should. Now this will allow at least
kernel-space tasks to proceed, but what about user-space processes??
Would they end up blocked by this loop?? Or will they still get
scheduled??

Dan Miller

----== Posted via Newsfeeds.Com - Unlimited-Unrestricted-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! 120,000+ Newsgroups
----= East and West-Coast Server Farms - Total Privacy via Encryption =----
.



Relevant Pages

  • Re: BISECTED: 2.6.29-rc2 regression: hibernation hang on eeepc-701
    ... It doesn't hang if I use the shutdown method (either 'echo ... I have 4 PCI devices without a kernel driver. ... 8086:2792 Mobile 915 Express Graphics Controller ...
    (Linux-Kernel)
  • Kernel to non-kernel comms
    ... I'm just starting a project which is going to have a kernel level driver ... problems with it causing a kernel hang. ... any recommendations of websites or books about 'typical pitfalls' with ...
    (microsoft.public.win32.programmer.kernel)
  • Re: XPE Software Hazard
    ... If you make your image pure driver based with your custom driver doing the work then since Win2000/XP kernel itself is very nicely ... In what scenarios XPE will hang while running?? ...
    (microsoft.public.windowsxp.embedded)
  • RE: Hang with Promise Ultra100 TX2 (kernel 2.4.18)
    ... Hang with Promise Ultra100 TX2 ... I'm running promise ultra133-tx2 successfully with 2.4.22 kernel. ... Merge the promise driver from later 2.4.x kernels to 2.4.18 and recompile? ... send the line "unsubscribe linux-kernel" in ...
    (Linux-Kernel)
  • Re: 2.6.30-rc4 kernel
    ... I think there may be a problem with the 2.6.30 kernel that is ... # Generic Driver Options ... # PCI IDE chipsets support ... # Other IDE chipsets support ...
    (Linux-Kernel)