Obscure mutex problem
- From: David Given <dg@xxxxxxxxxxx>
- Date: Wed, 05 Sep 2007 00:43:35 +0100
I'm the developer of an SMTP greylisting daemon, spey. It has a number of
happy users and generally works quite well.
Unfortunately, it has a really strange bug that I'm at a complete loss as to
how to deal with. This is a fairly major issue: on some machines it stalls on
startup and hangs.
The problem is that on all the machines I have access to, it runs fine... and
since I can't duplicate the problem, I can't debug it! All I have are
anecdotal reports from users. I've managed to persuade some to run customised
versions and provided assorted debug logs, but there doesn't seem to be any
common factor to allow me to figure out why it works (reliably) on some
machines and fails (reliably) on others.
The way the program works is as follows: there are multiple threads of
execution, where normally only one is running at a time. This is controlled by
a single mutex that's normally locked. All other threads block on this mutex.
During I/O, or any other function that ought to take place in the background,
the mutex is released, so allowing another thread to run. (The reason for the
odd design is that it's a coroutine implementation.) The mutex is a perfectly
ordinary standard mutex that's initialised from the main thread before any
child threads are created. All the work happens in the child threads; the main
thread just sleeps.
The issue is this:
The first thing child threads do when started is to block on the mutex,
causing them to suspend until it's safe for them to run. On some machines, the
threads never wake up.
Interesting things that I've observed include:
- at least one user reported that if the daemon was allowed to go into the
background, it would fail; but if it was told not to daemonise itself, it
would run fine.
- in studying the strace output, I notice that there doesn't seem to be a call
to futex() corresponding to the initial pthread_mutex_lock() from the main thread.
- when initially writing the code, I discovered that I had to set
PTHREAD_MUTEX_RECURSIVE to make the mutexes work at all... but then later
found this was no longer necessary. No doubt this is due to another change I
made, but I still don't understand it.
- rewriting the mutex initialisation code to use a statically initialised
mutex with PTHREAD_MUTEX_INITIALISER, rather than initialising the mutex with
code, didn't help.
- my test machines include an ARM-based NSLU2 with linuxthreads (one kernel
process per thread) and a i386-based PC with NPTL (real kernel threads). It
works fine on both of these.
If anyone's interested in browsing the code, it's here:
http://spey.cvs.sourceforge.net/spey/spey/src/Threadlet.cc?view=markup
The main program calls Threadlet::initialise() to create and initialise the
mutex, which is then locked. The main program creates a thread that attempts
to lock the mutex, and so blocks. It then calls Threadlet::halt(), which
releases the mutex, allowing the new child thread to run, and then just does
for(;;) pause(). The child thread then acts as the socket master and creates
new (different) child threads for each incoming connection... except it
doesn't get that far. When the failure occurs, the child thread never wakes up.
Does this issue look familiar to anyone? Any suggestions as to what I can try?
Any possible lines of attack? I'm completely stuck on this one...
--
┌── dg@cowlark.com ─── http://www.cowlark.com ───────────────────
│
│ "There does not now, nor will there ever, exist a programming language in
│ which it is the least bit hard to write bad programs." --- Flon's Axiom
.
- Follow-Ups:
- Re: Obscure mutex problem
- From: David Schwartz
- Re: Obscure mutex problem
- From: Rainer Weikusat
- Re: Obscure mutex problem
- From: Ulrich Eckhardt
- Re: Obscure mutex problem
- From: John
- Re: Obscure mutex problem
- Prev by Date: Re: BitTorrent in Ubuntu
- Next by Date: Re: Obscure mutex problem
- Previous by thread: Re: query in yaffs_open + return -1
- Next by thread: Re: Obscure mutex problem
- Index(es):
Relevant Pages
|