Re: Obscure mutex problem



David Given wrote:
I'm the developer of an SMTP greylisting daemon, spey. It has a number of
happy users and generally works quite well.

[snip]

Interesting things that I've observed include:

- at least one user reported that if the daemon was allowed to go into the
background, it would fail; but if it was told not to daemonise itself, it
would run fine.

- in studying the strace output, I notice that there doesn't seem to be a call
to futex() corresponding to the initial pthread_mutex_lock() from the main thread.

- when initially writing the code, I discovered that I had to set
PTHREAD_MUTEX_RECURSIVE to make the mutexes work at all... but then later
found this was no longer necessary. No doubt this is due to another change I
made, but I still don't understand it.

- rewriting the mutex initialisation code to use a statically initialised
mutex with PTHREAD_MUTEX_INITIALISER, rather than initialising the mutex with
code, didn't help.

- my test machines include an ARM-based NSLU2 with linuxthreads (one kernel
process per thread) and a i386-based PC with NPTL (real kernel threads). It
works fine on both of these.


If anyone's interested in browsing the code, it's here:

http://spey.cvs.sourceforge.net/spey/spey/src/Threadlet.cc?view=markup

The main program calls Threadlet::initialise() to create and initialise the
mutex, which is then locked. The main program creates a thread that attempts
to lock the mutex, and so blocks. It then calls Threadlet::halt(), which
releases the mutex, allowing the new child thread to run, and then just does
for(;;) pause(). The child thread then acts as the socket master and creates
new (different) child threads for each incoming connection... except it
doesn't get that far. When the failure occurs, the child thread never wakes up.


Does this issue look familiar to anyone? Any suggestions as to what I can try?
Any possible lines of attack? I'm completely stuck on this one...

Downloaded 0.4.1 from Sourceforge, but could not build (no access to Linux box). Some observations.

The fact you had to make the mutex recursive means that the same thread is taking the lock more than once. Since you didn't expect this, it suggests to me that the code is behaving in a way you did not design it to.

Your classes have virtual destructors defined, but no copy constructors. Inadvertent copying can wreak havoc with mutexes. It's a good idea to hide the copy constructor and assignment operator if your object tracks the state of a thread or contains mutexes. The rule of three has been good to me:

http://en.wikipedia.org/wiki/Rule_of_three_%28C%2B%2B_programming%29

BTW, if the mutex is unlocked when futex is called, there is no system call and you won't see it in strace. (that's why it's called a futex "fast userspace mutex").

Good luck.
John
.



Relevant Pages

  • Re: Multithreaded application for Embedded system: Synchronization Problems
    ... Mutex and Condition Variable is initialised what will happen? ... am doing the initialisation in another thread which starts after the ... another thread, simultaneously, with no synchronization or visibility ... but they are also shared memory objects, ...
    (comp.programming.threads)
  • Obscure mutex problem
    ... The problem is that on all the machines I have access to, ... All other threads block on this mutex. ... All the work happens in the child threads; ... rewriting the mutex initialisation code to use a statically initialised ...
    (comp.os.linux.development.apps)
  • Re: Multithreaded application for Embedded system: Synchronization Problems
    ... Mutex and Condition Variable is initialised what will happen? ... am doing the initialisation in another thread which starts after the ... Second, you're referencing shared data being written by one thread, in another thread, simultaneously, with no synchronization or visibility guarantees. ... Yes, the purpose of the mutex and CV are for memory synchronization and visibility; but they are also shared memory objects, and have state that must be consistent and visible. ...
    (comp.programming.threads)
  • Re: Writing Singleton Classes
    ... >exactely the same problem: initialisation must be protected ... >by a mutex or something... ... Thanks for the reference. ... After reading that article it seems that ...
    (comp.lang.cpp)
  • Re: [PATCH] autofs4: deadlock during create
    ... revalidate deadlock can occur in the automounter. ... The inconsistency is that the directory inode mutex is held for both ... lookup and revalidate calls when called via lookup_hash whereas it is ... to callback the daemon as it can't know whether it owns the mutex. ...
    (Linux-Kernel)