Re: Obscure mutex problem



David Given <dg@xxxxxxxxxxx> writes:

[...]

- when initially writing the code, I discovered that I had to set
PTHREAD_MUTEX_RECURSIVE to make the mutexes work at all... but then later
found this was no longer necessary. No doubt this is due to another change I
made, but I still don't understand it.

A 'recursive mutex' is one that may be locked multiple times by the
same thread (and must then be unlocked as often as it had been
locked). At least NPTL mutexes implement deadlock detection, meaning
*if* you always check the pthread_mutex_lock return value, you should
get an error of EDEADLK when trying to relock an already locked mutex.

- rewriting the mutex initialisation code to use a statically initialised
mutex with PTHREAD_MUTEX_INITIALISER, rather than initialising the mutex with
code, didn't help.

Why would it?

- my test machines include an ARM-based NSLU2 with linuxthreads (one kernel
process per thread) and a i386-based PC with NPTL (real kernel
threads).

Please don't perpetuate stupid myths about Linux threading
support. The Linux-kernel knows nothing about 'processes' or
'threads'. It uses an abstraction called 'task' (struct task_struct)
which represents a 'locus of control' (SUS term) and contains pointers
to various ressource structures, like an mm (address space) or a file
descriptor table. These ressource structures are shareable among
different tasks. In UNIX(*), a 'process' is defined to be a set of
certain ressources and n threads of control, n >= 1, using them. This
maps naturally onto the Linux kernel tasks, where all 'threads'
belonging to the same 'process' share this specific set of ressource
structures among them, with the 'traditional' single threaded process
being just the special case of a task not sharing ressource structures
with any other task.

This is how pthread-support for Linux has always been implemented and
the NPTL changes regarding this were fairly cosmetical (like
reimplementing getpid to not return the still existing task id but a
newly created thread group id which is usually identical to the
former, but copied to the new struct task_struct whenever an existing
task does a call resulting in the creation of a new POSIX thread).

[...]

Any possible lines of attack? I'm completely stuck on this one...

Something which may help: Request that one of the affected users
enables core dumps (ulimit -c) before starting the daemon and kills it
with a SIGQUIT after it has locked up (which should never be handled
by an application for exactly this reason). This should result in a
core file which can serve as an addtional input to gdb, allowing for
post-mortem inspection of the state of the process at the time it got
the signal (eg getting stack backtraces for all threads or examing
the state of objects existing at the time of the
signal). Specifically, you will be able to determine which thread (if
any) has the mutex locked by locking at the owner member of the
corresponding structure.
.