Re: ipc/msg.c "cleanup" breaks fakeroot on Alpha



On Sun, Nov 05, 2006 at 05:02:39PM +0100, Manfred Spraul wrote:
Linus Torvalds wrote:

[ Removed the kernel mailing list ]


[kernel mailing list added back]

As far as I can tell, people hold one or the other, but not both, and
happily do strange things to "r_msg". The code seems to _know_ that it is
racy, since in addition to the volatile, it does things like:

...
msr->r_msg = NULL;
wake_up_process(msr->r_tsk);
smp_mb();
msr->r_msg = ERR_PTR(res);
...

and that memory barrier again doesn't really seem to make a whole lot of
sense.

msr is a msg_receiver structure. The structure is stored on the stack of
msr->r_tsk.
The smp_mb() guarantees that the wake_up_process is complete before
ERR_PTR(res) is stored into msr->r_msg.

OK, so matching code in sys_msgrcv() spins on the r_tsk pointer
until it becomes non-NULL. Can't say that I understand why the
first assignment to msr->r_msg cannot simply be ERR_PTR(res), but
I am probably missing something basic. But your explanation below
clears that up, thank you!

I also don't understand why the code in sys_msgrcv() doesn't have
to remap the msqid, similar to the way it is done in sys_semtimedop().

Is there some reference counter that I am failing to see? There is
the ipc_rcu_getref() on the send side, and something similar seems
like it would work on the read side -- though with additional cache
thrashing, so probably cheaper to remap the id.

So, what am I missing here? How does a msgrcv() racing with an rmid()
avoid taking a lock on a message queue that just got freed? (The
ipc_lock_by_ptr() in "Lockless receive, part 3".) My concern is the
following sequence of steps:

o expunge_all() invokes wake_up_process() and sets r_msg.

o sys_msgrcv() is awakened, but for whatever reason does
not actually start executing (e.g., lots of other busy
processes at higher priority).

o expunge_all() returns to freeque(), which runs through the
rest of its processing, finally calling ipc_rcu_putref().

o ipc_rcu_putref() invokes call_rcu() to free the message
queue after a grace period.

o ipc_immediate_free() is invoked at the end of a grace
period, freeing the message queue.

o sys_msgrcv() finally gets a chance to run, and does an
rcu_read_lock() -- but too late!!!

o sys_msgrcv() does ipc_lock_by_ptr() on a message-queue
data structure that has already been freed. Things
degrade rapidly from here...

Or is there some subtlety that prevents this?

If this problem is real, either an ipc_rcu_getref() before the
msg_unlock() before the schedule() and an ipc_rcu_putref() after
the rcu_read_lock() following the schedule(), both in sys_msgrcv()
on the one hand... Or remap the msqid after the rcu_read_lock()
following the schedule on the other.

Thanx, Paul

But I don't know. It clearly _tries_ to do some smart locking, I just
don't see what the rules are.



The codes tries to to a lockless receive:
- the mutex is only required to create/destroy queues.
- normal queue operations are protected by msg_lock(msqid), which is a
spinlock. One spinlock for each queue.
- if a receiving thread doesn't find a message, then it adds a
msg_receiver structure into msq->q_receivers. This linked list is stored
in the message queue structure and protected by msg_lock(msqid).
- if a sending thread finds a msg_receiver structure, then it removes
the structure from the msq->q_receivers linked list, places the message
into msr->r_msg and wakes up the receiver. All operations happen under
msg_lock(msqid)
- the receiver notices that there is a message in it's msr->r_msg
structure and copies it to user space, without acquiring msg_lock(msqid).

ipc/sem.c uses the same idea, I added a longer block with documentation
(around line 150 in ipc/sem.c)

I'm only aware of one tricky race:
- the sender calls wake_up_process().
- as soon as the receiver finds something in r_msg, it can return to
user space. user space can call exit(3). do_exit destroys the task
structure.
- wake_up_process will cause an oops if it's called after do_exit().
This race happened on s390. The solution is this block:

msr->r_msg = NULL;
wake_up_process(msr->r_tsk);
smp_mb();
msr->r_msg = ERR_PTR(res);

Initially, r_msg is ERR_PTR(-EAGAIN). The sender first sets it to NULL
("message pending"), then calls wake_up_process(), then a memory
barrier, then the final value.

Back to the bug report:
"volatile" shouln't be necessary. The critical part is this loop:

msg = (struct msg_msg*)msr_d.r_msg;
while (msg == NULL) {
cpu_relax();
msg = (struct msg_msg *)msr_d.r_msg;
}
And cpu_relax is a barrier().
On i386, removing the "volatile" has no effect, the .o remains identical.
Falk, could you compare the .o files with/without volatile? Are there
any differences?

The oops was caused by try_to_wake_up, called by expunge_all.
I.e.:
- either the msq->q_receivers linked list got corrupted
- or the target thread was destroyed before wake_up_process completed.
Theoretically, both things are impossible:
- msq->q_receivers is protected by msq_lock()
- the target thread task_struct is guaranteed to remain in scope due to
the "msg == NULL" loop.

I'll try to reproduce the oops on i386 - but I don't see a bug right now.

--
Manfred
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: ipc/msg.c "cleanup" breaks fakeroot on Alpha
    ... The code seems to _know_ that it is racy, since in addition to the volatile, it does things like: ... One spinlock for each queue. ... - if a sending thread finds a msg_receiver structure, then it removes the structure from the msq->q_receivers linked list, places the message into msr->r_msg and wakes up the receiver. ... or the target thread was destroyed before wake_up_process completed. ...
    (Linux-Kernel)
  • netstat issue on Tru64. Kernel bug?
    ... I have noticed that many people were faced to the "netstat hangs" ... If a process writes in a message queue in such way that it overflows ... bash$ overflowQ ...
    (comp.unix.tru64)
  • Re: Need explaination of BeginInvoke timing
    ... it just queues the delegate invocation on the regular message queue for the control's owning thread. ... Actually, while you don't post enough code to be certain, it appears that the event handlers are just handlers for the regular Control events Enter, MouseUp, and MouseDown. ... The code is setting the flag, queuing a delegate that resets the flag which will be executed after all of the messages that are already in the queue, and then the remaining events are processed, including the MouseDown event and the delegate invocation that was queued via BeginInvoke. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: Event handling mechanism in Windows
    ... The IRP that originated the request `points back` to the requestor thread. ... and the APC adds a message to the message queue of the foreground thread ...
    (microsoft.public.win32.programmer.kernel)
  • Re: Dumb RTOS Question
    ... shows up in the queue. ... I thought a message queue would work or have ... then suspend for a safe amount of time before looping back to check them ... I just have not figured out this kooky RTOS stuff. ...
    (comp.arch.embedded)