Q: wait queue weirdness in A/D device driver - 2.4.18 or 20, RH 8.0
From: Rex Andrew (randrew_at_BOGUS.apl.washington.edu)
Date: 11/16/03
- Next message: RT: "Linux pthreads and pthread_rwlock* calls"
- Previous message: Kasper Dupont: "Re: source code for Linux Motorola A760 cell phone?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 16 Nov 2003 12:33:37 -0800
Hi all,
I've sort of dead ended after hunting for solutions for a week. You
probably know that burned-out feeling. Guess I need some wisdom.
The context: I'm trying to write a simple double buffering module for an
old ISA A/D card. The data capture occurs over DMA, so I set up 2 DMA
channels (the card supports this) and two DMA buffers. The idea is that
the driver interrupt handler will copy the filled DMA buffer to some
"master buffer" at interrupt time (and flip some flags) and wake up any
reading process. The read() function, upon entry, checks for available
data (via flags), if none it puts itself to sleep. When it is reawakened,
it should be because the int handler has now placed data into the master
buffer. It copies the data to user space and e.g., exists.
OK, all this seems simple. In fact, I had it all working under 2.2.18 on a
DOS 486. Now I'm on a Pentium 4 machine running 2.4.18 or .20 and it's
failing. (I have upgraded the syntax to the "new 2.4" lingo as necessary.)
I have tried various combinations of compilers and kernels (Ack!) and now
I'm using gcc 2.95.3 for both kernel and module (due to a suggestion on
the net.) No change. This is RH 8.0, uniprocessor system. The problem
occurs with either the distro kernel or one I build myself.
I *think* I've traced the problem down. Here's a synopsis. NOTE: the terms
"**X**" indicate where I have placed printk's where I print out the
address of the waitqueue &Q, and the contents of the next and prev
pointers.
AS a preliminary, let me introduce with a simple test program which is
like Rubini's sleepy read
and sleepy write example. This module is really simple:
/*--------- tiny_module ----------- */ static
DECLARE_WAIT_QUEUE_HEAD(Q);
/* ...... */
ssize_t read( blah )
{
**3**
sleep_on_interruptible(&Q);
**4**
}
}
ssize_t write( blah )
{
**5**
wake_up_interruptible(&Q);
**6**
}
/*------------------------------------ */
The idea here is that one user process tries to read the device, and the
driver puts the reader to sleep. Then another user process tries to write
the device, and the driver wakes up the reader, and both processes finish.
This works fine, and when it does, I observe the following expected
results: (A and B stand for addresses)
point meaning &Q Q.task_list.next prev comment
----- ------------- ---- ---------------- ----- -------------------------------------
**3** before read sleep A A A proper initialization of wait_queue_head_t
**5** at write entry A B B B is the sleeping process on queue
**4** read after wake up A A A B has been popped by wake up, queue is empty
**6** write " " " " A A A ditto
I've expanded the macros and undertand mostly what's going on.
sleep_on_interruptible has done an add_wait_queue() call, and that puts B
into the queue.
Now here's a rendition of my module:
/************ BIG_MODULE ************/ static
DECLARE_WAIT_QUEUE_HEAD(Q);
/* ...... */
void handler( )
{
printk(KERNEL_DEBUG "ISR\n");
/* flip some bits */
**1**
wake_up_interruptible(&Q);
**2**
}
}
/* ...... */
ssize_t read( blah )
{
**3**
sleep_on_interruptible(&Q);
**4**
}
/********************************************/
The code blows up at the wake_up_interruptible() call in the interrupt
handler and sends the CPU into an oops , a kernel panic, and a lockup of
the console and the network connection. (Something is still alive inside,
though -- I can still see diagnostics from the interrupt handler at the
right interrupt rate but now on the console vice dmesg!)
Diagnosis like that above reveals:
point meaning &Q Q.task_list.next prev comment
----- ------------- ---- ---------------- ----- -------------------------------------
**3** before read sleep A A A proper initialization of wait_queue_head_t
**1** at ISR entry A A !!!!! B B is the sleeping process on queue
**2** CAN'T GET HERE
It appears that the add_wait_queue in the sleep_on_interruptible() macro
puts the entry on the tail of the queue (or something like that.) When the
queue in this configuration hits the wake_up_.. in the handler, it causes
bad things.
A simple test-and-branch in the handler like this:
if (!list_empty(&Q))
wake_up_interruptible(&Q);
else
printk(KERN_DEBUG "queue is empty");
will always take the else branch because the test condition expands to
Q.next==&Q This condition is false for tiny_module once the reader process
has been put on the queue, but for BIG_MODULE -- for some reason -- the
condition is always true, even when there is an entry on the queue. (Of
course, this is not the desired functionality, the user space read()
command hangs.)
OK, I can't believe the kernel code is buggy, or nothing anywhere would
work. So therefore there must be some problem with accessing the local
wait queue. I've snooped into the macro expansions in sched.h, list.h and
wait.h, and *so far* I have been unable to get the queue to look right,
regardless of race-protected macros, dead-lock-protected macros, and
anything else. (like wait_event() etc.)
My interrupts are at most 5/second, so I do not suspect some re-entrant
problem. I do not do a lot of spinlocking (yet!) because (a) I'm a
spinlock newbie, and (b) I don't undertand why it would be a concurrency
problem (but it could...)
Putting the wake_up in a tasklet does not help. But I can successfully
make a tasklet queue and schedule a tasklet from the interrupt handler, so
I grasp the idea. :-D
I'm lucky because I know the exact problem, and it's repeatable every
time.
Please advise. Things I've considered:
1) The wait_queue_head_t has a wq_lock_t in it -- would it reveal
something interesting?
2) Do I have to disable IRQs in the read() before I
add_wait_queue(). No code example does this, tho.
3) The permissions on
the /dev file are rwxrwxrwx. Does this make any difference?
4) Should I
activate CONFIG_DEBUG_WAITQ (where would I do this?) to make the
wait_queue structures contain debug info? (And then what?)
5) Is it correct that the spinlock codes expand to nothing on uniprocessor
machines? Then worrying about spinlocks would be meaningless unless I
rebuilt for a multiprocessor machine.
6) Figure out if I can get access to some internals with the SysReq key,
and dump something out.
This is an industrial motherboard with both ISA and PCI slots. (Should be
irrelevant, but.....)
I guess I'll post this and keep snooping around. Thanks for any help.
--rex
- Next message: RT: "Linux pthreads and pthread_rwlock* calls"
- Previous message: Kasper Dupont: "Re: source code for Linux Motorola A760 cell phone?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|