Re: in 2.6.23-rc3-git7 in do_cciss_intr



On Wed, Nov 19 2008, Miller, Mike (OS Dev) wrote:


-----Original Message-----
From: Randy Dunlap [mailto:randy.dunlap@xxxxxxxxxx]
Sent: Wednesday, November 19, 2008 11:23 AM
To: Miller, Mike (OS Dev)
Cc: Jens Axboe; scsi; James Bottomley; lkml; akpm
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

Miller, Mike (OS Dev) wrote:

-----Original Message-----
From: Jens Axboe [mailto:jens.axboe@xxxxxxxxxx]
Sent: Wednesday, November 19, 2008 2:52 AM
To: Randy Dunlap
Cc: scsi; Miller, Mike (OS Dev); James Bottomley; lkml; akpm
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

On Tue, Nov 18 2008, Randy Dunlap wrote:
Randy Dunlap wrote:
Randy Dunlap wrote:
Miller, Mike (OS Dev) wrote:
-----Original Message-----
From: Randy Dunlap [mailto:randy.dunlap@xxxxxxxxxx]
Sent: Thursday, September 25, 2008 3:40 PM
To: scsi
Cc: Jens Axboe; Miller, Mike (OS Dev); James Bottomley; lkml;
akpm
Subject: Re: in 2.6.23-rc3-git7 in do_cciss_intr

On Thu, 25 Sep 2008 13:33:07 -0700 Randy Dunlap wrote:

Jens Axboe wrote:
On Thu, Sep 04 2008, Miller, Mike (OS Dev) wrote:
0x3bb2 <do_cciss_intr+1649>: mov 0x2(%r8),%dx
0x3bb7 <do_cciss_intr+1654>: test %dx,%dx
0x3bba <do_cciss_intr+1657>: je 0x3f0e
<do_cciss_intr+2509>
$ addr2line -e cciss.o -f do_cciss_intr+0x627
SA5_fifo_full

/home/rdunlap/linsrc/linux-2.6.27-rc3-git7/drivers/block/cciss.h:
2
06
OK ...that's confusing. It seems to be saying that
ctrlr_info_t
* was NULL. However, I can't see a way of
getting into the
fifo_full
callback from do_cciss_intr ..
especially not with an NULL host.

James
That is weird. Even if we could get there
fifo_full doesn't
do anything but wait for a bit.

Hi,

This just happened again. This time it's on
2.6.27-rc5-git3.
~Randy
Thanks Randy. I think. :)

I'll try to recreate in my lab.
This looks somewhat strange, mostly like 'c' is NULL
and it's
oopsing in in removeQ (I don't think Randy's analysis is
correct in
assuming it's 'h' and it's in fifo_full). Given that 'c'
cannot be
NULL, it's c->prev or c->next that are NULL.
This BUG: has happened (now) 5 times today. Higher
frequency than
usual for some reason.

I enabled CCISS_DEBUG and added one printk in
removeQ(). On the
first call
s/first/second/


to removeQ(), both c->next and c->prev are NULL.

Here's the kernel log output from cciss:
I added a printk() in addQ() as well. Here's the new output:

HP CISS Driver (v 3.6.20)
ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 54 cciss
0000:42:08.0:
PCI INT A -> Link[LNKA] -> GSI 54 (level, high) -> IRQ 54
command =
147 irq = 36 board_id = 3211103c cciss 0000:42:08.0: irq 87 for
MSI/MSI-X address 0 = fdf80000 cfg base address = 10 cfg
base address
index = 0 cfg offset = 400 Controller Configuration information
------------------------------------
Signature = CISS
Spec Number = 1
Transport methods supported = 0x6
Transport methods active = 0x3
Requested transport Method = 0x0
Coalesce Interrupt Delay = 0x0
Coalesce Interrupt Count = 0x1
Max outstanding commands = 0x256
Bus Types = 0x200000
Server Name =
Heartbeat Counter = 0x1672


Trying to put board into Simple mode I counter got to 1 0
Controller
Configuration information
------------------------------------
Signature = CISS
Spec Number = 1
Transport methods supported = 0x6
Transport methods active = 0x3
Requested transport Method = 0x0
Coalesce Interrupt Delay = 0x0
Coalesce Interrupt Count = 0x1
Max outstanding commands = 0x256
Bus Types = 0x200000
Server Name =
Heartbeat Counter = 0x1672


cciss0: <0x3238> at PCI 0000:42:08.0 IRQ 87 using DAC
cciss: intr_pending 8
cciss: addQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000
cciss: removeQ: Qptr=ffff88027e0100b8, c=ffff88007f83e000,
next=ffff88007f83e000, prev=ffff88007f83e000 Sending
7f83e000 - down
to controller
cciss: addQ: Qptr=ffff88027e0100c0, c=ffff88007f83e000
cciss: intr_pending 8
cciss: Read 4 back from board
cciss: removeQ: Qptr=ffff88027e0100c0, c=ffff88007f840000,
next=0000000000000000, prev=0000000000000000
BUG: unable to handle kernel NULL pointer dereference at
0000000000000248
Randy, can you post the debug patch you used? The above goes boom
when it attempts to remove a command that isn't on the
list, the Qptr
in the last example should be empty, hence the oops. So I'd be
interested in seeing what removeQ() calls this is, I'm
assuming it's
this bit in
do_cciss_intr():

...
while (c->busaddr != a) {
c = c->next;
if (c == h->cmpQ)
break;
}
}
/*
* If we've found the command, take it off the
* completion Q and free it
*/
if (c->busaddr == a) {
removeQ(&h->cmpQ, c);
if (c->cmd_type == CMD_RWREQ) {
complete_command(h, c, 0);
...

If so, what part of the c lookup are you hitting - the on
that does:

c = h->cmd_pool + a2;

or the c->busaddr check that his shown above?

--
Randy,
I still can't reproduce this bug. I have your config file
on a BL465c w/e200i. Just to confirm, you only see this at
init time, correct?

Yes, only at init time.

Please post your debug patch as Jens requested.

Done (separately).

I need to back up a bit. Yesterday these BUGs happened
consistenly, so I wondered why. Then I recalled that for
debugging another bug/problem, I had changed the test
system's normal boot kernel from 2.6.25 to 2.6.18-8. The
test system is used to build and then boot the new kernel
*via kexec*, so it's quite possible (or certain) that
something in the kexec world has been fixed since 2.6.18. I
don't recall seeing this problem lately when using 2.6.25 to
kexec/boot the new test kernel, so I'm quite willing to drop
the bug for now and then re-open it if I see the problem again. OK??

Ahhhh, the kexec piece was missing. Now I don't feel quite so
clueless. I'm OK with dropping the bug for now. Jens, James?

Yeah, kexec is definitely a clue. My guess is that we got some sort of
left over completion. Regardless of the status of this particular bug or
not, I think it would be a good idea to add some checks for when a
command is attempted removed from a queue it isn't currently on.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: in 2.6.23-rc3-git7 in do_cciss_intr
    ... On Tue, Nov 18 2008, Randy Dunlap wrote: ... Transport methods supported = 0x6 ... Coalesce Interrupt Delay = 0x0 ... it attempts to remove a command that isn't on the list, ...
    (Linux-Kernel)
  • RE: in 2.6.23-rc3-git7 in do_cciss_intr
    ... Transport methods supported = 0x6 ... Coalesce Interrupt Delay = 0x0 ... Randy, can you post the debug patch you used? ... I still can't reproduce this bug. ...
    (Linux-Kernel)
  • Wolfpack 4.3.10 Server Release
    ... Plug file descriptor leak in add command. ... This feature was flawed (it encourages players to ... Redesign of synchronization between commands, update and shutdown, ... The bug allowed players to delay ...
    (rec.games.empire)
  • Re: MoveRight in table fails at 1%: bug? workaround?
    ... temporary files as your macro runs, so that all changes can be undone. ... about erros at Selection.MoveRight command? ... >> Does anyone know something like a known bug about MoveRight? ... Using the Selection object makes for more convoluted ...
    (microsoft.public.word.vba.general)
  • Re: in 2.6.23-rc3-git7 in do_cciss_intr
    ... Randy Dunlap wrote: ... Here's the kernel log output from cciss: ... Transport methods supported = 0x6 ... Coalesce Interrupt Delay = 0x0 ...
    (Linux-Kernel)