"Oops" in 2.6.9 SCSI w/ usb-storage & multi-LUN

From: Stephen Warren (SWarren_at_nvidia.com)
Date: 12/21/04

  • Next message: Alan Cox: "Re: Cyrix 6x86 Comma Bug 2.6"
    Date:	Mon, 20 Dec 2004 17:23:44 -0800
    To: <linux-kernel@vger.kernel.org>
    
    
    

    We have found a reproducable kernel problem (looks like an oops, but
    isn't...) in 2.6.9 somewhere related to the usb-storage driver, the SCSI
    stack, and in particular multi-LUN devices. We have the SCSI multi-LUN
    kernel configuration option enabled.

    The basic scenario is that our application monitors for USB hotplug
    notifications, and whenever a device is plugged, it will open that
    device, and whenever the device is unplugged, it will close the
    previously opened file descriptor. The monitoring is done by listening
    to hotplug events on a netlink socket. We've back-ported the netlink
    hotplug code from 2.6.10-rc3 to our 2.6.9 kernel.

    The problem is, that when we unplug a device and go on to close the
    filehandle, we get a kernel oops-like message. Our test application
    then hangs and can't be killed. This seems to be guaranteed to happen
    with multi-LUN devices (e.g. multi-format memory "stick" readers) - in
    this case, we open e.g. 4 devices on plug, and close those 4 devices on
    unplug. The second close call hangs our application, and the oops-like
    message is generated a few seconds after.

    This seems to be somewhat timing sensitive. We have a test app written
    in C++ that does exhibit the problem very reliably on two machines. The
    simplified non-C++ version doesn't show the problem!

    Both of these test applications are attached. Also attached is a patch
    against 2.6.9 that adds netlink support to it, so you can try these test
    apps under 2.6.9.

    I've tried both test application under 2.6.10-rc3, and everything seems
    to work fine on that kernel. I see there were a lot of USB changes in
    2.6.10.

    Does anyone know what change from 2.6.10 fixed this specific issue. Is
    it something that's easy to isolate and back-port to 2.6.9?

    Here are relevant portions from /var/log/messages

    Dec 20 11:50:56 localhost kernel: usb 1-1: new high speed USB device
    using address 84
    Dec 20 11:50:56 localhost kernel: scsi80 : SCSI emulation for USB Mass
    Storage devices
    Dec 20 11:50:56 localhost kernel: Vendor: Generic Model: STORAGE
    DEVICE Rev: 9139
    Dec 20 11:50:56 localhost kernel: Type: Direct-Access
    ANSI SCSI revision: 02
    Dec 20 11:50:56 localhost kernel: Attached scsi removable disk sda at
    scsi80, channel 0, id 0, lun 0
    Dec 20 11:50:56 localhost kernel: Vendor: Generic Model: STORAGE
    DEVICE Rev: 9139
    Dec 20 11:50:56 localhost kernel: Type: Direct-Access
    ANSI SCSI revision: 02
    Dec 20 11:50:56 localhost kernel: Attached scsi removable disk sdb at
    scsi80, channel 0, id 0, lun 1
    Dec 20 11:50:56 localhost kernel: Device not ready. Make sure there is
    a disc in the drive.
    Dec 20 11:50:56 localhost kernel: Vendor: Generic Model: STORAGE
    DEVICE Rev: 9139
    Dec 20 11:50:56 localhost kernel: Type: Direct-Access
    ANSI SCSI revision: 02
    Dec 20 11:50:56 localhost kernel: Attached scsi removable disk sdc at
    scsi80, channel 0, id 0, lun 2
    Dec 20 11:50:56 localhost kernel: Vendor: Generic Model: STORAGE
    DEVICE Rev: 9139
    Dec 20 11:50:56 localhost kernel: Type: Direct-Access
    ANSI SCSI revision: 02
    Dec 20 11:50:56 localhost kernel: Attached scsi removable disk sdd at
    scsi80, channel 0, id 0, lun 3
    Dec 20 11:50:56 localhost kernel: Device not ready. Make sure there is
    a disc in the drive.
    Dec 20 11:50:56 localhost last message repeated 2 times
    Dec 20 11:50:57 localhost scsi.agent[30093]: disk at
    /devices/pci0000:00/0000:00:02.2/usb1/1-1/1-1:1.0/host80/80:0:0:0
    Dec 20 11:50:57 localhost scsi.agent[30111]: disk at
    /devices/pci0000:00/0000:00:02.2/usb1/1-1/1-1:1.0/host80/80:0:0:1
    Dec 20 11:50:57 localhost scsi.agent[30129]: disk at
    /devices/pci0000:00/0000:00:02.2/usb1/1-1/1-1:1.0/host80/80:0:0:2
    Dec 20 11:50:57 localhost scsi.agent[30147]: disk at
    /devices/pci0000:00/0000:00:02.2/usb1/1-1/1-1:1.0/host80/80:0:0:3
    Dec 20 11:51:08 localhost kernel: usb 1-1: USB disconnect, address 84
    Dec 20 11:51:18 localhost kernel: usb-storage: Error in bus_reset:
    invalid state 1028932437
    Dec 20 11:51:18 localhost kernel: scsi: Device offlined - not ready
    after error recovery: host 80 channel 0 id 0 lun 0
    Dec 20 11:51:18 localhost kernel: 80:0:0:0: Illegal state transition
    deleted->offline
    Dec 20 11:51:18 localhost kernel: Badness in scsi_device_set_state at
    /home/swarren/p4_wa/swarren-linux-alt/embedded/dvd/new_kernel-2.6/linux-
    nvidia/drivers/scsi/scsi_lib.c:1688
    Dec 20 11:51:18 localhost kernel: [<c03c6cf0>]
    scsi_device_set_state+0xc4/0x112
    Dec 20 11:51:18 localhost kernel: [<c03c4b8b>]
    scsi_eh_offline_sdevs+0x64/0x80
    Dec 20 11:51:18 localhost kernel: [<c03c4ffa>]
    scsi_unjam_host+0xc5/0xce
    Dec 20 11:51:18 localhost kernel: [<c03c50b5>]
    scsi_error_handler+0xb2/0xda
    Dec 20 11:51:18 localhost kernel: [<c03c5003>]
    scsi_error_handler+0x0/0xda
    Dec 20 11:51:18 localhost kernel: [<c0103d39>]
    kernel_thread_helper+0x5/0xb

    -- 
    Stephen Warren, Software Engineer, NVIDIA, Fort Collins, CO
    swarren@nvidia.com        http://www.nvidia.com/
    swarren@wwwdotorg.org     http://www.wwwdotorg.org/pgp.html
    
    
    
    

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/





  • Next message: Alan Cox: "Re: Cyrix 6x86 Comma Bug 2.6"

    Relevant Pages

    • Abstuerze nach Festplattentausch: Festplatte defekt?
      ... Platte etwas an die SD-RAM Module gekommen. ... Nun, koennte die Festplatte vielleicht einen Defekt haben, so dass das ... Der Kernel lief so auf der alten Platte mehrere Monate fehlerfrei. ... Aug 6 16:30:56 localhost kernel: Unable to handle kernel NULL pointer ...
      (de.comp.os.unix.linux.hardware)
    • Problem with cpufreq on my laptop (athlon 64 3700+)
      ... May 7 19:33:48 localhost cpufreq: Déchargement des modules cpufreq: ... May 7 19:33:48 localhost kernel: Unable to handle kernel NULL pointer ... May 7 19:33:48 localhost kernel: printing eip: ...
      (comp.os.linux.portable)
    • invalid opcode, recursive fault, 2.6.17
      ... note it occurred immediately following a syslogd restart. ... I am running a stock FC5 kernel on an AMD AthlonXP 1500+ ... Aug 6 18:09:16 localhost kernel: EIP is at 0xca05768d ... f6612e98 esp: f6612c28 ...
      (Linux-Kernel)
    • kernel oops 2.6.8 SMP
      ... I am not on the kernel mailing list, I would appreciate CC of replies. ... OOPS, sometimes fatal, sometimes not. ... Sep 30 13:53:20 localhost kernel: Unable to handle kernel NULL pointer ... Sep 30 13:53:20 localhost kernel: printing eip: ...
      (Linux-Kernel)
    • [PATCH 18-rc2] Fix typos in /Documentation : N-P
      ... Again, if you're not gonna do synchronization with disk drives (dang, ... -the kernel. ... There are two options specific to PSX driver portion. ... The driver uses the settings from the EEPROM set in the SCSI BIOS ...
      (Linux-Kernel)