Re: Random scsi disk disappearing



On Thu, Aug 17, 2006 at 02:55:58PM +0400, Michael Tokarev wrote:
From time to time, an scsi disk just disappears from
the bus, without any [error] messages whatsoever.
The only relevant stuff in dmesg is logging from md
(softraid) layer, about "error updating superblock"
and later "giving up and removing the disk from the
array" - not even error number.

When I try to access such a disk (/dev/sdX device),
I got "No such device or address" error back.

It's still listed in /sys/block and /proc/scsi/scsi,
but any access to the device gives this error.

But the disk is here, I know it is. Deleting it from
kernel:

echo y > /sys/block/sdX/device/delete

and adding it back:

echo scsi add-single-device x y z > /proc/scsi/scsi

works just fine, linux finds "new" scsi device and it
happily works again.

This happens on alot of different machines, with different
disk drives (ok, most of them are from Seagate, but not
all). I can't say for sure that it happens on different
scsi controllers - at least majority of them are adaptecs,
using aic7xxx or aix79xx driver.

I suspected the disks are too hot - nope, according to
smartctrl, the themp is far from bad (typically about
25..35 Celsius, and the themperature is not changing much).
Bad cables, bad power supply, bad anything else? Not sure
either, at least I can't guess more: the machines are
really different, some has good, under-loaded power supplies
(and server chassis/motherboards/allthestuff) some has less
good ones - makes no difference. And the thing is - having
in mind really sporadic disappearing, not depending on current
load, time of day (eg, during nights, there's no one on site
so no one to touch cables etc), ... Well, I just can't think
of any reason, at all.

But one thing bothers me most: there's NO LOGGING from scsi
layer. None, zero, not at all.

Has anyone else seen something similar? Any pointers on how
to debug the issue?

I'd recommend turning on scsi logging; it might give you a clue about
which bit of scanning is failing to work properly.

Try booting with scsi_mod.scsi_logging_level = 448 (I think I have that
number right; 7 shifted left by 6) and then you can compare failing and
non-failing runs and see if there's any difference.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Random scsi disk disappearing
    ... layer, about "error updating superblock" ... When I try to access such a disk, ... scsi controllers - at least majority of them are adaptecs, ... But one thing bothers me most: there's NO LOGGING from scsi ...
    (Linux-Kernel)
  • Re: separate hard drive for scratch disks for two different programs?
    ... measures out at a higher speed than the 750GB SATA ... much faster using the middle cylinders than the SCSI ... This is a function of the way the OS deals with the disk, ... of sharing the load between multiple drives, ...
    (rec.photo.digital.slr-systems)
  • Re: 3B2 Disks
    ... being able to read the disk in its present format. ... 2 MFM drives on a custom controller. ... SCSI came much later as an add on card. ...
    (comp.sys.3b1)
  • idle RAID1 cpu usage
    ... There's one little modification I made to it: instead of 2 SCSI disks, ... it has one SCSI and one SATA disk (and a PCI SII 3512 card to connect ... # Loadable module support ... # CD-ROM/DVD Filesystems ...
    (Linux-Kernel)
  • Re: moving an installed Debian system onto RAID-1
    ... I just succesfully completed a 3-day crusade against my scsi chain. ... from various fs-drivers when trying to install / fs ... -mount one of the two disks mentioned in the raidtab, ... -now you should have an ide disk that's a full install of everything ...
    (Debian-User)