Re: Random scsi disk disappearing
- From: Matthew Wilcox <matthew@xxxxxx>
- Date: Thu, 17 Aug 2006 05:35:37 -0600
On Thu, Aug 17, 2006 at 02:55:58PM +0400, Michael Tokarev wrote:
From time to time, an scsi disk just disappears from
the bus, without any [error] messages whatsoever.
The only relevant stuff in dmesg is logging from md
(softraid) layer, about "error updating superblock"
and later "giving up and removing the disk from the
array" - not even error number.
When I try to access such a disk (/dev/sdX device),
I got "No such device or address" error back.
It's still listed in /sys/block and /proc/scsi/scsi,
but any access to the device gives this error.
But the disk is here, I know it is. Deleting it from
kernel:
echo y > /sys/block/sdX/device/delete
and adding it back:
echo scsi add-single-device x y z > /proc/scsi/scsi
works just fine, linux finds "new" scsi device and it
happily works again.
This happens on alot of different machines, with different
disk drives (ok, most of them are from Seagate, but not
all). I can't say for sure that it happens on different
scsi controllers - at least majority of them are adaptecs,
using aic7xxx or aix79xx driver.
I suspected the disks are too hot - nope, according to
smartctrl, the themp is far from bad (typically about
25..35 Celsius, and the themperature is not changing much).
Bad cables, bad power supply, bad anything else? Not sure
either, at least I can't guess more: the machines are
really different, some has good, under-loaded power supplies
(and server chassis/motherboards/allthestuff) some has less
good ones - makes no difference. And the thing is - having
in mind really sporadic disappearing, not depending on current
load, time of day (eg, during nights, there's no one on site
so no one to touch cables etc), ... Well, I just can't think
of any reason, at all.
But one thing bothers me most: there's NO LOGGING from scsi
layer. None, zero, not at all.
Has anyone else seen something similar? Any pointers on how
to debug the issue?
I'd recommend turning on scsi logging; it might give you a clue about
which bit of scanning is failing to work properly.
Try booting with scsi_mod.scsi_logging_level = 448 (I think I have that
number right; 7 shifted left by 6) and then you can compare failing and
non-failing runs and see if there's any difference.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: Random scsi disk disappearing
- From: Michael Tokarev
- Re: Random scsi disk disappearing
- References:
- Random scsi disk disappearing
- From: Michael Tokarev
- Random scsi disk disappearing
- Prev by Date: Re: Merging libata PATA support into the base kernel
- Next by Date: Re: How to avoid serial port buffer overruns?
- Previous by thread: Random scsi disk disappearing
- Next by thread: Re: Random scsi disk disappearing
- Index(es):
Relevant Pages
|