[opensuse] md RAID5 problem, Sata hard reset I/O error, ICH9R, driver bug in opensuse 11.2?



Hello! Has anyone else seen this problem? Could it be a bug in 11.2?

Description:
-----------------------------------------------------------------------------------------

I installed 11.2 on my server last weekend, moving from 11.0 (new install). For many years I've been running RAID5 over three partitions of three HDDs using md and ext3, without a hitch. I've also had other RAID constellations running in parallel on the same HDDs and others, also for years without any problems. When installing 11.2 I also changed the partitioning and created a new, otherwise identical RAID5 setup. However, when copying back the data from my backup the array suddenly broke down with 2 of the drives marked faulty and removed!

I retried a number of times to try to understand what was happening. I tried both the default and the Xen kernel (both x86_64). I always got the same error. When testing, I saw that first one drive would get an exception and "hard reset" and be removed from the array - always the last raid device (2). This happened after copying several GB of data (usually after around 8 GB). If I then stopped and re-added the "faulty" drive it would get back in sync and I could continue copying another several GB until it got the hard reset again. If I didn't stop to re-add, then the second last raid device (1) would get a hard reset and be removed as faulty - obviously causing the 3-disk raid5 to break down.

After trying this for a while I tried instead to create a RAID1 setup using the same 3 partitions and ext4. I've had no problems at all with this setup, filling 90% (about 80 GB) and running for 5 days. Worth mentioning is that I've never had any problems with these physical HDDs which are now around 4-5 years old. I'm also running root and other partitions and a few VMs using LVM and Xen from the very same HDDs. I checked smart incl long self test without finding any errors. I've been running exactly the same hardware for a few years without any problems.

When googling a saw a similar problem reported but with nvidia (sata_nv) and some change introduce in the 2.6.26 kernel or around there. I don't know if this issue is related. In this case it's Intel ICH9R.
-----------------------------------------------------------------------------------------


Product: OpenSuse 11.2 (final)

uname -a:
Linux serv 2.6.31.8-0.1-xen #1 SMP 2009-12-15 23:55:40 +0100 x86_64 x86_64 x86_64 GNU/Linux

dmesg:
-----------------------------------------------------------------------------------------

[ 2.992003] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 2.997047] ata3.00: ATA-7: SAMSUNG SP2004C, VM100-33, max UDMA7
[ 2.997049] ata3.00: 390721968 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 3.002142] ata3.00: configured for UDMA/133
...
[15431.871709] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[15431.871723] ata3.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[15431.871724] res 40/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[15431.871733] ata3.00: status: { DRDY }
[15431.871739] ata3: hard resetting link
[15432.355696] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[15432.365819] ata3.00: configured for UDMA/133
[15432.365825] ata3.00: device reported invalid CHS sector 0
[15432.365831] end_request: I/O error, dev sdc, sector 185181205
[15432.365836] md: super_written gets error=-5, uptodate=0
[15432.365840] raid5: Disk failure on sdc5, disabling device.
[15432.365841] raid5: Operation continuing on 2 devices.
[15432.365853] ata3: EH complete
[15432.443698] RAID5 conf printout:
[15432.443706] --- rd:3 wd:2
[15432.443711] disk 0, o:1, dev:sda5
[15432.443715] disk 1, o:1, dev:sdb5
[15432.443718] disk 2, o:0, dev:sdc5
[15432.463691] RAID5 conf printout:
[15432.463696] --- rd:3 wd:2
[15432.463700] disk 0, o:1, dev:sda5
[15432.463703] disk 1, o:1, dev:sdb5
-----------------------------------------------------------------------------------------


sudo /usr/sbin/smartctl -a /dev/sdc: OK
-----------------------------------------------------------------------------------------

Model Family: SAMSUNG SpinPoint P120 series
Device Model: SAMSUNG SP2004C
Serial Number: S07GJ10Y946521
Firmware Version: VM100-33
User Capacity: 200,049,647,616 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a
Local Time is: Fri Jan 29 07:59:58 2010 CET
-----------------------------------------------------------------------------------------


lspci -nn:
-----------------------------------------------------------------------------------------

00:1f.2 SATA controller [0106]: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller [8086:2922] (rev 02)
-----------------------------------------------------------------------------------------


Mobo: Gigabyte GA-P35C-DS3R

--
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx