Software RAID 5 SATA array crashed

From: Adar Dembo (adembo_at_gmail.com)
Date: 09/18/05

  • Next message: Malcolm E. Jordan: "The footers in a2ps"
    Date: Sat, 17 Sep 2005 17:59:33 -0700
    To: debian-user@lists.debian.org
    
    

    Back in October I setup a software RAID 5 array using MD. I used 5x300
    gig SATA-II drives, running on two Promise TX4 SATAII controllers (the
    new ones with NCQ). One controller connected to two drives, and the
    other to three.

    A few days ago, after moving to a new house, I set up the server
    containing the array and tried to connect to it. I couldn't reach the
    server through the intranet, so I hooked up a keyboard and monitor to
    see what was up. When I peered in, I saw that the kernel hadn't even
    finished its boot procedure. Right as md was loaded by the kernel (it is
    built in, not a module), there was a call stack and a kernel error "IRQ
    193: nobody cared!" or something similar. Following that were repeating
    messages about SCSI commands failing on three of my drives I believe.

    Rebooting the machine didn't make the behavior go away. I powered it off
    and reseated all of the SATA connectors. This time, when booting up, I
    made progress. Here is what syslog said upon autodetecting the MD array:

    Sep 11 23:46:57 localhost kernel: md: Autodetecting RAID arrays.
    Sep 11 23:46:57 localhost kernel: md: autorun ...
    Sep 11 23:46:57 localhost kernel: md: considering sdf1 ...
    Sep 11 23:46:57 localhost kernel: md: adding sdf1 ...
    Sep 11 23:46:57 localhost kernel: md: adding sde1 ...
    Sep 11 23:46:57 localhost kernel: md: adding sdd1 ...
    Sep 11 23:46:57 localhost kernel: md: adding sdc1 ...
    Sep 11 23:46:57 localhost kernel: md: adding sdb1 ...
    Sep 11 23:46:57 localhost kernel: md: created md0
    Sep 11 23:46:57 localhost kernel: md: bind<sdb1>
    Sep 11 23:46:57 localhost kernel: md: bind<sdc1>
    Sep 11 23:46:57 localhost kernel: md: bind<sdd1>
    Sep 11 23:46:57 localhost kernel: md: bind<sde1>
    Sep 11 23:46:57 localhost kernel: md: bind<sdf1>
    Sep 11 23:46:57 localhost kernel: md: running:
    <sdf1><sde1><sdd1><sdc1><sdb1>
    Sep 11 23:46:57 localhost kernel: md: kicking non-fresh sdc1 from array!
    Sep 11 23:46:57 localhost kernel: md: unbind<sdc1>
    Sep 11 23:46:57 localhost kernel: md: export_rdev(sdc1)
    Sep 11 23:46:57 localhost kernel: md: md0: raid array is not clean --
    starting background reconstruction
    Sep 11 23:46:57 localhost kernel: raid5: device sdf1 operational as raid
    disk 4
    Sep 11 23:46:57 localhost kernel: raid5: device sde1 operational as raid
    disk 3
    Sep 11 23:46:57 localhost kernel: raid5: device sdd1 operational as raid
    disk 2
    Sep 11 23:46:57 localhost kernel: raid5: device sdb1 operational as raid
    disk 0
    Sep 11 23:46:57 localhost kernel: raid5: cannot start dirty degraded
    array for md0
    Sep 11 23:46:57 localhost kernel: RAID5 conf printout:
    Sep 11 23:46:57 localhost kernel: --- rd:5 wd:4 fd:1
    Sep 11 23:46:57 localhost kernel: disk 0, o:1, dev:sdb1
    Sep 11 23:46:57 localhost kernel: disk 2, o:1, dev:sdd1
    Sep 11 23:46:57 localhost kernel: disk 3, o:1, dev:sde1
    Sep 11 23:46:57 localhost kernel: disk 4, o:1, dev:sdf1
    Sep 11 23:46:57 localhost kernel: raid5: failed to run raid set md0
    Sep 11 23:46:57 localhost kernel: md: pers->run() failed ...
    Sep 11 23:46:57 localhost kernel: md: do_md_run() returned -22
    Sep 11 23:46:57 localhost kernel: md: md0 stopped.
    Sep 11 23:46:57 localhost kernel: md: unbind<sdf1>
    Sep 11 23:46:57 localhost kernel: md: export_rdev(sdf1)
    Sep 11 23:46:57 localhost kernel: md: unbind<sde1>
    Sep 11 23:46:57 localhost kernel: md: export_rdev(sde1)
    Sep 11 23:46:57 localhost kernel: md: unbind<sdd1>
    Sep 11 23:46:57 localhost kernel: md: export_rdev(sdd1)
    Sep 11 23:46:57 localhost kernel: md: unbind<sdb1>
    Sep 11 23:46:57 localhost kernel: md: export_rdev(sdb1)
    Sep 11 23:46:57 localhost kernel: md: ... autorun DONE.

    Note the message about sdc being non-fresh. Also note that the array is
    both DIRTY and DEGRADED. Degraded (I'm guessing) because sdc is detected
    as failed, and dirty because the machine was powered off when it was
    erroring and the array wasn't able to flush properly.

    I played around with mdadm but I could never get the array to start. All
    of the superblocks were intact, including sdc. Finally, I ran "mdrun",
    which managed to start the array. Here is the logging associated with
    this command:

    Sep 12 01:04:37 localhost kernel: md: md0 stopped.
    Sep 12 01:04:37 localhost kernel: md: bind<sdc>
    Sep 12 01:04:37 localhost kernel: md: bind<sdd>
    Sep 12 01:04:37 localhost kernel: md: bind<sdf>
    Sep 12 01:04:37 localhost kernel: md: bind<sde>
    Sep 12 01:04:37 localhost kernel: md: bind<sdb>
    Sep 12 01:04:37 localhost kernel: md: md0: raid array is not clean --
    starting background reconstruction
    Sep 12 01:04:37 localhost kernel: raid5: device sdb operational as raid
    disk 0
    Sep 12 01:04:37 localhost kernel: raid5: device sde operational as raid
    disk 4
    Sep 12 01:04:37 localhost kernel: raid5: device sdf operational as raid
    disk 3
    Sep 12 01:04:37 localhost kernel: raid5: device sdd operational as raid
    disk 2
    Sep 12 01:04:37 localhost kernel: raid5: device sdc operational as raid
    disk 1
    Sep 12 01:04:37 localhost kernel: raid5: allocated 5248kB for md0
    Sep 12 01:04:37 localhost kernel: raid5: raid level 5 set md0 active
    with 5 out of 5 devices, algorithm 2
    Sep 12 01:04:37 localhost kernel: RAID5 conf printout:
    Sep 12 01:04:37 localhost kernel: --- rd:5 wd:5 fd:0
    Sep 12 01:04:37 localhost kernel: disk 0, o:1, dev:sdb
    Sep 12 01:04:37 localhost kernel: disk 1, o:1, dev:sdc
    Sep 12 01:04:37 localhost kernel: disk 2, o:1, dev:sdd
    Sep 12 01:04:37 localhost kernel: disk 3, o:1, dev:sdf
    Sep 12 01:04:37 localhost kernel: disk 4, o:1, dev:sde
    Sep 12 01:04:37 localhost kernel: .<6>md: syncing RAID array md0
    Sep 12 01:04:37 localhost kernel: md: minimum _guaranteed_
    reconstruction speed: 1000 KB/sec/disc.
    Sep 12 01:04:37 localhost kernel: md: using maximum available idle IO
    bandwith (but not more than 200000 KB/sec) for reconstruction.
    Sep 12 01:04:37 localhost kernel: md: using 128k window, over a total of
    293057280 blocks.
    Sep 12 01:04:37 localhost kernel: md: md1 stopped.
    Sep 12 01:04:37 localhost last message repeated 4 times

    So it seems to me that mdrun forced the array to start, and since it
    began "syncing", it assumed sdc was not failed, and used all the drives
    to reconstruct the parity information (sync: rebuild parity,
    reconstruct: rebuild drive).

    During the resync, and even when it was done, I could not access the XFS
    filesystem. Both xfs_repair and xfs_check could not find a valid XFS
    superblock. I let xfs_repair check the entire device and it could not
    find a single XFS superblock. However, piping /dev/md0 into strings does
    yield some filenames that I recognize from the device.

    So now I've got this array, and I still don't know what malfunctioned.
    In addition, I have a bad filesystem which I don't want to give up on,
    because I'd be losing a ton of data. Anyone have any suggestions?

    -Adar

    PS: I'm not subscribed to debian-user, so please include me in the replies.

    -- 
    To UNSUBSCRIBE, email to debian-user-REQUEST@lists.debian.org 
    with a subject of "unsubscribe". Trouble? Contact listmaster@lists.debian.org
    

  • Next message: Malcolm E. Jordan: "The footers in a2ps"

    Relevant Pages

    • Re: Windows XP RAID 5
      ... array "disappears". ... Looking at Tom's guide to Windows XP RAID 5, ... Performance alone isn;t that big an issue for me - disk ... than 500 gig, I want it all in one easy to find and organise place. ...
      (uk.comp.homebuilt)
    • Re: P5LD2-Deluxe and RAID 1 rebuild...how to stop
      ... >>> When the Intel software wants to rebuild the RAID 1 array, ... >>> disk to the other disk, to resync the members of the RAID 1 ... > One thing I'm curious about, is the Intel RAID BIOS does ...
      (alt.comp.periphs.mainboard.asus)
    • Re: Installing Adaptec 1420SA SATA RAID controller
      ... neither disk nor RAID array. ... What about returning the Adaptec controller and replacing it with some ...
      (microsoft.public.windows.server.sbs)
    • Re: RAID 0
      ... Given that preference you end up with an 80 GB array. ... The OP has four drives. ... RAID 0 you need to get rid of it. ... I have discovered in the Disk Management ...
      (microsoft.public.windows.server.sbs)
    • Re: Hardware RAID 5 - Need vinum?
      ... creating one partition on the array with one slice. ... and I could create another slice and another ... > I could not understand what you meant by RAID device entry would be larger. ... > hard disk or 100 hard disks in some kind of RAID. ...
      (freebsd-questions)