mylex/LSILOGIC's DAC960 trouble

From: Ricardo Manuel Oliveira (rmo_at_eurotux.com)
Date: 12/12/03

  • Next message: Vid Strpic: "Re: Backport ide-cd cdrecord support to 2.4"
    To: linux-kernel@vger.kernel.org
    Date:	Fri, 12 Dec 2003 10:50:06 +0000
    
    
    

    Hi,

     I've been having trouble with one of my DAC960's RAID, which was in a
    co-located server. It started by working perfectly, and all of the
    sudden (within a 15 day period) our server simply died until rebooted.
    The log analisys never gave us a clue about what the problem was, until
    I got to the datacenter myself, to have a look.

     A couple shift+pageups after, I saw a bunch of lines from the DAC960
    driver code telling me the disks were dead (logs below). Strangest
    think, the disk enclosure lights did not indicate disk failure (they're
    SEAGATE, SCA U160 drives) and after a quick'n'dirty reboot, everything
    is back where it should be - and running perfectly.

     The RAID is in our lab now, and it has been running a stress test for
    about 7 days. At last, we can see the problem reproduced. Here are some
    status reports (some of which are quite strange - the status is still
    OK):

    zbr:~ # cat /proc/rd/status
    OK
    zbr:~ # cat /proc/rd/c0/current_status
    ***** DAC960 RAID Driver Version 2.4.20aa1 of 4 December 2002 *****
    Copyright 1998-2001 by Leonard N. Zubkoff <lnz@dandelion.com>
    Configuring Mylex AcceleRAID 160 PCI RAID Controller
      Firmware Version: 6.00-11, Channels: 1, Memory Size: 16MB
      PCI Bus: 2, Device: 12, Function: 1, I/O Address: Unassigned
      PCI Address: 0xEF000000 mapped at 0xD4856000, IRQ Channel: 5
      Controller Queue Depth: 512, Maximum Blocks per Command: 2048
      Driver Queue Depth: 511, Scatter/Gather Limit: 128 of 257 Segments
      Physical Devices:
        0:0 Vendor: SEAGATE Model: ST318406LC Revision: 0109
             Wide Synchronous at 20 MB/sec
             Serial Number: 3FE0VWHM0000223270XS
             Disk Status: Online, 35807232 blocks
             Errors - Parity: 127, Soft: 0, Hard: 0, Misc: 0
                      Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
        0:1 Vendor: SEAGATE Model: ST318406LC Revision: 0109
             Wide Synchronous at 160 MB/sec
             Serial Number: 3FE0W65T000022329AXF
             Disk Status: Online, 35807232 blocks
             Errors - Parity: 7, Soft: 0, Hard: 0, Misc: 0
                      Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
        0:2 Vendor: SEAGATE Model: ST318406LC Revision: 010A
             Wide Synchronous at 160 MB/sec
             Serial Number: 3FE205PF000023078YLA
             Disk Status: Online, 35807232 blocks
             Errors - Parity: 13, Soft: 0, Hard: 0, Misc: 0
                      Timeouts: 0, Retries: 0, Aborts: 0, Predicted: 0
        0:7 Vendor: MYLEX Model: AcceleRAID 160 Revision: 0600
             Wide Synchronous at 160 MB/sec
             Serial Number:
      Logical Drives:
        /dev/rd/c0d0: RAID-5, Online, 71598080 blocks
                      Logical Device Initialized, BIOS Geometry: 255/63
                      Stripe Size: 64KB, Segment Size: 8KB
                      Read Cache Disabled, Write Cache Disabled
      No Rebuild or Consistency Check in Progress

    Logs:

    Dec 11 20:30:29 zbr kernel: DAC960#0: Physical Device 0:2 Sense Data
    Received
    Dec 11 20:30:29 zbr kernel: DAC960#0: Physical Device 0:2 Request Sense:
    Sense Key = B, ASC = 48, ASCQ = 00
    Dec 11 20:30:29 zbr kernel: DAC960#0: Physical Device 0:2 Request Sense:
    Information = 00000000 00000000
    Dec 11 20:30:30 zbr kernel: DAC960#0: Physical Device 0:1 Sense Data
    Received
    Dec 11 20:30:30 zbr kernel: DAC960#0: Physical Device 0:1 Request Sense:
    Sense Key = B, ASC = 48, ASCQ = 00
    Dec 11 20:30:30 zbr kernel: DAC960#0: Physical Device 0:1 Request Sense:
    Information = 00000000 00000000
    Dec 11 20:30:30 zbr kernel: DAC960#0: Physical Device 0:1 Sense Data
    Received
    Dec 11 20:30:30 zbr kernel: DAC960#0: Physical Device 0:1 Request Sense:
    Sense Key = B, ASC = 48, ASCQ = 00
    Dec 11 20:30:30 zbr kernel: DAC960#0: Physical Device 0:1 Request Sense:
    Information = 00000000 00000000
    Dec 11 20:30:30 zbr kernel: DAC960#0: Physical Device 0:0 Sense Data
    Received
    Dec 11 20:30:31 zbr kernel: DAC960#0: Physical Device 0:0 Request Sense:
    Sense Key = B, ASC = 48, ASCQ = 00
    Dec 11 20:30:31 zbr kernel: DAC960#0: Physical Device 0:0 Request Sense:
    Information = 00000000 00000000
    Dec 11 20:30:31 zbr kernel: DAC960#0: Physical Device 0:2 Sense Data
    Received
    Dec 11 20:30:32 zbr kernel: DAC960#0: Physical Device 0:2 Request Sense:
    Sense Key = B, ASC = 48, ASCQ = 00
    ....
    ....
    Dec 11 20:31:04 zbr kernel: DAC960#0: Physical Device 0:0 Sense Data
    Received
    Dec 11 20:31:04 zbr kernel: DAC960#0: Physical Device 0:0 Request Sense:
    Sense Key = B, ASC = 47, ASCQ = 00
    Dec 11 20:31:04 zbr kernel: DAC960#0: Physical Device 0:0 Request Sense:
    Information = 00000000 00000000
    Dec 11 20:31:04 zbr kernel: DAC960#0: Physical Device 0:0 Errors: Parity
    = 43, Soft = 0, Hard = 0, Misc = 0
    Dec 11 20:31:04 zbr kernel: DAC960#0: Physical Device 0:0 Errors:
    Timeouts = 0,
    Retries = 0, Aborts = 0, Predicted = 0
    Dec 11 20:31:04 zbr kernel: DAC960#0: Physical Device 0:0 Sense Data
    Received
    Dec 11 20:31:04 zbr kernel: DAC960#0: Physical Device 0:0 Request Sense:
    Sense Key = B, ASC = 47, ASCQ = 00
    .....
    .....
    Dec 11 20:31:30 zbr kernel: DAC960#0: Physical Device 0:0 Request Sense:
    Information = 00000000 00000000
    Dec 11 20:31:30 zbr kernel: DAC960#0: Physical Device 0:0 Errors: Parity
    = 127,
    Soft = 0, Hard = 0, Misc = 0
    Dec 11 20:31:30 zbr kernel: DAC960#0: Physical Device 0:0 Errors:
    Timeouts = 0,
    Retries = 0, Aborts = 0, Predicted = 0

     The disks are still mounted, but a ls simply hangs:

    zbr:~ # mount
    /dev/hda3 on / type ext2 (rw)
    proc on /proc type proc (rw)
    devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
    /dev/hda1 on /boot type ext2 (rw)
    /dev/hda2 on /home type ext2 (rw)
    shmfs on /dev/shm type shm (rw)
    none on /proc/bus/usb type usbdevfs (rw)
    /dev/rd/c0d0p3 on /mnt/linux type ext3 (rw)
    /dev/rd/c0d0p1 on /mnt/linux/boot type ext3 (rw)
    /dev/rd/c0d0p5 on /mnt/linux/tmp type ext3 (rw)
    /dev/rd/c0d0p6 on /mnt/linux/var type ext3 (rw)
    /dev/rd/c0d0p7 on /mnt/linux/servicos type ext3 (rw)
    zbr:~ # cd /mnt/linux/
    zbr:/mnt/linux # ls
    (no output, simply hangs)

     We've checked the cables, the mylex card itself, everything seems to be
    in order. I could bet a reasonable amount of money that after a reboot
    everything will be just fine.

     Kernels tested:
    stock 2.4.18
    RH's 2.4.20

     Any help is greatly appreciated.

     Thanks in advance,
      Ricardo Oliveira

    -- 
    Ricardo Manuel Oliveira <rmo:eurotux.com>
    Eurotux Informática, SA
    
    

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at http://www.tux.org/lkml/



  • Next message: Vid Strpic: "Re: Backport ide-cd cdrecord support to 2.4"

    Relevant Pages

    • Re: disk latencie problems on a brand new exchange server with 2 mailboxes!!!!!!!!
      ... the logs (or a single drive for the OS, a mirror for the logs, and a ... From the description of the problem, odds are all 5 drives are in 1 RAID ... disk, space isn't the only factor; ...
      (microsoft.public.exchange.admin)
    • Re: IO Bottleneck
      ... At a minimum, you should have a mirror for the OS, a mirror for the logs, ... write times on the the RAID 5 array far exceed 20ms. ... Troubleshooting Analyzer, and of course, the results indicated a disk ...
      (microsoft.public.exchange.design)
    • Re: Minimal configuration, suggestions wanted.
      ... RAID 1 OS / SQL Binaries ... RAID 1 SQL Transaction Logs ... > the backup to outside of the 'box' or the computerroom. ... > that a failed disk is spotted with in 24 hours. ...
      (microsoft.public.sqlserver.server)
    • Re: Unlikely performance issues
      ... raid 5 is very fast with read operations. ... drives and separate RAID 10 drives for databases and logs. ... An ideal disk layout would be mirrored system ...
      (microsoft.public.exchange.admin)
    • slackware 9.1 software raid problem
      ... Setting up a RAID system with Slackware 8 is not extremely difficult once ... mirroring the root partition and booting from that mirror was not possible. ... Each disk is attached to a different IDE chain on the motherboard. ... The ability to boot from the Slackware 8 install CD. ...
      (alt.os.linux)