RE: lpfc RAID1 device panics when one device goes away

From: Hamilton Andrew (Andrew.Hamilton_at_afccc.af.mil)
Date: 01/30/04

  • Next message: Leo: "apache weird GET"
    To: redhat-list@redhat.com
    Date: Fri, 30 Jan 2004 10:27:43 -0500
    
    

    Are we talking about a failure of one of the HBA's or a failure of a drive?
    I thought we were talking about the HBA failing which is far different than
    a drive failing.

    I agree with the LUNS part. That is exactly the way I see it as well.
    However in your case you have 2 connections to the 2 LUNs. In a locally
    attached you have, typically, one scsi connection to the raid array, not
    two. Granted, I would think that it wouldn't work any different from the
    raid point of view if you had 1 SCSI connection or 2, but I assume that if
    you were using two connections to run your array and one of them went down
    how would the system know how to handle a lost SCSI card? Panic in my
    experience. I know there are hardware raid solutions that will fail over if
    one of the raid controller fails. I also know that there are software
    solutions. But I think you have to have an external software piece to do
    it. The kernel/OS isn't going to know by default how to "fail over" the
    connection. If you had a drive fail that's different. The software raid
    knows how to handle a drive failure. Handling a drive failure is fairly
    standard. If you have a backup then it just moves to that. But I think
    handling a SCSI failure would be very configuration dependent and would
    under normal circumstances cause a panic. Unless you had something
    intervening to catch those kinds of failures.

    I have 1 internal SCSI controller, 1 SCSI card, and 1 HBA. All of them act
    as SCSI controllers. I'm also running software raid on the local drives and
    have a raid on the SAN. If I had a SCSI card failure and the SCSI card and
    the local SCSI controller were talking to the local raid, how would the
    machine react? I wouldn't even know how to tell it to fail over to the SAN
    raid or use the other SCSI card to talk to the raid and ignore the failure
    without some sort of intervening software.

    Drew

    -----Original Message-----
    From: Bruen, Mark [mailto:mbruen@trilegiant.com]
    Sent: Friday, January 30, 2004 9:52 AM
    To: redhat-list@redhat.com
    Subject: Re: lpfc RAID1 device panics when one device goes away

    Actually I view the configuration as identical to having two locally
    attached SCSI disks which are mirrored via software RAID1. The only
    difference being the two "drives" (LUNs) are located on a storage array
    on a SAN. As far as the OS is concerned the two LUNs are just two
    separate SCSI drives. I'm speculating that the lpfc driver does not
    handle or requires tuning parameters to be set to return the failed path
    information back up to the SCSI driver in a manner which won't cause a
    panic.
            -Mark

    Hamilton Andrew wrote:
    > Mark,
    >
    > I may be wrong here and maybe someone out there knows better, but I
    > don't think this will work without PowerPath. That allows your OS to
    > treat both your HBA's as one. And it load balances across the two
    > HBA's. Without that you have two independent connections to two LUNs
    > and that is what is causing the panic. You need something that will
    > treat both your connections as one connection. Even if both your HBA's
    > can talk to both LUNs the OS is not going to fail over to the one that
    > is working without some sort of go-between, and the kernel does not know
    > it can talk to both LUNs via either HBA. It just knows that it had 2
    > connections to the raid and one of them is gone so the raid is no longer
    > available. At least that is the way it would seem to work to me.
    >
    > My 2 cents. Let me know if you find out something different though.
    >
    > Drew
    >
    > -----Original Message-----
    > From: Bruen, Mark [mailto:mbruen@trilegiant.com]
    > Sent: Friday, January 30, 2004 8:54 AM
    > To: redhat-list@redhat.com
    > Subject: Re: lpfc RAID1 device panics when one device goes away
    >
    >
    > No, it worked once but then on the next test panic'd again, I'll keep
    > looking.
    > -Mark
    >
    > Hamilton Andrew wrote:
    > > Did that fix it? I have an EMC CX600 configured much the same way, but
    > > I'm using RHEL 2.1AS instead of 3.0. I'm sure there are a ton of
    > > differences between the two distro's.
    > >
    > > -----Original Message-----
    > > From: Bruen, Mark [mailto:mbruen@trilegiant.com]
    > > Sent: Wednesday, January 28, 2004 7:09 PM
    > > To: redhat-list@redhat.com
    > > Subject: Re: lpfc RAID1 device panics when one device goes away
    > >
    > >
    > > I think I have fixed this by changing the partition type of each LUN's
    > > (disk)
    > > partition to "fd" (Linux raid auto).
    > >
    > > Bruen, Mark wrote:
    > > > That will be the config once Veritas and/or EMC support HBA path
    > > > failover on RedHat AS 3.0. Veritas will support it with DMP in
    > version 4
    > > > due in Q2/04, EMC has not committed to a date yet with PowerPath.
    > In the
    > > > interim I'm trying to provide path failover using software RAID1
    > of two
    > > > hardware RAID5 LUNs one on each path (two switches connected to two
    > > > storage processors connected to two HBAs per server).
    > > > -Mark
    > > >
    > > > Hamilton Andrew wrote:
    > > >
    > > >> What's your SAN? Why don't you configure your raid1 on the SAN and
    > > >> let it publish that raid group as 1 LUN? Are you using a any
    > kind of
    > > >> fibre switch between your cards and your SAN?
    > > >>
    > > >> Drew
    > > >>
    > > >> -----Original Message-----
    > > >> From: Bruen, Mark [mailto:mbruen@trilegiant.com]
    > > >> Sent: Wednesday, January 28, 2004 3:28 PM
    > > >> To: redhat-list@redhat.com
    > > >> Subject: lpfc RAID1 device panics when one device goes away
    > > >>
    > > >>
    > > >> I'm running RedHat AS 3.0 kernel 2.4.21-4.ELsmp on a Dell 1750
    > with 2
    > > >> Emulex
    > > >> LP9002DC-E HBAs. I've configured a RAID1 device called /dev/md10
    > from
    > > >> 2 SAN
    > > >> based LUNs /dev/sdc and /dev/sde. Everything works fine until I
    > > >> disable one of
    > > >> the HBA paths to the disk. Here's the console output:
    > > >> [root@reacher root]# !lpfc1:1306:LKe:Link Down Event received
    > Data: x2
    > > >> x2 x0 x20
    > > >> I/O error: dev 08:40, sector 69792
    > > >> raid1: Disk failure on sde, disabling device.
    > > >> Operation continuing on 1 devices
    > > >> md10: vno@ pspar2e! d?i@
    > > >> s@kq tAo rec@oqnAst`rIu/Oc
    > > >> t AaqArra@qyA!@
    > > >> -v-@ cpont
    > > >> inI/uOinhgr oihn de_g_r_a_m@vqA@`@ 70288
    > > >> I/O error: dev 08`I/O sector 70536
    > > >> I/O error: dev 08:40, sector 70784
    > > >> I/O error: dev 08:40, sector 71032
    > > >> I/O error: dev 08:40, sector 71280
    > > >> I/O error@qA@v@p2!?@
    > > >> AqA@qA`I/O
    > > >> BqA@qA@v@p I/Oh 7h____mv@`dev
    08:40,
    > > >> sector 72024
    > > >> `I/Oerror: dev 08:40, sector 72272
    > > >> I/O error: dev 08:40, sector 72520
    > > >> I/O error: dev 08:40, sector 72768
    > > >> I/O error: dev 08:40, sector 73@qA@v@p2!?@
    > > >> BqA@qA`I/O
    > > >> CqA@qA@v@p
    > > >> I/Ohdeh____mv@`2
    > > >> I/O error: dev 08:40, `I/Oor 73760
    > > >> I/O error: dev 08:40, sector 74008
    > > >> I/O error: dev 08:40, sector 74256
    > > >> I/O error: dev 08:40, sector 74504
    > > >> I/O error: dev@qA@v@p2!?@
    > > >> CqA@qA`I/O
    > > >> DqA@qA@v@p I/Oh0
    > > >> h____mv@`8:40, sector 75248
    > > >> I/O e`I/O: dev 08:40, sector 75496
    > > >> I/O error: dev 08:40, sector 75744
    > > >> I/O error: dev 08:40, sector 75992
    > > >> I/O error: dev 08:40, sector 76240
    > > >> <@qA@v@p2!?@
    > > >> DqA@qA`I/O
    > > >> EqA@qA@v@p I/Oh8:h____mv@` I/O error: dev
    > 08:40,
    > > >> secto`I/O984
    > > >> I/O error: dev 08:40, sector 77232
    > > >> I/O error: dev 08:40, sector 77480
    > > >> I/O error: dev 08:40, sector 77728
    > > >> I/O error: dev 08:4@qA@v@p2!?@
    > > >> EqA@qA`I/O
    > > >> FqA@qA@v@p I/Oh
    Ih____mv@`
    > > >> sector 78352
    > > >> I/O error:`I/O 08:40, sector 78600
    > > >> I/O error: dev 08:40, sector 78848
    > > >> I/O error: dev 08:40, sector 79096
    > > >> I/O error: dev 08:40, sector 79344
    > > >> I/@qA@v@p2!?@
    > > >> FqA@qA`I/O
    > > >> GqA@qA@v@p I/Oh sh____mv@`error: dev
    08:40,
    > > >> sector
    > > >> 800`I/O4> I/O error: dev 08:40, sector 80336
    > > >> I/O error: dev 08:40, sector 80584
    > > >> I/O error: dev 08:40, sector 80832
    > > >> I/O error: dev 08:40, se@qA@v@p2!?@
    > > >> GqA@qA`I/O
    > > >> HqA@qA@v@p
    > > >> I/Oherh____mv@`or 81576
    > > >> I/O error: dev `I/O0, sector 81824
    > > >> I/O error: dev 08:40, sector 82072
    > > >> I/O error: dev 08:40, sector 82320
    > > >> I/O error: dev 08:40, sector 82568
    > > >> I/O err@qA@v@p2!?@
    > > >> HqA@qA`I/O
    > > >> IqA@qA@v@p I/Ohorh____mv@`: dev
    08:40,
    > > >> sector 83312
    > > >> <4`I/OO error: dev 08:40, sector 83560
    > > >> I/O error: dev 08:40, sector 83808
    > > >> I/O error: dev 08:40, sector 84056
    > > >> Unable to handle kernel paging request at virtual address a0fb8488
    > > >> printing eip:
    > > >> c011f694
    > > >> *pde = 00000000
    > > >> Oops: 0000
    > > >> lp parport autofs tg3 floppy microcode keybdev mousedev hid input
    > > >> usb-ohci
    > > >> usbcore ext3 jbd raid1 raid0 lpfcdd mptscsih mptbase sd_mod
    scsi_mod
    > > >> CPU: -1041286984
    > > >> EIP: 0060:[<c011f694>] Not tainted
    > > >> EFLAGS: 00010087
    > > >>
    > > >> EIP is at do_page_fault [kernel] 0x54 (2.4.21-4.ELsmp)
    > > >> eax: f55ac544 ebx: f55ac544 ecx: a0fb8488 edx: e0b3c000
    > > >> esi: c1ef4000 edi: c011f640 ebp: 000000f0 esp: c1ef40c0
    > > >> ds: 0068 es: 0068 ss: 0068
    > > >> Process Dmu (pid: 0, stackpage=c1ef3000)
    > > >> Stack: 00000000 00000002 022c1008 c1eeee4c c1eff274 00000000
    > 00000000
    > > >> a0fb8488
    > > >> c17c4520 f58903f4 00000000 c1efd764 c1eee5fc f7fe53c4
    > 00030001
    > > >> 00000000
    > > >> 00000002 022c100c c1efd780 c1eeba44 00000000 00000000
    > 00000003
    > > >> c1b968ec
    > > >> Call Trace: [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4178)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef419c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef41b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4278)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef429c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef42b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4378)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef439c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef43b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4478)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef449c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef44b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4578)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef459c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef45b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4678)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef469c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef46b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4778)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef479c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef47b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4878)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef489c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef48b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4978)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef499c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef49b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4a78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4a9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4ab4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4b78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4b9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4bb4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4c78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4c9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4cb4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4d78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4d9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4db4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4e78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4e9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4eb4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4f78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef4f9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef4fb4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5078)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef509c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef50b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5178)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef519c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef51b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5278)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef529c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef52b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5378)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef539c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef53b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5478)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef549c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef54b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5578)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef559c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef55b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5678)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef569c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef56b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5778)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef579c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef57b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5878)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef589c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef58b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5978)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef599c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef59b4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5a78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5a9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5ab4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5b78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5b9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5bb4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5c78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5c9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5cb4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5d78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5d9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5db4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5e78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5e9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5eb4)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5f78)
    > > >> [<c011f640>] do_page_fault [kernel] 0x0 (0xc1ef5f9c)
    > > >> [<c011f694>] do_page_fault [kernel] 0x54 (0xc1ef5fb4)
    > > >>
    > > >> Code: 8b 82 88 c4 47 c0 8b ba 84 c4 47 c0 01 f8 85 c0 0f 85 46 01
    > > >>
    > > >> Kernel panic: Fatal exception
    > > >>
    > > >> Any Ideas?
    > > >> Thanks.
    > > >> -Mark
    > > >>
    > > >>
    > > >> --
    > > >> redhat-list mailing list
    > > >> unsubscribe
    > mailto:redhat-list-request@redhat.com?subject=unsubscribe
    > > >> https://www.redhat.com/mailman/listinfo/redhat-list

    -- 
    redhat-list mailing list
    unsubscribe mailto:redhat-list-request@redhat.com?subject=unsubscribe
    https://www.redhat.com/mailman/listinfo/redhat-list
    -- 
    redhat-list mailing list
    unsubscribe mailto:redhat-list-request@redhat.com?subject=unsubscribe
    https://www.redhat.com/mailman/listinfo/redhat-list
    

  • Next message: Leo: "apache weird GET"

    Relevant Pages

    • Re: RAID suggestions?
      ... controller failures with complete data lost and not one mdadm failure. ... But we've never lost data from a controller failure. ... That I in RAID is for inexpensive. ... hard drives were expensive, ...
      (Debian-User)
    • Slow Performance on HP DL380 G4 SmartArray 6i Controller
      ... Hi all, I'm very new to SCSI and server hardware in general, so please ... as we only have the one SCSI RAID HBA in the system then I beleive the ... or the drives or cables attached to it. ...
      (comp.periphs.scsi)
    • Re: Confused
      ... I had the same situation in reverse I picked up a bunch of scsi ... and realize that Norton Ghost 12 is your best friend not RAID 5... ... I had a power glitch that failed 2 drives in one of my RAID 5 ... learned to back up critical data to several places and use Norton Ghost. ...
      (comp.periphs.scsi)
    • Re: Installing freeBSD on an Intel RAID5 partition
      ... SATA RAID as a reasonable HD failure protection system or not. ... Im used to working with much larger drives, ...
      (freebsd-questions)
    • Re: SATA vs. SCSI, RAID?
      ... Also this would depend on which controller is ... Well MS's intrinsic SW RAID 1 drivers using typical Intel mobo EIDE ... controllers for ATA or Adaptec SCSI non-RAID cards do very well here. ... But do note that I only mentioned "3x drives ...
      (microsoft.public.windows.server.sbs)