Re: Opinions on new Fedora Core 2 install with LVM 2 and snapshots?

From: Bryan J. Smith (b.j.smith_at_ieee.org)
Date: 07/27/04

  • Next message: William M. Quarles: "Re: Mozilla 1.6 SRPM can be rebuilt for FC1"
    To: fedora-list@redhat.com
    Date: Mon, 26 Jul 2004 18:41:52 -0400
    
    

    [ Thanx again for your help! ]

    Bill Rugolsky Jr. wrote:
    > Well, that is also what we are doing. We need on-site and off-site
    > backup of our NetApp filer, and can do it with a Linux system for $2K
    > apiece.

    Exactly. It's my old employer, who I now consult for, but doesn't have
    the money for more than 10 engineers these days. The NetApp is where
    95% of the Linux and Solaris client mount. We actually didn't shouve
    out $8K more for the SMB service on the NetApp, because we like it to be
    largely NFS.

    The Linux server is where 5% of the UNIX client traffic goes, so NFS
    access is minimal. So it largely does small Windows client SMB access.
    In fact, we NFS mount the NetApp and do SMB on behalf of it (because
    it is so limited -- again, saved $8K because the engineers use it so
    little).

    > P4 2.8GHz, 1GB RAM, dual SATA 250GB in MD RAID1.

    This will be a dual-P3, ServerWorks chipset, 1GB RAM, 1000Base-SX
    Gigabit NIC (storage and NIC on different PCI channels), quad-SATA in
    hardware RAID-5 (possibly RAID-0+1 instead, not sure yet). The new
    3Ware Escalade 9000 series products have upto 1GB SDRAM in addition
    128MB SRAM, so I'm not too worried about the RAID-5 performance.

    > My personal system is a dual Opteron 246, 4GB RAM

    Sounds like what I want to build. I'm waiting for the IWill board that
    attaches the nVidia "CK8-04 Pro" (nForce4?) 24-channel PCI-Express +
    Legacy PC (LPC) HyperTransport tunnel to one CPU, and then an AMD8131
    PCI-X 1.0 HyperTransport tunnel to the other.

    Talk about I/O options and no bottlenecks!
    AnandTech had a picture of it here:
    http://images.anandtech.com/reviews/motherboards/fall2004preview/iwill2a64.jpg
    It should be sub-$500, probably sub-$400. Half of that is really
    because of the AMD8131 plus PCI-X traces. The "CK8-04 Pro" (nForce4?)
    will be nVidia's next "commodity" chipset for Socket-939/940 mainboards.
    So there's really no added cost there (which should be sub-$200 on its
    own in a mainboard, without the AMD8131 plus PCI-X traces).

    > 4x200GB SATA

    3Ware controller or software RAID?

    At $325 for a 4-channel 3Ware Escalade 9500S-4LP, I really see no reason
    not to put one in when you have 4 drives. You get a powerful 64-bit
    ASIC, plus SRAM for 0 wait state transfers/queuing, plus 128MB
    (expandible upto 1GB) of NVRAM backed SDRAM for buffered I/O (RAID-5
    writes as well as general read buffer).

    Heck, it's probably worth it to upgrade to 512MB or 1GB on the 3Ware
    Escalade 9000 _instead_ of a separate PCI NVRAM board from just a
    performance standpoint. You get it NVRAM backed up right there.

    > with each drive split into 3 equal partitions, for playing around with
    > various MD configs. I'm looking at tuning the whole NFS I/O path on the
    > latter,

    Let me know what you find. I'm _avoiding_ MD. I'm _only_ interested in
    using LVM for snapshots. I'd rather let the 3Ware 64-bit ASIC do all the
    queuing, sector remapping and RAID -- including SRAM for the 0 wait state
    operations and SDRAM for the buffered I/O (especially RAID-5 writes).

    People buy layer 3 network switches for performance, instead of using a
    Linux PC as a router. A 3Ware card is the same thing versus using software
    RAID in my opinion. Only for RAID-0 does it matter little (see my 2004
    April article in Sys Admin magazine for more on RAID efficiency by level).

    And 3Ware does a great job of releasing GPL drivers. Since all the "brains"
    of the RAID is on the card, the drivers are simple block code. And unlike
    the traditional intelligent ATA/SCSI RAID route that use microcontrollers
    and multi-wait-state SDRAM, 3Ware 9000 series has both SRAM and SDRAM on-board
    for the best of all worlds. Especially with ATA which is non-blocking I/O
    for just about everything except RAID-5 writes (which the 9000 series has
    SDRAM for buffering now, unlike the 5000-8000 before it).

    > I want to experiment with various configs first, e.g., filesystem LV on
    > one RAID1 PV, journal and/or snapshot LV on the other PV.

    I just want snapshots, that's it. I try to sell companies on a 1GB PCI
    NVRAM board for a full data Ext3 journal, but most don't go for it.
    With the 3Ware Escalade, it probably removes the issue now.

    > RAID6 also.

    What is RAID-6? Is that 2 parity stripes so you can lose 2 disks?

    > Once GFS clustering stabilizes on 2.6, I suppose I'll start over with
    > a cluster config ...

    Oh definitely with you there! Will start doing iSCSI liberally once
    GFS is in stock Fedora.

    > That should be fine.
    > I've been working from Arjan's Fedora test kernels, dropping the 4G/4G
    > and turning off highmem completely. I've also added kexec and a few
    > other goodies.
    > Arjan has been tracking the BitKeeper snapshots pretty closely.

    I looked inside the Red Hat SRPM for the 2.6.7-1.494 kernel and it's
    2.6.8-RC1 based. I think I'll just stick with that.

    One thing that I always fear is having the kernel and user-space tools
    "out-of-sync." So sticking with built RPMS (even if I have to rpmbuild
    them from SRPMS myself) typically tells me what user-space dependencies
    there are.

    I've seen people who choose Mandrake and ReiserFS and run into that
    repeatedly. ReiserFS works fine until an off-line fsck is required --
    then bam! There goes their data because the off-line tools, not so much
    because of ReiserFS itself. I know, Ext3 and XFS don't change internal
    structures like ReiserFS does, but I still am always shy with kernels.

    > Well, of course, we want to get them fixed, and bug reports are useful. :-)

    I have a very similar setup here at my house, dual-CPU, older 3Ware card,
    etc... I'm going to use that as a "test server" for this setup first.

    But for my company, I need production-quality. So if it's got bugs, I ain't
    going to enable snapshots until it's ready. I'm glad to here LVM2 on their
    own are fine.

    > FWIW, several commercial appliances apparently use XFS.

    It's large file performance is why it always found a home in video servers.

    I also adopted XFS early on (2001 February), because it handled all sorts
    of things on 2.4 that Ext3 struggled with for awhile -- Quotas, ACLs, etc...

    > I feel no compelling need to abandon Ext3; in my experience, the
    > filesystem and tools are extraodinarily robust,

    Of course. I trust Ext3 as well, because it hasn't changed since Ext2 of
    the mid-'90s.

    But that's also true of XFS, it remains unchanged since the mid-'90s as
    well. They directly ported it over from Irix. I trust it as well. NFS
    support was also excellent.

    I'm shy to even try IBM's JFS, because it comes from OS/2 and not AIX.
    JFS really lacked a _lot_ of traditional UNIX capabilities in its first
    releases on Linux, unlike XFS.

    I wondered why until the whole "Project Monterey" falling out happened
    with SCO and IBM, eventually resulting in the lawsuit that has gone
    well beyond. So it made sense once I looked at it from that.

    > and performance has always been adequate for my purposes.

    I didn't use XFS for performance. I used it for features and maturity.
    I'd rather stick with Ext3 because that's what Red Hat supports.

    But if people feel XFS is better for LVM2, then I'd use it. I didn't
    see SGI releasing XFS for Fedora releases (unlike RHL before it), so I
    figured there was little reason to go that direction. But I had to
    ask.

    > If you want to do hardcore testing, you need to choose one of the
    > several methods to switch off writes to the device at the block
    > layer, and then loop randomly wrecking and recovering the filesystem
    > and looking for corruption. (See Andrew Morton's test tools in Jeff
    > Garzik's gkernel.sourceforge.net repository.)

    I might do that on my personal workstation, but not my home server let
    alone a client's server.

    I just want a reliable volume manager. I'll enable snapshots when
    everyone feels they are production-quality. I was hoping LVM2 w/device
    mapper was there now -- as long as that's _all_ I'm using it for,
    snapshots and _nothing_ else.

    I'll leave the redundancy features to an underlying, intelligent
    storage controller. It makes life more simple for a sysadmin.

    > I like the 3ware controllers, but until their meta-data is supported
    > by dmraid or the like, I'll pass.

    Why? Is there some bonus for dmraid? I rather like the fact that the
    OS has no idea what is underneath.

    > Because every kernel has bugs, and hardware can be flakey.
    > Corruption can occur irrespective of journaling.

    Really? I haven't run into this with Ext3 or XFS (other than the one
    XFS 1.0 bug that took out my /var on one system). Is LVM2 flaky?

    I'd rather not run it if so.

    > Well, here's the theory: when doing synchronous NFS commits, full
    > data journaling only requires a sequential write to the journal;

    Correct. The logic is rather simplistic.

    > the data gets written back to the filesystem asynchronously. If it
    > is on a separate spindle or in NVRAM, it is decoupled from both the
    > read traffic and the asynchronous writeback. With NFS, the latency
    > of write acknowledgements typically affects throughput, so improving
    > one improves the other.

    Correct. And I totally agree with you on a preference for using a
    NVRAM board for Linux NFS servers.

    But I'll probably just do NFS async with Ext3 ordered-writes on
    systems where I don't have a NVRAM board. Most clients have balked
    at the idea of another $1K to the system cost.

    > I haven't done much experimenting, but over the years folks have
    > posted mixed results on ext3-users and nfs mail lists with various
    > combinations of data journal mode and internal, external, or
    > NVRAM journals.

    I've never seen improved performance. But I _do_ like the "piece of
    mind" that I'm doing NFS v3 sync, instead of async.

    But if there is not any issue with LVM2+DM+Snapshots, then I'll just
    use NFS async with Ext3 ordered-writes.

    > None that I'm aware of, but I know that you've been lurking on the
    > nfs and ext3-users list for years -- search the archives. ;-p

    Yeah, I need to do that. Maybe I can help "beef up" the Linux NFS
    HOWTO with some info.

    So if you have _any_ info, I'd be willing to document it into one
    guide.

    > Seriously, there are quite a few performance discussions and tuning
    > suggestions over the years involving Neil Brown, Tom McNeal, Chuck
    > Lever and others mostly on the NFS side of things, Andrew Morton,
    > Stephen Tweedie, and Andreas Dilger mostly on the Ext3/VM side.

    I really want to write an expanded HOWTO on how to build a production
    Linux NFS server with LVM2, Snapshots and Ext3 (possibly XFS as well).

    I started one back in the late 2.2 days with the Brown+Trond NFS
    patches, but then Seth updated the HOWTO and I just forgot out it.

    In all

    > You should measure the difference between NFS async and sync operation.
    > If things are working correctly, 2.6 sync should not be too shabby.

    With an NVRAM board, I don't doubt it. But without one, I think I'll
    stick with NFS async.

    Although the 128MB (upto 1GB) of NVRAM SDRAM buffer on the 3Ware Escalade
    9000 series is sure to help (not to mention the 2-4MB of SRAM for queuing,
    damn I love 3Ware's "storage switch" ASIC approach).

    > As for CIFS, I have no clue.

    I'm not too worried about SMB. SMB access is rather limited.

    The engineers use NFS, 95% of which goes to the NetApp, and then Rsync.

    > Now, I need to go take my own advice, when I find a few free hours ...

    Hey, if you have any notes, I'm more than willing to put them into
    a HOWTO. Thanx dude!

    -- Bryan

    P.S. Do I need to do anything beyond loading a LVM2 kernel with "device
    mapper" to use "pvcreate" to do snapshots?

    -- 
         Linux Enthusiasts call me anti-Linux.
       Windows Enthusisats call me anti-Microsoft.
     They both must be correct because I have over a
    decade of experience with both in mission critical
    environments, resulting in a bigotry dedicated to
     mitigating risk and focusing on technologies ...
               not products or vendors
    --------------------------------------------------
    Bryan J. Smith, E.I.            b.j.smith@ieee.org
    -- 
    fedora-list mailing list
    fedora-list@redhat.com
    To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list
    

  • Next message: William M. Quarles: "Re: Mozilla 1.6 SRPM can be rebuilt for FC1"

    Relevant Pages

    • Re: Major XFS problems...
      ... I have our fileserver running completly on XFS (because its quota & ... The Server is NFS, Samba and Appletalk ... servers for sharing a backup disk and two office PCs who run linux there ...
      (Linux-Kernel)
    • Re: 6.1 and NFS
      ... This is becoming a show stopper for us moving forward with FreeBSD and may require us moving to a different OS (Linux or Solaris, ... Well, Solaris has the best NFS implementation out there and includes a number of subtle workarounds in their server code to reduce the number of and/or impact of problems seen doing heterogeneous networking against clients running other operating systems, but frankly, rpc.lockd isn't significantly more stable there on Solaris than on FreeBSD. ... In other words, if you plan to use NFS filesharing, you should make every effort to utilize software which functions with the classic ".lock"file mechanism rather than depending on lockf/flock/fcntl -based locking working. ... If a process wants the lock, ...
      (freebsd-questions)
    • Re: file-copy corruption
      ... I too am heavily into NFS, ... I guess the cause could lie anywhere between the source disk, the source disk controller, ... On any recent Linux, it would be very rare for there to be "no ... diff had an option to not follow symlinks..). ...
      (Fedora)
    • Re: [PATCH 2.6.18-rc1] Make group sorting optional in the 2.6.x kernels
      ... in an operation on NFS on the client system, ... When there is no group id information passed downwards ... For someone of my level of knowledge of the kernel the README does not ... At the moment Linux is the only OS ...
      (Linux-Kernel)
    • Re: Help me replace some Windows installations
      ... > Possible with untrusted clients in SMB, and trusted clients in NFS. ... >> trust every client that might be connected to this network. ... > Still, user ABC on client, accesses to server with rights of the user ... > which Peter already told you about, or use SMB for Linux to Linux ...
      (comp.os.linux.setup)