Re: difference between striping using mdadm and LVM



Aragorn wrote:
On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody
identifying as David Brown wrote...
Aragorn wrote:
On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
identifying as David Brown wrote...

<snip to save a little space>

And assuming the data is important, the OP must also think about
backup solutions. But that's worth its own thread.
Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou
shalt make backups, and lots of them too!" :p
The zeroth rule, which is often forgotten (until you learn the hard
way!), is "thou shalt make a plan for restoring from backups, test
that plan, document that plan, and find a way to ensure that all
backups are tested and restoreable in this way". /Then/ you can start
making your actual backups!

Well, so far I've always used the tested and tried approach of tar'ing
in conjunction with bzip2. Can't get any cleaner than that. ;-)


rsync copying is even cleaner - the backup copy is directly accessible. And when combined with hard link copies in some way (such as rsnapshot) you can get snapshots.

Of course, .tar.bz2 is good too - /if/ you have it automated so that it is actually done (or you are one of these rare people that can regularly follow a manual procedure). It also needs to be saved in a safe and reliable place - many people have had regular backups saved to tape only to find later that the tapes were unreadable. And of course it needs to be saved again, in a different place and stored at a different site.

I know I'm preaching to the choir here, as you said before - but there may be others in the congregation.

And the second rule is "thou shalt make backups of your backups",
followed by "thou shalt have backups of critical hardware". (That's
another bonus of software raid - if your hardware raid card dies, you
may have to replace it with exactly the same type of card to get your
raid working again - with mdadm raid, you can use any PC.)

Well, considering that my Big Machine has drained my piggy bank for
about 17'000 Euros worth of hardware, having a duplicate machine is not
really an option. The piggy bank's on a diet now. :-)


You don't need a duplicate machine - you just need duplicates of any parts that are important, specific, and may not always been easily available. There is no need to buy a new machine, but as soon as your particular choice of hardware raid cards start going out of fashion, buy a spare. Better still, buy a spare /now/ before the manufacturer decides to update the firmware in new versions of the card and they become incompatible with your raid drives. Of course, you can always restore from backup in an emergency if the worst happens.

I do on the other hand still have a slightly older dual Xeon machine
with 4 GB of RAM and an U320 SCSI RAID 1 (with two 73 GB disks), which
I will be setting up as an emergency replacement server, and to store
additional backups on - I store my other backups on Iomega REV disks.

Another thing which must not be overlooked is that the CPU or ASIC on
a hardware RAID controller is typically a RISC chip, and so comparing
clock speeds would not really give an accurate impression of its
performance versus a mainboard processor chip. For instance, a MIPS
or Alpha processor running at 800 MHz still outperforms most (single
core) 2+ GHz processors.
As already mentioned, "hardware" raid is often done now with a general
purpose processor rather than an ASIC - and MIPS is a particularly
popular core for the job.

I'm not sure on the one on my SAS RAID adapter, but I think it's an
Intel RISC processor. It's not a MIPS or an Alpha, that much I am
certain of.


Intel haven't made RISC processors for many years (discounting the Itanium, which is an unlikely choice for a raid processor). They used to have StrongArms, and long, long ago they had a few other designs, but I'm pretty certain you don't have an Intel RISC processor on the card. It also will not be an Alpha - they have not been made for years either (they were very nice chips until DEC, then HP+Compaq totally screwed them up, with plenty of encouragement from Intel). Realistic cores include MIPS in many flavours, PPC, and for more recent designs, perhaps an ARM of some kind. If the heavy lifting is being done by ASIC logic rather than the processor core, there is a wider choice of possible cores.

But while you get a lot more work out of an 800 MHz for a given price,
size or power than you do with an x86, you don't get more for a given
clock rate. Parity calculations are really just a big stream
of "xor"'s, and a modern x86 will chew through these as fast as memory
bandwidth allows. Internally, x86 assembly is mostly converted to
wide-word RISC-style instructions, so a decently written parity
function will be as efficient per clock on an x86 as it is on MIPS.

True.

I have never heard of a distinction between a "hot spare" that is
spinning, and a "standby spare" that is not spinning.
This is quite a common distinction, mind you. There is even a
"live spare" solution, but to my knowledge this is specific to
Adaptec - they call it RAID 5E.

In a "live spare" scenario, the spare disk is not used as such but
is part of the live array, and both data and parity blocks are
being written to it, but with the distinction that each disk in the
array will also have empty blocks for the total capacity of a
standard spare disk. These empty blocks are thus distributed
across all disks in the array and are used for array reconstruction
in the event of a disk failure.
Is there any real advantage of such a setup compared to using raid 6
(in which case, the "empty" blocks are second parity blocks)? There
would be a slightly greater write overhead (especially for small
writes), but that would not be seen by the host if there is enough
cache on the controller.
Well, the advantage of this set-up is that you don't need to replace
a failing disk, since there is already sufficient diskspace left
blank on all disks in the array, and so the array can recreate itself
using that extra blank diskspace. This is of course all nice in
theory, but in practice one would eventually replace the disk anyway.
The same is true of raid6 - if one disk dies, the degraded raid6 is
very similar to raid5 until you replace the disk.

And I still don't see any significant advantage of spreading the
wholes around the drives rather than having them all on the one drive
(i.e., a normal hot spare). The rebuild still has to do as many reads
and writes, and takes as long. The rebuild writes will be spread over
all the disks rather than just on the one disk, but I can't see any
advantage in that.

Well, the idea is simply to give the spare disk some exercise, i.e. to
use it as part of the live array while still offering the extra
redundancy of a spare. So in the event of a failure, the array can be
fully rebuilt without the need to replace the broken drive, as opposed
to that the array would stay in degraded mode until the broken drive is
replaced.


The array will be in degraded mode while the rebuild is being done, just like if it were raid5 with a hot spare - and it will be equally slow during the rebuild. So no points there.

In fact, according to wikipedia, the controller will "compact" the degraded raid set into a normal raid5, and when you replace the broken drive it will "uncompact" it into the raid 5E arrangement again. The "compact" and "uncompact" operations take much longer than a standard raid5 rebuild.

So all you get here is a marginal increase in the parallelisation of multiple simultaneous small reads, which you could get anyway with raid6 rather than raid5 with a spare.

I suppose read performance, especially for many parallel small reads,
will be slightly higher than for a normal hot spare, since you have
more disks with active data and therefore higher chances of
parallelising these accesses. But you get the same advantage with
raid6.

Yes, but RAID 6 would be slower for small writes, and if one of the
drives fails, the array stays in degraded mode (since it considers
itself to be a RAID 6, not a RAID 5E).


Degraded raid5 and raid6 have varying speeds, depending on whether the data you access is available directly or must be calculated from the rest of the stripe and the parity. The same applies to a degraded raid 5E with a broken drive.

You are right that small writes to raid 6 would be slower than to a raid 5E.

It looks like we agree on most things here - we just had a little
difference on the areas we wrote about (specific information for the
OP, or more general RAID discussions), and a few small differences
in terminology.
Well, you've made me reconsider my usage of RAID 5, though. I am now
contemplating on using two RAID 10 arrays instead of two RAID 5
arrays, since each of the arrays has four disks. They are both
different arrays, though. They're connected to the same RAID
controller but the first array is comprised of 147 GB 15k Hitachi SAS
disks and the second array is comprised of 1 TB 7.2k Western Digital
RAID Edition SATA-2 disks on a hotswap backplane.

I had always considered RAID 5 to be the best trade-off, considering
the loss of diskspace involved versus the retail price of the hard
disks - especially the SAS disks - but considering that the SAS array
will be used to house the main systems in a virtualized set-up (on
Xen) and will probably endure the most small and random writes, RAID
10 might actually be a better solution. The cost of the lost
diskspace on the SATA-2 disks is smaller since this type of disks is
far less expensive than SAS.
I gather that raid 10 (hardware or software) is now often considered a
better choice - raid 5 is often viewed as unreliable due to the risks
of a second failure during rebuilds, which are increasingly
time-consuming with larger disks. Where practical, I think
mdadm "far" raid 10 is the optimal if you are happy with losing 50% of
your disk space - it is faster than other redundant setups in many
situations, and has a great deal of flexibility.

Well, 50% is the minimum storage capacity one loses when using any kind
of mirroring, be it RAID 1, RAID 10, RAID 0+1, RAID 50 or whatever.

If you want more redundancy, you can use double mirrors for 33% disk
space and still have full speed.

Yes, but that's a set-up which, due to understandable financial
considerations, would be reserved only for the corporate world. Many
people already consider me certifiably insane for having spent that
much money - 17'000 Euro, as I wrote higher up - on a privately owned
computer system. But then again, for the intended purposes, I need
fast and reliable hardware and a lot of horsepower. :-)


I'm curious - what is the intended purpose? I think I would have a hard job spending more than about three or four thousand Euros on a single system.

In the event of the OP on the other hand, 45 SAS disks of 300 GB each
and three SAS RAID storage enclosures also doesn't seem like quite an
affordable buy, so I take it he intends to use it for a business.


It also does not strike me as a high value-for-money system - I can't help feeling that this is way more bandwidth than you could actually make use of in the rest of the system, so it would be better to have fewer larger drives and less layers to reduce the latencies. Spent the cash saved on even more ram :-)

45 disks at a throughput of say 75 MBps each gives about 3.3 GBps - say 3 GBps since some are hot spares. Ultimately, being a server, this is going to be pumped out on Ethernet links. That's a lot of bandwidth - it would effectively saturate four 10 Gbit links.

I have absolutely no real-world experience with these sorts of systems, and could therefore be totally wrong, but my gut feeling is that the theoretical numbers will not scale with so many drives - something like 15 1 TB SATA drives would be similar in speed in practice.

That, or he's a maniac like me. :p

If you have the chance, it would be very nice to try out some
different arrangements and see which is fastest in reality, not just
in theory!

Ahh, but whole books have been written about such tests, and it still
always boils down to "What are you planning to do with it?" For
instance, a database server has different needs from a mailserver, and
this has different needs from a fileserver or workstation, etc. ;-)


It would still be fun!

The other option is to go for a file system that handles multiple
disks and redundancy directly - ZFS is the best known, with btrfs the
experimental choice on Linux.

I don't think Btrfs is already considered stable enough. ZFS is of
course a great choice, but the GPL forbids linking ZFS into the Linux
kernel. If there is a "filesystem in userspace" implementation of it,
then it would of course be possible to legally use ZFS on a GNU/Linux
system.


There /is/ a "filesystem in userspace" implementation of ZFS (using fuse). But it is not feature complete, and not particularly fast.

btrfs is still a risk, and is still missing some features (such as elegant handling of low free space...), but the potential is there.

I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a while,
which uses ZFS, albeit that ZFS was not my reason for being interested
in the project. I was more interested in the fact that it supports
both Solaris Zones - of which the Linux equivalents are OpenVZ and
VServer - and running paravirtualized on top of Xen.

Doing that with OpenVZ requires the use of a 2.6.27 kernel which is
still considered unstable by the OpenVZ developers, and doing that with
Vserver is as good as impossible, since they're still using a 2.6.16
kernel, and you can't apply the (now obsolete) Xen patches to that
because those are for 2.6.18. And thus, running VServer in a Xen
virtual machine would require that you run it via hardware
virtualization rather than paravirtualized.

The big problem with NexentaOS however is that it's based on Ubuntu and
that it uses binary .deb packages, whereas I would rather have a Gentoo
approach, where you can build the whole thing from sources without
having to go "the LFS way".


Why is it always so hard to get /everything/ you want when building a system :-(

Oh well, I've relayed the whole thing for the weekend, so I still have
plenty of time to think things over. ;-)

See, this is one of the advantages of Usenet. People get to share
not only knowledge but also differing views and strategies, and in
the end, everyone will have gleaned something useful. ;-)
Absolutely - that's also why it's good to have a general discussion
every now and again, rather than just answering a poster's questions.
Good questions (such as in this thread) inspire an exchange of
information for many people's benefits (I've learned things here too).

Maybe we should invite some politicians over to Usenet. Then *they*
might possibly learn something about the real world as well. :p

.