Re: [opensuse] Raid5/LVM2/XFS alignment



On Jan 30, 2008 5:41 PM, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
Thanks Neil,

I did not think about the on disk cache when I set the count. You
should also do a sync after called dd and include that in your timing.
There may even be a way from user space to tell the drives to flush
their caches. hdparm?

Isn't that usually done when shutting down? I'd imagine it could cause
a lot of trouble when a pc shuts down with some data in the drive
cache. I dunno much about these things, but to find a program that
forces the driver caches to be flushed I'd look in the shutdown
scripts


As to the block size, dd will invoke a kernel write call for each
block. In theory the kernel can coalesce those into bigger blocks, so
there is not an easy way to say what is being sent to the disk. But
the kernel should not be breaking down individual writes I don't
think.

So as I put in another e-mail, if dd is called with bs = 3x chunksize
(for a 4-disk raid5), and the writes are stripe aligned, then the
kernel has the ability to fully optimize the parity calculation. And
parity calculation is by far the biggest performance issue related to
raid5.

Greg





On Jan 30, 2008 6:12 AM, Neil <hok.krat@xxxxxxxxx> wrote:

On 1/28/08, Greg Freemyer <greg.freemyer@xxxxxxxxx> wrote:
On Jan 28, 2008 11:25 AM, Ciro Iriarte <cyruspy@xxxxxxxxx> wrote:
Hi, anybody has some notes about tuning md raid5, lvn and xfs?. I'm
getting 20mb/s with dd and I think it can be improved. I'll add config
parameters as soon as i get home. I'm using md raid5 on a motherboard
with nvidia sata controller, 4x500gb samsung sata2 disks and lvm with
OpenSUSE 10.3@x86_64.

Regards,
Ciro
--

I have not done any raid 5 perf. testing: 20 mb/sec seems pretty bad,
but not outrageous I suppose. I can get about 4-5GB/min from new sata
drives. So about 75 MB/sec from a single raw drive (ie. dd
if=/dev/zero of=/dev/sdb bs=4k)

You don't say how your invoking dd. The default bs is only 512 bytes
I think and that is totally inefficient with the linux kernel.

I typically use 4k which maps to what the kernel uses. ie. dd
if=/dev/zero of=big-file bs=4k count=1000 should give you a simple but
meaningful test..

I think the default stride is 64k per drive, so if your writing 3x 64K
at a time, you may get perfect alignment and miss the overhead of
having to recalculate the checksum all the time.

As another data point, I would bump that up to 30x 64K and see if you
continue to get speed improvements.

So tell us the write speed for
bs=512
bs=4k
bs=192k
bs=1920k

And the read speeds for the same. ie. dd if=big-file of=/dev/null bs=4k, etc.

I would expect the write speed to go up with each increase in bs, but
the read speed to be more or less constant. Then you need to figure
out what sort of real world block sizes your going to be using. Once
you have a bs, or collection of bs sizes that match your needs, then
you can start tuning your stack.

Greg
--
Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx


Isn't there a minimum to the total transmitted data to get a reliable
reading? Something like (4 (number of disks)-1(redundant disk)) *
16(MB of disk cache) wich would result in 48MB minimum send in this
case.
I belive this because all data send to a disk is first cached (as far
as I know) in the disk cache (most high volume drives have 16 MB of
it). Or does DD circomvent that?

Also I belive the dd block size should be above (4(number of
disks)-1(redundand disk))*256k(array chunck size) = 768k.
If you use a block size smaller than 256k the increased speed will not
be visible: the system is still writing to 1 disk at a time!
ie: bs=4k, so there are 64 blocks in each chunk
What does dd do? It writes 64 blocks to disk 1, continues to write 64
blocks to disk 2, continues to write 64 blocks to disk 3, and so
forth. (this is without paying mind to the redundancy).

Correct me if I am worng, but that's the way I used to test my old raid0 array

Neil

--
There are two kinds of people:
1. People who start their arrays with 1.
1. People who start their arrays with 0.




--

Greg Freemyer
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com




--
There are two kinds of people:
1. People who start their arrays with 1.
1. People who start their arrays with 0.
--
To unsubscribe, e-mail: opensuse+unsubscribe@xxxxxxxxxxxx
For additional commands, e-mail: opensuse+help@xxxxxxxxxxxx



Relevant Pages

  • athlon-xp + fakeraid regression
    ... The build completes fine, the kernel boots fine, the machine will seem to be fine as long as it remains quiescent. ... At the beginning, just after hitting enter on the make command, one of the ad4 disk light goes on solid for several seconds. ... There is a well known thing where these cheap pata fakeraid cards will try to do ata133 if the drive says it can, when really, even if he drives are new ata133 drives and the cables are new and short and shielded, you still shouldn't try to do ata133 since the spec is too tight and you'll just get bit errors or other failures. ... The fix is use ata100 somehow, either by disabling dma entirely in loader.conf (since you have no more selective option there, and the raid card bios never has an option for controlling pio/dma mode like motherboard bios's have) and then use atacontrol in rc.early to set udma5, or by using disks that can only do ata100 and only advertise ata100 to the controller. ...
    (freebsd-current)
  • Re: [opensuse] Hard drive problem after suspend to disk
    ... With the updated kernel version 2.6.16.21-0.13 and all following ... couldn't access the hard disk properly). ... I've seen this same behaviour on upenSUSE 10.2 after a system freeze: ... I never suspend to disk, though, and I only have IDE drives. ...
    (SuSE)
  • Re: Failure to Eject
    ... mounted FW drives nor flash drives either by dragging to trash or ... Instead, the OS keeps a disk cache in RAM, and synchronizes ... disk cache, changes in the cache won't be written to the disk. ... but the Empty Trash available from the Dock does. ...
    (comp.sys.mac.system)
  • Re: Failure to Eject
    ... John McWilliams wrote: ... mounted FW drives nor flash drives either by dragging to trash or clicking on the little eject symbol in the left hand column of a Finder Window. ... Instead, the OS keeps a disk cache in RAM, and synchronizes the cache with the disk periodically. ... If you physically disconnect the disk before the operating system has had a chance to synchronize the disk cache, changes in the cache won't be written to the disk. ...
    (comp.sys.mac.system)
  • Re: possible bug in ide-disk.c (2.6.18.2 but also older)
    ... your disk has an odd number of sectors? ... appear with a plain vanilla 2.6.18.2 kernel from www.kernel.org. ... The original reported number of sectors was an even number. ... and triggers turning off dma on some other drives and, ...
    (Linux-Kernel)