Re: [PATCH] Clustering indirect blocks in Ext3
- From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
- Date: Thu, 15 Nov 2007 23:02:19 -0800
On Thu, 15 Nov 2007 21:02:46 -0800 "Abhishek Rai" <abhishekrai@xxxxxxxxxx> wrote:
(This patch was previously posted on linux-ext4 where Andreas Dilger
offered some valuable comments on it).
This patch modifies the block allocation strategy in ext3 in order to
improve fsck performance. This was initially sent out as a patch for
ext2, but given the lack of ongoing development on ext2, I have
crossported it to ext3 instead. Slow fsck is not a serious problem on
ext3 due to journaling, but once in a while users do need to run full
fsck on their ext3 file systems. This can be due to several reasons:
(1) bad disk, bad crash, etc, (2) bug in jbd/ext3, and (3) every few
reboots, it's good to run fsck anyway. This patch will help reduce
full fsck time for ext3. I've seen 50-65% reduction in fsck time when
using this patch on a near-full file system. With some fsck
optimizations, this figure becomes 80%.
Most of Ext3 metadata is clustered on disk. For example, Ext3
partitions the block space into block groups and stores the metadata
for each block group (inode table, block bitmap, inode bitmap) at the
beginning of the block group. Clustering related metadata together not
only helps ext3 I/O performance by keeping data and related metadata
close together, but also helps fsck since it is able to find all the
metadata in one place. However, indirect blocks are an exception.
Indirect blocks are allocated on-demand and are spread out along with
the data. This layout enables good I/O performance due to the close
proximity between an indirect block and its data blocks but it makes
things difficult for fsck which must now rotate almost the entire disk
in order to read all indirect blocks. In fact, our measurements have
indicated that for most ext3 disks on which fsck takes a long time,
80% of the time is spent reading indirect blocks. So speeding upindirect block read accesses in fsck can significantly improve fsck
times.
One solution to this problem implemented in this patch is to cluster
indirect blocks together on a per group basis, similar to how inodes
and bitmaps are clustered.
So we have a section of blocks around the middle of the blockgroup which
are used for indirect blocks.
Presmably it starts around 50% of the way into the blockgroup?
How do you decide its size?
What happens when it fills up but we still have room for more data blocks
in that blockgroup?
Can this reserved area cause disk space wastage (all data blocks used,
metacluster area not yet full).
The file data block allocator now needs to avoid allocating blocks from
inside this reserved area. How is this implemented? It is awfully similar
to the existing reservations code - does it utilise that code?
Indirect block clusters (metaclusters) help
fsck performance by enabling fsck to fetch all indirect blocks by
reading from a few locations on the disk instead of rotating through
the entire disk. Unfortunately, a naive clustering scheme for indirect
blocks can hurt I/O performance, as it separates out indirect blocks
and corresponding direct blocks on the disk. So an I/O to a direct
block whose indirect block is not in the page cache now needs to incur
a longer seek+rotational delay in moving the disk head from the
indirect block to the direct block.
So our goal then is to implement metaclustering without having any
impact (<0.1%) on I/O performance. Fortunately, current ext3 I/O
algorithm is not the most efficient, improving it can camouflage the
performance hit we suffer due to metaclustering. In fact,
metaclustering automatically enables one such optimization. When doing
sequential read from a file and reading an indirect block for it, we
readahead several indirect blocks of the file from the same
metacluster. Moreover, when possible we do this asynchronously. This
reduces the seek+rotational latency associated with seeking between
data and indirect blocks during a (long) sequential read.
There is one more design choice that affect the performance of this
patch: location and number of metaclusters per block group. Currently
we have one metacluster per block group and it is located at the
center of the block group. We adopted this scheme after evaluating
three possible locations of metaclusters: beginning, middle, and end
of block group. We did not evaluate configurations with >1 metacluster
per block group. In our experiments, the middle configuration did not
cause any performance degradation for sequential and random reads.
Whereas putting the metacluster at the beginning of the block group
yields best performance for sequential reads (write performance is
unaffected by this change), putting it in the middle helps random
reads. Since the "middle path" maintains status quo, we adopted that
in our change.
Performance evaluation results:
Setup:
RAM: 8GB
Disk: 400GB disk.
CPU: Dual core hyperthreaded
All measurements were taken 10 times or more until standard deviation
was <2%. Machine was rebooted between runs and file system freshly
formatted, also we made sure that there was nothing running on the
machine at the time of the test.
Notation:
- 'vanilla': regular ext3 without any changes
- 'mc': metaclustering ext3
Benchmark 1: Sequential write to a 10GB file followed by 'sync'
1. vanilla:
Total: 3m9.0s
User: 0.08
System: 23s-48s (very high variance)
hm, system time variance is weird. You might have found an ext3 bug (or a
cpu time accounting bug).
Excecution profiling would tell, I guess.
2. mc:
Total: 3m6.1s
User: 0.08s
System: 48.1s
Benchmark 2: Sequential read from a 10GB file.
Description: the file is created using same type of ext2 (mc or vanilla)
1. vanilla:
Total: 3m14.5s
User: 0.04s
System: 13.4s
2. mc:
Total: 3m14.5s
User: 0.04s
System: 13.3s
Benchmark 3: Random read from a 300GB file
Description: read using 512 byte chunk total 5MB
1. vanilla:
Total: 3m56.4s
User: ~0
System: 0.6s
2. mc:
Total: 3m51.4s
User: ~0
System: 0.8s
Benchmark 4: Random read from a 300GB file
Description: read using 512KB chunk total 1% size of the file
1. vanilla:
Total: 4m46.3s
User: ~0
System: 3.9s
2. mc:
Total: 4m44.4s
User: ~0
System: 3.9s
Benchmark 5: fsck
Description: Prepare a newly formated 400GB disk as follows: create
200 files of 0.5GB each, 100 files of 1GB each, 40 files of 2.5GB ech,
and 10 files of 10GB each. fsck command line: fsck -f -n
1. vanilla:
Total: 12m18.1s
User: 15.9s
System: 18.3s
2. mc:
Total: 4m47.0s
User: 16.0s
System: 17.1s
They're large files. It would be interesting to see what the numbers are
for more and smaller files.
Benchmark 6: kernbench (this was done on an 8cpu machine with 32GB RAM)
1. vanilla:
Elapsed: 35.60
User: 228.79
System: 21.10
2. mc:
Elapsed: 35.12
User: 228.47
System: 21.08
Note:
1. This patch does not affect ext3 on-disk layout compatibility in any
way. Existing disks continue to work with new code, and disks modified
by new code continue to work with existing machines. In contrast, the
extents patch will also probably solve this problem but it breaks on-disk
compatibility.
2. Metaclustering is a mount time option (-o metacluster). This option
only affects the write path, when this option is specified indirect
blocks are allocated in clusters, when it is not specified they are
allocated alongside data blocks. The read path is unaffected by the
option, read behavior depends on the data layout on disk - if read
discovers metaclusters on disk it will do prefetching otherwise it
will not.
3. e2fsck speedup with metaclustering varies from disk
to disk with most benefit coming from disks which have a large number
of indirect blocks. For disks which have few indirect blocks, fsck
usually doesn't take too long anyway and hence it's OK not to deliver
a huge speedup there. But in all cases, metaclustering doesn't cause
any degradation in IO performance as seen in the benchmarks above.
Less speedup, for more-and-smaller files, it appears.
An important question is: how does it stand up over time? Simply laying
files out a single time on a fresh fs is the easy case. But what happens
if that disk has been in continuous create/delete/truncate/append usage for
six months?
[implementation]
We can get onto that later ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [PATCH] Clustering indirect blocks in Ext3
- From: Abhishek Rai
- Re: [PATCH] Clustering indirect blocks in Ext3
- From: Theodore Tso
- Re: [PATCH] Clustering indirect blocks in Ext3
- From: Andreas Dilger
- Re: [PATCH] Clustering indirect blocks in Ext3
- From: Matt Mackall
- Re: [PATCH] Clustering indirect blocks in Ext3
- References:
- [PATCH] Clustering indirect blocks in Ext3
- From: Abhishek Rai
- [PATCH] Clustering indirect blocks in Ext3
- Prev by Date: Re: [PATCH] keyspan: init termios properly
- Next by Date: Re: x86: disable preemption in delay_tsc()
- Previous by thread: [PATCH] Clustering indirect blocks in Ext3
- Next by thread: Re: [PATCH] Clustering indirect blocks in Ext3
- Index(es):
Relevant Pages
|