Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: mel@xxxxxxxxx (Mel Gorman)
- Date: Mon, 17 Sep 2007 11:13:07 +0100
On (16/09/07 23:31), Andrea Arcangeli didst pronounce:
On Sun, Sep 16, 2007 at 09:54:18PM +0100, Mel Gorman wrote:
The 16MB is the size of a hugepage, the size of interest as far as I am
concerned. Your idea makes sense for large block support, but much less
for huge pages because you are incurring a cost in the general case for
something that may not be used.
Sorry for the misunderstanding, I totally agree!
Great. It's clear that we had different use cases in mind when we were
poking holes in each approach.
There is nothing to say that both can't be done. Raise the size of
order-0 for large block support and continue trying to group the block
to make hugepage allocations likely succeed during the lifetime of the
system.
Sure, I completely agree.
At the risk of repeating, your approach will be adding a new and
significant dimension to the internal fragmentation problem where a
kernel allocation may fail because the larger order-0 pages are all
being pinned by userspace pages.
This is exactly correct, some memory will be wasted. It'll reach 0
free memory more quickly depending on which kind of applications are
being run.
I look forward to seeing how you deal with it. When/if you get to trying
to move pages out of slabs, I suggest you take a look at the Memory
Compaction patches or the memory unplug patches for simple examples of
how to use page migration.
It improves the probabilty of hugepage allocations working because the
blocks with slab pages can be targetted and cleared if necessary.
Agreed.
That suggestion was aimed at the large block support more than
hugepages. It helps large blocks because we'll be allocating and freeing
as more or less the same size. It certainly is easier to set
slub_min_order to the same size as what is needed for large blocks in
the system than introducing the necessary mechanics to allocate
pagetable pages and userspace pages from slab.
Allocating userpages from slab in 4k chunks with a 64k PAGE_SIZE is
really complex indeed. I'm not planning for that in the short term but
it remains a possibility to make the kernel more generic. Perhaps it
could worth it...
Perhaps.
Allocating ptes from slab is fairly simple but I think it would be
better to allocate ptes in PAGE_SIZE (64k) chunks and preallocate the
nearby ptes in the per-task local pagetable tree, to reduce the number
of locks taken and not to enter the slab at all for that.
It runs the risk of pinning up to 60K of data per task that is unusable for
any other purpose. On average, it'll be more like 32K but worth keeping
in mind.
Infact we
could allocate the 4 levels (or anyway more than one level) in one
single alloc_pages(0) and track the leftovers in the mm (or similar).
I'm not sure what you are getting at here. I think it would make more
sense if you said "when you read /proc/buddyinfo, you know the order-0
pages are really free for use with large blocks" with your approach.
I'm unsure who reads /proc/buddyinfo (that can change a lot and that
is not very significant information if the vm can defrag well inside
the reclaim code),
I read it although as you say, it's difficult to know what will happen if
you try and reclaim memory. It's why there is also a /proc/pagetypeinfo so
one can see the number of movable blocks that exist. That leads to better
guessing. In -mm, you can also see the number of mixed blocks but that will
not be available in mainline.
but it's not much different and it's more about
knowing the real meaning of /proc/meminfo, freeable (unmapped) cache,
anon ram, and free memory.
The idea is that to succeed an mmap over a large xfs file with
mlockall being invoked, those largepages must become available or
it'll be oom despite there are still 512M free... I'm quite sure
admins will gets confused if they get oom killer invoked with lots of
ram still free.
The overcommit feature will also break, just to make an example (so
much for overcommit 2 guaranteeing -ENOMEM retvals instead of oom
killage ;).
All this aside, there is nothing mutually exclusive with what you are proposing
and what grouping pages by mobility does. Your stuff can exist even if grouping
pages by mobility is in place. In fact, it'll help us by giving an important
comparison point as grouping pages by mobility can be trivially disabled with
a one-liner for the purposes of testing. If your approach is brought to being
a general solution that also helps hugepage allocation, then we can revisit
grouping pages by mobility by comparing kernels with it enabled and without.
Yes, I totally agree. It sounds worthwhile to have a good defrag logic
in the VM. Even allocating the kernel stack in today kernels should be
able to benefit from your work. It's just comparing a fork() failure
(something that will happen with ulimit -n too and that apps must be
able to deal with) with an I/O failure that worries me a bit.
I agree here with the IO failure. It's why I've been saying that Nick's
approach is what was needed to make large blocks a fully supported
feature. Your approach may work out as well. While those are in development,
Christoph's approach will let us know how regularly these failures actually
occur so we'll know in advance how often the slower paths in both Nick's
and your approach will be exercised.
I'm
quite sure a db failing I/O will not recover too nicely. If fork fails
that's most certainly ok... at worst a new client won't be able to
connect and he can retry later. Plus order 1 isn't really a big deal,
you know the probability to succeeds decreases exponentially with the
order.
--
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Goswin von Brederlow
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- References:
- [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Christoph Lameter
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Nick Piggin
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Jörn Engel
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Andrew Morton
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Goswin von Brederlow
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Andrea Arcangeli
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Mel Gorman
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Andrea Arcangeli
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Mel Gorman
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- From: Andrea Arcangeli
- [00/41] Large Blocksize Support V7 (adds memmap support)
- Prev by Date: Re: iunique() fails to return ino_t (after commit 866b04fccbf125cd)
- Next by Date: Re: [PATCH]PCI:disable resource decode in PCI BAR detection
- Previous by thread: Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- Next by thread: Re: [00/41] Large Blocksize Support V7 (adds memmap support)
- Index(es):
Relevant Pages
|