Re: [00/17] Large Blocksize Support V3
- From: Andy Whitcroft <apw@xxxxxxxxxxxx>
- Date: Thu, 26 Apr 2007 13:37:42 +0100
Nick Piggin wrote:
Christoph Lameter wrote:
On Thu, 26 Apr 2007, Nick Piggin wrote:
mapping through the radix tree. You just need to change the way the
filesystem looks up pages.
You didn't think any of the criticisms of higher order page cache size
were valid?
They are all known points that have been discussed to death.
I missed the part where you showed that it was a better solution than
the alternatives.
What are the exact requirement you are trying to address?
Block size > page cache size.
But what do you mean with it? A block is no longer a contiguous
section of memory. So you have redefined the term.
I don't understand what you mean at all. A block has always been a
contiguous area of disk.
Lets take Nick's definition of block being a disk based unit for the
moment. That does not change the key contention here, that even with
hardware specifically designed to handle 4k pages that hardware handles
larger contigious areas more efficiently. David Chinner gives us
figures showing major overall throughput improvements from (I assume)
shorter scatter gather lists and better tag utilisation. I am loath to
say we can just blame the hardware vendors for poor design.
You guys have a couple of problems, firstly you need to have ia64
filesystems accessable to x86_64. And secondly you have these
controllers
without enough sg entries for nice sized IOs.
This is not sgi specific sorry.
I sympathise, and higher order pagecache might solve these in a way, but
I don't think it is the right way to go, mainly because of the
fragmentation
issues.
And you dont care about Mel's work on that level?
I actually don't like it too much because it can't provide a robust
solution. What do you do on systems with small memories, or those that
eventually do get fragmented?
Actually, I don't know why people are so excited about being able to
use higher order allocations (I would rather be more excited about
never having to use them). But for those few places that really need
it, I'd rather see them use a virtually mapped kernel with proper
defragmentation rather than putting hacks all through the core code.
Virtually mapping the kernel was considered pretty seriously around the
time SPARSEMEM was being developed. However, that leads to a
non-constant relation for converting kernel virtual addresses to
physical ones which leads to significant complexity, not to mention
runtime overhead.
As a solution to the problem of supplying large pages from the allocator
it seems somewhat unsatisfactory. If no significant other changes are
made in support of large allocations, the process of defragmenting
becomes very expensive. Requiring a stop_machine style hiatus while the
physical copy and replace occurs for any kernel backed memory.
To put it a different way, even with such a full defragmentation scheme
available some sort of avoidance scheme would be highly desirable to
avoid using the very expensive deframentation underlying it.
Increasing PAGE_SIZE, support for block size > page cache size, and
getting
io controllers matched to a 4K page size IMO would be some good ways to
solve these problems. I know they are probably harder...
No this has been tried before and does not work. Why should we loose
the capability to work with 4k pages just because there is some data
that has to be thrown around in quantity? I'd like to have flexibility
here.
Is that a big problem? Really? You use 16K pages on your IPF systems,
don't you?
To my knowledge, moving to a higher base page size has its advantages in
TLB reach, but brings with it some pretty serious downsides. Especially
in caching small files. Internal fragmentation in the page cache
significantly affecting system performance. So much so that development
is ongoing to see if supporting sub-base-page objects in the buffer
cache could be beneficial.
The fragmentation problem is solvable and we already have a solution
in mm. So I do not really see a problem there?
I don't think that it is solved, and I think the heuristics that are
there would be put under more stress if they become widely used. And
it isn't only about whether we can get the page or not, but also about
the cost. Look up Linus's arguments about page colouring, which are
similar and I also think are pretty valid.
-apw
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [00/17] Large Blocksize Support V3
- From: Nick Piggin
- Re: [00/17] Large Blocksize Support V3
- From: David Chinner
- Re: [00/17] Large Blocksize Support V3
- References:
- [00/17] Large Blocksize Support V3
- From: clameter
- Re: [00/17] Large Blocksize Support V3
- From: Eric W. Biederman
- Re: [00/17] Large Blocksize Support V3
- From: Nick Piggin
- Re: [00/17] Large Blocksize Support V3
- From: Christoph Lameter
- Re: [00/17] Large Blocksize Support V3
- From: Nick Piggin
- Re: [00/17] Large Blocksize Support V3
- From: Christoph Lameter
- Re: [00/17] Large Blocksize Support V3
- From: Nick Piggin
- Re: [00/17] Large Blocksize Support V3
- From: Christoph Lameter
- Re: [00/17] Large Blocksize Support V3
- From: Nick Piggin
- [00/17] Large Blocksize Support V3
- Prev by Date: Re: Linux 2.6.21
- Next by Date: Re: suspend2 merge (was Re: [Suspend2-devel] Re: CFS and suspend2: hang in atomic copy)
- Previous by thread: Re: [00/17] Large Blocksize Support V3
- Next by thread: Re: [00/17] Large Blocksize Support V3
- Index(es):
Relevant Pages
|