Re: [00/17] Large Blocksize Support V3



David Chinner <dgc@xxxxxxx> writes:

On Thu, Apr 26, 2007 at 03:37:28PM +1000, Nick Piggin wrote:
I think starting with the assumption that we _want_ to use higher order
allocations, and then creating all this complexity around that is not a
good one, and if we start introducing things that _require_ significant
higher order allocations to function then it is a nasty thing for
robustness.

From my POV, we started with the problem of how to provide atomic
access to a multi-page block in the page cache. For example, we want
to lock the filesystem block and prevent any updates to it, so we
need to lock all the pages in it. And then when we write them back,
they all need to change state at the same time, and they all need to
have their radix tree tags changed at the same time, the problem of
mapping them to disk, getting writeback to do block aligned and
sized writeback chunks, and so on.

Ok. That is a reasonable problem and worth solving.

I suspect the easiest way to go is to have something in the code
that points all of the locking activities at the first page in the
block. Like large pages do but they don't have to be physically
contiguous.

I think that would be even less code then what you are proposing.

And then there's the problem that most hardware is limited to 128
s/g entries and that means 128 non-contiguous pages in memory is the
maximum I/O size we can issue to these devices. We have RAID arrays
that go twice as fast if we can send them 1MB I/Os instead of 512k
I/Os and that means we need contiguous pages to be handled to the
devices....

Ok. Now why are high end hardware manufacturers building crippled
hardware? Or is there only an 8bit field in SCSI for describing
scatter gather entries? Although I would think this would be
move of a controller ranter than a drive issue.

All of these things require some aggregating structure to
co-ordinate. In times gone by on other OSs, this has been done with
a buffer cache, but Linux got rid of that long ago and we don't want
to reintroduce one. We can't use buffer heads - they can only point
to one page. So what do we do?

For I/O we have the BIO which can point to multiple pages just fine.

Buffer heads are irrelevant. The question is how do we get to
the page cache from the BIO and from the BIO to the page cache.

That's where compound pages are so nice - they solve all of these
problems with only a very small amount of code perturbation and they
don't change any algorithms or fundamental design of the OS at all.

The change the fundamental fragmentation avoidance algorithm of the
OS. Use only one size of page. That is a huge problem.

Yes we do relax that rule but only on things that we don't care
about much and don't mind failing.

We don't have to rewrite filesystems to support this. We don't have
to redesign the VM to support this. We don't have to do very much
work to the block layer and drivers to make this work.

This changes the fundamental algorithm for avoiding fragmentation
problems in the VM. Don't have multiple page sizes.

That rule is relaxed allowing us to use larger pages in special
cases but this is not a special case. This is the common case.

That is a huge concern.

FWIW, if you want 32 bit machines to support larger than 16TB
devices, you need high order page indexing in the page cache....

Huh? Only if you are doing raw device accesses. We are limited
to 16TB files.

That would make the choice of using larger order pages (essentially
increasing PAGE_SIZE) something that can be investigated in parallel.

I agree that hardware inefficiencies should be handled by increasing
PAGE_SIZE (not making PAGE_CACHE_SIZE > PAGE_SIZE) at the arch level.

And what do we do for arches that can't do multiple page sizes, only
only have a limited and mostly useless set of page sizes to choose
from?

You have HW_PAGE_SIZE != PAGE_SIZE. That is you hide the fact from
the bulk of the kernel struct page manges 2 or more real hardware pages.
But you expose it to the handful of places that actually care.

Partly this is a path you are starting down in your patches, with
larger page cache support.


Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: Possible ways of dealing with OOM conditions.
    ... There is more to networking that skbs only, what about route cache, ... jumbo-frames (e1000 requires such allocations for 9k frames ... Well, if you have such hardware its not rare at all, But yeah that ... adds to the fragmentation issues on the page-allocator level. ...
    (Linux-Kernel)
  • Re: Improving performance of full table scans
    ... Increasing the buffer cache might help, but there is still a risk of the table aging out of the cache. ... Lack of sufficient physical RAM will be cause to procure more hardware. ... Another thing to consider is to spread your I/O among multiple disk spindles, ...
    (comp.databases.oracle.server)
  • Re: [00/17] Large Blocksize Support V3
    ... Support for larger buffers than page cache pages. ... the buffer has to become the primary interface for the filesystem ... I/O needs to be issued based on buffers, ... The CPU hardware works well with 4k pages, ...
    (Linux-Kernel)
  • Re: Possible ways of dealing with OOM conditions.
    ... There is more to networking that skbs only, what about route cache, ... With power-of-two allocation SLAB wastes 500 bytes for each 1500 MTU ... Well, if you have such hardware its not rare at all, But yeah that ... adds to the fragmentation issues on the page-allocator level. ...
    (Linux-Kernel)
  • Re: Caching control
    ... |> | invalidate/unmap them in order to discard the data from memory. ... |> writing out to disk. ... | easy to discard as clean disk cache. ... stating that a specific amount of RAM can be used only for I/O ...
    (comp.os.linux.development.system)

Loading