Re: [00/17] Large Blocksize Support V3



Andy Whitcroft wrote:
Nick Piggin wrote:


I don't understand what you mean at all. A block has always been a
contiguous area of disk.


Lets take Nick's definition of block being a disk based unit for the
moment. That does not change the key contention here, that even with
hardware specifically designed to handle 4k pages that hardware handles
larger contigious areas more efficiently. David Chinner gives us
figures showing major overall throughput improvements from (I assume)
shorter scatter gather lists and better tag utilisation. I am loath to
say we can just blame the hardware vendors for poor design.

So their controllers get double the throughput when going from 512K
(128x4K pages) to 2MB (128x16K pages) requests. Do you really think
it is to do with command processing overhead?


Actually, I don't know why people are so excited about being able to
use higher order allocations (I would rather be more excited about
never having to use them). But for those few places that really need
it, I'd rather see them use a virtually mapped kernel with proper
defragmentation rather than putting hacks all through the core code.


Virtually mapping the kernel was considered pretty seriously around the
time SPARSEMEM was being developed. However, that leads to a
non-constant relation for converting kernel virtual addresses to
physical ones which leads to significant complexity, not to mention
runtime overhead.

Yeah, a page table walk (or better, a TLB hit). And yeah it will cost
a bit of performance, it always does.


As a solution to the problem of supplying large pages from the allocator
it seems somewhat unsatisfactory. If no significant other changes are
made in support of large allocations, the process of defragmenting
becomes very expensive. Requiring a stop_machine style hiatus while the
physical copy and replace occurs for any kernel backed memory.

That would be a stupid thing to do though. All you need to do (after
you keep DMA away) is to unmap the pages.


To put it a different way, even with such a full defragmentation scheme
available some sort of avoidance scheme would be highly desirable to
avoid using the very expensive deframentation underlying it.

Maybe. That doesn't change the fact that avoidance isn't a complete
solution by itself.


Is that a big problem? Really? You use 16K pages on your IPF systems,
don't you?


To my knowledge, moving to a higher base page size has its advantages in
TLB reach, but brings with it some pretty serious downsides. Especially
in caching small files. Internal fragmentation in the page cache
significantly affecting system performance. So much so that development
is ongoing to see if supporting sub-base-page objects in the buffer
cache could be beneficial.

I think 16K would be pretty reasonable (ia64 tends to use it). I guess
powerpc went to 64k either because that's what databases want or because
their TLB refills are too slow, so the internal fragmentation bites
them a lot harder.

But that was more of a side comment, because I still think io controllers
should be easily capable of operation on 4K pages. Graphics cards are,
aren't they?

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: Flames over -- Re: Which is simpler?
    ... different from testing whether it has changed -- it could have changed while the system was asleep, with the result that hardware is indeed there but it's not the _same_ hardware. ... It is currently my belief that the kernel probes the hardware the same way after a cold boot, resume from s-t-r, and resume from s-t-d. ... This is why you can suspend to disk, and when you resume, your open files are still valid; the disk is found to be still there, so it continues to be accessible. ... the kernel queries the USB controller to see if a disconnect occurred while the system was asleep. ...
    (Linux-Kernel)
  • Re: My boss want to kill debian, please help !
    ... The *hardware* issue with e100 that I know off ... there's something else I should point out: the debian *stable* kernel ... currently sane. ... "One disk to rule them all, ...
    (Debian-User)
  • Re: SBS Server crash Unexpected - Error ID 6008
    ... If so, start in Safe Mode, delete all page files and reboot, then recreate your pagefile and reboot one more time ... Hard ware vendor has check hardware log, Raid Controller Log, but no hardware ... Loading Kernel Symbols ... in paging file or disk controller error. ...
    (microsoft.public.windows.server.sbs)
  • Re: [00/17] Large Blocksize Support V3
    ... contiguous area of disk. ... Lets take Nick's definition of block being a disk based unit for the ... hardware specifically designed to handle 4k pages that hardware handles ... No - it has to do with things like the RAID controller caching behaviour, ...
    (Linux-Kernel)
  • Re: "Device 08:11 not ready" - error log
    ... > kernel: Device 08:11 not ready. ... > I need to know for which hardware, i am not able to find it in the ... is for all the partition on the disk. ...
    (RedHat)