Re: Caching control

On Mar 20, 10:51 am, phil-news-nos...@xxxxxxxx wrote:

On Thu, 12 Mar 2009 14:20:54 -0700 (PDT) David Schwartz <dav...@xxxxxxxxxxxxxx> wrote:

| Right, but those pages are mapped into memory. It would have to
| invalidate/unmap them in order to discard the data from memory. If
| swapiness is set very low, it's not supposed to discard mappings just
| to increase disk cache.

Invalidating and unmapping is still cheap.  It has to be done for
modified pages, too.  But unmodified pages don't have the cost of
writing out to disk.

That's why swapiness defaults to a high value. Clean mappings are as
easy to discard as clean disk cache. In principle, a system with a
unified memory architecture shouldn't care whether disk data in cache
is mapped or not.

However, there are some pathological cases where preventing the system
from unmapping pages in response to unmapped writes is helpful. This
is why Linux provides the 'swapiness' tunable.

By default, Linux operates in the pure, technically right mode.

| I think you're missing the thrust of my analysis. The increasing disk
| cache can only be a problem for one of two reasons:
| 1) It's pushing mappings out of memory.
| 2) It's pushing other things out of disk cache.
| The I/O will run at the same speed regardless of how big the disk
| cache is, full speed.

Maybe not.  At least in earlier kernels this was not true.  There was so
much CPU time spent figuring out what to remove from cache, that there
was a point where increasing RAM actually _reduced_ the I/O rate.  I
don't see that anymore.  But I did many versions ago.

That's just a bug. Sure, you might need to do crazy things to
workaround a bug. But if the problem turns out to be due to a bug, the
first thing we should try to do is fix it. (If that fails, then we can
search for workarounds.)

| If the problem is pushing mappings out, swapiness is the right fix. If
| the problem is pushing other things out of disk cache, a smaller disk
| cache will make things worse.

What exactly is the swapiness value related to?  What are its units?

Swapiness controls how likely the system is to discard a mapped page
rather than an unmapped page. Its default value is 100, which means
that mappings are treated the same as non-mapped disk cache. This is
"technically correct". There's no a reason a page read through a disk
file that's 'mmap'ped has any more right to stay in memory than one
read with 'read'. If you turn it all the way to zero, a mapped page
will almost never by discarded to make more space to fit unmapped

0 is dangerous as it can allow large mappings to make normal read/
write I/O slow to a crawl. But very small values should not be
pathological in most cases.

What would be clear, although not necessarily optimal, would be a
reserve, stating that a specific amount of RAM can be used only for I/O
cache, or write cache, or read cache, or mappings, etc.  When the
utilization is at or below the reserve, pages in the reserve class
cannot be taken at all.  The sum of all reserves, plus other fixed RAM
usage, obviously must be less than RAM.

If you make this small, it does no good. If you make it large, it does
harm. This is not done because it's almost never what people really

| There is no scenario I can think of where shrinking the disk cache is
| the right fix.

When the gains by a larger disk cache are less than the losses by
smaller space for other things, then I do see that as a case where a
smaller disk cache is appropriate.

What "other things"? You mean mappings? If you mean mappings, the fix
is dropping swapiness, which gives mapping priority. Do you mean
memory allocated for kernel structures? They already have priority
over the disk cache. If you mean something else, do tell.

 When disk cache (for writing) is
considered all by itself, it should have a performance curve that
approaches leveling out.  Just where that happens depends on the
randomness of the I/O requests.  Sequential I/O would level out very
fast (e.g. a steep initial rise in performance).  Very random I/O should
be the worst performance with a slower rise and longer leveling out.

When the (write) disk cache is considered with respect to its impact on
other things, then you have a balancing act.  If there are only two
things to address, then you weight the curves by importance, find the
intersection, and that's your optimal point.  When there are three or
more things to address (and in reality there are many), then there is
usually no one point optimal for everything, but there will generally be
a range of points that can at least be worked with.

What "other things" are you talking about? If you mean mapped pages,
you have 'swapiness' for that.

If you mean unmapped clean pages, the problem is very different and no
change in the cache size will help. (Unless you separate clean and
dirty data, but then you'll wind up pushing the problem elsewhere.)

|> And there is no need to wait 40 to 120 seconds before starting the writes, as
|> an article/post by Ted T'so I saw last night suggested ext4 was doing.
| If that's happening, it's likely a bug. I agree, the writes should
| start as soon as enough of them are buffered. That should not take
| more than a second under any scenario I can imagine. (One exception
| might be lazy allocation, but even then, 40 seconds seems completely
| unreasonable to me.)

There are people apparently claiming that deferring things like that is
the way to go.  It's hard to say.  If file space allocation is deferred,
then an allocation made (committed) later on when it is known where a
write has to occur can "piggy back" on the drive positioning, and result
in what may be related pages being located near each other (e.g. they
may be read back together in the future).  But I'm still a believer that
writes should always proceed immediately.  What I would do instead of
deferred allocation, is "re-do allocation".  At a later time, if related
data has to be written somewhere else, AND if the data for this write is
still intact in RAM, do the allocation over (free the other space), and
do both writes near each other together.  This would have to be weighted
against how busy the disk is, since the repeated write would slow down a
very busy disk.

That's a very expensive proposition. But allocating 16MB chunks will
get you about as much performance as there is to get. And you should
be able to collect 16MB of data in no more than a second or so,

BTW, I'm definitely NOT going to migrate to ext4 for quite a while.  I
have posted elsewhere that it appears that POSIX itself is broken with
respect ext4 being able to claim POSIX compliance while being able to
lose data by not syncronizing a file allocation with its renaming.  The
order of operations _should_ be guaranteed for _related_ data.

POSIX itself is broken in quite a few ways, unfortunately. Don't ever
get me started on POSIX directory reading functions.