Re: Caching control



On Fri, 20 Mar 2009 11:59:01 -0700 (PDT) David Schwartz <davids@xxxxxxxxxxxxx> wrote:
| On Mar 20, 10:51?am, phil-news-nos...@xxxxxxxx wrote:
|
|> On Thu, 12 Mar 2009 14:20:54 -0700 (PDT) David Schwartz <dav...@xxxxxxxxxxxxx> wrote:
|
|> | Right, but those pages are mapped into memory. It would have to
|> | invalidate/unmap them in order to discard the data from memory. If
|> | swapiness is set very low, it's not supposed to discard mappings just
|> | to increase disk cache.
|
|> Invalidating and unmapping is still cheap. ?It has to be done for
|> modified pages, too. ?But unmodified pages don't have the cost of
|> writing out to disk.
|
| That's why swapiness defaults to a high value. Clean mappings are as
| easy to discard as clean disk cache. In principle, a system with a
| unified memory architecture shouldn't care whether disk data in cache
| is mapped or not.
|
| However, there are some pathological cases where preventing the system
| from unmapping pages in response to unmapped writes is helpful. This
| is why Linux provides the 'swapiness' tunable.
|
| By default, Linux operates in the pure, technically right mode.

There should be separate settings that influence the swappiness of
unmodified pages that can be stolen without any writing, and modified
pages that have to be written out to the swap space in order to steal
their RAM slot. Given that there is only one setting, I'd like to know
(and without having to trace it all through the source code) just how
this setting influences BOTH of these types of swapouts. In particular,
how is it balanced between them? Is there a weighting factor? I'd like
to know just how the numbers actually influence things.


|> | If the problem is pushing mappings out, swapiness is the right fix. If
|> | the problem is pushing other things out of disk cache, a smaller disk
|> | cache will make things worse.
|
|> What exactly is the swapiness value related to? ?What are its units?
|
| Swapiness controls how likely the system is to discard a mapped page
| rather than an unmapped page. Its default value is 100, which means
| that mappings are treated the same as non-mapped disk cache. This is
| "technically correct". There's no a reason a page read through a disk
| file that's 'mmap'ped has any more right to stay in memory than one
| read with 'read'. If you turn it all the way to zero, a mapped page
| will almost never by discarded to make more space to fit unmapped
| pages.

When you refer to "non-mapped disk cache", are you referring to pages
that were read in from files (or raw disk devices), or pages that are
written by processes destined to be written to the file (or disk device)?


| 0 is dangerous as it can allow large mappings to make normal read/
| write I/O slow to a crawl. But very small values should not be
| pathological in most cases.

So is this value just a statistical thing?

I'm trying to relate this to just how much RAM will get used for the
various classes of usage. And I do classify things differently than
just mapped or unmapped.

It seems in one respect I should use maximum (100?) swapiness, whereas
in another respect, I should use a low value (20?). So that tells me I
need something more than just this one setting.


|> What would be clear, although not necessarily optimal, would be a
|> reserve, stating that a specific amount of RAM can be used only for I/O
|> cache, or write cache, or read cache, or mappings, etc. ?When the
|> utilization is at or below the reserve, pages in the reserve class
|> cannot be taken at all. ?The sum of all reserves, plus other fixed RAM
|> usage, obviously must be less than RAM.
|
| If you make this small, it does no good. If you make it large, it does
| harm. This is not done because it's almost never what people really
| want.

Ideally the sum of all reserves (counting things like kernel code as a
reserve, too) should be less than 50% of all of RAM, preferrably as low
as 25%. Then the remainder can be used dynamically for whatever is
needed. Even then, there should also be some kind of time based change
impedance. For example, if 80% happens to be used for one class of use,
and a big demand begins for a different class, it shouldn't suddenly make
a big change. It should have a "soft" reserve that slowly changes with
the demand. The rate of that change should be tunable, with well chosen
defaults for common configurations.


|> | There is no scenario I can think of where shrinking the disk cache is
|> | the right fix.
|
|> When the gains by a larger disk cache are less than the losses by
|> smaller space for other things, then I do see that as a case where a
|> smaller disk cache is appropriate.
|
| What "other things"? You mean mappings? If you mean mappings, the fix
| is dropping swapiness, which gives mapping priority. Do you mean
| memory allocated for kernel structures? They already have priority
| over the disk cache. If you mean something else, do tell.

By "other things" I mean any other kind of demand that could be forced
to do more I/O if the disk cache demand pushes on it. Since Linux does
not have a swapped kernel, the kernel code is fixed in RAM, and that
class of usage won't be relevant to this issue. Any class of usage,
than when pressed on by disk write caching, would do its own I/O, then
it could be taking I/O bandwidth (especially severe if this includes
head seek time) away from the disk writes, slowing down the overall
useful data rate.

At some point of building up a cache of dirty write data, the cache
should not grow any more, and the write() should be blocked (and if my
suggesting of allowing non-blocking I/O for file and disk writes is
enabled, should result in EAGAIN or the legacy EWOULDBLOCK from an
attempt to write on a descriptor with non-blocking enabled). This point
should be enough to allow some _reasonable_ level of write order
optimization to maximize physical device throughput, while specifically
_avoiding_ causing other classes of memory usage from having to do any
I/O that is in excess of the performance gains on this writing.


|> When the (write) disk cache is considered with respect to its impact on
|> other things, then you have a balancing act. ?If there are only two
|> things to address, then you weight the curves by importance, find the
|> intersection, and that's your optimal point. ?When there are three or
|> more things to address (and in reality there are many), then there is
|> usually no one point optimal for everything, but there will generally be
|> a range of points that can at least be worked with.
|
| What "other things" are you talking about? If you mean mapped pages,
| you have 'swapiness' for that.

I'm not being specific. I'm being general. I mean anything that can
incur an I/O bandwidth usage that slows down the writing.

When I use O_DIRECT to write to disk, I avoid the issue of competing I/O
because these writes are not cached. Thus other memory uses won't slow
down the writing. The catch with O_DIRECT is, when the physical I/O is
done, there's nothing ready for the driver to immediately start on the
device again. So a round trip back to the process has to be done to get
the next data. O_DIRECT implies syncronous I/O.

The ideal scenario for what I am trying to do would be a SMALL FINITE
cache. In most cases I'm writing sequentially, so there is no gain to
a large cache. It's bulk write that won't be read later, so I have no
gain to leaving it RAM for something to read soon. What I need is for
there to be JUST ENOUGH queued to immediately keep the device busy
BEFORE making the round trip back to the process to get more data that
it is writing. In theory, I should need no more write cache than the
size that can be collected together for big I/O to the disk, times two
(to allow for data to be ready for the driver to immediately start when
the previous I/O is complete).

I'm putting a program together now to run 2 or more parallel processes
doing writes (using pwrite() calls to a descriptor with O_DIRECT). The
first should have its data going to disk immediately, while the process
is blocked on its pwrite() call waits until that I/O is down. The
second would soon do its pwrite() call, which also blocks, but this data
sits waiting for the disk to become ready (when the first write
physically gets done). Hopefully this is queued all the way down to the
driver, so the instant the interrupt gets handled in the device driver,
it can start the second chunk of I/O right then, keeping the device busy
to the max. Two such writer processes should do the job. I'm making
this program tunable so I can evaluate if maybe 3 or 4 might work better
on systems that are also busy doing other things (e.g. to help keep the
write queue have something for the driver to write).

I would rather have had a SMALL FINITE write cache WITH non-blocking I/O
on the descriptor actually work for disk, so I could do it all in ONE
process. But even without the non-blocking I/O, a managable cache would
be useful. This would mean an ability to specify per-open-descriptor,
a ceiling on the write cache size. When the cached data hits that limit,
then it should behave in a way that it would if otherwise there was no
place to put the data the process is passing through via write(). But
instead of that condition being one that impacts the whole system, it
would be set small (1M or so) so there is minimal impact on the system.

Clearly, if the system has to do other I/O for legimate reasons, and that
I/O is to a controller channel, bus path, or physical device, that uses
bandwidth the bulk writing would also be using, there is an unavoidable
competition for it. But I most certainly do not want the bulk writing to
EVER cause other processes to swap out or otherwise be forced to do I/O
that they otherwise would not do (assuming other processes are reasonable
and not currently trying to take all the system resources, either).

My new computer has 6 SATA ports. If I put 6 drives in them, I should be
able to run 6 bulk writers, one for each, all at full speed and run the
drives at their very top speed, without impacting each other, and without
flooding the cache (using 24M of RAM for cache, 4M per disk, would not be
a flood).


| If you mean unmapped clean pages, the problem is very different and no
| change in the cache size will help. (Unless you separate clean and
| dirty data, but then you'll wind up pushing the problem elsewhere.)

I don't know what you mean by "unmapped clean pages". Is that just
residual, such as what was written by a process, then written to the
disk, then is now still sitting there in RAM just in case something
happens to read that disk block and can just use this data as is?


|> |> And there is no need to wait 40 to 120 seconds before starting the writes, as
|> |> an article/post by Ted T'so I saw last night suggested ext4 was doing.
|> |
|> | If that's happening, it's likely a bug. I agree, the writes should
|> | start as soon as enough of them are buffered. That should not take
|> | more than a second under any scenario I can imagine. (One exception
|> | might be lazy allocation, but even then, 40 seconds seems completely
|> | unreasonable to me.)
|
|> There are people apparently claiming that deferring things like that is
|> the way to go. ?It's hard to say. ?If file space allocation is deferred,
|> then an allocation made (committed) later on when it is known where a
|> write has to occur can "piggy back" on the drive positioning, and result
|> in what may be related pages being located near each other (e.g. they
|> may be read back together in the future). ?But I'm still a believer that
|> writes should always proceed immediately. ?What I would do instead of
|> deferred allocation, is "re-do allocation". ?At a later time, if related
|> data has to be written somewhere else, AND if the data for this write is
|> still intact in RAM, do the allocation over (free the other space), and
|> do both writes near each other together. ?This would have to be weighted
|> against how busy the disk is, since the repeated write would slow down a
|> very busy disk.
|
| That's a very expensive proposition. But allocating 16MB chunks will
| get you about as much performance as there is to get. And you should
| be able to collect 16MB of data in no more than a second or so,
| typically.

The bug of the day in ext4 (and POSIX) is that writing data to even a small
file (won't come close to 16M), then "twisting" the directory references to
put the new file in place of the old (the usual hardlink the file to an old
name, then rename the new file to take the previous place, leaving the old
file with only an old name link), loses data (because the new file did not
have its _data_ syncronized). It is argued that the process should do an
fsync() on the file before closing it and doing the link twisting. What I
argue is that this should be the default, and a program that does not need
for a file to be immediately syncronized should be able to specify that by
some means (an open() or fcntl() flag). Correctness should be paramount by
default, and risky performance should be an available option. And yes, I
would use the performance options often. But I would expect exactly correct
operation by default, in particular because something the effect of this is
being done at a layer not near syscalls (e.g. by shell scripts).

I define correctness as achiving the same outcome as if every operation were
done completely synchronously, within the scope of view of related processes
(which would likely be those in the same process group). That should be the
default. Then performance options could include flags and other features to
allow the program to specify what it does not need to have happen.

That's my personal philosophy: correctness first, by default, absolutely,
and performance options always available.


|> BTW, I'm definitely NOT going to migrate to ext4 for quite a while. ?I
|> have posted elsewhere that it appears that POSIX itself is broken with
|> respect ext4 being able to claim POSIX compliance while being able to
|> lose data by not syncronizing a file allocation with its renaming. ?The
|> order of operations _should_ be guaranteed for _related_ data.
|
| POSIX itself is broken in quite a few ways, unfortunately. Don't ever
| get me started on POSIX directory reading functions.

If you were to start a thread here, or a blog on the web, about POSIX being
broken, I would be interested and want to read it (and maybe reply).

--
|WARNING: Due to extreme spam, googlegroups.com is blocked. Due to ignorance |
| by the abuse department, bellsouth.net is blocked. If you post to |
| Usenet from these places, find another Usenet provider ASAP. |
| Phil Howard KA9WGN (email for humans: first name in lower case at ipal.net) |
.



Relevant Pages

  • Re: What can I check to fix system performance?
    ... it seems you have plenty of memory available: ... copies of files you have read of written lately, in a cache, in case ... processes per CPU, or 40 in all. ... Consider the disk structure. ...
    (comp.os.linux.setup)
  • Re: Ram Card Design Help
    ... A board that emulated the two board altair controller and ... USB mass storage devices, like a floppy disk, memory stick, hard ... through some I/O ports and readable either by I/O port or by a common ...
    (comp.os.cpm)
  • Re: Scheduler: Process priority fed back to parent?
    ... > interactivity cache could estimate interactivity over a period of hours ... Then you don't even have to write it to the filesystem. ... For those of us with enough memory or a large variety of programs, ... That way the file is already in disk cache or on its way when the ...
    (Linux-Kernel)
  • Re: Disk access performance
    ... disk I/O on windows parralelizes very poorly -- try ... where there is real parallelism between the CPU and I/O. ... If you have enough memory you could linearize the reading of the files ... large chunks via a Stream adapter, ...
    (microsoft.public.dotnet.framework.performance)
  • Re: FreeBSD and poor ata performance
    ... Linux doesn't really have raw disk devices. ... All disk I/O is through the disk cache, ... performance (on a system with sufficient memory to hold all of the data ... sys 0m12.477s ...
    (freebsd-questions)