madvise/fadvise: why doesn't it work?

From: Peter Boncz (boncz_at_cwi.nl)
Date: 10/30/04


Date: 30 Oct 2004 13:29:33 -0700

Dear list,

I have problems making madvise/posix_fadvise stop drowning my system
with pages I don't want anymore (linux 2.6.8 on amd64).

What I want: I use mmap() to sequentially write out a *large* file.
That is:
1- create file
2- seek (bigger than RAM, e.g. 20GB), you get a file with a big hole
3- mmap (MAP_SHARED)
4- write into the memory region

I do have sufficient disk for that 20GB.

When I have written as much memory as I have physical RAM (in this
particular case 8GB), write-out performance (speed,bandwidth)
significantly degrades. Apparently due to swapping.

My first action was to insert
3a- madvise(MADV_SEQUENTIAL)

on some systems this means that *after* pages have been
touched, they are put on the swapout list. But this does not seem to
be happening in Linux; pages just accumulate in memory.

So I inserted:
4a- msync(MS_SYNC)

this is done regularly, every 100MB or so. It surely writes the pages.
But the pages zre not dropped. Linux stil gets dizzy at 8GB written out.
Then I also inserted

4b- madvise(MADV_DONTNEED)

over the region written up to then, also periodically each 100MB. But
no effect..

finally I thought I got it when I discovered the new posix_fadvise.

4c- posix_fadvise(POSIX_FADV_DONTNEED)

Also each 100MB on the region written out so far. But it does not help.

Then I started reading kernel code.

The comment in the code of mm/fadvise.c says that I should use
msync(MS_INVALIDATE) to do what I want (which is not what the msync
man page says). But hey, I tried it:

4d- msync(MS_INVALIDATE)

Without success again. Top keeps showing an ever increasing amount of
RES (RSS) memory. I can't stop it! Again, when RSS reaches my physical
RAM size, the system gets very depressed (and occasionnaly *panics*,
yes! kernel says it can't alloc).

Can any of you gurus tell me what's wrong (with me)?

thanks a lot in advance,

Peter Boncz
CWI, The Netherlands

ps1: please don't ask the question, why I just do not write with
write(), because I do not have any other answer than that I can't

ps2: I see in the fadvise.c code that SEQUENTIAL only doubles the
prefetch size, from 128KB to 256KB. On current hardware, it is not
uncommon to see disk subsystems provide peek sequential performance
only with buffer sizes of 2MB. 256KB is highly suboptimal and very
conservative. If an application takes the trouble to give fadvise, you
can risk a bit more, I think.



Relevant Pages

  • Re: mmap and large files
    ... On linux, suppose I have a 64 bit system with 32GB ram. ... offset equal to 0 and length the length of a 10 gb file, will mmap map ... touching the virtual memory system through the file abstraction. ...
    (comp.lang.c)
  • Re: Read more than 2GB of data
    ... Your example does not compile on my 64-bit linux. ... you read into memory, ... I use mmap to attach file to the memory region and then simply ... I need to process huge amounts of data. ...
    (comp.unix.programmer)
  • Re: need fastest way to write 2gig array to disk file
    ... > Actually, I think what I need is to be able to write entirely to the cache, so ... > made as large as, say, 75% of all free memory? ... > Before I try things like mmap, I'd like to see if I can simply tune linux. ...
    (comp.os.linux.development.apps)
  • Linux ELF loader vulnerabilities
    ... Numerous bugs have been found in the Linux ELF binary loader while ... Internally the Linux kernel uses a binary format loader layer to ... and the position of the memory map header in the binary image and ... An user may try to execute such a malicious binary with an unterminated ...
    (Bugtraq)
  • [Full-Disclosure] Linux ELF loader vulnerabilities
    ... Numerous bugs have been found in the Linux ELF binary loader while ... Internally the Linux kernel uses a binary format loader layer to ... and the position of the memory map header in the binary image and ... An user may try to execute such a malicious binary with an unterminated ...
    (Full-Disclosure)