madvise/fadvise: why doesn't it work?
From: Peter Boncz (boncz_at_cwi.nl)
Date: 10/30/04
- Next message: Peter T. Breuer: "Re: kernel: struct list_head"
- Previous message: vertigo: "Re: kernel: struct list_head"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 30 Oct 2004 13:29:33 -0700
Dear list,
I have problems making madvise/posix_fadvise stop drowning my system
with pages I don't want anymore (linux 2.6.8 on amd64).
What I want: I use mmap() to sequentially write out a *large* file.
That is:
1- create file
2- seek (bigger than RAM, e.g. 20GB), you get a file with a big hole
3- mmap (MAP_SHARED)
4- write into the memory region
I do have sufficient disk for that 20GB.
When I have written as much memory as I have physical RAM (in this
particular case 8GB), write-out performance (speed,bandwidth)
significantly degrades. Apparently due to swapping.
My first action was to insert
3a- madvise(MADV_SEQUENTIAL)
on some systems this means that *after* pages have been
touched, they are put on the swapout list. But this does not seem to
be happening in Linux; pages just accumulate in memory.
So I inserted:
4a- msync(MS_SYNC)
this is done regularly, every 100MB or so. It surely writes the pages.
But the pages zre not dropped. Linux stil gets dizzy at 8GB written out.
Then I also inserted
4b- madvise(MADV_DONTNEED)
over the region written up to then, also periodically each 100MB. But
no effect..
finally I thought I got it when I discovered the new posix_fadvise.
4c- posix_fadvise(POSIX_FADV_DONTNEED)
Also each 100MB on the region written out so far. But it does not help.
Then I started reading kernel code.
The comment in the code of mm/fadvise.c says that I should use
msync(MS_INVALIDATE) to do what I want (which is not what the msync
man page says). But hey, I tried it:
4d- msync(MS_INVALIDATE)
Without success again. Top keeps showing an ever increasing amount of
RES (RSS) memory. I can't stop it! Again, when RSS reaches my physical
RAM size, the system gets very depressed (and occasionnaly *panics*,
yes! kernel says it can't alloc).
Can any of you gurus tell me what's wrong (with me)?
thanks a lot in advance,
Peter Boncz
CWI, The Netherlands
ps1: please don't ask the question, why I just do not write with
write(), because I do not have any other answer than that I can't
ps2: I see in the fadvise.c code that SEQUENTIAL only doubles the
prefetch size, from 128KB to 256KB. On current hardware, it is not
uncommon to see disk subsystems provide peek sequential performance
only with buffer sizes of 2MB. 256KB is highly suboptimal and very
conservative. If an application takes the trouble to give fadvise, you
can risk a bit more, I think.
- Next message: Peter T. Breuer: "Re: kernel: struct list_head"
- Previous message: vertigo: "Re: kernel: struct list_head"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|