Re: [PATCH 0/8] zcache: page cache compression support




----- caiqian@xxxxxxxxxx wrote:

----- "Nitin Gupta" <ngupta@xxxxxxxxxx> wrote:

Frequently accessed filesystem data is stored in memory to reduce
access to
(much) slower backing disks. Under memory pressure, these pages are
freed and
when needed again, they have to be read from disks again. When
combined working
set of all running application exceeds amount of physical RAM, we
get
extereme
slowdown as reading a page from disk can take time in order of
milliseconds.

Memory compression increases effective memory size and allows more
pages to
stay in RAM. Since de/compressing memory pages is several orders of
magnitude
faster than disk I/O, this can provide signifant performance gains
for
many
workloads. Also, with multi-cores becoming common, benefits of
reduced
disk I/O
should easily outweigh the problem of increased CPU usage.

It is implemented as a "backend" for cleancache_ops [1] which
provides
callbacks for events such as when a page is to be removed from the
page cache
and when it is required again. We use them to implement a 'second
chance' cache
for these evicted page cache pages by compressing and storing them
in
memory
itself.

We only keep pages that compress to PAGE_SIZE/2 or less. Compressed
chunks are
stored using xvmalloc memory allocator which is already being used
by
zram
driver for the same purpose. Zero-filled pages are checked and no
memory is
allocated for them.

A separate "pool" is created for each mount instance for a
cleancache-aware
filesystem. Each incoming page is identified with <pool_id,
inode_no,
index>
where inode_no identifies file within the filesystem corresponding
to
pool_id
and index is offset of the page within this inode. Within a pool,
inodes are
maintained in an rb-tree and each of its nodes points to a separate
radix-tree
which maintains list of pages within that inode.

While compression reduces disk I/O, it also reduces the space
available for
normal (uncompressed) page cache. This can result in more frequent
page cache
reclaim and thus higher CPU overhead. Thus, it's important to
maintain
good hit
rate for compressed cache or increased CPU overhead can nullify any
other
benefits. This requires adaptive (compressed) cache resizing and
page
replacement policies that can maintain optimal cache size and
quickly
reclaim
unused compressed chunks. This work is yet to be done. However, in
the
current
state, it allows manually resizing cache size using (per-pool)
sysfs
node
'memlimit' which in turn frees any excess pages *sigh* randomly.

Finally, it uses percpu stats and compression buffers to allow
better
performance on multi-cores. Still, there are known bottlenecks like
a
single
xvmalloc mempool per zcache pool and few others. I will work on
this
when I
start with profiling.

* Performance numbers:
- Tested using iozone filesystem benchmark
- 4 CPUs, 1G RAM
- Read performance gain: ~2.5X
- Random read performance gain: ~3X
- In general, performance gains for every kind of I/O

Test details with graphs can be found here:
http://code.google.com/p/compcache/wiki/zcacheIOzone

If I can get some help with testing, it would be intersting to find
its
effect in more real-life workloads. In particular, I'm intersted in
finding
out its effect in KVM virtualization case where it can potentially
allow
running more number of VMs per-host for a given amount of RAM. With
zcache
enabled, VMs can be assigned much smaller amount of memory since
host
can now
hold bulk of page-cache pages, allowing VMs to maintain similar
level
of
performance while a greater number of them can be hosted.

* How to test:
All patches are against 2.6.35-rc5:

- First, apply all prerequisite patches here:
http://compcache.googlecode.com/hg/sub-projects/zcache_base_patches

- Then apply this patch series; also uploaded here:
http://compcache.googlecode.com/hg/sub-projects/zcache_patches


Nitin Gupta (8):
Allow sharing xvmalloc for zram and zcache
Basic zcache functionality
Create sysfs nodes and export basic statistics
Shrink zcache based on memlimit
Eliminate zero-filled pages
Compress pages using LZO
Use xvmalloc to store compressed chunks
Document sysfs entries

Documentation/ABI/testing/sysfs-kernel-mm-zcache | 53 +
drivers/staging/Makefile | 2 +
drivers/staging/zram/Kconfig | 22 +
drivers/staging/zram/Makefile | 5 +-
drivers/staging/zram/xvmalloc.c | 8 +
drivers/staging/zram/zcache_drv.c | 1312
++++++++++++++++++++++
drivers/staging/zram/zcache_drv.h | 90 ++
7 files changed, 1491 insertions(+), 1 deletions(-)
create mode 100644
Documentation/ABI/testing/sysfs-kernel-mm-zcache
create mode 100644 drivers/staging/zram/zcache_drv.c
create mode 100644 drivers/staging/zram/zcache_drv.h
By tested those patches on the top of the linus tree at this commit
d0c6f6258478e1dba532bf7c28e2cd6e1047d3a4, the OOM was trigger even
though there looked like still lots of swap.

# free -m
total used free shared buffers
cached
Mem: 852 379 473 0 3
15
-/+ buffers/cache: 359 492
Swap: 2015 14 2001

# ./usemem 1024
0: Mallocing 32 megabytes
1: Mallocing 32 megabytes
2: Mallocing 32 megabytes
3: Mallocing 32 megabytes
4: Mallocing 32 megabytes
5: Mallocing 32 megabytes
6: Mallocing 32 megabytes
7: Mallocing 32 megabytes
8: Mallocing 32 megabytes
9: Mallocing 32 megabytes
10: Mallocing 32 megabytes
11: Mallocing 32 megabytes
12: Mallocing 32 megabytes
13: Mallocing 32 megabytes
14: Mallocing 32 megabytes
15: Mallocing 32 megabytes
Connection to 192.168.122.193 closed.

usemem invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
usemem cpuset=/ mems_allowed=0
Pid: 1829, comm: usemem Not tainted 2.6.35-rc5+ #5
Call Trace:
[<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff81108520>] dump_header+0x70/0x190
[<ffffffff811086c1>] oom_kill_process+0x81/0x180
[<ffffffff81108c08>] __out_of_memory+0x58/0xd0
[<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
[<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
[<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
[<ffffffff81140a69>] alloc_page_vma+0x89/0x140
[<ffffffff81125f76>] handle_mm_fault+0x6d6/0x990
[<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff81121afd>] ? follow_page+0x19d/0x350
[<ffffffff8112639c>] __get_user_pages+0x16c/0x480
[<ffffffff810127c9>] ? sched_clock+0x9/0x10
[<ffffffff811276ef>] __mlock_vma_pages_range+0xef/0x1f0
[<ffffffff81127f01>] mlock_vma_pages_range+0x91/0xa0
[<ffffffff8112ad57>] mmap_region+0x307/0x5b0
[<ffffffff8112b354>] do_mmap_pgoff+0x354/0x3a0
[<ffffffff8112b3fc>] ? sys_mmap_pgoff+0x5c/0x200
[<ffffffff8112b41a>] sys_mmap_pgoff+0x7a/0x200
[<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8100fa09>] sys_mmap+0x29/0x30
[<ffffffff8100b032>] system_call_fastpath+0x16/0x1b
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 140
CPU 1: hi: 186, btch: 31 usd: 47
active_anon:128 inactive_anon:140 isolated_anon:0
active_file:0 inactive_file:9 isolated_file:0
unevictable:126855 dirty:0 writeback:125 unstable:0
free:1996 slab_reclaimable:4445 slab_unreclaimable:23646
mapped:923 shmem:7 pagetables:778 bounce:0
Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:11896kB isolated(anon):0kB isolated(file):0kB
present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB
shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 994 994 994
Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB
active_anon:512kB inactive_anon:560kB active_file:0kB
inactive_file:36kB unevictable:495524kB isolated(anon):0kB
isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB
writeback:500kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB
slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726
all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB
1*1024kB 1*2048kB 0*4096kB = 4032kB
Node 0 DMA32: 476*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB
1146 total pagecache pages
215 pages in swap cache
Swap cache stats: add 19633, delete 19418, find 941/1333
Free swap = 2051080kB
Total swap = 2064380kB
262138 pages RAM
43914 pages reserved
4832 pages shared
155665 pages non-shared
Out of memory: kill process 1727 (console-kit-dae) score 1027939 or a
child
Killed process 1727 (console-kit-dae) vsz:4111756kB, anon-rss:0kB,
file-rss:600kB
console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
console-kit-dae cpuset=/ mems_allowed=0
Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5
Call Trace:
[<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff81108520>] dump_header+0x70/0x190
[<ffffffff811086c1>] oom_kill_process+0x81/0x180
[<ffffffff81108c08>] __out_of_memory+0x58/0xd0
[<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
[<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
[<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
[<ffffffff8114522e>] kmem_getpages+0x6e/0x180
[<ffffffff81147d79>] fallback_alloc+0x1c9/0x2b0
[<ffffffff81147602>] ? cache_grow+0x4b2/0x520
[<ffffffff81147a5b>] ____cache_alloc_node+0xab/0x200
[<ffffffff810d55d5>] ? taskstats_exit+0x305/0x3b0
[<ffffffff8114862b>] kmem_cache_alloc+0x1fb/0x290
[<ffffffff810d55d5>] taskstats_exit+0x305/0x3b0
[<ffffffff81063a4b>] do_exit+0x12b/0x890
[<ffffffff810924fd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8108641f>] ? cpu_clock+0x6f/0x80
[<ffffffff81095cbd>] ? lock_release_holdtime+0x3d/0x190
[<ffffffff814e1010>] ? _raw_spin_unlock_irq+0x30/0x40
[<ffffffff8106420e>] do_group_exit+0x5e/0xd0
[<ffffffff81075b54>] get_signal_to_deliver+0x2d4/0x490
[<ffffffff811ea6ad>] ? inode_has_perm+0x7d/0xf0
[<ffffffff8100a2e5>] do_signal+0x75/0x7b0
[<ffffffff81169d2d>] ? vfs_ioctl+0x3d/0xf0
[<ffffffff8116a394>] ? do_vfs_ioctl+0x84/0x570
[<ffffffff8100aa85>] do_notify_resume+0x65/0x80
[<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8100b381>] int_signal+0x12/0x17
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 151
CPU 1: hi: 186, btch: 31 usd: 61
active_anon:128 inactive_anon:165 isolated_anon:0
active_file:0 inactive_file:9 isolated_file:0
unevictable:126855 dirty:0 writeback:25 unstable:0
free:1965 slab_reclaimable:4445 slab_unreclaimable:23646
mapped:923 shmem:7 pagetables:778 bounce:0
Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:11896kB isolated(anon):0kB isolated(file):0kB
present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB
shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 994 994 994
Node 0 DMA32 free:3828kB min:4000kB low:5000kB high:6000kB
active_anon:512kB inactive_anon:660kB active_file:0kB
inactive_file:36kB unevictable:495524kB isolated(anon):0kB
isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB
writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB
slab_unreclaimable:94584kB kernel_stack:1296kB pagetables:3088kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1726
all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB
1*1024kB 1*2048kB 0*4096kB = 4032kB
Node 0 DMA32: 445*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 1*2048kB 0*4096kB = 3828kB
1146 total pagecache pages
230 pages in swap cache
Swap cache stats: add 19649, delete 19419, find 942/1336
Free swap = 2051084kB
Total swap = 2064380kB
262138 pages RAM
43914 pages reserved
4818 pages shared
155685 pages non-shared
Out of memory: kill process 1806 (sshd) score 9474 or a child
Killed process 1810 (bash) vsz:108384kB, anon-rss:0kB, file-rss:656kB
console-kit-dae invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=0
console-kit-dae cpuset=/ mems_allowed=0
Pid: 1752, comm: console-kit-dae Not tainted 2.6.35-rc5+ #5
Call Trace:
[<ffffffff814e10cb>] ? _raw_spin_unlock+0x2b/0x40
[<ffffffff81108520>] dump_header+0x70/0x190
[<ffffffff811086c1>] oom_kill_process+0x81/0x180
[<ffffffff81108c08>] __out_of_memory+0x58/0xd0
[<ffffffff81108ddc>] ? out_of_memory+0x15c/0x1f0
[<ffffffff81108d8f>] out_of_memory+0x10f/0x1f0
[<ffffffff8110cc7f>] __alloc_pages_nodemask+0x7af/0x7c0
[<ffffffff8114522e>] kmem_getpages+0x6e/0x180
[<ffffffff81147d79>] fallback_alloc+0x1c9/0x2b0
[<ffffffff81147602>] ? cache_grow+0x4b2/0x520
[<ffffffff81147a5b>] ____cache_alloc_node+0xab/0x200
[<ffffffff810d55d5>] ? taskstats_exit+0x305/0x3b0
[<ffffffff8114862b>] kmem_cache_alloc+0x1fb/0x290
[<ffffffff810d55d5>] taskstats_exit+0x305/0x3b0
[<ffffffff81063a4b>] do_exit+0x12b/0x890
[<ffffffff810924fd>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8108641f>] ? cpu_clock+0x6f/0x80
[<ffffffff81095cbd>] ? lock_release_holdtime+0x3d/0x190
[<ffffffff814e1010>] ? _raw_spin_unlock_irq+0x30/0x40
[<ffffffff8106420e>] do_group_exit+0x5e/0xd0
[<ffffffff81075b54>] get_signal_to_deliver+0x2d4/0x490
[<ffffffff811ea6ad>] ? inode_has_perm+0x7d/0xf0
[<ffffffff8100a2e5>] do_signal+0x75/0x7b0
[<ffffffff81169d2d>] ? vfs_ioctl+0x3d/0xf0
[<ffffffff8116a394>] ? do_vfs_ioctl+0x84/0x570
[<ffffffff8100aa85>] do_notify_resume+0x65/0x80
[<ffffffff814e02f2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[<ffffffff8100b381>] int_signal+0x12/0x17
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 119
CPU 1: hi: 186, btch: 31 usd: 73
active_anon:50 inactive_anon:175 isolated_anon:0
active_file:0 inactive_file:9 isolated_file:0
unevictable:126855 dirty:0 writeback:25 unstable:0
free:1996 slab_reclaimable:4445 slab_unreclaimable:23663
mapped:923 shmem:7 pagetables:778 bounce:0
Node 0 DMA free:4032kB min:60kB low:72kB high:88kB active_anon:0kB
inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:11896kB isolated(anon):0kB isolated(file):0kB
present:15756kB mlocked:11896kB dirty:0kB writeback:0kB mapped:0kB
shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:24kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 994 994 994
Node 0 DMA32 free:3952kB min:4000kB low:5000kB high:6000kB
active_anon:200kB inactive_anon:700kB active_file:0kB
inactive_file:36kB unevictable:495524kB isolated(anon):0kB
isolated(file):0kB present:1018060kB mlocked:495524kB dirty:0kB
writeback:100kB mapped:3692kB shmem:28kB slab_reclaimable:17780kB
slab_unreclaimable:94652kB kernel_stack:1296kB pagetables:3088kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1536
all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 0*4kB 2*8kB 1*16kB 1*32kB 2*64kB 2*128kB 2*256kB 0*512kB
1*1024kB 1*2048kB 0*4096kB = 4032kB
Node 0 DMA32: 470*4kB 3*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB
0*512kB 0*1024kB 1*2048kB 0*4096kB = 3952kB
1146 total pagecache pages
221 pages in swap cache
Swap cache stats: add 19848, delete 19627, find 970/1386
Free swap = 2051428kB
Total swap = 2064380kB
262138 pages RAM
43914 pages reserved
4669 pages shared
155659 pages non-shared
Out of memory: kill process 1829 (usemem) score 8253 or a child
Killed process 1829 (usemem) vsz:528224kB, anon-rss:502468kB,
file-rss:376kB

# cat usemem.c
# cat usemem.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#define CHUNKS 32

int
main(int argc, char *argv[])
{
mlockall(MCL_FUTURE);

unsigned long mb;
char *buf[CHUNKS];
int i;

if (argc < 2) {
fprintf(stderr, "usage: usemem megabytes\n");
exit(1);
}
mb = strtoul(argv[1], NULL, 0);

for (i = 0; i < CHUNKS; i++) {
fprintf(stderr, "%d: Mallocing %lu megabytes\n", i, mb/CHUNKS);
buf[i] = (char *)malloc(mb/CHUNKS * 1024L * 1024L);
if (!buf[i]) {
fprintf(stderr, "malloc failure\n");
exit(1);
}
}

for (i = 0; i < CHUNKS; i++) {
fprintf(stderr, "%d: Zeroing %lu megabytes at %p\n",
i, mb/CHUNKS, buf[i]);
memset(buf[i], 0, mb/CHUNKS * 1024L * 1024L);
}


exit(0);
}

If this ever be relevant, this was tested inside the kvm guest. The host was a RHEL6 with THP enabled.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: Is Greenspun enough?
    ... Most OSes memory map executables directly from the file system so code doesn't pollute the file cache or swap space. ...
    (comp.lang.lisp)
  • Re: [PATCH 0/8] zcache: page cache compression support
    ... Memory compression increases effective memory size and allows ... though there looked like still lots of swap. ... 0: Mallocing 32 megabytes ... Swap cache stats: add 19633, delete 19418, find 941/1333 ...
    (Linux-Kernel)
  • Re: [PATCH 0/8] zcache: page cache compression support
    ... Memory compression increases effective memory size and allows more ... chance' cache ... By tested those patches on the top of the linus tree at this commit d0c6f6258478e1dba532bf7c28e2cd6e1047d3a4, the OOM was trigger even though there looked like still lots of swap. ... 0: Mallocing 32 megabytes ...
    (Linux-Kernel)
  • Re: Is Greenspun enough?
    ... Most OSes memory map executables directly from the file ... >> system so code doesn't pollute the file cache or swap space. ... but executables have a twist. ...
    (comp.lang.lisp)
  • Re: IA64 Linux VM performance woes.
    ... > At first the throughtput we are getting without file cache bypass is at around ... you're used to IRIX (or ever used the 2.6 layer). ... > and eventually all memory gets occupied by FS pages. ... That suggests you may be running with not much swap. ...
    (comp.sys.sgi.admin)