Re: Anticipatory prefaulting in the page fault handler V1

From: Adam Litke (agl_at_us.ibm.com)
Date: 12/14/04

  • Next message: Adam Denenberg: "bind() udp behavior 2.6.8.1"
    To: Christoph Lameter <clameter@sgi.com>
    Date:	Tue, 14 Dec 2004 09:28:16 -0600
    
    

    What benchmark are you using to generate the following results? I'd
    like to run this on some of my hardware and see how the results compare.

    On Wed, 2004-12-08 at 11:24, Christoph Lameter wrote:
    > Standard Kernel on a 512 Cpu machine allocating 32GB with an increasing
    > number of threads (and thus increasing parallellism of page faults):
    >
    > Gb Rep Threads User System Wall flt/cpu/s fault/wsec
    > 32 3 1 1.416s 138.165s 139.050s 45073.831 45097.498
    > 32 3 2 1.397s 148.523s 78.044s 41965.149 80201.646
    > 32 3 4 1.390s 152.618s 44.044s 40851.258 141545.239
    > 32 3 8 1.500s 374.008s 53.001s 16754.519 118671.950
    > 32 3 16 1.415s 1051.759s 73.094s 5973.803 85087.358
    > 32 3 32 1.867s 3400.417s 117.003s 1849.186 53754.928
    > 32 3 64 5.361s 11633.040s 197.034s 540.577 31881.112
    > 32 3 128 23.387s 39386.390s 332.055s 159.642 18918.599
    > 32 3 256 15.409s 20031.450s 168.095s 313.837 37237.918
    > 32 3 512 18.720s 10338.511s 86.047s 607.446 72752.686
    >
    > Patched kernel:
    >
    > Gb Rep Threads User System Wall flt/cpu/s fault/wsec
    > 32 3 1 1.098s 138.544s 139.063s 45053.657 45057.920
    > 32 3 2 1.022s 127.770s 67.086s 48849.350 92707.085
    > 32 3 4 0.995s 119.666s 37.045s 52141.503 167955.292
    > 32 3 8 0.928s 87.400s 18.034s 71227.407 342934.242
    > 32 3 16 1.067s 72.943s 11.035s 85007.293 553989.377
    > 32 3 32 1.248s 133.753s 10.038s 46602.680 606062.151
    > 32 3 64 5.557s 438.634s 13.093s 14163.802 451418.617
    > 32 3 128 17.860s 1496.797s 19.048s 4153.714 322808.509
    > 32 3 256 13.382s 766.063s 10.016s 8071.695 618816.838
    > 32 3 512 17.067s 369.106s 5.041s 16291.764 1161285.521
    >
    > These number are roughly equal to what can be accomplished with the
    > page fault scalability patches.
    >
    > Kernel patches with both the page fault scalability patches and
    > prefaulting:
    >
    > Gb Rep Threads User System Wall flt/cpu/s fault/wsec
    > 32 10 1 4.103s 456.384s 460.046s 45541.992 45544.369
    > 32 10 2 4.005s 415.119s 221.095s 50036.407 94484.174
    > 32 10 4 3.855s 371.317s 111.076s 55898.259 187635.724
    > 32 10 8 3.902s 308.673s 67.094s 67092.476 308634.397
    > 32 10 16 4.011s 224.213s 37.016s 91889.781 564241.062
    > 32 10 32 5.483s 209.391s 27.046s 97598.647 763495.417
    > 32 10 64 19.166s 219.925s 26.030s 87713.212 797286.395
    > 32 10 128 53.482s 342.342s 27.024s 52981.744 769687.791
    > 32 10 256 67.334s 180.321s 15.036s 84679.911 1364614.334
    > 32 10 512 66.516s 93.098s 9.015s131387.893 2291548.865
    >
    > The fault rate doubles when both patches are applied.
    >
    > And on the high end (512 processors allocating 256G) (No numbers
    > for regular kernels because they are extremely slow, also no
    > number for a low number of threads. Also very slow)
    >
    > With prefaulting:
    >
    > Gb Rep Threads User System Wall flt/cpu/s fault/wsec
    > 256 3 4 8.241s 1414.348s 449.016s 35380.301 112056.239
    > 256 3 8 8.306s 1300.982s 247.025s 38441.977 203559.271
    > 256 3 16 8.368s 1223.853s 154.089s 40846.272 324940.924
    > 256 3 32 8.536s 1284.041s 110.097s 38938.970 453556.624
    > 256 3 64 13.025s 3587.203s 110.010s 13980.123 457131.492
    > 256 3 128 25.025s 11460.700s 145.071s 4382.104 345404.909
    > 256 3 256 26.150s 6061.649s 75.086s 8267.625 663414.482
    > 256 3 512 20.637s 3037.097s 38.062s 16460.435 1302993.019
    >
    > Page fault scalability patch and prefaulting. Max prefault order
    > increased to 5 (max preallocation of 32 pages):
    >
    > Gb Rep Threads User System Wall flt/cpu/s fault/wsec
    > 256 10 8 33.571s 4516.293s 863.021s 36874.099 194356.930
    > 256 10 16 33.103s 3737.688s 461.028s 44492.553 363704.484
    > 256 10 32 35.094s 3436.561s 321.080s 48326.262 521352.840
    > 256 10 64 46.675s 2899.997s 245.020s 56936.124 684214.256
    > 256 10 128 85.493s 2890.198s 203.008s 56380.890 826122.524
    > 256 10 256 74.299s 1374.973s 99.088s115762.963 1679630.272
    > 256 10 512 62.760s 706.559s 53.027s218078.311 3149273.714
    >
    > We are getting into an almost linear scalability in the high end with
    > both patches and end up with a fault rate > 3 mio faults per second.
    >
    > The one thing that takes up a lot of time is still be the zeroing
    > of pages in the page fault handler. There is a another
    > set of patches that I am working on which will prezero pages
    > and led to another an increase in performance by a factor of 2-4
    > (if prezeroed pages are available which may not always be the case).
    > Maybe we can reach 10 mio fault /sec that way.
    >
    > Patch against 2.6.10-rc3-bk3:
    >
    > Index: linux-2.6.9/include/linux/sched.h
    > ===================================================================
    > --- linux-2.6.9.orig/include/linux/sched.h 2004-12-01 10:37:31.000000000 -0800
    > +++ linux-2.6.9/include/linux/sched.h 2004-12-01 10:38:15.000000000 -0800
    > @@ -537,6 +537,8 @@
    > #endif
    >
    > struct list_head tasks;
    > + unsigned long anon_fault_next_addr; /* Predicted sequential fault address */
    > + int anon_fault_order; /* Last order of allocation on fault */
    > /*
    > * ptrace_list/ptrace_children forms the list of my children
    > * that were stolen by a ptracer.
    > Index: linux-2.6.9/mm/memory.c
    > ===================================================================
    > --- linux-2.6.9.orig/mm/memory.c 2004-12-01 10:38:11.000000000 -0800
    > +++ linux-2.6.9/mm/memory.c 2004-12-01 10:45:01.000000000 -0800
    > @@ -55,6 +55,7 @@
    >
    > #include <linux/swapops.h>
    > #include <linux/elf.h>
    > +#include <linux/pagevec.h>
    >
    > #ifndef CONFIG_DISCONTIGMEM
    > /* use the per-pgdat data instead for discontigmem - mbligh */
    > @@ -1432,8 +1433,106 @@
    > unsigned long addr)
    > {
    > pte_t entry;
    > - struct page * page = ZERO_PAGE(addr);
    > + struct page * page;
    > +
    > + addr &= PAGE_MASK;
    > +
    > + if (current->anon_fault_next_addr == addr) {
    > + unsigned long end_addr;
    > + int order = current->anon_fault_order;
    > +
    > + /* Sequence of page faults detected. Perform preallocation of pages */
    >
    > + /* The order of preallocations increases with each successful prediction */
    > + order++;
    > +
    > + if ((1 << order) < PAGEVEC_SIZE)
    > + end_addr = addr + (1 << (order + PAGE_SHIFT));
    > + else
    > + end_addr = addr + PAGEVEC_SIZE * PAGE_SIZE;
    > +
    > + if (end_addr > vma->vm_end)
    > + end_addr = vma->vm_end;
    > + if ((addr & PMD_MASK) != (end_addr & PMD_MASK))
    > + end_addr &= PMD_MASK;
    > +
    > + current->anon_fault_next_addr = end_addr;
    > + current->anon_fault_order = order;
    > +
    > + if (write_access) {
    > +
    > + struct pagevec pv;
    > + unsigned long a;
    > + struct page **p;
    > +
    > + pte_unmap(page_table);
    > + spin_unlock(&mm->page_table_lock);
    > +
    > + pagevec_init(&pv, 0);
    > +
    > + if (unlikely(anon_vma_prepare(vma)))
    > + return VM_FAULT_OOM;
    > +
    > + /* Allocate the necessary pages */
    > + for(a = addr;a < end_addr ; a += PAGE_SIZE) {
    > + struct page *p = alloc_page_vma(GFP_HIGHUSER, vma, a);
    > +
    > + if (p) {
    > + clear_user_highpage(p, a);
    > + pagevec_add(&pv,p);
    > + } else
    > + break;
    > + }
    > + end_addr = a;
    > +
    > + spin_lock(&mm->page_table_lock);
    > +
    > + for(p = pv.pages; addr < end_addr; addr += PAGE_SIZE, p++) {
    > +
    > + page_table = pte_offset_map(pmd, addr);
    > + if (!pte_none(*page_table)) {
    > + /* Someone else got there first */
    > + page_cache_release(*p);
    > + pte_unmap(page_table);
    > + continue;
    > + }
    > +
    > + entry = maybe_mkwrite(pte_mkdirty(mk_pte(*p,
    > + vma->vm_page_prot)),
    > + vma);
    > +
    > + mm->rss++;
    > + lru_cache_add_active(*p);
    > + mark_page_accessed(*p);
    > + page_add_anon_rmap(*p, vma, addr);
    > +
    > + set_pte(page_table, entry);
    > + pte_unmap(page_table);
    > +
    > + /* No need to invalidate - it was non-present before */
    > + update_mmu_cache(vma, addr, entry);
    > + }
    > + } else {
    > + /* Read */
    > + for(;addr < end_addr; addr += PAGE_SIZE) {
    > + page_table = pte_offset_map(pmd, addr);
    > + entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
    > + set_pte(page_table, entry);
    > + pte_unmap(page_table);
    > +
    > + /* No need to invalidate - it was non-present before */
    > + update_mmu_cache(vma, addr, entry);
    > +
    > + };
    > + }
    > + spin_unlock(&mm->page_table_lock);
    > + return VM_FAULT_MINOR;
    > + }
    > +
    > + current->anon_fault_next_addr = addr + PAGE_SIZE;
    > + current->anon_fault_order = 0;
    > +
    > + page = ZERO_PAGE(addr);
    > /* Read-only mapping of ZERO_PAGE. */
    > entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), vma->vm_page_prot));
    >
    > -
    > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at http://vger.kernel.org/majordomo-info.html
    > Please read the FAQ at http://www.tux.org/lkml/

    -- 
    Adam Litke - (agl at us.ibm.com)
    IBM Linux Technology Center
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Adam Denenberg: "bind() udp behavior 2.6.8.1"