hugepage: Serialize hugepage allocation and instantiation



This patch applies on top of
hugepage-small-fixes-to-hugepage-clear-copy-path.patch in -mm.

Currently, no lock or mutex is held between allocating a hugepage and
inserting it into the pagetables / page cache. When we do go to
insert the page into pagetables or page cache, we recheck and may free
the newly allocated hugepage. However, since the number of hugepages
in the system is strictly limited, and it's usualy to want to use all
of them, this can still lead to spurious allocation failures.

For example, suppose two processes are both mapping (MAP_SHARED) the
same hugepage file, large enough to consume the entire available
hugepage pool. If they race instantiating the last page in the
mapping, they will both attempt to allocate the last available
hugepage. One will fail, of course, returning OOM from the fault and
thus causing the process to be killed, despite the fact that the
entire mapping can, in fact, be instantiated.

The patch below fixes this race by the simple method of adding a
(sleeping) mutex to serialize the hugepage fault path between
allocation and insertion into pagetables and/or page cache. It would
be possible to avoid the serialization by catching the allocation
failures, waiting on some condition, then rechecking to see if someone
else has instantiated the page for us. Given the likely frequency of
hugepage instantiations, it seems very doubtful it's worth the extra
complexity.

This patch causes no regression on the libhugetlbfs testsuite, and one
test, which can trigger this race now passes where it previously
failed.

Actually, the test still sometimes fails, though less often and only
as a shmat() failure, rather processes getting OOM killed by the VM.
The dodgy heuristic tests in fs/hugetlbfs/inode.c for whether there's
enough hugepage space aren't protected by the new mutex, and would be
ugly to do so, so there's still a race there. Another patch to
replace those tests with something saner for this reason as well as
others coming...

Signed-off-by: David Gibson <dwg@xxxxxxxxxxx>

Index: working-2.6/mm/hugetlb.c
===================================================================
--- working-2.6.orig/mm/hugetlb.c 2006-02-28 16:12:06.000000000 +1100
+++ working-2.6/mm/hugetlb.c 2006-02-28 16:20:36.000000000 +1100
@@ -13,6 +13,7 @@
#include <linux/pagemap.h>
#include <linux/mempolicy.h>
#include <linux/cpuset.h>
+#include <linux/mutex.h>

#include <asm/page.h>
#include <asm/pgtable.h>
@@ -54,6 +55,13 @@ static void copy_huge_page(struct page *
*/
static DEFINE_SPINLOCK(hugetlb_lock);

+/*
+ * Serializes hugepage allocation and instantiation, so that we don't
+ * get spurious allocation failures if two CPUs race to instantiate
+ * the same page in the page cache.
+ */
+static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+
static void enqueue_huge_page(struct page *page)
{
int nid = page_to_nid(page);
@@ -512,9 +520,13 @@ int hugetlb_fault(struct mm_struct *mm,
if (!ptep)
return VM_FAULT_OOM;

+ mutex_lock(&hugetlb_instantiation_mutex);
entry = *ptep;
- if (pte_none(entry))
- return hugetlb_no_page(mm, vma, address, ptep, write_access);
+ if (pte_none(entry)) {
+ ret = hugetlb_no_page(mm, vma, address, ptep, write_access);
+ mutex_unlock(&hugetlb_instantiation_mutex);
+ return ret;
+ }

ret = VM_FAULT_MINOR;

@@ -524,6 +536,7 @@ int hugetlb_fault(struct mm_struct *mm,
if (write_access && !pte_write(entry))
ret = hugetlb_cow(mm, vma, address, ptep, entry);
spin_unlock(&mm->page_table_lock);
+ mutex_unlock(&hugetlb_instantiation_mutex);

return ret;
}

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: [patch -mmotm] mm: invoke oom killer for __GFP_NOFAIL
    ... Then it will show the page allocation failure warning and will return. ... Sure, if it selects a task that will free a lot of memory, which is it's ... if the huge page was allocated from the static hugepage pool then ... does not necessarily help applications requiring more memory unless they ...
    (Linux-Kernel)
  • Re: [patch -mmotm] mm: invoke oom killer for __GFP_NOFAIL
    ... Then it will show the page allocation failure warning and will return. ... Sure, if it selects a task that will free a lot of memory, which is it's ... if the huge page was allocated from the static hugepage pool then ... But this patch increases the probability of innocent task killing. ...
    (Linux-Kernel)
  • Re: [PATCH 0/4] Reclaim page capture v3
    ... These are the numbers for the improvement of hugepage allocation success? ... then processes will start queueing up and not be allowed to jump the ... required for reclaim (only dirty pages, and only when backed by filesystems ...
    (Linux-Kernel)
  • Re: [PATCH 20/20] Get rid of the concept of hot/cold page freeing
    ... With the patch applied, we are still using hot/cold information on the ... allocation side so I'm somewhat surprised the patch even makes much of a ... an active process getting cache hot pages. ... We switched to doing non-temporal stores in copy_from_user, ...
    (Linux-Kernel)
  • Re: [PATCH 2.6.9-rc2-mm1 0/2] mm: memory policy for page cache allocation
    ... >>allocation of page cache pages on NUMA machines. ... and for other workloads you can specify the default policy ... Since there is (with this patch) a separate policy to control ...
    (Linux-Kernel)