Re: [patch 1/2] x86_64 page fault NMI-safe

On Thu, Jul 15, 2010 at 07:51:55AM -0700, Linus Torvalds wrote:
On Thu, Jul 15, 2010 at 7:11 AM, Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
On Wed, Jul 14, 2010 at 03:56:43PM -0700, Linus Torvalds wrote:
You can:

 - make sure that you only ever use _one_ single top-level entry for
all vmalloc issues, and can make sure that all processes are created
with that static entry filled in. This is optimal, but it just doesn't
work on all architectures (eg on 32-bit x86, it would limit the
vmalloc space to 4MB in non-PAE, whatever)

But then, even if you ensure that, don't we need to also fill lower level
entries for the new mapping.

Yes, but now they are all mapped by the one *shared* top-level entry.

Think about it.

[ Time passes ]

End result: if you can map the whole vmalloc area with a single
top-level entry that is shared by all processes, and can then just
fill in the lower levels when doing actual allocations, it means that
all processes will automatically get the entries added, and do not
need any fixups.

In other words, the page tables will be automatically correct and
filled in for everybody - without having to traverse any lists,
without any extra locking, and without any races. So this is efficient
and simple, and never needs any faulting to fill in page tables later

(Side note: "single top-level entry" could equally well be "multiple
preallocated entries covering the whole region": the important part is
not really the "single entry", but the "preallocated and filled into
every page directory from the start" part)

Right, I got it. Thanks for these explanations.

Also, why is this a worry for vmalloc but not for kmalloc? Don't we also
risk to add a new memory mapping for new memory allocated with kmalloc?

No. The kmalloc space is all in the 1:1 kernel mapping, and is always
mapped. Even with PAGEALLOC_DEBUG, it's always mapped at the top
level, and even if a particular page is unmapped/remapped for
debugging, it is done so in the shared kernel page tables (which ends
up being the above trivial case - there is just a single set of page
directory entries that are shared by everybody).


 - at vmalloc time, when adding a new page directory entry, walk all
the tens of thousands of existing page tables under a lock that
guarantees that we don't add any new ones (ie it will lock out fork())
and add the required pgd entry to them.

 - or just take the fault and do the "fill the page tables" on demand.

Quite frankly, most of the time it's probably better to make that last
choice (unless your hardware makes it easy to make the first choice,
which is obviously simplest for everybody). It makes it _much_ cheaper
to do vmalloc. It also avoids that nasty latency issue. And it's just
simpler too, and has no interesting locking issues with how/when you
expose the page tables in fork() etc.

So the only downside is that you do end up taking a fault in the
(rare) case where you have a newly created task that didn't get an
even newer vmalloc entry.

But then how did the previous tasks get this new mapping? You said
we don't walk through every process page tables for vmalloc.

We always add the mapping to the "init_mm" page tables when it is
created (just a single mm), and when fork creates a new page table, it
will always copy the kernel mapping parts from the old one. So the
_common_ case is that all normal mappings are already set up in page
tables, including newly created page tables.

The uncommon case is when there is a new page table created _and_ a
new vmalloc mapping, and the race that happens between those events.
Whent hat new page table is then later used (and it can be _much_
later, of course: we're now talking process scheduling, so IO delays
etc are relevant), it won't necessarily have the page table entries
for vmalloc stuff that was created since the page tables were created.
So we fill _those_ in dynamically.

Such new page table created that might race is only about top level page
right? Otherwise it wouldn't race since the top level entries are shared
and then updates inside lower level pages are naturally propagated, if
I understood you well.

So, if only top level pages that gets added can generate such lazily
mapping update, I wonder why I experienced this fault everytime with
my patches.

I allocated 8192 bytes per cpu in a x86-32 system that has only 2 GB.
I doubt there is a top level page table update there at this time with
such a small amount of available memory. But still it faults once on

I have troubles to visualize the race and the problem here.

But vmalloc mappings should be reasonably rare, and the actual "fill
them in" cases are much rarer still (since we fill them in page
directory entries at a time: so even if you make a lot of vmalloc()
calls, we only _fill_ at most once per page directory entry, which is
usually a pretty big chunk). On 32-bit x86, for example, we'd fill
once every 4MB (or 2MB if PAE), and you cannot have a lot of vmalloc
mappings that large (since the VM space is limited).

So the cost of filling things in is basically zero, because it happens
so seldom. And by _allowing_ things to be done lazily, we avoid all
the locking costs, and all the costs of traversing every single
possible mm (and there can be many many thousands of those).


I would understand this race if we were to walk on every processes page
tables and add the new mapping on them, but we missed one new task that
forked or so, because we didn't lock (or just rcu).

.. and how do you keep track of which tasks you missed? And no, it's
not just the new tasks - you have old tasks that have their page
tables built up too, but need to be updated. They may never need the
mapping since they may be sleeping and never using the driver or data
structures that created it (in fact, that's a common case), so filling
them would be pointless. But if we don't do the lazy fill, we'd have
to fill them all, because WE DO NOT KNOW.


So the parts of the problem I don't understand are:

- why don't we have this problem with kmalloc() ?

Hopefully clarified.


- did I understand well the race that makes the fault necessary,
 ie: we walk the tasklist lockless, add the new mapping if
 not present, but we might miss a task lately forked, but
 the fault will fix that.

But the _fundamental_ issue is that we do not want to walk the
tasklist (or the mm_list) AT ALL. It's a f*cking waste of time. It's a
long list, and nobody cares. In many cases it won't be needed.

The lazy algorithm is _better_. It's way simpler (we take nested
faults all the time in the kernel, and it's a particularly _easy_ page
fault to handle with no IO or no locking needed), and it does less
work. It really boils down to that.

Yeah, agreed. But I'm still confused about when exactly we need to fault
(doubts I have detailed in my question above).

So it's not the lazy page table fill that is the problem. Never has
been. We've been doing the lazy fill for a long time, and it was
simple and useful way back when.

The problem has always been NMI, and nothing else. NMI's are nasty,
and the x86 NMI blocking is insane and crazy.

Which is why I'm so adamant that this should be fixed in the NMI code,
and we should _not_ talk about trying to screw up other, totally
unrelated, code. The lazy fill really was never the problem.

Yeah agreed.

Thanks for your explanations!

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at