Re: [PATCH 08 of 11] anon-vma-rwsem



We are pursuing Linus' suggestion currently. This discussion is
completely unrelated to that work.

On Thu, May 15, 2008 at 09:57:47AM +0200, Nick Piggin wrote:
I'm not sure if you're thinking about what I'm thinking of. With the
scheme I'm imagining, all you will need is some way to raise an IPI-like
interrupt on the target domain. The IPI target will have a driver to
handle the interrupt, which will determine the mm and virtual addresses
which are to be invalidated, and will then tear down those page tables
and issue hardware TLB flushes within its domain. On the Linux side,
I don't see why this can't be done.

We would need to deposit the payload into a central location to do the
invalidate, correct? That central location would either need to be
indexed by physical cpuid (65536 possible currently, UV will push that
up much higher) or some sort of global id which is difficult because
remote partitions can reboot giving you a different view of the machine
and running partitions would need to be updated. Alternatively, that
central location would need to be protected by a global lock or atomic
type operation, but a majority of the machine does not have coherent
access to other partitions so they would need to use uncached operations.
Essentially, take away from this paragraph that it is going to be really
slow or really large.

Then we need to deposit the information needed to do the invalidate.

Lastly, we would need to interrupt. Unfortunately, here we have a
thundering herd. There could be up to 16256 processors interrupting the
same processor. That will be a lot of work. It will need to look up the
mm (without grabbing any sleeping locks in either xpmem or the kernel)
and do the tlb invalidates.

Unfortunately, the sending side is not free to continue (in most cases)
until it knows that the invalidate is completed. So it will need to spin
waiting for a completion signal will could be as simple as an uncached
word. But how will it handle the possible failure of the other partition?
How will it detect that failure and recover? A timeout value could be
difficult to gauge because the other side may be off doing a considerable
amount of work and may just be backed up.

Sure, you obviously would need to rework your code because it's been
written with the assumption that it can sleep.

It is an assumption based upon some of the kernel functions we call
doing things like grabbing mutexes or rw_sems. That pushes back to us.
I think the kernel's locking is perfectly reasonable. The problem we run
into is we are trying to get from one context in one kernel to a different
context in another and the in-between piece needs to be sleepable.

What is XPMEM exactly anyway? I'd assumed it is a Linux driver.

XPMEM allows one process to make a portion of its virtual address range
directly addressable by another process with the appropriate access.
The other process can be on other partitions. As long as Numa-link
allows access to the memory, we can make it available. Userland has an
advantage in that the kernel entrance/exit code contains memory errors
so we can contain hardware failures (in most cases) to only needing to
terminate a user program and not lose the partition. The kernel enjoys
no such fault containment so it can not safely directly reference memory.


Thanks,
Robin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: Advantage of partitioning?
    ... When you install your next kernel, ... There are reasons to separate it, ... Another is to keep the boot files on a simple filesystem that's well ... Some Mac systems work well with HFS /boot partitions, ...
    (comp.os.linux.setup)
  • Re: Hard drive bad sector warning
    ... aligned partitions but I'm lazy. ... I am just using the default Sqeeze kernel 2.6.32-5-686. ... Check that your kernel supports larger sectors - see further in this post - ... Test Time: 02:52:40, September 11, 2011 ...
    (Debian-User)
  • Re: [opensuse] Questions for Partitioning gurus]
    ... in datalv you can just expand it without playing around with disk ... partitioner it seemed to want put my 10.2 partitions with the old 10.0 stuff ... Ummm,,, that would be the kernel version, so it wouldn't be right if the ... the proper kernel and install it. ...
    (SuSE)
  • Re: [linux-usb-devel] [2.6.22-rc7] khubd NULL deref oops...
    ... Kernel is 2.6.22-rc7 on ia32 ... So the device got reset right in the middle of scanning for partitions. ... Buffer I/O error on device sda, logical block 5 ...
    (Linux-Kernel)