Re: [PATCH] drivers/base: export gpl (un)register_memory_notifier




Dave Hansen <haveblue@xxxxxxxxxx> wrote on 14.02.2008 18:12:43:

On Thu, 2008-02-14 at 09:46 +0100, Christoph Raisch wrote:
Dave Hansen <haveblue@xxxxxxxxxx> wrote on 13.02.2008 18:05:00:
On Wed, 2008-02-13 at 16:17 +0100, Jan-Bernd Themann wrote:
Constraints imposed by HW / FW:
- eHEA has own MMU
- eHEA Memory Regions (MRs) are used by the eHEA MMU to translate
virtual
addresses to absolute addresses (like DMA mapped memory on a PCI
bus)
- The number of MRs is limited (not enough to have one MR per
packet)

Are there enough to have one per 16MB section?

Unfortunately this won't work. This was one of our first ideas we
tossed
out,
but the number of MRs will not be sufficient.

Can you give a ballpark of how many there are to work with? 10? 100?
1000?

It depends on HMC configuration, but in worst case the upper limit is in
the 2 digits range.

But, I'm really not convinced that you can actually keep this map
yourselves. It's not as simple as you think. What happens if you
get
on an LPAR with two sections, one 256MB@0x0 and another
16MB@0x1000000000000000. That's quite possible. I think your
vmalloc'd
array will eat all of memory.
I'm glad you mention this part. There are many algorithms out there to
handle this problem,
hashes/trees/... all of these trade speed for smaller memory footprint.
We based the table decission on the existing implementations of the
architecture.
Do you see such a case coming along for the next generation POWER
systems?

Dude. It exists *TODAY*. Go take a machine, add tens of gigabytes of
memory to it. Then, remove all of the sections of memory in the middle.
You'll be left with a very sparse memory configuration that we *DO*
handle today in the core VM. We handle it quite well, actually.

The hypervisor does not shrink memory from the top down. It pulls
things out of the middle and shuffles things around. In fact, a NUMA
node's memory isn't even contiguous.

Your code will OOM the machine in this case. I consider the ehea driver
buggy in this regard.

Your comment indicates that the upper limit for memory to be set on HMC
does not influence
the upper limit of the partition physical address space.
So our base assumption we discussed internally is wrong here.
(conclusion see below)

I would guess these drastic changes would also require changes in base
kernel.

No, we actually solved those a couple years ago.

Will you provide a generic mapping system with a contiguous virtual
address
space
like the ehea_bmap we can query? This would need to be a "stable" part
of
the implementation,
including translation functions from kernel to
nextgen_ehea_generic_bmap
like virt_to_abs.

Yes, that's a real possibility, especially if some other users for it
come forward. We could definitely add something like that to the
generic code. But, you'll have to be convincing that what we have now
is insufficient.

Does this requirement:
"- MRs cover a contiguous virtual memory block (no holes)"
come from the hardware?

yes
Is that *EACH* MR? OR all MRs?

each
Where does EHEA_BUSMAP_START come from? Is that defined in the
hardware? Have you checked to ensure that no other users might want a
chunk of memory in that area?

EHEA_BUSMAP_START is a value which has to match between the wqe
virtual addresses and the MR used in them.
Fortunately there's a simple answer on that one. Each MR has a own address
space,
so there's no need to check.
A HEA MR actually has exactly the same attributes as a Infiniband MR with
this hardware.
send/receive processing is pretty much comparable to a Infiniband UD queue.

Can you query the existing MRs?
no
Not change them in place, but can you
query their contents?
no

That's why we have SPARSEMEM_EXTREME and SPARSEMEM_VMEMMAP
implemented
in the core, so that we can deal with these kinds of problems, once
and
*NOT* in every single little driver out there.

Functions to use while building ehea_bmap + MRs:
- Use either the functions that are used by the memory hotplug
system
as
well, that means using the section defines + functions
(section_nr_to_pfn,
pfn_valid)

Basically, you can't use anything related to sections outside of the
core code. You can use things like pfn_valid(), or you can create
new
interfaces that are properly abstracted.

We picked sections instead of PFNs because this keeps the ehea_bmap in
a
reasonable range
on the existing systems.
But if you provide a abstract method handling exactly the problem we
mention
we'll be happy to use that and dump our private implementation.

One thing you can guarantee today is that things are contiguous up to
MAX_ORDER_NR_PAGES. That's a symbol that is unlikely to change and is
much more appropriate than using sparsemem. We could also give you a
nice new #define like MINIMUM_CONTIGUOUS_PAGES or something. I think
that's what you really want.

That's definitely the right direction.

From this mail thread I would conclude....
memory space can have holes, and drivers shouldn't make any assumption when
where and how.

A translation from kernel to ehea_bmap space should be fast and predictable
(ruling out hashes).
If a driver doesn't know anything else about the mapping structure,
the normal solution in kernel for this type of problem is a multi level
look up table
like pgd->pud->pmd->pte
This doesn't sound right to be implemented in a device driver.

We didn't see from the existing code that such a mapping to a contiguous
space already exists.
Maybe we've missed it.

If the mapping is less random, the translation gets much simpler.
MAX_ORDER_NR_PAGES helps here, is there more like that?


Gruss / Regards
Christoph Raisch + Jan-Bernd Themann

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • RE: 2.6.23.1 - sata_mv (7042) hang with large file operations
    ... of the PEX inbound/outbound translation windows to move your data ... Thats another issue how does memory dribble/scribbling (only side affect ... I can think of if something is going wrong with this translation)? ... and how I setup some translation windows in my MPC8548E ...
    (Linux-Kernel)
  • Re: [SLE] Totally OT: 64/32,... -- memory models are _everything_ to programs
    ... (segmented into stack, text, initialized data, heap, and mmapped segments) ... the translation address -> address32 ... the size of available memory to the process is limited to 32bits, ...
    (SuSE)
  • Re: Vonage Sued to Quit Using Verizon Patents
    ... memory space allocated to them. ... "Dynamic Address Translation" is an impressive phase, ... timesharing, but General Electric did and their machines were used for ... early timeshared computers. ...
    (comp.dcom.telecom)
  • Re: 2.6.23.1 - sata_mv (7042) hang with large file operations
    ... and how I setup some translation windows in my MPC8548E ... Translation Setup for mapping all from/to PEX bus to outside the physical 4GIG memory space ... (and still have these inbound/output pex translation windows), but fails when I put back the 4 Gig ...
    (Linux-Kernel)
  • Re: segmentation fault
    ... BIOS data are present in the low memory after Linux kernel ... It seems that you're trying to port DOS utilities into ... to use a flat address translation, ... The linear addresses are then translated to the real ...
    (comp.os.linux.development.system)