Re: [PATCH] reserve RAM below PHYSICAL_START



Hi Andi,

On Mon, Mar 03, 2008 at 01:17:46PM +0100, Andi Kleen wrote:
Andrea Arcangeli <andrea@xxxxxxxxxxxx> writes:

Hello,

this patch allows to prevent linux from using the ram below
PHYSICAL_START.

The "reserved RAM" can be mapped by virtualization software with to
create a 1:1 mapping between guest physical (bus) address and host
physical (bus) address.

Wouldn't it be easier if your virtualization software just marked
that area reserved or unmapped in its e820 map?

Of if you don't want that you can get the same result with mem=...
arguments (e.g commonly used by crash dumping)

Would all bootloader and OS be capable of booting with a virtualized
e820 map that marks everything below 256M as reserved (an host needs
at least 256M of ram to avoid swapping if somebody tries to log in to
kde)? How would real mode dma run at all when the host is booted with
mem=256M? I didn't verify it in practice but before starting this, I
assumed that if it really works it would be mostly by luck... not the
ideal for a virtualization solution that aims to be generic.

The only bit that won't be generic will be page at address zero and
the trampoline page, but besides those 3 pages, all other ram below 1M
will be completely marked as available ram in the virtualized e820
map. And hopefully nobody does DMA to those 3 pages marked reserved in
the virtualized e820 map (the two trampoline pages can be moved just
before phys address 640k with a fully orthogonal patch to greatly
decrease the risk of bootloader issues, I'm deferring that patch until
I tested some bootloader/OS combination with the ~0x6000 address).

Even if that was all not possible for some reason having CONFIG for this would
seem unfortunate for me -- i don't think users really want specially
compiled kernels for specific hypervisors. With paravirt Linux
is trying to get away from that. Some runtime setup method
would be much better.

You're right but the relocatable kernel only works if you relocate it
at very low addresses (see MODULES_VADDR/KERNEL_IMAGE_SIZE). I fixed
that for the compile-time approach I taken, but fixing that for the
relocatable kernel so the kernel can relocate itself to address 900M
physical before jumping long mode, requires many more changes,
including moving all memparse/strlout/vsprintf to arch/x86/boot to
compile it it 32bit so the kernel command line can be parsed in 32bit
non-paging mode to extract the relocation address, before jumping
paging long mode.

My compile time approach doesn't slowdown the kernel module
allocation, it remains a small and relatively simple change to the
e820 map code. Hopefully KVM pci-passthrough without VT-d is done in
standard setups so the compile time approach will not be a big
limitation. So from a mainline kernel point of view, given this is
only needed in the short term because currently sold CPUs lack VT-d
the smaller is the change to allow pci-passthrough, the better. The
relocatable approach would be a much bigger change. Also note this
only works up to address near 1G, we can't reserve more than 1G with
this (extending over 1G requires even more changes). But a 800-900M
guest with pci-passthrough is sure enough right now (extending this to
2G is very easy with an incremental patch, extending over 2G is not
easy).

And if you're right and we'll later find everybody needs
pci-passthrough on every new system without recompiling the host
kernel, we can always switch to a relocatable kernel without changing
the userland API at all (/proc/iomem will show "reserved RAM" and
"reserved RAM failed" the same way as today, kvm userland won't notice
the difference). So I wouldn't worry so much about this being a
compile time thing to start with, given this avoids polluting the
kernel for a short-term matter.

In fact the only thing I'd worry about _right_now_ is the fact there's
no API in /proc/iomem to mark "reserved RAM" regions as
"busy". However given you also need to be root to map from /dev/mem I
don't think it's a big deal.

Thanks for the comments.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Increase vmalloc size on kernel >=2.4.27
    ... is there a way to reserve more than 128MB of virtual memory on a i386 ... machine with kernel>=2.4.27 and 1GB of RAM? ...
    (Linux-Kernel)
  • RAID ate my PCIe-PCI bridges MMIO address space (?)
    ... When I boot a live CD of the same distro with the same kernel the soundcard ... Note that fglrx or no fglrx, as well as Intel HDA driver or no HDA driver, ... 0000:00:12.0: OHCI Host Controller ... cannot reserve MMIO region ...
    (Linux-Kernel)
  • Re: [PATCH] xen: core dom0 support
    ... Despite all the noise made about kvm in kernel circles, ... let's add Xen support to the kernel!" ... Xen manages to use new hardware ... Xen's memory virtualization is pretty neat, ...
    (Linux-Kernel)
  • Re: [PATCH] xen: core dom0 support
    ... Despite all the noise made about kvm in kernel circles, ... For example, Xen manages to use new hardware virtualization features pretty quickly, partly because it doesn't need to trade-off against normal kernel functions. ... We end up mapping the event channels back to irqs and they are delivered as normal interrupts as far as the rest of the kernel is concerned. ...
    (Linux-Kernel)
  • Re: Newbie Setting up xserver
    ... Could be, but now that I know that low memory is causing the problem, I will ... The kernel gobbles up the rest of the physical memory huh. ... Actually my RAM is split into two ... >it could be that some of the processes related to setting the timezone ...
    (comp.os.linux.setup)