Re: Allocating kernel memory

From: George Nelson (
Date: 05/15/04

Date: 14 May 2004 15:30:28 -0700

Kasper Dupont <> wrote in message news:<>...

> The evidence you have provided so far indicates a
> user mode implementation would be the best choice.
> Until you provide some more informations, I will
> have to assume that is the case.

I can assure you this is not an option.

> The kernel have a disk cache, which is not restricted
> by the address space limitation. On a machine with 8GB
> of physical RAM more than 7GB can be used by the disk
> cache.
> Your cache should use the same principles to use the
> desired amount of physical RAM for cache.

Thanks - I'll look into this but does it still apply to systems with
less than 8GB? My development system has 2GB most of which I'd like to
be able to use as a cache.

> > The fact that I
> > was unaware that there is an (artificial) limit on the amount of
> > memory available within the kernel does not imply I do not understand
> > the system, rather it shows that I did not see the need to know the
> > design details of a kernel system in order to develop my code.
> It clearly shows you did not understand all the kernel
> design details relevant to your code.

Havinf worked on a variety of OS's at kernel level with a theoretical
and practical background in OS's, experience has led me to assume that
the kernel can allocate as much memory as it likes so I did not
consider this as a relevant design detail - obviously I was wrong.

> > So you are saying that the kernel does not map all of physical memory
> > into its address space!
> Yes. The kernel did map all physical memory back in the
> days when physical memory was small enough to fit in
> the address space. But memory has become larger, and
> the address space remained unchanged.
> > Without looking at the code details, I find
> > this hard to believe.

I understand that at any point in time only 4GB of memory can be
addressed. That is not the same as being mapped. The kernel must map
all of memory in order to manage it notwithstanding that it must deal
with it in 4GB chunks. This time my bad choice of words.

> I'm not 100% sure what function it is you need to
> use here. But I think it is get_free_pages. Clearly
> kmalloc is not an option as it allocates only from
> the memory permanently mapped in kernel address
> space.
> get_free_pages takes a flags argument, some of the
> flags will tell which zone to allocate from. On x86
> Linux the normal sizes of the zones are:

I do in fact use get_free_pages to allocate I/O buffers. I need to
look closer at the allocation flags and how they are used.

> >
> > I may not understand the architecture in detail but I do know that
> > each process can have a 4GB virtual address space (I checked this in
> > the Intel architecture manual).
> Yes, but those 4GB are split into user and kernel
> space. (The kernel space is shared between all
> processes).
> > According to the book The Linux Kernel
> > (admittedly covering 2.2 kernel and the only 2.4 changes mentioned is
> > support for the PAE to handle upto 64GB of physical memory), 3GB of
> > the user address space is accessible by user processes or the kernel
> > while the remaining 1GB is accessible to the kernel only.

> Those 3GB being talked about is the user space.
> The 1GB is the kernel space. AFAIR high memory
> support was first in mainstream kernels starting
> with 2.4. In 2.2 you could not use more physical
> memory than you could fit into kernel address
> space. (But there were bigmem patches for 2.2)

Yes I appreciate this but that 3GB, as I understand it, is also part
of the kernel address space. With this split, Linux is not very
scalable (640MB of address space is required to map a 64GB memory
system - ooops and there goes most of the kernel address space). In
reality no user process needs its code area mapped into the kernel and
few processes require large parts of its data area mapped into the
kernel (unless using memory-mapped I/O). It seems to me that a better
compromise giving the kernel a larger address space and taking the
performance hit needed to map one or two user pages to transfer data
in/out of the kernel would have been a better approach.

> > There may be
> > some performance benefits from the scheme that is used in Linux
> > although these escape me for the moment.
> If you want to use the same linear addresses for
> both user and kernel space, you will need to
> replace the page tables when switching between
> user mode and kernel mode. Switching page tables
> is expensive (Linux does all it can to avoid
> those switches whenever they are not strictly
> necesarry). And kernel code not having access to
> user space will make copying of data between
> user and kernel space more expensive.

Other than flushing of the TLB's I do not see what is expensive in
switching page tables. Surely all that is needed is to point CR3 at
the new page table?
The TLB flush problem would be solved with separate user mode and
kernel mode TLB's. This can be achieved by mapping kernel address
space using 4MB pages and user space with 4KB pages. I think this
would also provide other benefits within the kernel (not least of
which less real memory would be needed for the kernel page tables).
When access to user space is needed the appropriate user page or pages
could be mapped into a window in the kernel address space reserved for
this purpose.

If performance is the sole design criteria for kernel development then
why not turn off paging altogether? Oh but wait, paging is there to
resolve a memory fragmentation problem and the performance hit is
acceptable as a result. I think the same is true here - the limit on
the kernel address space has a serious impact on its future
scalability so use the features of the architecture to resolve that
problem, accept the performance hit but try to minimise it by all
means. The existing solution sucks and only helps when any user
process is transferring data in/out of the kernel - as soon as a task
switch takes place the TLB's are flushed anyway. As I said above,
other solutions are available which minimise the performance hit but
do not restrict the kernel unduly.

> Tell me about it. I recently had to replace a motherboard.
> But since it was impossible to find a new motherboard that
> would fit with my case, powersuply, CPU, RAM, GFX board,
> all of those had to be replaced as well.

Been there, done that, got the T-shirt - lol.

> The FS and GS registers were introduced already with the
> 386. The only new things happening since then was support
> for 4MB pages and the addition of PAE. And now AMD64 have
> eliminated all the restrictions we have been discussing.
> I haven't used any AMD64 system yet, but AFAIK it should
> give old 32 bit programs a full 4GB of user space while
> the kernel can get even more (assuming the kernel code is
> 64 bit). And it doesn't have the performance problems of
> PAE.

Guess I'll have to start looking at the AMD architecture manual - my
own system has an AMD CPU unforunately I am stuck with Intel for this
project (at least for the time being).