Re: Allocating kernel memory
From: Kasper Dupont (kasperd_at_daimi.au.dk)
Date: 05/15/04
- Next message: Kasper Dupont: "Re: address space problem on dynamically linked libs"
- Previous message: Rolf Magnus: "Re: address space problem on dynamically linked libs"
- In reply to: George Nelson: "Re: Allocating kernel memory"
- Next in thread: George Nelson: "Re: Allocating kernel memory"
- Reply: George Nelson: "Re: Allocating kernel memory"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sat, 15 May 2004 16:04:27 +0200
George Nelson wrote:
>
> Kasper Dupont <kasperd@daimi.au.dk> wrote in message news:<40A0F958.186250C5@daimi.au.dk>...
>
> > The evidence you have provided so far indicates a
> > user mode implementation would be the best choice.
> > Until you provide some more informations, I will
> > have to assume that is the case.
> >
>
> I can assure you this is not an option.
I don't see any reason it shouldn't be.
>
> > The kernel have a disk cache, which is not restricted
> > by the address space limitation. On a machine with 8GB
> > of physical RAM more than 7GB can be used by the disk
> > cache.
> >
> > Your cache should use the same principles to use the
> > desired amount of physical RAM for cache.
> >
>
> Thanks - I'll look into this but does it still apply to systems with
> less than 8GB? My development system has 2GB most of which I'd like to
> be able to use as a cache.
AFAIK there is no difference in the API no matter how
much RAM you have. The same principle should apply to
any amount of RAM as long as it is larger than the
direct addressable size. Actually the API for accessing
high memory should even work for low memory, it just
means a few of the calls are basically no-ops. There is
a difference in the page table format between systems
with support for more than 4GB and systems without
support for more than 4GB. But I don't see any place
the API exposes that difference.
>
> Havinf worked on a variety of OS's at kernel level with a theoretical
> and practical background in OS's, experience has led me to assume that
> the kernel can allocate as much memory as it likes so I did not
> consider this as a relevant design detail - obviously I was wrong.
Allocating the memory is easy, using it is more
difficult, because it needs to be mapped before you
can address it. In some configurations it isn't
possible to map all memory at the same time.
>
> > > So you are saying that the kernel does not map all of physical memory
> > > into its address space!
> >
> > Yes. The kernel did map all physical memory back in the
> > days when physical memory was small enough to fit in
> > the address space. But memory has become larger, and
> > the address space remained unchanged.
> >
> > > Without looking at the code details, I find
> > > this hard to believe.
> >
>
> I understand that at any point in time only 4GB of memory can be
> addressed. That is not the same as being mapped.
Actually it is. You have to map the memory before it
can be addressed.
> The kernel must map
> all of memory in order to manage it
Actually not.
> notwithstanding that it must deal
> with it in 4GB chunks.
Being able to address 4GB at one time doesn't mean
it has to be done in 4GB chunks. Rather you can map
a million 4KB chunks from all over the physical RAM.
> This time my bad choice of words.
Indeed.
>
> > I'm not 100% sure what function it is you need to
> > use here. But I think it is get_free_pages. Clearly
> > kmalloc is not an option as it allocates only from
> > the memory permanently mapped in kernel address
> > space.
> >
> > get_free_pages takes a flags argument, some of the
> > flags will tell which zone to allocate from. On x86
> > Linux the normal sizes of the zones are:
> >
>
> I do in fact use get_free_pages to allocate I/O buffers. I need to
> look closer at the allocation flags and how they are used.
Actually I think you need to call alloc_pages directly
instead of get_free_pages. Let's take a look how it is
implemented:
unsigned long __get_free_pages(unsigned int gfp_mask, unsigned int order)
{
struct page * page;
page = alloc_pages(gfp_mask, order);
if (!page)
return 0;
return (unsigned long) page_address(page);
}
But if you give flags that would result in a high
memory page, it will not have an address. So I guess
get_free_pages would return 0 in that case. In other
words it would leak a high memory page and appear
not to have allocated any. The comments about
page_address also tell, that you may never use it on
a highmem page.
/*
* Permanent address of a page. Obviously must never be
* called on a highmem page.
*/
#if defined(CONFIG_HIGHMEM) || defined(WANT_PAGE_VIRTUAL)
#define page_address(page) ((page)->virtual)
But you should also notice, that the arguments for
alloc_pages are exactly the same as those for
get_free_pages. The difference is the return value.
alloc_pages will return a pointer to the page struct
for the first allocated page. It is an array, so
finding the other pages shouldn't be any problem.
If you allocate multiple pages this way, they will
be physically contiguous. But don't expect to be able
to map the to contiguous virtual addresses.
I also found one example showing how to map the page
into kernel address space:
int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
{
char *kaddr;
unsigned long left, count = desc->count;
if (size > count)
size = count;
kaddr = kmap(page);
left = __copy_to_user(desc->buf, kaddr + offset, size);
kunmap(page);
Assuming the above code is correct, we see that
kmap cannot fail. (Obviously that means kmap may
sleep, and you must keep that in mind). There can
not be many pages in address space at a time. I
don't know exactly how many, but I think a few
thousands. So each process shouldn't try to map
many at a time. And only keep them mapped for a
short time.
>
> Yes I appreciate this but that 3GB, as I understand it, is also part
> of the kernel address space. With this split, Linux is not very
> scalable (640MB of address space is required to map a 64GB memory
> system - ooops and there goes most of the kernel address space).
That appears to be an improvement compared to the
result I got last time I did the calculations.
64GB of memory is 16777216 pages. Requiring a 68
byte page struct to represent each page means a
total of 1140850688 bytes. That is 1088MB.
If the page struct have been reduced to 40 bytes
the 640MB would be correct. I agree it is still
a lot compared to the 1GB of address space.
But who is to blame? Using 1% of your RAM for
management data seems fair to me. And I know a
lot of people complain about the 3GB for user
space being too restrictive. And the cost of
changing page tables on each user/kernel switch
is too high.
So is Intel to blame for this problematic design?
Or do you think the kernel developers could have
come up with some better design within the
restrictions of the CPU design? That would have
to be either a smaller page struct, which I'm not
sure how realistic it is. Or a way to place parts
of this management data in high memory, which
mean a more complex design and probably a
performance hit.
> In
> reality no user process needs its code area mapped into the kernel and
> few processes require large parts of its data area mapped into the
> kernel (unless using memory-mapped I/O). It seems to me that a better
> compromise giving the kernel a larger address space and taking the
> performance hit needed to map one or two user pages to transfer data
> in/out of the kernel would have been a better approach.
It is not only the code to copy data between user
and kernel, that would be slowed down. Changing
page tables would slow down switching between user
and kernel, and that is needed very often.
Besides we don't have the kind of granularity you
suggest. You can't decide for each page if it is
going to be in kernel or user space, let alone
parts of a page.
>
> Other than flushing of the TLB's I do not see what is expensive in
> switching page tables. Surely all that is needed is to point CR3 at
> the new page table?
That sounds plausible.
> The TLB flush problem would be solved with separate user mode and
> kernel mode TLB's. This can be achieved by mapping kernel address
> space using 4MB pages and user space with 4KB pages.
Can you point out a reference explaining why this
should be possible? AFAIK the two page sizes are
indicated by a bit in the page pointed to by CR3.
You can have 4MB pages and 4KB pages in the same
address space.
> I think this
> would also provide other benefits within the kernel (not least of
> which less real memory would be needed for the kernel page tables).
AFAIK the kernel is already mapped with 4MB pages
(on architectures supporting it). Besides even
user space can sometimes be mapped with 4MB pages.
>
> If performance is the sole design criteria for kernel development then
> why not turn off paging altogether? Oh but wait, paging is there to
> resolve a memory fragmentation problem and the performance hit is
> acceptable as a result.
If you think paging is just a tool to solve the
fragmentation problem, you obviously don't know
enough about how a kernel works. Paging is also
about security and the possibility to implement
memory mapped file, virtual memory, shared memory,
CoW, etc.
The performance hit cause by paging shouldn't be
much. If the page is in TLB the access should be
as fast as it would have been if paging was not
enabled. So TLB misses is the only cost.
If you consider all the costs involved in a kernel
without paging, it would probably end up being
more expensive.
1. fork would have to copy the entire address
space at once.
2. You can't swap out single pages. You will have
to swap out entire processes. And a process
will require more available physical memory
before it can be scheduled, so you would need
more swap than otherwise.
3. You can't do on demand loading. To start a
program every single page of the executable and
all libraries need to be loaded.
4. Memory mapped files would require some ugly and
inefficient hacks to recognize dirty sectors.
> I think the same is true here - the limit on
> the kernel address space has a serious impact on its future
> scalability
I haven't seen any documentation for that claim. Just
because you can't write code under those restrictions
doesn't mean there is a problem. It just means you
are not as good a coder as those who wrote the rest
of the system. If you can't live with the restrictions,
then don't write kernel code. There is a simple
solution to your problem. Write user mode code.
Besides, when the problem can be solved by switching to
a more appropriate architecture, it is not a scalability
problem of the kernel, then it is a scalability problem
of the architecture. With PAE the difference between
total physical memory and addressable memory can be so
huge, that it can really become a problem. Had you said
PAE doesn't scale, I would have to agree with you.
BTW. I have even read about some mainframe systems were
it was decided not to make more than 4GB of RAM available
to the software, because they wanted to keep using the
well tested 32 bit software. Any additional RAM could
only be used for disk caching.
> so use the features of the architecture to resolve that
> problem, accept the performance hit but try to minimise it by all
> means.
Why should we accept all x86 Linux systems with 1GB or
more memory to be slowed down, just because you want a
litle more kernel address space, which nobody else seems
to have any use for?
> The existing solution sucks and only helps when any user
> process is transferring data in/out of the kernel
That actually happens a lot. But no, you are not even
right about that. Every kernel/user switch would be
slowed down by a need to switch page tables. That means
you'd have to pay the price on every single exception,
system call, interrupt, and so on.
> - as soon as a task
> switch takes place the TLB's are flushed anyway.
Right, but they are rare. The kernel does a great work
to make them as rare as by any means possible.
>
> Guess I'll have to start looking at the AMD architecture manual - my
> own system has an AMD CPU unforunately I am stuck with Intel for this
> project (at least for the time being).
I will also buy an AMD64 system in a few months. It has
a lot of the features I don't want to live without.
-- Kasper Dupont -- der bruger for meget tid paa usenet. For sending spam use abuse@mk.lir.dk and kasperd@mk.lir.dk I'd rather be a hammer than a nail.
- Next message: Kasper Dupont: "Re: address space problem on dynamically linked libs"
- Previous message: Rolf Magnus: "Re: address space problem on dynamically linked libs"
- In reply to: George Nelson: "Re: Allocating kernel memory"
- Next in thread: George Nelson: "Re: Allocating kernel memory"
- Reply: George Nelson: "Re: Allocating kernel memory"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|