Re: Allocating kernel memory
From: George Nelson (grn_at_freeuk.com)
Date: 05/16/04
- Next message: Kasper Dupont: "Re: Allocating kernel memory"
- Previous message: Kasper Dupont: "Re: How to correctly detect a device"
- In reply to: Kasper Dupont: "Re: Allocating kernel memory"
- Next in thread: Kasper Dupont: "Re: Allocating kernel memory"
- Reply: Kasper Dupont: "Re: Allocating kernel memory"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 15 May 2004 20:30:19 -0700
Kasper Dupont <kasperd@daimi.au.dk> wrote in message news:<40A6236B.4DCBF5EC@daimi.au.dk>...
> George Nelson wrote:
> >
> > Kasper Dupont <kasperd@daimi.au.dk> wrote in message news:<40A0F958.186250C5@daimi.au.dk>...
> >
> > > The evidence you have provided so far indicates a
> > > user mode implementation would be the best choice.
> > > Until you provide some more informations, I will
> > > have to assume that is the case.
> > >
> >
> > I can assure you this is not an option.
>
> I don't see any reason it shouldn't be.
>
Maybe you can't but I assure you I can. The particular development I
am doing has to be in kernel space - no other option.
> >
> > > I'm not 100% sure what function it is you need to
> > > use here. But I think it is get_free_pages. Clearly
> > > kmalloc is not an option as it allocates only from
> > > the memory permanently mapped in kernel address
> > > space.
> > >
> > > get_free_pages takes a flags argument, some of the
> > > flags will tell which zone to allocate from. On x86
> > > Linux the normal sizes of the zones are:
> > >
Thanks for this information. I'll have a look at this.
> >
> > Yes I appreciate this but that 3GB, as I understand it, is also part
> > of the kernel address space. With this split, Linux is not very
> > scalable (640MB of address space is required to map a 64GB memory
> > system - ooops and there goes most of the kernel address space).
>
> That appears to be an improvement compared to the
> result I got last time I did the calculations.
> 64GB of memory is 16777216 pages. Requiring a 68
> byte page struct to represent each page means a
> total of 1140850688 bytes. That is 1088MB.
>
> If the page struct have been reduced to 40 bytes
> the 640MB would be correct. I agree it is still
> a lot compared to the 1GB of address space.
>
> But who is to blame? Using 1% of your RAM for
> management data seems fair to me. And I know a
> lot of people complain about the 3GB for user
> space being too restrictive. And the cost of
> changing page tables on each user/kernel switch
> is too high.
>
I never complained about the real memory that was used but pointed out
that a large portion of the kernel virtual address space is used up.
> So is Intel to blame for this problematic design?
> Or do you think the kernel developers could have
> come up with some better design within the
> restrictions of the CPU design? That would have
> to be either a smaller page struct, which I'm not
> sure how realistic it is. Or a way to place parts
> of this management data in high memory, which
> mean a more complex design and probably a
> performance hit.
>
Every OS developer faces problems with an architecture in one way or
another (try writing a Unix kernel for a PDP-11 with a 64KB address
space). My point is that I feel the Linux solution has not employed
the features of the architecture in the best possible way. I believe
large memory systems will become more and more common and 32-bit CPUs
will still be around for some time.
> > In
> > reality no user process needs its code area mapped into the kernel and
> > few processes require large parts of its data area mapped into the
> > kernel (unless using memory-mapped I/O). It seems to me that a better
> > compromise giving the kernel a larger address space and taking the
> > performance hit needed to map one or two user pages to transfer data
> > in/out of the kernel would have been a better approach.
>
> It is not only the code to copy data between user
> and kernel, that would be slowed down. Changing
> page tables would slow down switching between user
> and kernel, and that is needed very often.
>
> Besides we don't have the kind of granularity you
> suggest. You can't decide for each page if it is
> going to be in kernel or user space, let alone
> parts of a page.
>
> >
> > Other than flushing of the TLB's I do not see what is expensive in
> > switching page tables. Surely all that is needed is to point CR3 at
> > the new page table?
>
> That sounds plausible.
>
> > The TLB flush problem would be solved with separate user mode and
> > kernel mode TLB's. This can be achieved by mapping kernel address
> > space using 4MB pages and user space with 4KB pages.
>
> Can you point out a reference explaining why this
> should be possible? AFAIK the two page sizes are
> indicated by a bit in the page pointed to by CR3.
> You can have 4MB pages and 4KB pages in the same
> address space.
>
Wrong I'm afraid. The only bits in CR3 relate to the L1 and L2 caching
behaviour. One bit is used to disable it and the other determines
whether the cache will operate as write-through or write-back (Intel
Architecture Vol 3, p2-16). The page granularity is set by the PSE bit
of CR4 (same volume, p3-18). Also read section 3.7.3, Mixing 4KB and
4MB byte pages. Section 3-11 also describes the other features
availbale to help minimise TLB flushing. I cannot locate the reference
about flushing of 4KB TLBs as opposed to 4MB TLBs but since there is
no need to modify CR3 during a system call no flush would take place
anyway, all that is required is to change form the user space segment
selector to the kernel space segment selector (which I believe is what
happens anyway in the current kernel?). The portion of kernel virtual
address space used to map user space can be mapped as 4KB pages and
the relevant entries flushed in the TLB. Most system calls transfer
very small amounts of data (with I/O the obvious exception so only one
or two page table entries would need to be changed and frequently the
same virtual addresses in user space will be used in consecutive
system calls. A little extra code can be added to check this if it
provides some performance improvement.
> >
> > If performance is the sole design criteria for kernel development then
> > why not turn off paging altogether? Oh but wait, paging is there to
> > resolve a memory fragmentation problem and the performance hit is
> > acceptable as a result.
>
> If you think paging is just a tool to solve the
> fragmentation problem, you obviously don't know
> enough about how a kernel works. Paging is also
> about security and the possibility to implement
> memory mapped file, virtual memory, shared memory,
> CoW, etc.
>
I'm afraid you are wrong. Paging was introduced primarily to resolve a
memory fragmentation problem. I do not disagree with the other
benefits that have been made possible from paging but its basic intent
was to resolve the issue in early OS's of allocating space in a real
address memory system. By treating memory in pages and having H/W
assist to translate a virtual address to a physical address resolved
the memory fragmentation problem which meant either a roll-out,
roll-in of processes or a memory move when holes appeared in the
physical memory space. It was quickly seen that paging allowed
programs to have a larger virtual address space than available real
memory, the possibility of shared memory and a few additional safety
features (e.g not mapping the page at virtual address 0).
> The performance hit cause by paging shouldn't be
> much. If the page is in TLB the access should be
> as fast as it would have been if paging was not
> enabled. So TLB misses is the only cost.
>
No. It still takes some time to perform the TLB lookup even though
that is significantly less than accessing the page table in memory.
Disable paging at that will remove this. Every instruction and data
access will incur the TLB lookup penalty, small though it is. The fact
that TLB's are there is simply because paging, despite its benefits,
would be unacceptably slow without this hardware assist. But a program
will ALWAYS run faster in real memory mode compared to virtual address
mode.
> If you consider all the costs involved in a kernel
> without paging, it would probably end up being
> more expensive.
>
> 1. fork would have to copy the entire address
> space at once.
> 2. You can't swap out single pages. You will have
> to swap out entire processes. And a process
> will require more available physical memory
> before it can be scheduled, so you would need
> more swap than otherwise.
> 3. You can't do on demand loading. To start a
> program every single page of the executable and
> all libraries need to be loaded.
> 4. Memory mapped files would require some ugly and
> inefficient hacks to recognize dirty sectors.
>
I totally agree. But these are additional benfits that accrue from
paging but does not change the fact that paging was first introduced
to resolve the memory fragmentation problem (your point 2). The other
points are OS features that were added much later to OS's (even the
original Unix system on a paged system performed a copy of the entire
address space and this was true of ATT up till I believe System V
where they adopted the BSD virtual fork approach - which only appeared
first I think in BSD4.1). It was not until 32-bit architectures became
common that demand paging was introduced. It was then that the
possibility of having a larger virtual address space than available
physical memory became a real possibility. The evolution of operating
system features has always been driven by the management of scarce
resources and performance considerations. If paging does not invoke a
performance hit why is it that Cray supercomputers are real memory
systems? Simply because despite the above benefits from paging, a real
memory system will always run faster. Large memory and disk caches
address all of the benefits you mention. Don't get me wrong - I am not
suggesting that we should throw paging away but making the point that
paging does invoke a performance penalty that cannot be avoided - only
minimised. But neither would I say that paging will always be
necessary in an OS.
> > I think the same is true here - the limit on
> > the kernel address space has a serious impact on its future
> > scalability
>
> I haven't seen any documentation for that claim. Just
> because you can't write code under those restrictions
> doesn't mean there is a problem. It just means you
> are not as good a coder as those who wrote the rest
> of the system. If you can't live with the restrictions,
> then don't write kernel code. There is a simple
> solution to your problem. Write user mode code.
>
Excuse me but I have never said that I couldn't write code under these
restrictions. On the contrary I have stated on a number of occassions
that my code runs perfectly happily under these restrictions. In fact
under normal operating conditions I suspect that this limit will not
pose a performance problem. My original query was to find out if there
was a limit and why as I was trying to obtain 'best' performance data.
I can get more than sufficient memory so that write performance is not
affected but more memory would significantly improve read performance
as it would give me a larger cache and improve the cache hit ratio.
All I have done is express surprise at the 1GB address space limit.
And for your information I know I am as good a coder, if not better,
than those who wrote the system. I have never knocked the people who
coded the system but queried one design decision. I can see benefits
to a larger kernel virtual address space particularly in relation to
large memory systems. I believe it can be achieved without sacrificing
much in the way of performance. If it is never implemented then it
will make no difference to me or my code. I will simply look for other
ways I can improve performance but know that this would be the biggest
single performance enhancing factor.
> Besides, when the problem can be solved by switching to
> a more appropriate architecture, it is not a scalability
> problem of the kernel, then it is a scalability problem
> of the architecture. With PAE the difference between
> total physical memory and addressable memory can be so
> huge, that it can really become a problem. Had you said
> PAE doesn't scale, I would have to agree with you.
>
But 32-bit architectures will be around for a long time yet and real
memory will increase. Commercial users will want to get the best out
of their current investment and are more likely to add memory, disk to
an existing (32-bit system) than replace them. And changing
architectures is only a temporary measure. Sure a 64-bit architecture
looks like it resolves all the scalability issues now but the original
PC thought 640KB would be more than sufficient memory. When 32-bit
systems came out real memory was measured in terms of MBs rather than
now where GB systems are fairly common. OS developers used virtual
memory as a means to allow large programs to run in small memory
systems not because it was ideal but simply because the cost of real
memory was prohibitive and it was an acceptable compromise. Now we are
at the stage where the opposite is the case, real memory systems
larger than the address space are economically viable. I see no reason
why this trend will change and the similar problems will arise at some
point in the future for 64-bit systems.
As for switching to a more appropriate architecture I do not have that
luxury. I have the platform on which I must develop my code and that
is an Intel 32-bit system. Until my company decides otherwise I can do
little to change this other than make representation for change.
> BTW. I have even read about some mainframe systems were
> it was decided not to make more than 4GB of RAM available
> to the software, because they wanted to keep using the
> well tested 32 bit software. Any additional RAM could
> only be used for disk caching.
>
> > so use the features of the architecture to resolve that
> > problem, accept the performance hit but try to minimise it by all
> > means.
>
> Why should we accept all x86 Linux systems with 1GB or
> more memory to be slowed down, just because you want a
> litle more kernel address space, which nobody else seems
> to have any use for?
>
A simpler, more elegant kernel perhaps? With a 1GB address space on a
64GB memory system more than likely used as a server, the demands
placed on kernel memory resources for memory will be great (large
number of user processes each with their kernel resident data
structures, internal structures to handle memory managment, I/O,
buffer caches and so on). This in turn will cause the kernel to spend
more time managing memory reducing performance and limiting the
scalability of the system in terms of the number of users it can
support and the I/O throughput. But neither am I the only person to
think that the address space split is an issue. A number of articles
have addressed this self-same issue. Also, if there is no use for a
kernel with a larger address space then why are there kernel versions
with a 2:2 split and I believe there is also a patch for a 4GB kernel
address space? I guess I am not alone.
> > The existing solution sucks and only helps when any user
> > process is transferring data in/out of the kernel
>
> That actually happens a lot. But no, you are not even
> right about that. Every kernel/user switch would be
> slowed down by a need to switch page tables. That means
> you'd have to pay the price on every single exception,
> system call, interrupt, and so on.
>
How long does it take to load a segment selector? That is all that is
needed (see Vol 3, section 3.10, p3-36). One segment selector selects
the kernel page table the other selects the user process page table.
Two separate 4GB address spaces, little overhead, no TLB flushing and
the only performance impact is mapping the relevant portion of user
space into the kernel. I suspect that even this could be avoided by
modifying the copy_from copy_to routines to use the appropriate
segment selector registers but would need to read further. Even if
this is not possible, setting up a kernel address space mapping would
not be a significant overhead - a few page table entries at most. All
in all a solution that makes best use of the limitations (and
features) of the architecture.
> > - as soon as a task
> > switch takes place the TLB's are flushed anyway.
>
> Right, but they are rare. The kernel does a great work
> to make them as rare as by any means possible.
>
Only every time a process is scheduled to run. Not sure what the
process time quantum is under Linux (about 1/5th of a second I think).
So at least every quantum interval there is likely to be a TLB flush
in a multi-process environment (since this will cause a process switch
and a change in page tables for the new process).
I think we may need to agree to differ though as to the Linux design.
I have always accepted that the design decisions taken were made for
good reasons but that does not stop me from considering alternative
solutions which in my opinion provide more advantages than
disadvantages. To satisfy my curiosity, I would implement my ideas
myself if I had the time.
- Next message: Kasper Dupont: "Re: Allocating kernel memory"
- Previous message: Kasper Dupont: "Re: How to correctly detect a device"
- In reply to: Kasper Dupont: "Re: Allocating kernel memory"
- Next in thread: Kasper Dupont: "Re: Allocating kernel memory"
- Reply: Kasper Dupont: "Re: Allocating kernel memory"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|