Re: Allocating kernel memory
From: George Nelson (grn_at_freeuk.com)
Date: 05/18/04
- Next message: kindsol: "HELP: gettimeofday() jumps backwards"
- Previous message: Frank Worsley: "MF / DTMF tone detection with modem"
- In reply to: Kasper Dupont: "Re: Allocating kernel memory"
- Next in thread: Gil Hamilton: "Re: Allocating kernel memory"
- Reply: Gil Hamilton: "Re: Allocating kernel memory"
- Reply: Kasper Dupont: "Re: Allocating kernel memory"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: 17 May 2004 16:34:34 -0700
Kasper Dupont <kasperd@daimi.au.dk> wrote in message news:<40A7177E.6A47EF53@daimi.au.dk>...
Kasper,
I'd like to start by apologising unreservedly. I confess I did not
read the Intel architecture spec closely enough although I still wish
I could find the reference in there that mentioned the different
treatment of the 4KB and 4MB TLBs. I made the wrong assumption that a
paged H/W spec would properly support kernel and user modes of
operation in a paged environment which was why I expressed my initial
surprise at the 1GB kernel limit. There are good reasons, I am sure
you would agree, that the kernel or a user process should be able to
use its maximal address space.
However, I still would like to address a few points you made in your
last post.
> As long as it can be done within the same clockcycle
> it doesn't matter how long time it takes, it would
> still not affect the end result in any way. I don't
> have any meassurements, but I don't belive the CPU
> designers would allow a TLB lookup to slow down the
> memory access. The memory bandwidth really should
> be the bottleneck.
>
I agree that a clock cycle seems to be reasonable but like you I do
not know enough about how the TLB works. But that is still an extra
clock cycle/memory reference. On the clock that gates the key to the
TLB, a real memory address would be winging its way down the address
bus, so for every instruction there is an extra clock cycle and the
same for any data accessed by the instruction. So probably somewhere
on average 2 clocks extra/instruction. It may be 2 clocks for 4KB
pages as both a directory entry and a page entry must be looked up in
the TLB (but maybe this can happen in parallel).
Also paging as a necessary feature in H/W is very quickly becoming a
debateable like some other features of the modern OS. If you consider
the history of OS development and the reasons why we have things like
paging and multi-processing it was to maximise the use of scarce (and
expensive) resources to users of those systems. In today's climate
almost everyone has their own PC so is multi-processing really
required? Memory is cheap and plentiful so does the past benefits from
paging still hold true? Consider paging more closely. It provides a
way to handle memory allocation relatively cleanly but at a cost in
performance to set it up and maintain it. But if you are the only user
of a system with a large amount of memory why go to the trouble at
all? Well, OK you can demand page and don't have to load everything at
once but what advantage does that really give other than faster
process startup? On a single user system I would personally rather
take the hit and have all my code loaded at process startup than
suffer the performance hit measured in wall clock time caused by
demand paging. Demand paging is only an advantage when there is
competition for memory. On 32-bit architectures it is economically
viable to configure the system memory to fill its address space - so
there goes the virtual memory benefit(which costs in performance too).
Basically, the point I am making is that OS features are not written
in stone and what we are used to today may radically change in the not
too distant future. Perhaps it is time to start to question the
features that are really needed in an OS than continue developing OS's
with features that may be redundant with systems available today and
in the future.
> Comparing a CPU without paging support and CPU with
> paging support is definitely not the same as comparing
> an OS with and without paging support running on a CPU
> that does have the support.
>
> > All I have done is express surprise at the 1GB address space limit.
>
> The reasons for that design decission should have been
> made clear by now.
>
I wasn't making this comparison. I was using the Cray system as an
example of a H/W designed for performance. Take any paged architecture
system and turn off paging and your program will run faster was the
point I was making.
> >
> > But 32-bit architectures will be around for a long time yet and real
> > memory will increase. Commercial users will want to get the best out
> > of their current investment and are more likely to add memory, disk to
> > an existing (32-bit system) than replace them.
>
> Sure, you can add more memory to your systems. But the
> problem really only applies to PAE systems, where you
> can add a lot more than 4GB of RAM. How many of them
> are there "out there"? Those 8GB systems I could find
> were using more than 50% of the zone 1 memory for stuff
> that could have been on zone 2. So you shouldn't
> experience any problem on an 16GB system either.
>
I disagree. PAE is not the problem on the Intel architecture - it is
simply a way to extend the life of an architecture relatively cheaply
while designing and developing a 64-bit architecture. The problem I
see with the Intel architecture is that the H/W has taken the decision
over TLB flushing rather than leaving that in the hands of the
software designer. The silicon on the chip with a fairly minor
modification could allow 4GB address spaces for anyone. In the
simplest solution the addition of two TLB flush control bits in one of
the control registers would allow my original suggestion of mapping
the kernel with 4MB pages and user space with 4KB pages. One bit, when
set would cause the 4KB TLBs to be flushed, the other the 4MB TLBs.
Thus transitions from user to kernel space and back could be carried
out without a flush. The 4KB bit would be set only when scheduling a
new process. Replacement of entries would still be under control of
the H/W as it is now. Ideally separate TLB's for kernel mode and user
mode would be better but that would require more silicon (although I
note that hyperthreaded systems flush the TLB entries according to the
logical processor that executes the page table change - ie. only
entries belonging to that processor are flushed so the ability exists
to have entries keyed by processor ID and could probably be extended
to a privilege level key).
> With PAE the simple elegant design is not possible. There
> is no way you can put 64GB of RAM into a 4GB address space.
> A new design was required. And since we had a design that
> would allow the kernel address space to be smaller than
> the physical RAM, the 1GB extra that was in some cases
> taken for the kernel could be reclaimed for user space.
>
Agreed - with a 4GB address space you can't address more than 4GB of
memory at a time but that doesn't make it a bad thing to be able to
support a larger real memory. As I said above the problem is really
the fact that the H/W makes the decision about flushing and for the
reasons you have stated, the kernel virtual address space is limited.
This is what causes the problems in handling larger memory
configurations.
> You have 2GB of physical RAM. There is one hell of a
> difference between 2GB and 64GB. Nobody in their
> right minds would use the 4:4 patch for such a small
> system.
Why not? If I am developing code where essentially I am running almost
exclusively in kernel mode (as in my case I am) then I would not
suffer from the page table flush problem as the only user processes
that would run would be infrequent and purely for management purposes
so the TLB flush performance problem would not be an issue. I think
there is a market for a 4:4 system for vendors wishing to use Linux as
the base for say a router or a node in a SAN. I'm not sure if there
are any attempts to minimise the TLB flush problem here but even with
the existing architecture the effect of flushing can be minimised with
a little work and making use of the Global page bit. I think TLB have
64 entries for 4KB pages and 32 if they are mapping 4MB pages.
Allocate some portion of these as permament entries (for the kernel
all of the kernel code in the 4MB instruction TLB and say half the
entries mapping commonly accessed kernel data and of course the kernel
mode stack page). User space involves more work as the only obvious
page(s) to flag as global is the user stack. However some entries
could be reserved in the instruction and data (4KB) TLBs to hold
global code and data pages as these pages are demand paged in. The
page when mapped can have its global bit set and a replacement
strategy used when the number of global pages reaches the
predetermined limit. I don't think overall such a scheme would incur a
significant additional penalty over what is incurred to handle the
initial page fault and hopefully locality of reference would minimise
the effect of the TLBs being flushed during user/kernel transitions.
When a new user process is scheduled the INVALPG instruction can be
used to invalidate the global pages.
- Next message: kindsol: "HELP: gettimeofday() jumps backwards"
- Previous message: Frank Worsley: "MF / DTMF tone detection with modem"
- In reply to: Kasper Dupont: "Re: Allocating kernel memory"
- Next in thread: Gil Hamilton: "Re: Allocating kernel memory"
- Reply: Gil Hamilton: "Re: Allocating kernel memory"
- Reply: Kasper Dupont: "Re: Allocating kernel memory"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|