Re: Direct Linux syscalls

From: Beth (BethStone21_at_hotmail.NOSPICEDHAM.com)
Date: 12/23/04


Date: Thu, 23 Dec 2004 02:14:44 GMT

Kasper Dupont wrote:
> Beth wrote:
> > the mechanisms are
> > really dedicated to hardware IRQs primarily but Intel added "software
> > interrupts" to re-use its "run-time relocation" stuff for BIOS / OS
> > functions)...
>
> OK, that explains why the instruction isn't really optimized.

Yeah, with the x86 architecture, I don't think Intel were seeing much of a
future in it at the design stage because it suffers from a number of quite
horrible "hacks"...the most infamous being the "real-mode addressing" used
before protected mode was added...addresses are calculated as: Address =
(Segment * 16) + Offset...which leads to the rather horrible situation that
0000:0400h, 0040:0000h and 0030:0100h all point to the _SAME_ memory
address, despite being numerically dis-similar...

There's no way a self-respecting engineer would prefer this "hack"
introduced, if they actually knew at the time that this chip would have a
_30 year_ lifespan...and become the architecture used in 90+% of machines
world-wide...but they didn't know, of course, so one cannot blame them but
the odd thing about the original architectural designs is hardly ideal...

Plus, the x86 started out as a simple 16-bit microprocessor...no MMU, no
"protections"...these were added with "protected mode" but, well, this was
a later addition...the point being that until you have such "protected
mode" operations added to the CPU, it would have no such thing as "user"
and "supervisor" modes to begin with, anyway, for an instruction like
"sysenter" to even make sense...

So, yeah, as noted, it's not really an "OS calling mechanism" as such, but
more of a "piggyback" on the interrupt system (needed for hardware IRQs and
such)...the interrupt system already had to deal with "run-time relocation"
(by not calling an address directly but calling it _indirectly_ through a
table of addresses...an "Interrupt Vector Table" (IVT)...later renamed
"Interrupt Descriptor Table" (IDT) to distinguish its protected mode
equivalent, as addressing works completely differently in protected mode
and required a different format table)...and then CPU exceptions also
re-used this same system (in a sense, the CPU exception being more or less
a form of "hardware IRQ" sent from the CPU to itself...an "internal" IRQ
;)...so, it made sense to also add on the "INT" instruction to allow
software to also generate "interrupts" on request and its mechanisms could
be "re-used" to also serve for BIOS / OS functions (which could benefit
from being indexed via a "table of addresses" so that newer versions of the
OS could relocate its routines elsewhere, just change the addresses in the
table but programs calling via the table don't need to be recompiled)...

> > > With some kernel versions, you can just call
> > > some high address (0xFFFFF000 I think), where
> > > the kernel will have placed an appropriate
> > > trapping instruction for your configuration.
> >
> > Another way again?
>
> Actually not. It just call some code that will
> use the right one of the two possibilities. The
> code on the called address is also executing as
> user mode code.

Yeah, from the description given elsewhere, I think I get the idea...the
application just calls the page and then it's either filled with "int 80h"
or "sysenter", depending on what the system is configured to use...thus,
it's "transparent" to the application which one is in use...kind of clever
but also kind of wasteful...

> > To be honest, I was always rather surprised that Linux used "INT",
> > anyway...this is usually considered to be "not recommended" in a
> > multi-tasking system and not the speediest way to do it (though, it
might
> > be the _easiest_ way to do it...so perhaps Linus was simply thinking of
how
> > to get it done quickly, not what would necessarily be the "leanest and
> > meanest" performance-wise)...
>
> What other options were there when Linux were originally designed for the
386?

There are a few options available, in fact...but the most sensible to my
mind would have been a plain, simple "CALL" instruction...the protected
mode architecture (which actually came in with the '286 originally but the
'386 made it 32-bit and, well, actually useful :) also allows a change of
"privilege levels" and such with a "call gate", as well as through
interrupts...

Indeed, the most obvious alternative option is much the same way that
shared libraries work by...a program calls the OS indirectly though a
"table", which can be constructed by the OS loader...

As a simple example, when a process loads, a "jump table" of address to all
the syscalls could be provided to the application...say, just as an
example, the EAX register holds the address of the start of this "jump
table"...the application can then save this address away in a variable and
use instructions of the form "CALL [ TableAddress + (syscallnumber * 4)]"
('386 addressing modes are powerful enough that this operation is a single
machine instruction) to index into that "jump table" and use it to call the
OS system functions...the OS loader itself constructs the table - so it can
fill out the table with the addresses dynamically and it can happily change
addresses from version to version - and the application is simply compiled
to make calls via the table...

The other point about using "CALL" is that it can be set up to make a
simple call to other user mode code or it can be set up with a "call gate"
in order to trigger a user -> kernel transition...and this would be
"transparent" to the calling application (the difference lies in the MMU
tables, not in the actual instructions an application uses)...

This also tends to make sense from a "micro-kernel" architecture
point-of-view (you still use the exact same "indirect CALL" instruction,
whether you're calling other user mode code or actually calling into kernel
space...thus, the system functions need not all be in kernel space...you
could even conceivably move a "monolithic" slowly towards a
"micro-kernel" - reduce the kernel but still provide all the same functions
that were available when "monolithic" with user mode equivalents - and do
it quite "transparently")...

As well as even making much more sense from the UNIX / C side of things in
that these "CALL" instructions could directly be C convention calls...as
Linux itself is written in C, the process is often quite bizarre from a
"big picture" point of view...call into "libc", loads stack into registers,
calls "int 80h", "int 80h" takes parameters and puts them on stack in order
to call internal C functions...it's moving the parameters around all over
the place and making calls to calls and such...quite a bit of "overhead"
attached (and, no, it's not that this "overhead" is unacceptably slow or
large but it's just not really necessary...why do something that there's no
actual need to be doing all the time?)...

The '386 supported an "indirect CALL" (with a "call gate", if you needed to
switch privilege levels during that call to jump into "kernel
space"...quite "transparent" to the caller that this is happening too) and
could have used a "jump table" kind of call from the beginning...indeed,
being NOT greatly dissimilar from the mechanism used for "shared libraries"
(the difference being that the OS loader constructs the "jump table"
dynamically, according to entries in the executable header about what
libraries and functions need to be loaded and "imported" into this "jump
table"), this could also have been made more "generic" and re-used for that
too (that, so to speak, every process automatically has the kernel loaded
"as if" it's a shared library when it starts)...

This system was certainly possible at the time because Microsoft were
already using it for Windows (and though Windows has a great many faults,
this is one place where they appear to have got it right...well, they're
using the right _mechanism_, anyway...BUT, as typical Microsoft, it's
rather wasted and spoilt with other strange "stdcall" conventions and an
insistance on 500 system API when 3 would do the job just as well)...

Indeed, I've had to look into this for a project and, ideally, a CALL via a
"jump table" with parameters in registers (where "libc" can provide C
convention "parameters via the stack" wrapper functionality around this for
"portability"...but where performance is more important than "portability",
an application can ignore this and use "parameters in registers"
directly)...this gives reasonably efficient performance while not actually
compromising flexibility, "portability" concerns (indeed, if you're not
fussed to support any lower-level "parameters in registers" interface, you
could make these calls actual C convention calls directly and then they are
called exactly like "libc" functions), "transparency" for "transitions"
between user and kernel space and so on and so forth...this method would
generally meet most requirements simultaneously...as close to a "one size
fits all" solution as is possible...it actually would be arguably _simpler_
and has potential for "re-use" for shared libraries too (the OS loader does
already create such a "jump table" for shared libraries...merely "allocate"
the first entries in the table to the system calls, "as if" these system
calls were from an implied "kernel shared library"...because these are
"implied", there's no need to list them in an executable header or
anything...to account for "future expansion", perhaps it's best to have one
"jump table" per "shared library" so that if the number of system calls
goes up, this doesn't "clash" with anything)...

Linus, though, didn't have "shared libraries" to begin with...and, well,
probably didn't much care about "performance" to begin with, rather than
just getting it up and running...now with Intel introducing "sysenter", the
question might be academic, as the dedicated instruction no doubts performs
better than any other means and should now be preferred...but, for the
original '386, there certainly were other options..."int" isn't the only
way...I think, simply, "int" was the most obvious and simplest way when
Linus first started (also familiar from DOS doing it this way too) and once
he decided on that way, you have to keep with it on "compatibility"
grounds...not that he did anything "wrong", so to speak, but there were
other possibilities available which could have performed slightly better
(and even given the kernel itself a "libc" style calling mechanism right at
the "core", feeding directly into the C functions of the kernel itself, as
the OS was written mostly with C itself :)...

Beth :)