Re: [patch 00/13] devtmpfs patches
- From: Kay Sievers <kay.sievers@xxxxxxxx>
- Date: Mon, 11 May 2009 18:28:20 +0200
On Mon, May 11, 2009 at 17:53, Alan Cox <alan@xxxxxxxxxxxxxxxxxxx> wrote:
But he does not use an initramfs, and distros insist to do that. And
that basically means you need to prepare /dev two times, and also prep
Once. You may want to move a few bits later. You only need null,
zero and console to get started. Thats three fixed device nodes.
And random, rtc, tty for a custom console, and whatever not, in the
non-trivial case. Not to mention non-x86 boxes.
If we have stable block numbers you might need more than one extra if you
have to search for a UUID/label and it moved from where you cached
it. Without stable block numbers you can't cache the node but most create
lots of nodes to go looking. Do I understand that bit right ?
Look at drivers/block/, you need all of the names then, to get it
booting from a non-sd node.
However you still only create it once as you have zero, console and null
on the initrd already and do
mkdir final-dev
mount tmpfs
create them in final-dev
mount root
move final-dev
Tell me if I'm going astray here as I want to clearly understand the
problem.
Maybe your root disk shows up after the "create them in final-dev"?
Initramfs logic works by just waiting for the device node udev creates
asynchronously. When it's there, we go ahead. To make sure you don't
miss it, you have to start udev before you copy the nodes over.
Another data point: On a fairly typical PC on a single CPU we can do over
30,000 mknods per second on tmpfs. I've just benched it. So you can
create those block nodes very fast indeed.
On a 1 second budget I can create 3000 device nodes (which should cover
most user systems quite adequately) and have 0.9 seconds left to do other
work.
Sure. But that does not solve the problem of missing device nodes or
the requirement of shipping all possible combinations.
Device spaces have user controlled naming rules, user controlled
permissions, user controlled labelling and the like. That is policy, and
the administering of that is management.
I see. But that does not change at all. It's just that you can also
bring up the box without the complex management we need to do today.
If you have an environment using any of those features then not having
that management is not a win - its a bug.
Bugs happen, it's a reality. We don't needlessly make it harder to
work around a bug. We have many tings to make the kernel
self-contained. With your argument, we should remove all partition
scanning from the kernel too.
That was one of the things that killed devfs eventually, and it's not a
problem your proposal or devfs solved.
Oh, that old devfs was killed for many good reasons, sure. The biggest
reason alone to kill it, was the dumb new naming scheme, which broke
The "naming scheme" ? It was not the naming scheme but the inability to
make it do stuff the way users wanted. If the naming scheme had been
trivially configurable then the distro would simply have shipped a
different naming scheme.
Yeah, but it did not even create the current names by default. So it
was the main reason distros did not use is.
As mentioned, we create 12.000 files in sysfs, now we just add 210 and
setfacl -m u:alan:r /sys/devices/virtual/dmi/id/bios_vendor
setfacl: /sys/devices/virtual/dmi/id/bios_vendor: Operation not supported
Sysfs doesn't even support per user ACLs which means its not much use for
tty devices or a lot of other things where you want to give access to a
piece of hardware to groups of users or use SELinux to control root more
tightly.
We add the 210 to a separate tmpfs which is the subject of this mail,
and that supports ACLs just fine. We don't add any device nodes to
sysfs.
decouple the kernel initial bootup from a complex userspace
dependency, all for the sake of robustness, that is also faster and
very flexible.
It isn't flexible. You can't set the naming policy, you can't set the
permissions, you can't control the labelling. It might be a convenient
way to implement a very specific narrow set up.
The kernel _is_ the naming policy already, claiming anything different
is just a lie. If you go and rename /sys/block/sda in the kernel, no
current udev system will provide a /dev/sda node anymore. It's that
since forever.
Udev still has the last say, and can overwrite the kernel policy,
nothing will change, but that does not happen today, and will not
happen in the future for 98% of the devices.
No, that problem is solved by exporting all of it in sysfs already
today. But that does not provide any of the robustness and reliability
gains the kernel-provided nodes do.
What is robust and reliable about having another set of nodes that an
existing distro won't know about and existing tools don't know about that
has permissions and labels that bypass the security as configured by the
system administrator ????
Which "other" set? There is only one set of names, that is the kernel
provided name. There is no bypass anywhere.
5. Make the new big block numbers stable
Might be nice to have, but we still can't include all of the possible
block driver names and nodes in initramfs. Distros can just not manage
that, and don't do it today.
Even if we have to create a lot of nodes it shouldn't be slow - mknod
syscalls on tmpfs are as we've just established - quite acceptably quick.
Yes I think stable numbers would be smart.
Just grep in drivers/block/ and estimate how many nodes you will need
to provide. General purpose distros don't do that today, and don't
want to go back to the time they needed to that.
Mine does too. But general purpose systems have different problems to solve.
I'm of the opinion your system isn't general purpose - its Kay purpose.
If it can become truely general purpose and replace or improve udev with
something far better then great but can it ?
I don't understand this question. What do you mean?
What problem?
The problem I've been pointing out all along - security, naming,
permissions, persistency.
Naming happens in the kernel for udev systems since forever.
Permissions happens in udev, and we keep that. All kernel created
nodes are 0600 root:root. If a device exists in the kernel, we will
see its node, if it goes away the node goes away, just like sysfs, and
just like we do with udev in /dev today.
Let me know what specifically needs to be fixed, I'll do it right
away, I wrote and maintain most of it, so I should be pretty quick to
act here. I work on it almost every day, and I mostly don't find it
non-funny. :)
So if you maintain it why is it so slow ? (that isn't an accusation of
incompetence btw I want to understand the bottlenecks) - what percentage
is CPU wait, what is I/O wait, wtf are we doing with all that wall time
and serialized probing ? You've still not provided any useful data on
timings. If you had four or five pet programmers and were told "fix udev"
what would you direct them to sort out ? The numbers you've posted
contain no breakdown. Yes its faster than the old system for your
specific case but there is no "why" in the data.
It isn't slow. It's just that bootstrapping/re-constructing something
later can obviously never be faster than doing it when the device is
created.
I don't know of any obvious fixes to udev, otherwise I would have
implemented them.
There isn't any reason it should magically go faster in kernel. We don't
run the CPU at a different speed in kernel and syscalls are cheap.
Yes, we are in the context of the device, and create the node on top
of many other things we do. At the time userspace runs, we need to
recover all that information, which is not as robust, and not as
cheap. The recover/bootstrap point is a hard blocking point for other
stuff that can run at the same time otherwise.
but it does actually get us something featureful
and useful that does what people want.
Actually, many people asked for more robustness and less complexity to
bring up a box, not for more special hacks in udev, initramfs, the
boot scripts. That's what we try to solve here, and what we did, from
my perspective.
"from my perspective" - bingo...
Sure, what else can I say, I have only my one, just like you have yours.
Which is the devfs problem - its easy to solve a problem for one
perspective or one user only. But we'd have an awful lot of devfs clones
in the kernel if we kept doing that.
So I'd like
- my device file system to do SELinux and ACLs (and Tomoyo and ...)
- ability to set labels and security contexts and permissions
- device nodes in one place only
- ability to use security models which take stuff away from root (so
chmodding the sysfs node 000 doesn't cut the mustard)
- a guarantee I can't race the policy application and node creation on
hotplug. In other words the creator sets up its security contexts and
the like then does the node create.
You can do all that just like you do today, no change at all.
Putting device nodes into sysfs can't do most if any of that
Nobody talked about that.
Putting the data to create those initial device nodes into sysfs *can*
make it customisable this way. It also means your initrd can be more
robust because the device creation logic is very very simple.
sh < /sys/initial-device-list
And you still need to cope with the races, and bring up the event
listener before that. This is less reliable and always slower than the
kernel provided nodes, besides that your /sys/initial-device-list will
be the same amount of code we need for the node creation right away,
without any of the other benefits, and will require another
special-case tool we don't use today.
might be slightly extreme but you need little more when not using fancy
feature sets. We've just established by benchmarking that the mknod paths
are fast enough.
It's a question of API and layering
If you put the devices into sysfs I get burger and fries the way you like
If you put the list of devices into sysfs I get to decide how I want it.
Come on, nobody puts nodes in sysfs. Where did you get that idea from?
We have enough fixed nodes to run a recovery shell in the initrd or boot
with init=/bin/sh so the recovery argument doesn't seem to hold water.
Unless you got a box that does not work anymore, than it's the most
important thing you can have.
The performance for reading one sysfs file (even without sysfs
optimisation) and writing 3000 device nodes to disk is more than
acceptable so if you don't mind I'd prefer my burger with extra onions ;)
Sure, if I can have a beer too. :)
Thanks,
Kay
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [patch 00/13] devtmpfs patches
- From: Alan Cox
- Re: [patch 00/13] devtmpfs patches
- From: Arjan van de Ven
- Re: [patch 00/13] devtmpfs patches
- References:
- [patch 00/13] devtmpfs patches
- From: Greg KH
- Re: [patch 00/13] devtmpfs patches
- From: Arjan van de Ven
- Re: [patch 00/13] devtmpfs patches
- From: Eric W. Biederman
- Re: [patch 00/13] devtmpfs patches
- From: Kay Sievers
- Re: [patch 00/13] devtmpfs patches
- From: Alan Cox
- Re: [patch 00/13] devtmpfs patches
- From: Kay Sievers
- Re: [patch 00/13] devtmpfs patches
- From: Alan Cox
- Re: [patch 00/13] devtmpfs patches
- From: Kay Sievers
- Re: [patch 00/13] devtmpfs patches
- From: Alan Cox
- [patch 00/13] devtmpfs patches
- Prev by Date: [git-pull -tip] x86: Addition of cpufeatures to friendly access miscellaneous MSRs
- Next by Date: Re: [git-pull -tip] x86: Addition of cpufeatures to friendly access miscellaneous MSRs
- Previous by thread: Re: [patch 00/13] devtmpfs patches
- Next by thread: Re: [patch 00/13] devtmpfs patches
- Index(es):
Relevant Pages
|