Re: [RFC] cpuset: remove sched domain hooks from cpusets



Hi Paul,

This mail seems to be as good as any to reply to, so here goes

On Fri, Oct 20, 2006 at 12:00:05PM -0700, Paul Jackson wrote:
The patch I posted previously should (modulo bugs) only do partitioning
in the top-most cpuset. I still need clarification from Paul as to why
this is unacceptable, though.

That patch partitioned on the children of the top cpuset, not the
top cpuset itself.

There is only one top cpuset - and that covers the entire system.

Consider the following example:

/dev/cpuset cpu_exclusive=1, cpus=0-7, task A
/dev/cpuset/a cpu_exclusive=1, cpus=0-3, task B
/dev/cpuset/b cpu_exclusive=1, cpus=4-7, task C

We have three cpusets - the top cpuset and two children, 'a' and 'b'.

We have three tasks, A, B and C. Task A is running in the top cpuset,
with access to all 8 cpus on the system. Tasks B and C are each in
a child cpuset, with access to just 4 cpus.

By your patch, the cpu_exclusive cpusets 'a' and 'b' partition the
sched domains in two halves, each covering 4 of the systems 8 cpus.
(That, or I'm still a sched domain idiot - quite possible.)

As a result, task A is screwed. If it happens to be on any of cpus
0-3 when the above is set up and the sched domains become partitioned,
it will never be considered for load balancing on any of cpus 4-7.
Or vice versa, if it is on any of cpus 4-7, it has no chance of
subsequently running on cpus 0-3.

ok I see the issue here, although the above has been the case all along.
I think the main issue here is that most of the users dont have to
do more than one level of partitioning (having to partitioning a system
with not more than 16 - 32 cpus, mostly less) and it is fairly easy to keep
track of exclusive cpusets and task placements and this is not such
a big problem at all. However I can see that with 1024 cpus this is not
trivial anymore to remember all of the partitioning especially if the
partioning is more than 2 levels deep and that its gets unwieldy

So I propose the following changes to cpusets

1. Have a new flag that takes care of sched domains. (say sched_domain)
Although I still think that we can still tag sched domains at the
back of exclusive cpusets, I think it best to separate the two
and maybe even add a separate CONFIG option for this. This way we
can keep any complexity arising out of this, such as hotplug/sched
domains all under the config.
2. The main change is that we dont allow tasks to be added to a cpuset
if it has child cpusets that also have the sched_domain flag turned on
(Maybe return a EINVAL if the user tries to do that)

Clearly one issue remains, tasks that are already running at the top cpuset.
Unless these are manually moved down to the correct cpuset heirarchy they
will continue to have the problem as before. I still dont have a simple
enough solution for this at the moment other than to document this at the
moment. But I still think on smaller systems this should be fairly easy task
for the administrator if they really know what they are doing. And the
fact that we have a separate flag to indicate the sched domain partitioning
should make it harder for them to shoot themselves in the foot.
Maybe there are other better ways to resolve this ?

One point I would argue against is to completely decouple cpusets and
sched domains. We do need a way to partition sched domains and doing
it along the lines of cpusets seems to be the most logical. This is
also much simpler in terms of additional lines of code needed to support
this feature. (as compared to adding a whole new API just to do this)

-Dinakar
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • [PATCH] cgroups: consolidate cgroup documents
    ... accounting/limiting the resources which processes in a cgroup can ... you to associate a set of CPUs and a set of memory nodes with the ... +the resources within a tasks current cpuset. ...
    (Linux-Kernel)
  • Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
    ... * This is perhaps the first non-trivial cpuset patch to come ... set of CPUs isolated in child cpusets, ... This same issue, in a strange way, comes up on the memory side, ... to partition the system, along cpuset boundaries. ...
    (Linux-Kernel)
  • Re: [Lse-tech] Re: [RFC PATCH] Dynamic sched domains aka Isolated cpusets
    ... make use of these CPUs ... An exclusive cpuset ensures that the cpus it has are disjoint from ... cpuset from its parent, so that this set of cpus are truly ... To isolate a cpuset x, x has to be an exclusive cpuset and its ...
    (Linux-Kernel)
  • Re: [PATCH] ia64 cpuset + build_sched_domains() mangles structures
    ... > cpuset functionality seems to work fine on a NUMA ppc64 box. ... The "dynamic sched domains" functionality has recently been merged into ... remainder of the non-isolated CPUs. ... +#ifdef CONFIG_NUMA ...
    (Linux-Kernel)
  • [PATCH v2] cpuset sched_load_balance flag
    ... When enabled in a cpuset it tells the kernel ... balancing on the CPUs in that cpuset, ... (non-overlapping sets of CPUs over which load balancing is ... +domains such that it only load balances within each sched domain. ...
    (Linux-Kernel)