Re: [PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)



Paul Jackson wrote:
So far as I can tell, the logic and design is ok.

I found some of the comments, function names and code factoring
to be confusing or incomplete. And I suspect that the rebuilding
of sched domains in the case that sched_power_savings_store()
is called in kernel/sched.c, on systems using cpusets, is not
needed or desirable (I could easily be wrong here - beware!).
Actually we do need it. See below.

I'm attaching a patch that has the changes that I would like
to suggest for your consideration. I have only recompiled this
patch, for one configuration - x86_64 with CPUSETS. I am hoping,
Max, that you can complete the testing.

The patch below applies to 2.6.27-rc1, plus the first patch,
"origin.patch" in Andrew's 2.6.27-rc1-mm1 quilt patch stack,
plus your (Max's) latest:
[PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)

Here's a description of most of what I noticed:

1) Unrelated spacing tweak, changing:
LIST_HEAD(q); /* queue of cpusets to be scanned*/
to:
LIST_HEAD(q); /* queue of cpusets to be scanned */
Check.

2) The phrasing:
Must be called with cgroup_lock held.
seems imprecise to me; "cgroup_lock" is the name of a wrapper
function, not of a lock. The underlying lock is cgroup_mutex,
which is still mentioned by name in various kernel/cpuset.c
comments, even though cgroup_mutex is static in kernel/cgroup.c
So I fiddled with the wording of this phrasing.
Check.

3) You removed the paragraph:
* ... May take callback_mutex during
* call due to the kfifo_alloc() and kmalloc() calls. May nest
* a call to the get_online_cpus()/put_online_cpus() pair.
* Must not be called holding callback_mutex, because we must not
* call get_online_cpus() while holding callback_mutex. Elsewhere
* the kernel nests callback_mutex inside get_online_cpus() calls.
* So the reverse nesting would risk an ABBA deadlock.

But you didn't replace it with an updated locking description.
I expanded and tweaked various locking comments.
I think it it's covered by the top level comment that starts with
/*
* There are two global mutexes guarding cpuset structures. The first
* is the main control groups cgroup_mutex, accessed via
* cgroup_lock()/cgroup_unlock().
....

But I'm ok with your suggested version.

4) The alignment of the right side of consecutive assignment statements,
as in:
ndoms = 0;
doms = NULL;
dattr = NULL;
csa = NULL;
or
*domains = doms;
*attributes = dattr;
is not usually done in kernel code. I suggest obeying convention,
and not aligning these here either.
Check.

5) The rebuilding of sched domains in the sched_power_savings_store()
routine in kernel/sched.c struck me as inappropriate for systems
that were managing sched domains using cpusets. So I made that
sched.c rebuild only apply if CONFIG_CPUSETS was not configured,
which in turn enabled me to remove rebuild_sched_domains() from
being externally visible outside kernel/cpuset.c

I don't know if this change is correct, however.
Actually it is appropriate, and there is one more user of the
arch_reinit_sched_domains() which is S390 topology updates.
Those things (mc_power and topology updates) have to update domain flags based
on the mc/smt power and current topology settings.
This is done in the
__rebuild_sched_domains()
...
SD_INIT(sd, ALLNODES);
...
SD_INIT(sd, MC);
...

SD_INIT(sd,X) uses one of SD initializers defined in the include/linux/topology.h
For example SD_CPU_INIT() includes BALANCE_FOR_PKG_POWER which expands to

#define BALANCE_FOR_PKG_POWER \
((sched_mc_power_savings || sched_smt_power_savings) ? \
SD_POWERSAVINGS_BALANCE : 0)

Yes it's kind convoluted :). Anyway, the point is that we need to rebuild the
domains when those settings change. We could probably write a simpler version
that just iterates existing domains and updates the flags. Maybe some other dat :)

6) The renaming of rebuild_sched_domains() to generate_sched_domains()
was only partial, and along with the added similarly named routines
seemed confusing to me. Also, that rename of rebuild_sched_domains()
was only partially accomplished, not being done in some comments
and in one printk kernel warning.

I reverted that rename.
Actually it was not much of a rename. The only rename was
rebuild_sched_domains() -> async_rebuild_sched_domains() and yes looks like I
missed one or two comment.
The other stuff was basically factoring out the function that generates the
domain masks/attrs from the function that actually tells the scheduler to
rebuild them.
I'd be ok with some other name for generate_sched_domains() if you think it's
confusing.

btw With your current proposal we have
rebuild_sched_domains() - just generates domain masks/attrs
async_rebuild_sched_domains() - generates domain masks/attrs and actually
rebuilds the domains

That I think is more confusing. So I guess my preference would be to have the
generation part separate. And like I explained above we do need to be able to
call rebuild_sched_domains() from the scheduler code (at least at this point).

I also reduced by one the number of functions needed to handle
the asynchronous invocation of this rebuild, essentially collapsing
your routines rebuild_sched_domains() and rebuild_domains_callback()
into a single routine, named rebuild_sched_domains_thread().

Thanks to the above change (5), the naming and structure of these
routines was no longer visible outside kernel/cpuset.c, making
this collapsing of two functions into one easier.
Yes if we did not have to call rebuild_sched_domains() from
arch_reinit_sched_domains() I would've done something similar.

Sounds like you're ok with the general approach and as I mentioned we do need
externally usable rebuild_sched_domains(). So how about this. I'll update
style/comments based on your suggestions but keep the generate_sched_domains()
and partition_sched_domains() split. Or if you have a better name for
generate_sched_domains() we'll use that instead.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



Relevant Pages

  • Re: unable to send/receive mail "verify and rebuild the Entourage database"
    ... It's less confusing. ... the Entourage database. ... I'd prefer not a total reinstall of the program, ...
    (microsoft.public.mac.office.entourage)
  • Re: F5 is running an older version of my project
    ... There are some rare cases, when build doesn't build all what is needed. ... Then a rebuild works. ... as it makes debugging quite confusing! ... Perhaps this is just an uncommon quirk... ...
    (microsoft.public.dotnet.languages.csharp)
  • F5 is running an older version of my project
    ... However, when I build or rebuild the solution, it builds flawlessly... ... Is there a way to clear the older versions (other than deleting the ... as it makes debugging quite confusing! ...
    (microsoft.public.dotnet.languages.csharp)
  • lostandfoundconfig
    ... template in the CN=LostandFoundConfig. ... very confusing. ... CN=LostandFoundConfig container or just certain items to ... rebuild the address book? ...
    (microsoft.public.exchange2000.admin)
  • Re: lots of _RBLDAxxxx after table rebuilt
    ... The rebuild utility does not rename the original file until the ... rename that rebuild does after it rebuilds the file. ... Just do like the rebuild utility would do and delete the original ...
    (comp.databases.btrieve)