[PATCH] sched: smpnice work around for active_load_balance()
- From: Peter Williams <pwil3058@xxxxxxxxxxxxxx>
- Date: Tue, 28 Mar 2006 17:00:50 +1100
Problem:
It is undesirable for HT/MC packages to have more than one of their CPUs
busy if there are other packages that have all of their CPUs idle. This
involves moving the only running task (i.e. the one actually on the CPU)
off on to another CPU and is achieved by using active_load_balance() and
relying on the fact that (when it starts) the queue's migration thread
will preempt the sole running task and (therefore) make it movable. The
migration thread then moves it to an idle package.
Unfortunately, the mechanism for setting the run queues active_balance
flag is buried deep inside load_balance() and relies heavily on
find_busiest_group() and find_busiest_queue() reporting success even if
the busiest queue has only one task running. This requirement is not
currently met.
Solution:
The long term solution to this problem is provide an alternative
mechanism for triggering active load balancing for run queues that need it.
However, in the meantime, this patch modifies find_busiest_group() so
that (when idle != NEWLY_IDLE) it prefers groups with at least one CPU with more than one running task over those with only one and modifies find_busiest_queue() so that (when idle != NEWLY_IDLE) it prefers queues with more than one running task over those with only one. This means that the "busiest" queue found in load_balance() will only have one or less runnable tasks if there were no queues with more than one runnable task. When called in load_balance_newidle() they will have the existing functionality. These measure will prevent high load weight tasks that have a CPU to themselves from suppressing load balancing between other queues.
NB This patch should be backed out when a proper fix for the problems inherit in HT and MC systems is implemented.
Signed-off-by: Peter Williams <pwil3058@xxxxxxxxxxxxxx>
Peter
--
Peter Williams pwil3058@xxxxxxxxxxxxxx
"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce
Index: MM-2.6.X/kernel/sched.c
===================================================================
--- MM-2.6.X.orig/kernel/sched.c 2006-03-27 17:02:53.000000000 +1100
+++ MM-2.6.X/kernel/sched.c 2006-03-28 16:57:58.000000000 +1100
@@ -2098,6 +2098,7 @@ find_busiest_group(struct sched_domain *
unsigned long max_pull;
unsigned long busiest_load_per_task, busiest_nr_running;
unsigned long this_load_per_task, this_nr_running;
+ unsigned int busiest_has_loaded_cpus = idle == NEWLY_IDLE;
int load_idx;
max_load = this_load = total_load = total_pwr = 0;
@@ -2152,7 +2153,15 @@ find_busiest_group(struct sched_domain *
this = group;
this_nr_running = sum_nr_running;
this_load_per_task = sum_weighted_load;
- } else if (avg_load > max_load && nr_loaded_cpus) {
+ } else if (nr_loaded_cpus) {
+ if (avg_load > max_load || !busiest_has_loaded_cpus) {
+ max_load = avg_load;
+ busiest = group;
+ busiest_nr_running = sum_nr_running;
+ busiest_load_per_task = sum_weighted_load;
+ busiest_has_loaded_cpus = 1;
+ }
+ } else if (!busiest_has_loaded_cpus && avg_load < max_load) {
max_load = avg_load;
busiest = group;
busiest_nr_running = sum_nr_running;
@@ -2161,7 +2170,7 @@ find_busiest_group(struct sched_domain *
group = group->next;
} while (group != sd->groups);
- if (!busiest || this_load >= max_load)
+ if (!busiest || this_load >= max_load || busiest_nr_running == 0)
goto out_balanced;
avg_load = (SCHED_LOAD_SCALE * total_load) / total_pwr;
@@ -2265,12 +2274,19 @@ static runqueue_t *find_busiest_queue(st
{
unsigned long max_load = 0;
runqueue_t *busiest = NULL, *rqi;
+ unsigned int busiest_is_loaded = idle == NEWLY_IDLE;
int i;
for_each_cpu_mask(i, group->cpumask) {
rqi = cpu_rq(i);
- if (rqi->raw_weighted_load > max_load && rqi->nr_running > 1) {
+ if (rqi->nr_running > 1) {
+ if (rqi->raw_weighted_load > max_load || !busiest_is_loaded) {
+ max_load = rqi->raw_weighted_load;
+ busiest = rqi;
+ busiest_is_loaded = 1;
+ }
+ } else if (!busiest_is_loaded && rqi->raw_weighted_load > max_load) {
max_load = rqi->raw_weighted_load;
busiest = rqi;
}
- Follow-Ups:
- Re: [PATCH] sched: smpnice work around for active_load_balance()
- From: Siddha, Suresh B
- Re: [PATCH] sched: smpnice work around for active_load_balance()
- Prev by Date: Please pull from '85xx' branch of powerpc
- Next by Date: Suspend2-2.2.2 for 2.6.16.
- Previous by thread: Fix unlock_buffer() to work the same way as bit_unlock()
- Next by thread: Re: [PATCH] sched: smpnice work around for active_load_balance()
- Index(es):
Relevant Pages
- Re: [RFC] (How to) Let idle CPUs sleep
... turns out that if we restrict the amount of time idle cpus are ... cpu sleeps.
... * local timer ticks. ... +int idle_balance_retry ... (Linux-Kernel) - Too many timer interrupts in NO_HZ
... CPU in an SMP server. ... However I am interested to see CPU idle times
for couple minutes on ... There are way too many timer interrupts even though the CPUs
have ... entered tickless idle loop. ... (Linux-Kernel) - Re: Relationship between load average and CPU busy or CPU idle
... > Is there some kind of relationship between the load average figure and CPU
... Because Unix expects disk I/O to finish really soon ... idle time is
computed by looking at what the processors are ... (comp.unix.solaris) - Re: "REPORT: sd-0.46 vs cfs-v6 vs mainline 2.6.21-rc7 Beryl + Video + Audio"
... a 'smoother' desktop on SD. ... rate and in better CPU utilization? ...
The idle usage for cfs and SD goes down to nearly 0. ... (Linux-Kernel) - Re: Narrowed down bottlneck to disk, how can I find out which FS is being hammered? 100% utilization
... But wait time *is* idle time also. ... performing useful work. ...
Only if you consider a CPU doing nothing to be utilized. ... idle, *but* there is an outstanding
I/O request, then the time is logged ... (comp.unix.solaris)