[patch] Re: PostgreSQL pgbench performance regression in 2.6.23+




On Mon, 2008-05-26 at 20:28 -0400, Greg Smith wrote:

I did again get useful results here with the stock 2.6.26.git kernel and
default parameters using Peter's small patch to adjust se.waker.

What I found most interesting was how the results changed when I set
/proc/sys/kernel/sched_features = 0, without doing anything with batch
mode. The default for that is 1101111111=895. What I then did was run
through setting each of those bits off one by one to see which feature(s)
were getting in the way here. The two that mattered a lot were 895-32=863
(no SCHED_FEAT_SYNC_WAKEUPS ) and 895-2=893 (no
SCHED_FEAT_WAKEUP_PREEMPT). Combining those two but keeping the rest of
the features on (895-32-2=861) actually gave the best result I've ever
seen here, better than with all the features disabled. Tossing out all
the tests I did that didn't show anything useful, here's my table of the
interesting results.

Clients .22.19 .26.git waker f=0 f=893 f=863 f=861
1 7660 11043 11041 9214 11204 9232 9433
2 17798 11452 16306 16916 11165 16686 16097
3 29612 13231 18476 24202 11348 26210 26906
4 25584 13053 17942 26639 11331 25094 25679
6 25295 12263 18472 28918 11761 30525 33297
8 24344 11748 19109 32730 12190 31775 35912
10 23963 11612 19537 31688 12331 29644 36215
15 23026 11414 19518 33209 13050 28651 36452
20 22549 11332 19029 32583 13544 25776 35707
30 22074 10743 18884 32447 14191 21772 33501
40 21495 10406 18609 31704 11017 20600 32743
50 20051 10534 17478 29483 14683 19949 31047
60 18690 9816 17467 28614 14817 18681 29576

Note that compared to earlier test runs, I replaced the 5 client case with
a 60 client one to get more data on the top end. I also wouldn't pay too
much attention to the single client case; that one really bounces around a
lot on most of the kernel revs, even with me doing 5 runs and using the
median.

Care to give the below a whirl? If fixes the over-enthusiastic affinity
bug in a less restrictive way. It doesn't attempt to addresss the needs
of any particular load though, that needs more thought (tricky issue).

With default features, I get the below.

2.6.26-smp x86_64
1 10121.600913
2 14360.229517
3 17048.770371
4 18748.777814
5 22086.493358
6 24913.416187
8 27976.026783
10 29346.503261
15 29157.239431
20 28392.257204
30 26590.199787
40 24422.481578
50 23305.981434

(I can get a bit more by disabling HR_TICK along with a dinky patchlet
to reduce overhead when it's disabled. Bottom line is that the bug is
fixed though, maximizing performance is separate issue imho)

Prevent short-running wakers of short-running threads from overloading a single
cpu via wakeup affinity, and wire up disconnected debug option.

Signed-off-by: Mike Galbraith <efault@xxxxxx>

kernel/sched_fair.c | 25 ++++++++++++++-----------
1 files changed, 14 insertions(+), 11 deletions(-)

Index: linux-2.6.26.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.26.git.orig/kernel/sched_fair.c
+++ linux-2.6.26.git/kernel/sched_fair.c
@@ -1057,16 +1057,27 @@ wake_affine(struct rq *rq, struct sched_
struct task_struct *curr = this_rq->curr;
unsigned long tl = this_load;
unsigned long tl_per_task;
+ int bad_imbalance;

- if (!(this_sd->flags & SD_WAKE_AFFINE))
+ if (!(this_sd->flags & SD_WAKE_AFFINE) || !sched_feat(AFFINE_WAKEUPS))
return 0;

/*
+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:
+ */
+ if (sync && tl)
+ tl -= curr->se.load.weight;
+
+ bad_imbalance = 100*(tl + p->se.load.weight) > imbalance*load;
+
+ /*
* If the currently running task will sleep within
* a reasonable amount of time then attract this newly
* woken task:
*/
- if (sync && curr->sched_class == &fair_sched_class) {
+ if (sync && !bad_imbalance && curr->sched_class == &fair_sched_class)
{
if (curr->se.avg_overlap < sysctl_sched_migration_cost &&
p->se.avg_overlap < sysctl_sched_migration_cost)
return 1;
@@ -1075,16 +1086,8 @@ wake_affine(struct rq *rq, struct sched_
schedstat_inc(p, se.nr_wakeups_affine_attempts);
tl_per_task = cpu_avg_load_per_task(this_cpu);

- /*
- * If sync wakeup then subtract the (maximum possible)
- * effect of the currently running task from the load
- * of the current CPU:
- */
- if (sync)
- tl -= current->se.load.weight;
-
if ((tl <= load && tl + target_load(prev_cpu, idx) <= tl_per_task) ||
- 100*(tl + p->se.load.weight) <= imbalance*load) {
+ !bad_imbalance) {
/*
* This domain has SD_WAKE_AFFINE and
* p is cache cold in this domain, and


Some still open questions after this long investigation that I'd like to
know the answers to are:

1) Why are my 2.6.26.git results so dramatically worse than the ones Mike
posted? I'm not sure what was different about his test setup here. The
2.6.22 results are pretty similar, and the fully tuned ones as well, so
the big difference on that column bugs me.

Because those were git _with_ Peter's patch. I was looking at the
difference between a non-broken 2.6.26 with and without minimizing
preemption, to match the way I tested 2.6.22.

2) Mike suggested a patch to 2.6.25 in this thread that backports the
feature for disabling SCHED_FEAT_SYNC_WAKEUPS. Would it be reasonable to
push that into 2.6.25.5? It's clearly quite useful for this load and
therefore possibly others.

If the patch above flies, imho, that should be folded into the backport
to stable. The heart of the patch is a bugfix, so definitely stable
material. Whether to enable the debugging/tweaking knobs as well is a
different question. (I would do it, but I ain't the maintainer;)

3) Peter's se.waker patch is a big step forward on this workload without
any tuning, closing a significant amount of the gap between the default
setup and what I get with the two troublesome features turned off
altogether. What issues might there be with pushing that into the stock
{2.6.25|2.6.26} kernel?

(it doesn't fully address the 1:N needs, that needs much more thought)

4) What known workloads are there that suffer if SCHED_FEAT_SYNC_WAKEUPS
and SCHED_FEAT_WAKEUP_PREEMPT are disabled? I'd think that any attempt to
tune those code paths would need my case for "works better when
SYNC/PREEMPT wakeups disabled" as well as a case that works worse to
balance modifications against.

I suspect very many, certainly any load where latency is of major
importance. Peak performance of the mysql+oltp benchmark for sure is
injured with your settings. As a really good demonstration of how
important wakeup preemption can be, try running the attached testcase,
which was distilled from a real world load and posted to lkml a few
years ago, without wakeup preemption and nailed to one cpu.

5) Once (4) has identified some tests cases, what else might be done to
make the default behavior better without killing the situations it's
intended for?

See patch :)

-Mike
#include <stdlib.h>
#include <stdio.h>
#include <signal.h>
#include <unistd.h>

#include <sys/types.h>
#include <sys/wait.h>

volatile unsigned long loop = 10000000;

void
handler (int n)
{
if (loop > 0)
--loop;
}

static int
child (void)
{
pid_t ppid = getppid ();

sleep (1);
while (1)
kill (ppid, SIGUSR1);
return 0;
}

int
main (int argc, char **argv)
{
pid_t child_pid;
int r;

loop = argc > 1 ? strtoul (argv[1], NULL, 10) : 10000000;
printf ("expecting to receive %lu signals\n", loop);

if ((child_pid = fork ()) == 0)
exit (child ());

signal (SIGUSR1, handler);
while (loop)
sleep (1);
r = kill (child_pid, SIGTERM);
waitpid (child_pid, NULL, 0);
return 0;
}


Relevant Pages

  • Re: PostgreSQL pgbench performance regression in 2.6.23+
    ... Also, that requires being intrusive into people's setup scripts, which bothers me a lot more than doing a bit of kernel tuning at system startup. ... I did again get useful results here with the stock 2.6.26.git kernel and default parameters using Peter's small patch to adjust se.waker. ... Combining those two but keeping the rest of the features on actually gave the best result I've ever seen here, better than with all the features disabled. ... Mike suggested a patch to 2.6.25 in this thread that backports the feature for disabling SCHED_FEAT_SYNC_WAKEUPS. ...
    (Linux-Kernel)
  • Re: [2.6 patch] remove securebits
    ... Attached is what I consider only an RFC patch. ... In the absence of filesystem capability support, ... int (int option, unsigned long arg2, ... int security_task_prctl(int option, unsigned long arg2, unsigned long arg3, ...
    (Linux-Kernel)
  • Re: [PATCH] Updated: Reworked Cell OProfile: SPU mutex lock fix
    ... arch-independent pieces of the OProfile kernel driver that this patch ... kernel driver patches. ... into the kernel buffer without holding the buffer_mutex lock. ... int spu_sync_start; ...
    (Linux-Kernel)
  • [PATCH] Altix I/O code cleanup
    ... This is my first patch for this - more to come ... -static unsigned int boot_options = OPTION_NONE; ... -} CONFIGCLASS; ... -} MEMORYTYPE; ...
    (Linux-Kernel)
  • [PATCH 3/4] new timeofday x86-64 arch specific changes (v. B1)
    ... It applies on top of my timeofday-core_B1 patch. ... static int __init acpi_parse_hpet ... * Driver to use the Power Management Timer available in some ... -static void cpufreq_delayed_get; ...
    (Linux-Kernel)