Re: [RFC] (How to) Let idle CPUs sleep

From: Srivatsa Vaddagiri (vatsa_at_in.ibm.com)
Date: 06/30/05

  • Next message: Richard B. Johnson: "Re: reiser4 vs politics: linux misses out again"
    Date:	Thu, 30 Jun 2005 18:17:11 +0530
    To: Nick Piggin <nickpiggin@yahoo.com.au>
    
    

    [ This is a continuation of the thread at : http://lkml.org/lkml/2005/5/7/98 ]

    On Sun, May 08, 2005 at 02:14:23PM +1000, Nick Piggin wrote:
    > Rusty Russell wrote:
    > >On Sat, 2005-05-07 at 23:57 +0530, Srivatsa Vaddagiri wrote:
    > >
    > >>Two solutions have been proposed so far:
    > >>
    > >> A. As per Nick's suggestion, impose a max limit (say some 100 ms or
    > >> say a second, Nick?) on how long a idle CPU can avoid taking
    >
    > Yeah probably something around that order of magnitude. I suspect
    > there will fast be a point where either you'll get other timers
    > going off more frequently, and / or you simply get very quickly
    > diminishing returns on the amount of power saving gained from
    > increasing the period.

    I did some measurements on the average interval that a idle CPU is allowed to
    sleep before it wakes up. I also have some numbers on Xen comparing performance
    between the two approaches (100ms limit vs no limit on sleep interval).

    First, the avg sleep interval.

    Numbers were obtained on this configuration:

            Machine : 8-way Intel (P3) box
            Distro : RHEL3
            Kernel : 2.6.12

    Each time a CPU became idle, next_timer_interrupt was called to find the next
    nearest timer (sleep_time_req) and the CPU slept till either the time elapsed
    or it was interrupted by a interrupt (external/device, reschedule, invalidate
    or call_function). When any of these happen, the actual amount of time slept
    in jiffies (sleep_time_real) was measured. In order to measure what effect
    internal system events (reschedule/invalidate/call-function IPI) alone have on
    avg sleep interval, I redirected all external interrupts to CPU 0.

    Also the numbers were obtained at init runlevel 3 with system mostly idle.
    I basically logged in and ran the test to get the numbers. Apart from
    this test, the only other workloads active were the system daemons that are
    initiated at runlevel 3. The test cleared various per-cpu counters (which were
    used to capture sleep_time_real numbers), slept for 2 minutes and read the
    counters at the end of it.

    Below are the numbers I obtained from the test:

    CPU# # of Mean Std Dev Max Min
           samples

    1: 113 1088.690 761.352 2001.000 2.000
    2: 76 1633.197 640.230 2002.000 94.000
    3: 80 1552.300 682.050 2003.000 1.000
    4: 71 1735.606 531.744 2004.000 165.000
    5: 2525 48.669 6.391 50.000 2.000
    6: 74 1652.905 601.908 2006.000 97.000
    7: 1783 68.915 33.495 100.000 0.000

    [ CPU0 is excluded because its avg sleep interval is just 1 jiffie -
      The external jiffie timer keeps waking it every jiffy ]

    The mean above represents sleep_time_real, i.e the real time a CPU
    was allowed to sleep before something woke it. The best mean
    time was for CPU4 (1735.606 jiffies or ~1.7 sec). Also note that
    the maximum time it actually could sleep was ~2 sec.

    Digging further revealed that this max time was restricted by
    various timers kernel uses. Mostly it was found to be because of
    the slab allocator reap timer (it requests a timer every ~2sec on
    every CPU) and machine_check timer (MCE_RATE in arch/i386/kernel/cpu/mcheck/
    non-fatal.c ).

    Just to see the effect of increasing the timeouts of these timers,
    I changed MCE_RATE to 60*HZ (from 15*HZ), REAPTIMEOUT_CPUC to 20*HZ (from 2*HZ),
    and REAPTIMEOUT_LIST3 to 40*HZ (from 4*HZ). With this change and reruning
    my test under the same conditions, the new numbers that were obtained for
    sleep_time_real are:

    CPU# # of Mean Std Dev Max Min
           samples

    1: 24 5229.833 7341.283 20001.000 4.000
    2: 2520 49.797 2.745 50.000 3.000
    3: 8 15002.250 8575.149 20003.000 1127.000
    4: 8 15002.875 8557.532 20004.000 1156.000
    5: 8 15003.750 8537.711 20005.000 1189.000
    6: 996 125.923 1209.668 30000.000 2.000
    7: 134 940.269 176.835 1060.000 310.000

    Note that now the best average sleep_time_real is now 15002.875 (~15 sec)
    for CPU4 and max was 20004.000 (~20 sec).

    Now I am not saying that we can blindly increase these timeouts
    (REAPTIMEOUT_CPUC/REAPTIMEOUT_LIST3/MCE_RATE). But I think that we can
    make these timers idle-cpu aware. For example, if the slab timer notices
    that it is running on a idle CPU and the numbers of slabs it is actually
    reaping have come down very much, then it can request for longer timeouts
    instead of every ~2sec the way it does now.

    It is probably good for virtualization/power-management env that we identify
    such short timeouts requested in kernel and make them idle-cpu aware to
    the extend possible. In that case, restricting max sleep time to 100ms, IMO,
    is not going to be efficient.

    In order to see the benefit of sleeping longer (and rather than waking
    every 100ms), I ran some benchmarks on Xen. Basically I measured
    the throughput of a CPU-intensive benchmark (which finds first 30 random prime
    numbers). Time taken to find the 30 prime numbers was used as a benchmark
    to compare the two approaches.

    Configuration used was:

            Machine : 4-way Intel Xeon box (HT disabled in BIOS)
            Distro : RHEL4
            Kernel : Xen-unstable kernel (based on 2.6.11) for both
                              Domain 0 and guest domains
            
    Xen-unstable was used since it provides SMP guest support. Additional 4-way
    domains were started using the ttylinux distro
    (http://easynews.dl.sourceforge.net/sourceforge/xen/ttylinux-xen.bz2).

    The prime-number benchmark was run in one of the domains while the other
    domains were idle. The throughput of the benchmark was measued once with
    a total of 30 4-way domains started and another time with 100 4-way domains
    started. In each case, to ensure that virtual-cpu/physical-cpu ration remains
    same, I pinned each virtual CPU of each domain to corresponding physical CPU.
    For ex: virtual CPU 0 of domains 1-30 was pinned physical CPU 0, virtual CPU 1
    of domains 1-30 was pinned to physical CPU 1 and so on.

    Case A : No-limit-on-sleep-interval

                                    Avg of 10 runs Std Dev

    30 domains started 28.55 sec 0.009
    100 domains started 28.835 sec 0.011

    Note that in going from 30->100 domains, the performance has deteriorated by
    only 0.9%.

    Case B : 100ms limit on sleep interval. Since by default Xen guest domain kernel
             follows case A, it was modified to restrict sleep interval to 100ms

                                              Avg of 10 runs Std Dev

    30 domains started 28.867 0.010
    100 domains started 29.993 0.009

    Note that in going from 30->100 domains, there is a deterioration of 3.9%

    Comparing the two approaches,

                            No limit 100 ms limit Difference

            30 domains 28.55 28.867 1.1%
            100 domains 28.835 29.993 4%

    The difference is growing with increasing virtualization. I have heard that
    the number of partitions can be much more (>300?) in some high end zSeries
    machines.

    So those are the numbers. To conclude:

            - Currently, in 2.6.12 kernel, the avg real sleep time obtained
              is ~1.7 sec
            - We can (and IMO we must) look at the short timers in use by the
              kernel and potentially increase this avg real sleep time
              further (atleast for idle CPUs). This mean waking up every 100 ms
              makes it inefficient.
            - Even if we decide to go for the simple approach, we should increase
              the sleep-time limit to few seconds. Don't know what that means
              for load-balancing.

    -- 
    Thanks and Regards,
    Srivatsa Vaddagiri,
    Linux Technology Center,
    IBM Software Labs,
    Bangalore, INDIA - 560017
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Richard B. Johnson: "Re: reiser4 vs politics: linux misses out again"

    Relevant Pages

    • Re: [RFC] (How to) Let idle CPUs sleep
      ... turns out that if we restrict the amount of time idle cpus are ... cpu sleeps. ... * local timer ticks. ... +int idle_balance_retry ...
      (Linux-Kernel)
    • Re: [patch 6/6] x86: add c1e aware idle function
      ... This excludes those machines from high ... To work nicely with C1E enabled machines we use a separate idle ... This allows us to do timer broadcasting ... Does the boot CPU ...
      (Linux-Kernel)
    • Re: 2.6.25-rc5-git6: Reported regressions from 2.6.24
      ... With NO_HZ enabled CPU1 is in a long idle sleep. ... boot process there is probably no timer pending on CPU1, ... wheel on the idle CPU when a timer gets added from some other CPU. ...
      (Linux-Kernel)
    • Re: [PATCH] sky2: Use deferrable timer for watchdog
      ... deferrable timer - if the machine is 100% IDLE there is no one needing ... CPU - that is going to cause an IPI, reschedule, TLB invalidation etc. ... many machines are idle waiting for network data. ... go IDLE if there are deferrable timers. ...
      (Linux-Kernel)
    • Re: 2.6.16-rc6-mm2
      ... CPU: Trace cache: 12K uops, ... Calibrating delay using timer specific routine.. ... # ACPI Support ...
      (Linux-Kernel)