SD_SHARE_CPUPOWER breaks scheduler fairness

From: Steve Rotolo (steve.rotolo_at_ccur.com)
Date: 05/31/05

  • Next message: Steven Rostedt: "Re: RT patch acceptance"
    To: linux-kernel@vger.kernel.org
    Date:	Tue, 31 May 2005 13:46:48 -0400
    
    

    The SD_SHARE_CPUPOWER flag in SMT scheduling domains (hyperthread
    systems) can starve out sched_other tasks and even hang the system. A
    long-running (or run-away) sched_fifo task causes sched_other tasks to
    get stuck on the sibling cpu's runqueue without any chance to run. The
    sibling cpu simply stays idle with tasks on it's runqueue for as long as
    the sched_fifo task runs on the other sibling cpu. The culprit is
    dependent_sleeper() in sched.c.

    I guess the SD_SHARE_CPUPOWER is supposed to cause the scheduler to
    prohibit non-real-time tasks from running on a cpu while a real-time
    task is running on the sibling cpu. The problem is that sched_other
    tasks are not migrated to a different runqueue and essentially get stuck
    on a dead runqueue until either the sched_fifo task yields or the
    load-balancer moves him. Unfortunately, the load-balancer will never
    migrate the task if the runqueue length is not sufficiently out of
    balance. Even more unfortunate, the load-balancer will actually move
    tasks *to* the dead runqueue if it is less busy. And still worse, since
    SD_WAKE_IDLE is also set in the scheduling domain, the dead cpu will
    actually attract waking tasks to it because it is idle! The cpu becomes
    a sort-of black-hole sucking in innocent tasks so they can no longer
    run.

    The worst-case scenario is when there are N spinning sched_fifo tasks on
    an N-way hyperthreaded system. This hangs the system since nothing can
    run on the virtual cpus. If you turn off the SD_SHARE_CPUPOWER flag,
    the system stays fully functional until you have N*2 spinners hogging
    all the virtual cpus.

    I get the same behavior from 2.6.9 to 2.6.12-rc5. So is this a bug or a
    feature?

    -- 
    Steve Rotolo
    Concurrent Computer Corporation
    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    Please read the FAQ at  http://www.tux.org/lkml/
    

  • Next message: Steven Rostedt: "Re: RT patch acceptance"

    Relevant Pages

    • [PATCH] Scheduler: Improving the scheduler performance.
      ... Scheduler: Improving the scheduler performance. ... we use only one expired array, then the CPUs of a multiprocessor system ... array,which will be declared outside of the runqueue array.Here,the ... as "glo_expired").So, any CPU without active task, will check the global ...
      (Linux-Kernel)
    • [patch, 2.6.10-rc2] sched: fix ->nr_uninterruptible handling bugs
      ... potentially executes with old_rq potentially being!= rq, ... Nothing except the load-average uses nr_uninterruptibleso this ... which task is still on the runqueue ... - on CPU-hotplug down we might zap a CPU that has a nonzero counter. ...
      (Linux-Kernel)
    • Re: [PATCH 1/7] CPU controller V1 - split runqueue
      ... I suggest to split existing runqueue structure ... virtual cpu ... +struct task_grp_rq { ... We place interactive tasks back into the active array, ...
      (Linux-Kernel)
    • [PATCH] Staircase cpu scheduler v14.2
      ... the staircase cpu scheduler is still in active ... * priority range a task can explore, a value of '1' means the ... These are the runqueue data structures: ... goto out_running; ...
      (Linux-Kernel)
    • [ RFC, PATCH 1/5 ] CPU controller - base changes
      ... This patch splits the single main runqueue into several runqueues on each CPU. ... * Administrator can also define the CPU bandwidth provided to each task-group. ... +struct task_grp_rq { ...
      (Linux-Kernel)