1. 09 Feb, 2014 1 commit
  2. 13 Jan, 2014 8 commits
    • Daniel Lezcano's avatar
      sched: Reduce trigger_load_balance() parameters · 7caff66f
      Daniel Lezcano authored
      
      
      The cpu information is already stored in the struct rq, so no need to pass it
      as parameter to the trigger_load_balance function.
      
      Cc: linaro-kernel@lists.linaro.org
      Cc: preeti.lkml@gmail.com
      Cc: mingo@redhat.com
      Cc: peterz@infradead.org
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1389008085-9069-2-git-send-email-daniel.lezcano@linaro.org
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7caff66f
    • Peter Zijlstra's avatar
      sched/deadline: Remove the sysctl_sched_dl knobs · 1724813d
      Peter Zijlstra authored
      
      
      Remove the deadline specific sysctls for now. The problem with them is
      that the interaction with the exisiting rt knobs is nearly impossible
      to get right.
      
      The current (as per before this patch) situation is that the rt and dl
      bandwidth is completely separate and we enforce rt+dl < 100%. This is
      undesirable because this means that the rt default of 95% leaves us
      hardly any room, even though dl tasks are saver than rt tasks.
      
      Another proposed solution was (a discarted patch) to have the dl
      bandwidth be a fraction of the rt bandwidth. This is highly
      confusing imo.
      
      Furthermore neither proposal is consistent with the situation we
      actually want; which is rt tasks ran from a dl server. In which case
      the rt bandwidth is a direct subset of dl.
      
      So whichever way we go, the introduction of dl controls at this point
      is painful. Therefore remove them and instead share the rt budget.
      
      This means that for now the rt knobs are used for dl admission control
      and the dl runtime is accounted against the rt runtime. I realise that
      this isn't entirely desirable either; but whatever we do we appear to
      need to change the interface later, so better have a small interface
      for now.
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.org
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1724813d
    • Juri Lelli's avatar
      sched/deadline: speed up SCHED_DEADLINE pushes with a push-heap · 6bfd6d72
      Juri Lelli authored
      Data from tests confirmed that the original active load balancing
      logic didn't scale neither in the number of CPU nor in the number of
      tasks (as sched_rt does).
      
      Here we provide a global data structure to keep track of deadlines
      of the running tasks in the system. The structure is composed by
      a bitmask showing the free CPUs and a max-heap, needed when the system
      is heavily loaded.
      
      The implementation and concurrent access scheme are kept simple by
      design. However, our measurements show that we can compete with sched_rt
      on large multi-CPUs machines [1].
      
      Only the push path is addressed, the extension to use this structure
      also for pull decisions is straightforward. However, we are currently
      evaluating different (in order to decrease/avoid contention) data
      structures to solve possibly both problems. We are also going to re-run
      tests considering recent changes inside cpupri [2].
      
       [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
       [2] http://www.spinics.net/lists/linux-rt-users/msg06778.html
      
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6bfd6d72
    • Dario Faggioli's avatar
      sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks · 332ac17e
      Dario Faggioli authored
      
      
      In order of deadline scheduling to be effective and useful, it is
      important that some method of having the allocation of the available
      CPU bandwidth to tasks and task groups under control.
      This is usually called "admission control" and if it is not performed
      at all, no guarantee can be given on the actual scheduling of the
      -deadline tasks.
      
      Since when RT-throttling has been introduced each task group have a
      bandwidth associated to itself, calculated as a certain amount of
      runtime over a period. Moreover, to make it possible to manipulate
      such bandwidth, readable/writable controls have been added to both
      procfs (for system wide settings) and cgroupfs (for per-group
      settings).
      
      Therefore, the same interface is being used for controlling the
      bandwidth distrubution to -deadline tasks and task groups, i.e.,
      new controls but with similar names, equivalent meaning and with
      the same usage paradigm are added.
      
      However, more discussion is needed in order to figure out how
      we want to manage SCHED_DEADLINE bandwidth at the task group level.
      Therefore, this patch adds a less sophisticated, but actually
      very sensible, mechanism to ensure that a certain utilization
      cap is not overcome per each root_domain (the single rq for !SMP
      configurations).
      
      Another main difference between deadline bandwidth management and
      RT-throttling is that -deadline tasks have bandwidth on their own
      (while -rt ones doesn't!), and thus we don't need an higher level
      throttling mechanism to enforce the desired bandwidth.
      
      This patch, therefore:
      
       - adds system wide deadline bandwidth management by means of:
          * /proc/sys/kernel/sched_dl_runtime_us,
          * /proc/sys/kernel/sched_dl_period_us,
         that determine (i.e., runtime / period) the total bandwidth
         available on each CPU of each root_domain for -deadline tasks;
      
       - couples the RT and deadline bandwidth management, i.e., enforces
         that the sum of how much bandwidth is being devoted to -rt
         -deadline tasks to stay below 100%.
      
      This means that, for a root_domain comprising M CPUs, -deadline tasks
      can be created until the sum of their bandwidths stay below:
      
          M * (sched_dl_runtime_us / sched_dl_period_us)
      
      It is also possible to disable this bandwidth management logic, and
      be thus free of oversubscribing the system up to any arbitrary level.
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      332ac17e
    • Dario Faggioli's avatar
      sched/deadline: Add SCHED_DEADLINE inheritance logic · 2d3d891d
      Dario Faggioli authored
      
      
      Some method to deal with rt-mutexes and make sched_dl interact with
      the current PI-coded is needed, raising all but trivial issues, that
      needs (according to us) to be solved with some restructuring of
      the pi-code (i.e., going toward a proxy execution-ish implementation).
      
      This is under development, in the meanwhile, as a temporary solution,
      what this commits does is:
      
       - ensure a pi-lock owner with waiters is never throttled down. Instead,
         when it runs out of runtime, it immediately gets replenished and it's
         deadline is postponed;
      
       - the scheduling parameters (relative deadline and default runtime)
         used for that replenishments --during the whole period it holds the
         pi-lock-- are the ones of the waiting task with earliest deadline.
      
      Acting this way, we provide some kind of boosting to the lock-owner,
      still by using the existing (actually, slightly modified by the previous
      commit) pi-architecture.
      
      We would stress the fact that this is only a surely needed, all but
      clean solution to the problem. In the end it's only a way to re-start
      discussion within the community. So, as always, comments, ideas, rants,
      etc.. are welcome! :-)
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      [ Added !RT_MUTEXES build fix. ]
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2d3d891d
    • Juri Lelli's avatar
      sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic · 1baca4ce
      Juri Lelli authored
      
      
      Introduces data structures relevant for implementing dynamic
      migration of -deadline tasks and the logic for checking if
      runqueues are overloaded with -deadline tasks and for choosing
      where a task should migrate, when it is the case.
      
      Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
      be moved among CPUs when necessary. It is also possible to bind a
      task to a (set of) CPU(s), thus restricting its capability of
      migrating, or forbidding migrations at all.
      
      The very same approach used in sched_rt is utilised:
       - -deadline tasks are kept into CPU-specific runqueues,
       - -deadline tasks are migrated among runqueues to achieve the
         following:
          * on an M-CPU system the M earliest deadline ready tasks
            are always running;
          * affinity/cpusets settings of all the -deadline tasks is
            always respected.
      
      Therefore, this very special form of "load balancing" is done with
      an active method, i.e., the scheduler pushes or pulls tasks between
      runqueues when they are woken up and/or (de)scheduled.
      IOW, every time a preemption occurs, the descheduled task might be sent
      to some other CPU (depending on its deadline) to continue executing
      (push). On the other hand, every time a CPU becomes idle, it might pull
      the second earliest deadline ready task from some other CPU.
      
      To enforce this, a pull operation is always attempted before taking any
      scheduling decision (pre_schedule()), as well as a push one after each
      scheduling decision (post_schedule()). In addition, when a task arrives
      or wakes up, the best CPU where to resume it is selected taking into
      account its affinity mask, the system topology, but also its deadline.
      E.g., from the scheduling point of view, the best CPU where to wake
      up (and also where to push) a task is the one which is running the task
      with the latest deadline among the M executing ones.
      
      In order to facilitate these decisions, per-runqueue "caching" of the
      deadlines of the currently running and of the first ready task is used.
      Queued but not running tasks are also parked in another rb-tree to
      speed-up pushes.
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1baca4ce
    • Dario Faggioli's avatar
      sched/deadline: Add SCHED_DEADLINE structures & implementation · aab03e05
      Dario Faggioli authored
      
      
      Introduces the data structures, constants and symbols needed for
      SCHED_DEADLINE implementation.
      
      Core data structure of SCHED_DEADLINE are defined, along with their
      initializers. Hooks for checking if a task belong to the new policy
      are also added where they are needed.
      
      Adds a scheduling class, in sched/dl.c and a new policy called
      SCHED_DEADLINE. It is an implementation of the Earliest Deadline
      First (EDF) scheduling algorithm, augmented with a mechanism (called
      Constant Bandwidth Server, CBS) that makes it possible to isolate
      the behaviour of tasks between each other.
      
      The typical -deadline task will be made up of a computation phase
      (instance) which is activated on a periodic or sporadic fashion. The
      expected (maximum) duration of such computation is called the task's
      runtime; the time interval by which each instance need to be completed
      is called the task's relative deadline. The task's absolute deadline
      is dynamically calculated as the time instant a task (better, an
      instance) activates plus the relative deadline.
      
      The EDF algorithms selects the task with the smallest absolute
      deadline as the one to be executed first, while the CBS ensures each
      task to run for at most its runtime every (relative) deadline
      length time interval, avoiding any interference between different
      tasks (bandwidth isolation).
      Thanks to this feature, also tasks that do not strictly comply with
      the computational model sketched above can effectively use the new
      policy.
      
      To summarize, this patch:
       - introduces the data structures, constants and symbols needed;
       - implements the core logic of the scheduling algorithm in the new
         scheduling class file;
       - provides all the glue code between the new scheduling class and
         the core scheduler and refines the interactions between sched/dl
         and the other existing scheduling classes.
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      Signed-off-by: default avatarMichael Trimarchi <michael@amarulasolutions.com>
      Signed-off-by: default avatarFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      aab03e05
    • Dario Faggioli's avatar
      sched: Add new scheduler syscalls to support an extended scheduling parameters ABI · d50dde5a
      Dario Faggioli authored
      
      
      Add the syscalls needed for supporting scheduling algorithms
      with extended scheduling parameters (e.g., SCHED_DEADLINE).
      
      In general, it makes possible to specify a periodic/sporadic task,
      that executes for a given amount of runtime at each instance, and is
      scheduled according to the urgency of their own timing constraints,
      i.e.:
      
       - a (maximum/typical) instance execution time,
       - a minimum interval between consecutive instances,
       - a time constraint by which each instance must be completed.
      
      Thus, both the data structure that holds the scheduling parameters of
      the tasks and the system calls dealing with it must be extended.
      Unfortunately, modifying the existing struct sched_param would break
      the ABI and result in potentially serious compatibility issues with
      legacy binaries.
      
      For these reasons, this patch:
      
       - defines the new struct sched_attr, containing all the fields
         that are necessary for specifying a task in the computational
         model described above;
      
       - defines and implements the new scheduling related syscalls that
         manipulate it, i.e., sched_setattr() and sched_getattr().
      
      Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
      proof of concept and for developing and testing purposes. Making them
      available on other architectures is straightforward.
      
      Since no "user" for these new parameters is introduced in this patch,
      the implementation of the new system calls is just identical to their
      already existing counterpart. Future patches that implement scheduling
      policies able to exploit the new data structure must also take care of
      modifying the sched_*attr() calls accordingly with their own purposes.
      Signed-off-by: default avatarDario Faggioli <raistlin@linux.it>
      [ Rewrote to use sched_attr. ]
      Signed-off-by: default avatarJuri Lelli <juri.lelli@gmail.com>
      [ Removed sched_setscheduler2() for now. ]
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d50dde5a
  3. 27 Nov, 2013 1 commit
  4. 06 Nov, 2013 1 commit
    • Preeti U Murthy's avatar
      sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus · 37dc6b50
      Preeti U Murthy authored
      
      
      nr_busy_cpus parameter is used by nohz_kick_needed() to find out the
      number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES
      flag set.  Therefore instead of updating nr_busy_cpus at every level
      of sched domain, since it is irrelevant, we can update this parameter
      only at the parent domain of the sd which has this flag set. Introduce
      a per-cpu parameter sd_busy which represents this parent domain.
      
      In nohz_kick_needed() we directly query the nr_busy_cpus parameter
      associated with the groups of sd_busy.
      
      By associating sd_busy with the highest domain which has
      SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains
      which could have this flag set and trigger nohz_idle_balancing if any
      of the levels have more than one busy cpu.
      
      sd_busy is irrelevant for asymmetric load balancing. However sd_asym
      has been introduced to represent the highest sched domain which has
      SD_ASYM_PACKING flag set so that it can be queried directly when
      required.
      
      While we are at it, we might as well change the nohz_idle parameter to
      be updated at the sd_busy domain level alone and not the base domain
      level of a CPU.  This will unify the concept of busy cpus at just one
      level of sched domain where it is currently used.
      
      Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: svaidy@linux.vnet.ibm.com
      Cc: vincent.guittot@linaro.org
      Cc: bitbucket@online.de
      Cc: benh@kernel.crashing.org
      Cc: anton@samba.org
      Cc: Morten.Rasmussen@arm.com
      Cc: pjt@google.com
      Cc: peterz@infradead.org
      Cc: mikey@neuling.org
      Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      37dc6b50
  5. 29 Oct, 2013 1 commit
  6. 16 Oct, 2013 1 commit
    • Peter Zijlstra's avatar
      sched: Fix race in migrate_swap_stop() · 74602315
      Peter Zijlstra authored
      
      
      There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
      places with task T, on CPU B.
      
      Task P:
        - call migrate_swap
      Task T:
        - go to sleep, removing itself from the runqueue
      Task P:
        - double lock the runqueues on CPU A & B
      Task T:
        - get woken up, place itself on the runqueue of CPU C
      Task P:
        - see that task T is on a runqueue, and pretend to remove it
          from the runqueue on CPU B
      
      Now CPUs B & C both have corrupted scheduler data structures.
      
      This patch fixes it, by holding the pi_lock for both of the tasks
      involved in the migrate swap. This prevents task T from waking up,
      and placing itself onto another runqueue, until after migrate_swap
      has released all locks.
      
      This means that, when migrate_swap checks, task T will be either
      on the runqueue where it was originally seen, or not on any
      runqueue at all. Migrate_swap deals correctly with of those cases.
      Tested-by: default avatarJoe Mario <jmario@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: hannes@cmpxchg.org
      Cc: aarcange@redhat.com
      Cc: srikar@linux.vnet.ibm.com
      Cc: tglx@linutronix.de
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.net
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      74602315
  7. 09 Oct, 2013 7 commits
  8. 20 Sep, 2013 1 commit
  9. 12 Sep, 2013 1 commit
    • Peter Zijlstra's avatar
      sched/fair: Rewrite group_imb trigger · 6263322c
      Peter Zijlstra authored
      
      
      Change the group_imb detection from the old 'load-spike' detector to
      an actual imbalance detector. We set it from the lower domain balance
      pass when it fails to create a balance in the presence of task
      affinities.
      
      The advantage is that this should no longer generate the false
      positive group_imb conditions generated by transient load spikes from
      the normal balancing/bulk-wakeup etc. behaviour.
      
      While I haven't actually observed those they could happen.
      
      I'm not entirely happy with this patch; it somehow feels a little
      fragile.
      
      Nor does it solve the biggest issue I have with the group_imb code; it
      it still a fragile construct in that once we 'fixed' the imbalance
      we'll not detect the group_imb again and could end up re-creating it.
      
      That said, this patch does seem to preserve behaviour for the
      described degenerate case. In particular on my 2*6*2 wsm-ep:
      
        taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'
      
      ends up with 9 spinners, each on their own CPU; whereas if you disable
      the group_imb code that typically doesn't happen (you'll get one pair
      sharing a CPU most of the time).
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-36fpbgl39dv4u51b6yz2ypz5@git.kernel.org
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6263322c
  10. 09 Aug, 2013 1 commit
    • Tejun Heo's avatar
      cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ · 8af01f56
      Tejun Heo authored
      
      
      The names of the two struct cgroup_subsys_state accessors -
      cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
      The former clashes with the type name and the latter doesn't even
      indicate it's somehow related to cgroup.
      
      We're about to revamp large portion of cgroup API, so, let's rename
      them so that they're less awkward.  Most per-controller usages of the
      accessors are localized in accessor wrappers and given the amount of
      scheduled changes, this isn't gonna add any noticeable headache.
      
      Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
      to task_css().  This patch is pure rename.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarLi Zefan <lizefan@huawei.com>
      8af01f56
  11. 23 Jul, 2013 2 commits
    • Peter Zijlstra's avatar
      sched: Micro-optimize the smart wake-affine logic · 7d9ffa89
      Peter Zijlstra authored
      
      
      Smart wake-affine is using node-size as the factor currently, but the overhead
      of the mask operation is high.
      
      Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record
      the highest cache-share domain size, and make it to be the new factor, in order
      to reduce the overhead and make it more reasonable.
      Tested-by: default avatarDavidlohr Bueso <davidlohr.bueso@hp.com>
      Tested-by: default avatarMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarMichael Wang <wangyun@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Link: http://lkml.kernel.org/r/51D5008E.6030102@linux.vnet.ibm.com
      
      
      [ Tidied up the changelog. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7d9ffa89
    • Vladimir Davydov's avatar
      sched: Move h_load calculation to task_h_load() · 68520796
      Vladimir Davydov authored
      The bad thing about update_h_load(), which computes hierarchical load
      factor for task groups, is that it is called for each task group in the
      system before every load balancer run, and since rebalance can be
      triggered very often, this function can eat really a lot of cpu time if
      there are many cpu cgroups in the system.
      
      Although the situation was improved significantly by commit a35b6466
      
      
      ('sched, cgroup: Reduce rq->lock hold times for large cgroup
      hierarchies'), the problem still can arise under some kinds of loads,
      e.g. when cpus are switching from idle to busy and back very frequently.
      
      For instance, when I start 1000 of processes that wake up every
      millisecond on my 8 cpus host, 'top' and 'perf top' show:
      
      Cpu(s): 17.8%us, 24.3%sy,  0.0%ni, 57.9%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 243K cycles
        7.57%  [kernel]               [k] __schedule
        7.08%  [kernel]               [k] timerqueue_add
        6.13%  libc-2.12.so           [.] usleep
      
      Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu
      usage increases significantly although the 'wakers' are still executing
      in the root cpu cgroup:
      
      Cpu(s): 19.1%us, 48.7%sy,  0.0%ni, 31.6%id,  0.0%wa,  0.0%hi,  0.7%si
      Events: 230K cycles
       24.56%  [kernel]            [k] tg_load_down
        5.76%  [kernel]            [k] __schedule
      
      This happens because this particular kind of load triggers 'new idle'
      rebalance very frequently, which requires calling update_h_load(),
      which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
      though it is absolutely useless, because idle cpu cgroups have no tasks
      to pull.
      
      This patch tries to improve the situation by making h_load calculation
      proceed only when h_load is really necessary. To achieve this, it
      substitutes update_h_load() with update_cfs_rq_h_load(), which computes
      h_load only for a given cfs_rq and all its ascendants, and makes the
      load balancer call this function whenever it considers if a task should
      be pulled, i.e. it moves h_load calculations directly to task_h_load().
      For h_load of the same cfs_rq not to be updated multiple times (in case
      several tasks in the same cgroup are considered during the same balance
      run), the patch keeps the time of the last h_load update for each cfs_rq
      and breaks calculation when it finds h_load to be uptodate.
      
      The benefit of it is that h_load is computed only for those cfs_rq's,
      which really need it, in particular all idle task groups are skipped.
      Although this, in fact, moves h_load calculation under rq lock, it
      should not affect latency much, because the amount of work done under rq
      lock while trying to pull tasks is limited by sched_nr_migrate.
      
      After the patch applied with the setup described above (1000 wakers in
      the root cgroup and 10000 idle cgroups), I get:
      
      Cpu(s): 16.9%us, 24.8%sy,  0.0%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 242K cycles
        7.57%  [kernel]                  [k] __schedule
        6.70%  [kernel]                  [k] timerqueue_add
        5.93%  libc-2.12.so              [.] usleep
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1373896159-1278-1-git-send-email-vdavydov@parallels.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      68520796
  12. 27 Jun, 2013 7 commits
  13. 19 Jun, 2013 1 commit
  14. 28 May, 2013 2 commits
  15. 07 May, 2013 2 commits
  16. 04 May, 2013 1 commit
    • Frederic Weisbecker's avatar
      sched: Keep at least 1 tick per second for active dynticks tasks · 265f22a9
      Frederic Weisbecker authored
      
      
      The scheduler doesn't yet fully support environments
      with a single task running without a periodic tick.
      
      In order to ensure we still maintain the duties of scheduler_tick(),
      keep at least 1 tick per second.
      
      This makes sure that we keep the progression of various scheduler
      accounting and background maintainance even with a very low granularity.
      Examples include cpu load, sched average, CFS entity vruntime,
      avenrun and events such as load balancing, amongst other details
      handled in sched_class::task_tick().
      
      This limitation will be removed in the future once we get
      these individual items to work in full dynticks CPUs.
      Suggested-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      265f22a9
  17. 26 Apr, 2013 1 commit
    • Vincent Guittot's avatar
      sched: Fix init NOHZ_IDLE flag · 25f55d9d
      Vincent Guittot authored
      
      
      On my SMP platform which is made of 5 cores in 2 clusters, I
      have the nr_busy_cpu field of sched_group_power struct that is
      not null when the platform is fully idle - which makes the
      scheduler unhappy.
      
      The root cause is:
      
      During the boot sequence, some CPUs reach the idle loop and set
      their NOHZ_IDLE flag while waiting for others CPUs to boot. But
      the nr_busy_cpus field is initialized later with the assumption
      that all CPUs are in the busy state whereas some CPUs have
      already set their NOHZ_IDLE flag.
      
      More generally, the NOHZ_IDLE flag must be initialized when new
      sched_domains are created in order to ensure that NOHZ_IDLE and
      nr_busy_cpus are aligned.
      
      This condition can be ensured by adding a synchronize_rcu()
      between the destruction of old sched_domains and the creation of
      new ones so the NOHZ_IDLE flag will not be updated with old
      sched_domain once it has been initialized. But this solution
      introduces a additionnal latency in the rebuild sequence that is
      called during cpu hotplug.
      
      As suggested by Frederic Weisbecker, another solution is to have
      the same rcu lifecycle for both NOHZ_IDLE and sched_domain
      struct. A new nohz_idle field is added to sched_domain so both
      status and sched_domain will share the same RCU lifecycle and
      will be always synchronized. In addition, there is no more need
      to protect nohz_idle against concurrent access as it is only
      modified by 2 exclusive functions called by local cpu.
      
      This solution has been prefered to the creation of a new struct
      with an extra pointer indirection for sched_domain.
      
      The synchronization is done at the cost of :
      
       - An additional indirection and a rcu_dereference for accessing nohz_idle.
       - We use only the nohz_idle field of the top sched_domain.
      Signed-off-by: default avatarVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: linaro-kernel@lists.linaro.org
      Cc: peterz@infradead.org
      Cc: fweisbec@gmail.com
      Cc: pjt@google.com
      Cc: rostedt@goodmis.org
      Cc: efault@gmx.de
      Link: http://lkml.kernel.org/r/1366729142-14662-1-git-send-email-vincent.guittot@linaro.org
      
      
      [ Fixed !NO_HZ build bug. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      25f55d9d
  18. 22 Apr, 2013 1 commit
    • Frederic Weisbecker's avatar
      sched: Kick full dynticks CPU that have more than one task enqueued. · 9f3660c2
      Frederic Weisbecker authored
      
      
      Kick the tick on full dynticks CPUs when they get more
      than one task running on their queue. This makes sure that
      local fairness is maintained by the tick on the destination.
      
      This is done regardless of these tasks' class. We should
      be able to be more clever in the future depending on these. eg:
      a CPU that runs a SCHED_FIFO task doesn't need to maintain
      fairness against local pending tasks of the fair class.
      
      But keep things simple for now.
      Signed-off-by: default avatarFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      9f3660c2