1. 22 Feb, 2013 1 commit
    • Shaohua Li's avatar
      smp: make smp_call_function_many() use logic similar to smp_call_function_single() · 9a46ad6d
      Shaohua Li authored
      I'm testing swapout workload in a two-socket Xeon machine.  The workload
      has 10 threads, each thread sequentially accesses separate memory
      region.  TLB flush overhead is very big in the workload.  For each page,
      page reclaim need move it from active lru list and then unmap it.  Both
      need a TLB flush.  And this is a multthread workload, TLB flush happens
      in 10 CPUs.  In X86, TLB flush uses generic smp_call)function.  So this
      workload stress smp_call_function_many heavily.
      Without patch, perf shows:
      +  24.49%  [k] generic_smp_call_function_interrupt
      -  21.72%  [k] _raw_spin_lock
         - _raw_spin_lock
            + 79.80% __page_check_address
            + 6.42% generic_smp_call_function_interrupt
            + 3.31% get_swap_page
            + 2.37% free_pcppages_bulk
            + 1.75% handle_pte_fault
            + 1.54% put_super
            + 1.41% grab_super_passive
            + 1.36% __swap_duplicate
            + 0.68% blk_flush_plug_list
            + 0.62% swap_info_get
      +   6.55%  [k] flush_tlb_func
      +   6.46%  [k] smp_call_function_many
      +   5.09%  [k] call_function_interrupt
      +   4.75%  [k] default_send_IPI_mask_sequence_phys
      +   2.18%  [k] find_next_bit
      swapout throughput is around 1300M/s.
      With the patch, perf shows:
      -  27.23%  [k] _raw_spin_lock
         - _raw_spin_lock
            + 80.53% __page_check_address
            + 8.39% generic_smp_call_function_single_interrupt
            + 2.44% get_swap_page
            + 1.76% free_pcppages_bulk
            + 1.40% handle_pte_fault
            + 1.15% __swap_duplicate
            + 1.05% put_super
            + 0.98% grab_super_passive
            + 0.86% blk_flush_plug_list
            + 0.57% swap_info_get
      +   8.25%  [k] default_send_IPI_mask_sequence_phys
      +   7.55%  [k] call_function_interrupt
      +   7.47%  [k] smp_call_function_many
      +   7.25%  [k] flush_tlb_func
      +   3.81%  [k] _raw_spin_lock_irqsave
      +   3.78%  [k] generic_smp_call_function_single_interrupt
      swapout throughput is around 1400M/s.  So there is around a 7%
      improvement, and total cpu utilization doesn't change.
      Without the patch, cfd_data is shared by all CPUs.
      generic_smp_call_function_interrupt does read/write cfd_data several times
      which will create a lot of cache ping-pong.  With the patch, the data
      becomes per-cpu.  The ping-pong is avoided.  And from the perf data, this
      doesn't make call_single_queue lock contend.
      Next step is to remove generic_smp_call_function_interrupt() from arch
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 28 Jan, 2013 1 commit
    • Wang YanQing's avatar
      smp: Fix SMP function call empty cpu mask race · f44310b9
      Wang YanQing authored
      I get the following warning every day with v3.7, once or
      twice a day:
        [ 2235.186027] WARNING: at /mnt/sda7/kernel/linux/arch/x86/kernel/apic/ipi.c:109 default_send_IPI_mask_logical+0x2f/0xb8()
      As explained by Linus as well:
       | Once we've done the "list_add_rcu()" to add it to the
       | queue, we can have (another) IPI to the target CPU that can
       | now see it and clear the mask.
       | So by the time we get to actually send the IPI, the mask might
       | have been cleared by another IPI.
      This patch also fixes a system hang problem, if the data->cpumask
      gets cleared after passing this point:
              if (WARN_ONCE(!mask, "empty IPI mask"))
      then the problem in commit 83d349f3 ("x86: don't send an IPI to
      the empty set of CPU's") will happen again.
      Signed-off-by: default avatarWang YanQing <udknight@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarJan Beulich <jbeulich@suse.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: peterz@infradead.org
      Cc: mina86@mina86.org
      Cc: srivatsa.bhat@linux.vnet.ibm.com
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/20130126075357.GA3205@udknight
      [ Tidied up the changelog and the comment in the code. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  3. 05 Jun, 2012 1 commit
  4. 08 May, 2012 1 commit
  5. 03 May, 2012 1 commit
  6. 29 Mar, 2012 2 commits
    • Gilad Ben-Yossef's avatar
      smp: add func to IPI cpus based on parameter func · b3a7e98e
      Gilad Ben-Yossef authored
      Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
      calculates the cpumask of cpus to IPI by calling a function supplied as a
      parameter in order to determine whether to IPI each specific cpu.
      The function works around allocation failure of cpumask variable in
      CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
      via smp_call_function_single().
      The function is useful since it allows to seperate the specific code that
      decided in each case whether to IPI a specific cpu for a specific request
      from the common boilerplate code of handling creating the mask, handling
      failures etc.
      [akpm@linux-foundation.org: s/gfpflags/gfp_flags/]
      [akpm@linux-foundation.org: avoid double-evaluation of `info' (per Michal), parenthesise evaluation of `cond_func']
      [akpm@linux-foundation.org: s/CPU/CPUs, use all 80 cols in comment]
      Signed-off-by: default avatarGilad Ben-Yossef <gilad@benyossef.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Reviewed-by: default avatar"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Gilad Ben-Yossef's avatar
      smp: introduce a generic on_each_cpu_mask() function · 3fc498f1
      Gilad Ben-Yossef authored
      We have lots of infrastructure in place to partition multi-core systems
      such that we have a group of CPUs that are dedicated to specific task:
      cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
      Still, kernel code will at times interrupt all CPUs in the system via IPIs
      for various needs.  These IPIs are useful and cannot be avoided
      altogether, but in certain cases it is possible to interrupt only specific
      CPUs that have useful work to do and not the entire system.
      This patch set, inspired by discussions with Peter Zijlstra and Frederic
      Weisbecker when testing the nohz task patch set, is a first stab at trying
      to explore doing this by locating the places where such global IPI calls
      are being made and turning the global IPI into an IPI for a specific group
      of CPUs.  The purpose of the patch set is to get feedback if this is the
      right way to go for dealing with this issue and indeed, if the issue is
      even worth dealing with at all.  Based on the feedback from this patch set
      I plan to offer further patches that address similar issue in other code
      This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
      infrastructure API (the former derived from existing arch specific
      versions in Tile and Arm) and uses them to turn several global IPI
      invocation to per CPU group invocations.
      Core kernel:
      on_each_cpu_mask() calls a function on processors specified by cpumask,
      which may or may not include the local processor.
      You must not call this function with disabled interrupts or from a
      hardware interrupt handler or from a bottom half handler.
      Note that the generic version is a little different then the Arm one:
      1. It has the mask as first parameter
      2. It calls the function on the calling CPU with interrupts disabled,
         but this should be OK since the function is called on the other CPUs
         with interrupts disabled anyway.
      The API is the same as the tile private one, but the generic version
      also calls the function on the with interrupts disabled in UP case
      This is OK since the function is called on the other CPUs
      with interrupts disabled.
      Signed-off-by: default avatarGilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarChris Metcalf <cmetcalf@tilera.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: default avatarMichal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  7. 31 Oct, 2011 1 commit
    • Paul Gortmaker's avatar
      kernel: Map most files to use export.h instead of module.h · 9984de1a
      Paul Gortmaker authored
      The changed files were only including linux/module.h for the
      EXPORT_SYMBOL infrastructure, and nothing else.  Revector them
      onto the isolated export header for faster compile times.
      Nothing to see here but a whole lot of instances of:
        -#include <linux/module.h>
        +#include <linux/export.h>
      This commit is only changing the kernel dir; next targets
      will probably be mm, fs, the arch dirs, etc.
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
  8. 17 Jun, 2011 1 commit
  9. 23 Mar, 2011 1 commit
  10. 17 Mar, 2011 4 commits
    • Milton Miller's avatar
      smp_call_function_interrupt: use typedef and %pf · c8def554
      Milton Miller authored
      Use the newly added smp_call_func_t in smp_call_function_interrupt for
      the func variable, and make the comment above the WARN more assertive
      and explicit.  Also, func is a function pointer and does not need an
      offset, so use %pf not %pS.
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Milton Miller's avatar
      smp_call_function_many: handle concurrent clearing of mask · 723aae25
      Milton Miller authored
      Mike Galbraith reported finding a lockup ("perma-spin bug") where the
      cpumask passed to smp_call_function_many was cleared by other cpu(s)
      while a cpu was preparing its call_data block, resulting in no cpu to
      clear the last ref and unlock the block.
      Having cpus clear their bit asynchronously could be useful on a mask of
      cpus that might have a translation context, or cpus that need a push to
      complete an rcu window.
      Instead of adding a BUG_ON and requiring yet another cpumask copy, just
      detect the race and handle it.
      Note: arch_send_call_function_ipi_mask must still handle an empty
      cpumask because the data block is globally visible before the that arch
      callback is made.  And (obviously) there are no guarantees to which cpus
      are notified if the mask is changed during the call; only cpus that were
      online and had their mask bit set during the whole call are guaranteed
      to be called.
      Reported-by: default avatarMike Galbraith <efault@gmx.de>
      Reported-by: default avatarJan Beulich <JBeulich@novell.com>
      Acked-by: default avatarJan Beulich <jbeulich@novell.com>
      Cc: stable@kernel.org
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Milton Miller's avatar
      call_function_many: add missing ordering · 45a57919
      Milton Miller authored
      Paul McKenney's review pointed out two problems with the barriers in the
      2.6.38 update to the smp call function many code.
      First, a barrier that would force the func and info members of data to
      be visible before their consumption in the interrupt handler was
      missing.  This can be solved by adding a smp_wmb between setting the
      func and info members and setting setting the cpumask; this will pair
      with the existing and required smp_rmb ordering the cpumask read before
      the read of refs.  This placement avoids the need a second smp_rmb in
      the interrupt handler which would be executed on each of the N cpus
      executing the call request.  (I was thinking this barrier was present
      but was not).
      Second, the previous write to refs (establishing the zero that we the
      interrupt handler was testing from all cpus) was performed by a third
      party cpu.  This would invoke transitivity which, as a recient or
      concurrent addition to memory-barriers.txt now explicitly states, would
      require a full smp_mb().
      However, we know the cpumask will only be set by one cpu (the data
      owner) and any preivous iteration of the mask would have cleared by the
      reading cpu.  By redundantly writing refs to 0 on the owning cpu before
      the smp_wmb, the write to refs will follow the same path as the writes
      that set the cpumask, which in turn allows us to keep the barrier in the
      interrupt handler a smp_rmb instead of promoting it to a smp_mb (which
      will be be executed by N cpus for each of the possible M elements on the
      I moved and expanded the comment about our (ab)use of the rcu list
      primitives for the concurrent walk earlier into this function.  I
      considered moving the first two paragraphs to the queue list head and
      lock, but felt it would have been too disconected from the code.
      Cc: Paul McKinney <paulmck@linux.vnet.ibm.com>
      Cc: stable@kernel.org (2.6.32 and later)
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Milton Miller's avatar
      call_function_many: fix list delete vs add race · e6cd1e07
      Milton Miller authored
      Peter pointed out there was nothing preventing the list_del_rcu in
      smp_call_function_interrupt from running before the list_add_rcu in
      Fix this by not setting refs until we have gotten the lock for the list.
      Take advantage of the wmb in list_add_rcu to save an explicit additional
      I tried to force this race with a udelay before the lock & list_add and
      by mixing all 64 online cpus with just 3 random cpus in the mask, but
      was unsuccessful.  Still, inspection shows a valid race, and the fix is
      a extension of the existing protection window in the current code.
      Cc: stable@kernel.org (v2.6.32 and later)
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  11. 21 Jan, 2011 2 commits
    • Milton Miller's avatar
      kernel/smp.c: consolidate writes in smp_call_function_interrupt() · 225c8e01
      Milton Miller authored
      We have to test the cpu mask in the interrupt handler before checking the
      refs, otherwise we can start to follow an entry before its deleted and
      find it partially initailzed for the next trip.  Presently we also clear
      the cpumask bit before executing the called function, which implies
      getting write access to the line.  After the function is called we then
      decrement refs, and if they go to zero we then unlock the structure.
      However, this implies getting write access to the call function data
      before and after another the function is called.  If we can assert that no
      smp_call_function execution function is allowed to enable interrupts, then
      we can move both writes to after the function is called, hopfully allowing
      both writes with one cache line bounce.
      On a 256 thread system with a kernel compiled for 1024 threads, the time
      to execute testcase in the "smp_call_function_many race" changelog was
      reduced by about 30-40ms out of about 545 ms.
      I decided to keep this as WARN because its now a buggy function, even
      though the stack trace is of no value -- a simple printk would give us the
      information needed.
      Raw data:
      Without patch:
        ipi_test startup took 1219366ns complete 539819014ns total 541038380ns
        ipi_test startup took 1695754ns complete 543439872ns total 545135626ns
        ipi_test startup took 7513568ns complete 539606362ns total 547119930ns
        ipi_test startup took 13304064ns complete 533898562ns total 547202626ns
        ipi_test startup took 8668192ns complete 544264074ns total 552932266ns
        ipi_test startup took 4977626ns complete 548862684ns total 553840310ns
        ipi_test startup took 2144486ns complete 541292318ns total 543436804ns
        ipi_test startup took 21245824ns complete 530280180ns total 551526004ns
      With patch:
        ipi_test startup took 5961748ns complete 500859628ns total 506821376ns
        ipi_test startup took 8975996ns complete 495098924ns total 504074920ns
        ipi_test startup took 19797750ns complete 492204740ns total 512002490ns
        ipi_test startup took 14824796ns complete 487495878ns total 502320674ns
        ipi_test startup took 11514882ns complete 494439372ns total 505954254ns
        ipi_test startup took 8288084ns complete 502570774ns total 510858858ns
        ipi_test startup took 6789954ns complete 493388112ns total 500178066ns
      	#include <linux/module.h>
      	#include <linux/init.h>
      	#include <linux/sched.h> /* sched clock */
      	#define ITERATIONS 100
      	static void do_nothing_ipi(void *dummy)
      	static void do_ipis(struct work_struct *dummy)
      		int i;
      		for (i = 0; i < ITERATIONS; i++)
      			smp_call_function(do_nothing_ipi, NULL, 1);
      		printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
      	static struct work_struct work[NR_CPUS];
      	static int __init testcase_init(void)
      		int cpu;
      		u64 start, started, done;
      		start = local_clock();
      		for_each_online_cpu(cpu) {
      			INIT_WORK(&work[cpu], do_ipis);
      			schedule_work_on(cpu, &work[cpu]);
      		started = local_clock();
      		done = local_clock();
      		pr_info("ipi_test startup took %lldns complete %lldns total %lldns\n",
      			started-start, done-started, done-start);
      		return 0;
      	static void __exit testcase_exit(void)
      	MODULE_AUTHOR("Anton Blanchard");
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    • Anton Blanchard's avatar
      kernel/smp.c: fix smp_call_function_many() SMP race · 6dc19899
      Anton Blanchard authored
      I noticed a failure where we hit the following WARN_ON in
                      if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
                      refs = atomic_dec_return(&data->refs);
                      WARN_ON(refs < 0);      <-------------------------
      We atomically tested and cleared our bit in the cpumask, and yet the
      number of cpus left (ie refs) was 0.  How can this be?
      It turns out commit 54fdade1
      ("generic-ipi: make struct call_function_data lockless") is at fault.  It
      removes locking from smp_call_function_many and in doing so creates a
      rather complicated race.
      The problem comes about because:
       - The smp_call_function_many interrupt handler walks call_function.queue
         without any locking.
       - We reuse a percpu data structure in smp_call_function_many.
       - We do not wait for any RCU grace period before starting the next
      Imagine a scenario where CPU A does two smp_call_functions back to back,
      and CPU B does an smp_call_function in between.  We concentrate on how CPU
      C handles the calls:
      CPU A            CPU B                  CPU C              CPU D
      					call_function.queue sees
      					data from CPU A on list
                                              call_function.queue sees
                                                (stale) CPU A on list
      							   smp_call_function int
      							   clears last ref on A
      							   list_del_rcu, unlock
      smp_call_function reuses
      percpu *data A
                                               data->cpumask sees and
                                               clears bit in cpumask
                                               might be using old or new fn!
                                               decrements refs below 0
      set data->refs (too late!)
      The important thing to note is since the interrupt handler walks a
      potentially stale call_function.queue without any locking, then another
      cpu can view the percpu *data structure at any time, even when the owner
      is in the process of initialising it.
      The following test case hits the WARN_ON 100% of the time on my PowerPC
      box (having 128 threads does help :)
      #include <linux/module.h>
      #include <linux/init.h>
      #define ITERATIONS 100
      static void do_nothing_ipi(void *dummy)
      static void do_ipis(struct work_struct *dummy)
      	int i;
      	for (i = 0; i < ITERATIONS; i++)
      		smp_call_function(do_nothing_ipi, NULL, 1);
      	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
      static struct work_struct work[NR_CPUS];
      static int __init testcase_init(void)
      	int cpu;
      	for_each_online_cpu(cpu) {
      		INIT_WORK(&work[cpu], do_ipis);
      		schedule_work_on(cpu, &work[cpu]);
      	return 0;
      static void __exit testcase_exit(void)
      MODULE_AUTHOR("Anton Blanchard");
      I tried to fix it by ordering the read and the write of ->cpumask and
      ->refs.  In doing so I missed a critical case but Paul McKenney was able
      to spot my bug thankfully :) To ensure we arent viewing previous
      iterations the interrupt handler needs to read ->refs then ->cpumask then
      ->refs _again_.
      Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
      [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ]
      [miltonm@bga.com: remove excess tests]
      Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
      Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: <stable@kernel.org> [2.6.32+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  12. 20 Jan, 2011 1 commit
  13. 13 Jan, 2011 1 commit
    • Amerigo Wang's avatar
      kernel: clean up USE_GENERIC_SMP_HELPERS · 351f8f8e
      Amerigo Wang authored
      For arch which needs USE_GENERIC_SMP_HELPERS, it has to select
      USE_GENERIC_SMP_HELPERS, rather than leaving a choice to user, since they
      don't provide their own implementions.
      Also, move on_each_cpu() to kernel/smp.c, it is strange to put it in
      For arch which doesn't use USE_GENERIC_SMP_HELPERS, e.g.  blackfin, only
      on_each_cpu() is compiled.
      Signed-off-by: default avatarAmerigo Wang <amwang@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  14. 27 Oct, 2010 1 commit
  15. 10 Sep, 2010 1 commit
    • Heiko Carstens's avatar
      generic-ipi: Fix deadlock in __smp_call_function_single · 27c379f7
      Heiko Carstens authored
      Just got my 6 way machine to a state where cpu 0 is in an
      endless loop within __smp_call_function_single.
      All other cpus are idle.
      The call trace on cpu 0 looks like this:
       ----> timer irq
      __smp_call_function_single() got called from nohz_balancer_kick()
      (inlined) with the remote cpu being 1, wait being 0 and the per
      cpu variable remote_sched_softirq_cb (call_single_data) of the
      current cpu (0).
      Then it loops forever when it tries to grab the lock of the
      call_single_data, since it is already locked and enqueued on cpu 0.
      My theory how this could have happened: for some reason the
      scheduler decided to call __smp_call_function_single() on it's own
      cpu, and sends an IPI to itself. The interrupt stays pending
      since IRQs are disabled. If then the hypervisor schedules the
      cpu away it might happen that upon rescheduling both the IPI and
      the timer IRQ are pending. If then interrupts are enabled again
      it depends which one gets scheduled first.
      If the timer interrupt gets delivered first we end up with the
      local deadlock as seen in the calltrace above.
      Let's make __smp_call_function_single() check if the target cpu is
      the current cpu and execute the function immediately just like
      smp_call_function_single does. That should prevent at least the
      scenario described here.
      It might also be that the scheduler is not supposed to call
      __smp_call_function_single with the remote cpu being the current
      cpu, but that is a different issue.
      Signed-off-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarJens Axboe <jaxboe@fusionio.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      LKML-Reference: <20100910114729.GB2827@osiris.boeblingen.de.ibm.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
  16. 27 May, 2010 1 commit
  17. 30 Mar, 2010 1 commit
    • Tejun Heo's avatar
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo authored
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      The script does the followings.
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
      The conversion was done in the following steps.
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
      6. percpu.h was updated not to include slab.h.
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Guess-its-ok-by: default avatarChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
  18. 18 Jan, 2010 1 commit
  19. 16 Jan, 2010 1 commit
    • David John's avatar
      smp_call_function_any(): pass the node value to cpumask_of_node() · af2422c4
      David John authored
      The change in acpi_cpufreq to use smp_call_function_any causes a warning
      when it is called since the function erroneously passes the cpu id to
      cpumask_of_node rather than the node that the cpu is on.  Fix this.
      cpumask_of_node(3): node > nr_node_ids(1)
      Pid: 1, comm: swapper Not tainted 2.6.33-rc3-00097-g2c1f1895 #223
      Call Trace:
       [<ffffffff81028bb3>] cpumask_of_node+0x23/0x58
       [<ffffffff81061f51>] smp_call_function_any+0x65/0xfa
       [<ffffffff810160d1>] ? do_drv_read+0x0/0x2f
       [<ffffffff81015fba>] get_cur_val+0xb0/0x102
       [<ffffffff81016080>] get_cur_freq_on_cpu+0x74/0xc5
       [<ffffffff810168a7>] acpi_cpufreq_cpu_init+0x417/0x515
       [<ffffffff81562ce9>] ? __down_write+0xb/0xd
       [<ffffffff8148055e>] cpufreq_add_dev+0x278/0x922
      Signed-off-by: default avatarDavid John <davidjon@xenontk.org>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  20. 15 Dec, 2009 1 commit
  21. 14 Dec, 2009 1 commit
  22. 18 Nov, 2009 1 commit
    • Rusty Russell's avatar
      generic-ipi: Add smp_call_function_any() · 2ea6dec4
      Rusty Russell authored
      Andrew points out that acpi-cpufreq uses cpumask_any, when it really
      would prefer to use the same CPU if possible (to avoid an IPI).  In
      general, this seems a good idea to offer.
      [ tglx: Documented selection preference and Inlined the UP case to
        	avoid the copy of smp_call_function_single() and the extra
        	EXPORT ]
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Zhao Yakui <yakui.zhao@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
  23. 23 Oct, 2009 1 commit
  24. 24 Sep, 2009 1 commit
  25. 23 Sep, 2009 1 commit
    • Xiao Guangrong's avatar
      generic-ipi: make struct call_function_data lockless · 54fdade1
      Xiao Guangrong authored
      This patch can remove spinlock from struct call_function_data, the
      reasons are below:
      1: add a new interface for cpumask named cpumask_test_and_clear_cpu(),
         it can atomically test and clear specific cpu, we can use it instead
         of cpumask_test_cpu() and cpumask_clear_cpu() and no need data->lock
         to protect those in generic_smp_call_function_interrupt().
      2: in smp_call_function_many(), after csd_lock() return, the current's
         cfd_data is deleted from call_function list, so it not have race
         between other cpus, then cfs_data is only used in
         smp_call_function_many() that must disable preemption and not from
         a hardware interrupthandler or from a bottom half handler to call,
         only the correspond cpu can use it, so it not have race in current
         cpu, no need cfs_data->lock to protect it.
      3: after 1 and 2, cfs_data->lock is only use to protect cfs_data->refs in
         generic_smp_call_function_interrupt(), so we can define cfs_data->refs
         to atomic_t, and no need cfs_data->lock any more.
      Signed-off-by: default avatarXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Acked-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      [akpm@linux-foundation.org: use atomic_dec_return()]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  26. 21 Aug, 2009 1 commit
  27. 07 Aug, 2009 1 commit
  28. 09 Jun, 2009 1 commit
  29. 13 Mar, 2009 1 commit
  30. 25 Feb, 2009 4 commits
    • Ingo Molnar's avatar
      generic-ipi: cleanups · 0b13fda1
      Ingo Molnar authored
      Andrew pointed out that there's some small amount of
      style rot in kernel/smp.c.
      Clean it up.
      Reported-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    • Peter Zijlstra's avatar
      generic-ipi: remove CSD_FLAG_WAIT · 6e275637
      Peter Zijlstra authored
      Oleg noticed that we don't strictly need CSD_FLAG_WAIT, rework
      the code so that we can use CSD_FLAG_LOCK for both purposes.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    • Peter Zijlstra's avatar
      generic-ipi: remove kmalloc() · 8969a5ed
      Peter Zijlstra authored
      Remove the use of kmalloc() from the smp_call_function_*()
      Steven's generic-ipi patch (d7240b98: generic-ipi: use per cpu
      data for single cpu ipi calls) started the discussion on the use
      of kmalloc() in this code and fixed the
      smp_call_function_single(.wait=0) fallback case.
      In this patch we complete this by also providing means for the
      _many() call, which fully removes the need for kmalloc() in this
      The problem with the _many() call is that other cpus might still
      be observing our entry when we're done with it. It solved this
      by dynamically allocating data elements and RCU-freeing it.
      We solve it by using a single per-cpu entry which provides
      static storage and solves one half of the problem (avoiding
      referencing freed data).
      The other half, ensuring the queue iteration it still possible,
      is done by placing re-used entries at the head of the list. This
      means that if someone was still iterating that entry when it got
      moved, he will now re-visit the entries on the list he had
      already seen, but avoids skipping over entries like would have
      happened had we placed the new entry at the end.
      Furthermore, visiting entries twice is not a problem, since we
      remove our cpu from the entry's cpumask once its called.
      Many thanks to Oleg for his suggestions and him poking holes in
      my earlier attempts.
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
    • Nick Piggin's avatar
      generic IPI: simplify barriers and locking · 15d0d3b3
      Nick Piggin authored
      Simplify the barriers in generic remote function call interrupt
      Firstly, just unconditionally take the lock and check the list
      in the generic_call_function_single_interrupt IPI handler. As
      we've just taken an IPI here, the chances are fairly high that
      there will be work on the list for us, so do the locking
      unconditionally. This removes the tricky lockless list_empty
      check and dubious barriers. The change looks bigger than it is
      because it is just removing an outer loop.
      Secondly, clarify architecture specific IPI locking rules.
      Generic code has no tools to impose any sane ordering on IPIs if
      they go outside normal cache coherency, ergo the arch code must
      make them appear to obey cache coherency as a "memory operation"
      to initiate an IPI, and a "memory operation" to receive one.
      This way at least they can be reasoned about in generic code,
      and smp_mb used to provide ordering.
      The combination of these two changes means that explict barriers
      can be taken out of queue handling for the single case -- shared
      data is explicitly locked, and ipi ordering must conform to
      that, so no barriers needed. An extra barrier is needed in the
      many handler, so as to ensure we load the list element after the
      IPI is received.
      Does any architecture actually *need* these barriers? For the
      initiator I could see it, but for the handler I would be
      surprised. So the other thing we could do for simplicity is just
      to require that, rather than just matching with cache coherency,
      we just require a full barrier before generating an IPI, and
      after receiving an IPI. In which case, the smp_mb()s can go
      away. But just for now, we'll be on the safe side and use the
      barriers (they're in the slow case anyway).
      Signed-off-by: default avatarNick Piggin <npiggin@suse.de>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: linux-arch@vger.kernel.org
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
  31. 30 Jan, 2009 1 commit
    • Steven Rostedt's avatar
      generic-ipi: use per cpu data for single cpu ipi calls · d7240b98
      Steven Rostedt authored
      The smp_call_function can be passed a wait parameter telling it to
      wait for all the functions running on other CPUs to complete before
      returning, or to return without waiting. Unfortunately, this is
      currently just a suggestion and not manditory. That is, the
      smp_call_function can decide not to return and wait instead.
      The reason for this is because it uses kmalloc to allocate storage
      to send to the called CPU and that CPU will free it when it is done.
      But if we fail to allocate the storage, the stack is used instead.
      This means we must wait for the called CPU to finish before
      Unfortunatly, some callers do no abide by this hint and act as if
      the non-wait option is mandatory. The MTRR code for instance will
      deadlock if the smp_call_function is set to wait. This is because
      the smp_call_function will wait for the other CPUs to finish their
      called functions, but those functions are waiting on the caller to
      This patch changes the generic smp_call_function code to use per cpu
      variables if the allocation of the data fails for a single CPU call. The
      smp_call_function_many will fall back to the smp_call_function_single
      if it fails its alloc. The smp_call_function_single is modified
      to not force the wait state.
      Since we now are using a single data per cpu we must synchronize the
      callers to prevent a second caller modifying the data before the
      first called IPI functions complete. To do so, I added a flag to
      the call_single_data called CSD_FLAG_LOCK. When the single CPU is
      called (which can be called when a many call fails an alloc), we
      set the LOCK bit on this per cpu data. When the caller finishes
      it clears the LOCK bit.
      The caller must wait till the LOCK bit is cleared before setting
      it. When it is cleared, there is no IPI function using it.
      Signed-off-by: default avatarSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: default avatarJens Axboe <jens.axboe@oracle.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarIngo Molnar <mingo@elte.hu>
  32. 31 Dec, 2008 1 commit