Skip to content
  • Anton Blanchard's avatar
    kernel/smp.c: fix smp_call_function_many() SMP race · 6dc19899
    Anton Blanchard authored
    I noticed a failure where we hit the following WARN_ON in
    generic_smp_call_function_interrupt:
    
                    if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
                            continue;
    
                    data->csd.func(data->csd.info);
    
                    refs = atomic_dec_return(&data->refs);
                    WARN_ON(refs < 0);      <-------------------------
    
    We atomically tested and cleared our bit in the cpumask, and yet the
    number of cpus left (ie refs) was 0.  How can this be?
    
    It turns out commit 54fdade1
    
    
    ("generic-ipi: make struct call_function_data lockless") is at fault.  It
    removes locking from smp_call_function_many and in doing so creates a
    rather complicated race.
    
    The problem comes about because:
    
     - The smp_call_function_many interrupt handler walks call_function.queue
       without any locking.
     - We reuse a percpu data structure in smp_call_function_many.
     - We do not wait for any RCU grace period before starting the next
       smp_call_function_many.
    
    Imagine a scenario where CPU A does two smp_call_functions back to back,
    and CPU B does an smp_call_function in between.  We concentrate on how CPU
    C handles the calls:
    
    CPU A            CPU B                  CPU C              CPU D
    
    smp_call_function
                                            smp_call_function_interrupt
                                                walks
    					call_function.queue sees
    					data from CPU A on list
    
                     smp_call_function
    
                                            smp_call_function_interrupt
                                                walks
    
                                            call_function.queue sees
                                              (stale) CPU A on list
    							   smp_call_function int
    							   clears last ref on A
    							   list_del_rcu, unlock
    smp_call_function reuses
    percpu *data A
                                             data->cpumask sees and
                                             clears bit in cpumask
                                             might be using old or new fn!
                                             decrements refs below 0
    
    set data->refs (too late!)
    
    The important thing to note is since the interrupt handler walks a
    potentially stale call_function.queue without any locking, then another
    cpu can view the percpu *data structure at any time, even when the owner
    is in the process of initialising it.
    
    The following test case hits the WARN_ON 100% of the time on my PowerPC
    box (having 128 threads does help :)
    
    #include <linux/module.h>
    #include <linux/init.h>
    
    #define ITERATIONS 100
    
    static void do_nothing_ipi(void *dummy)
    {
    }
    
    static void do_ipis(struct work_struct *dummy)
    {
    	int i;
    
    	for (i = 0; i < ITERATIONS; i++)
    		smp_call_function(do_nothing_ipi, NULL, 1);
    
    	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
    }
    
    static struct work_struct work[NR_CPUS];
    
    static int __init testcase_init(void)
    {
    	int cpu;
    
    	for_each_online_cpu(cpu) {
    		INIT_WORK(&work[cpu], do_ipis);
    		schedule_work_on(cpu, &work[cpu]);
    	}
    
    	return 0;
    }
    
    static void __exit testcase_exit(void)
    {
    }
    
    module_init(testcase_init)
    module_exit(testcase_exit)
    MODULE_LICENSE("GPL");
    MODULE_AUTHOR("Anton Blanchard");
    
    I tried to fix it by ordering the read and the write of ->cpumask and
    ->refs.  In doing so I missed a critical case but Paul McKenney was able
    to spot my bug thankfully :) To ensure we arent viewing previous
    iterations the interrupt handler needs to read ->refs then ->cpumask then
    ->refs _again_.
    
    Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
    
    [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ]
    [miltonm@bga.com: remove excess tests]
    Signed-off-by: default avatarAnton Blanchard <anton@samba.org>
    Signed-off-by: default avatarMilton Miller <miltonm@bga.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
    Cc: <stable@kernel.org> [2.6.32+]
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    6dc19899