• Shaohua Li's avatar
    smp: make smp_call_function_many() use logic similar to smp_call_function_single() · 9a46ad6d
    Shaohua Li authored
    I'm testing swapout workload in a two-socket Xeon machine.  The workload
    has 10 threads, each thread sequentially accesses separate memory
    region.  TLB flush overhead is very big in the workload.  For each page,
    page reclaim need move it from active lru list and then unmap it.  Both
    need a TLB flush.  And this is a multthread workload, TLB flush happens
    in 10 CPUs.  In X86, TLB flush uses generic smp_call)function.  So this
    workload stress smp_call_function_many heavily.
    
    Without patch, perf shows:
    +  24.49%  [k] generic_smp_call_function_interrupt
    -  21.72%  [k] _raw_spin_lock
       - _raw_spin_lock
          + 79.80% __page_check_address
          + 6.42% generic_smp_call_function_interrupt
          + 3.31% get_swap_page
          + 2.37% free_pcppages_bulk
          + 1.75% handle_pte_fault
          + 1.54% put_super
          + 1.41% grab_super_passive
          + 1.36% __swap_duplicate
          + 0.68% blk_flush_plug_list
          + 0.62% swap_info_get
    +   6.55%  [k] flush_tlb_func
    +   6.46%  [k] smp_call_function_many
    +   5.09%  [k] call_function_interrupt
    +   4.75%  [k] default_send_IPI_mask_sequence_phys
    +   2.18%  [k] find_next_bit
    
    swapout throughput is around 1300M/s.
    
    With the patch, perf shows:
    -  27.23%  [k] _raw_spin_lock
       - _raw_spin_lock
          + 80.53% __page_check_address
          + 8.39% generic_smp_call_function_single_interrupt
          + 2.44% get_swap_page
          + 1.76% free_pcppages_bulk
          + 1.40% handle_pte_fault
          + 1.15% __swap_duplicate
          + 1.05% put_super
          + 0.98% grab_super_passive
          + 0.86% blk_flush_plug_list
          + 0.57% swap_info_get
    +   8.25%  [k] default_send_IPI_mask_sequence_phys
    +   7.55%  [k] call_function_interrupt
    +   7.47%  [k] smp_call_function_many
    +   7.25%  [k] flush_tlb_func
    +   3.81%  [k] _raw_spin_lock_irqsave
    +   3.78%  [k] generic_smp_call_function_single_interrupt
    
    swapout throughput is around 1400M/s.  So there is around a 7%
    improvement, and total cpu utilization doesn't change.
    
    Without the patch, cfd_data is shared by all CPUs.
    generic_smp_call_function_interrupt does read/write cfd_data several times
    which will create a lot of cache ping-pong.  With the patch, the data
    becomes per-cpu.  The ping-pong is avoided.  And from the perf data, this
    doesn't make call_single_queue lock contend.
    
    Next step is to remove generic_smp_call_function_interrupt() from arch
    code.
    Signed-off-by: 's avatarShaohua Li <shli@fusionio.com>
    Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Signed-off-by: 's avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: 's avatarLinus Torvalds <torvalds@linux-foundation.org>
    9a46ad6d