Skip to content
  • Mel Gorman's avatar
    mm, page_alloc: only use per-cpu allocator for irq-safe requests · 374ad05a
    Mel Gorman authored
    Many workloads that allocate pages are not handling an interrupt at a
    time.  As allocation requests may be from IRQ context, it's necessary to
    disable/enable IRQs for every page allocation.  This cost is the bulk of
    the free path but also a significant percentage of the allocation path.
    
    This patch alters the locking and checks such that only irq-safe
    allocation requests use the per-cpu allocator.  All others acquire the
    irq-safe zone->lock and allocate from the buddy allocator.  It relies on
    disabling preemption to safely access the per-cpu structures.  It could
    be slightly modified to avoid soft IRQs using it but it's not clear it's
    worthwhile.
    
    This modification may slow allocations from IRQ context slightly but the
    main gain from the per-cpu allocator is that it scales better for
    allocations from multiple contexts.  There is an implicit assumption
    that intensive allocations from IRQ contexts on multiple CPUs from a
    single NUMA node are rare and that the fast majority of scaling issues
    are encountered in !IRQ contexts such as page faulting.  It's worth
    noting that this patch is not required for a bulk page allocator but it
    significantly reduces the overhead.
    
    The following is results from a page allocator micro-benchmark.  Only
    order-0 is interesting as higher orders do not use the per-cpu allocator
    
                                              4.10.0-rc2                 4.10.0-rc2
                                                 vanilla               irqsafe-v1r5
    Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
    Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
    Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
    Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
    Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
    Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
    Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
    Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
    Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
    Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
    Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
    Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
    Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
    Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
    Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
    Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
    Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
    Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
    Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
    Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
    Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
    Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
    Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
    Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
    Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
    Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
    Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
    Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
    Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
    Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
    Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
    Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
    Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
    Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
    Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
    Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
    Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
    Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
    Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
    Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
    Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
    Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
    Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
    Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
    Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
    
    This is the alloc, free and total overhead of allocating order-0 pages
    in batches of 1 page up to 16384 pages.  Avoiding disabling/enabling
    overhead massively reduces overhead.  Alloc overhead is roughly reduced
    by 14-20% in most cases.  The free path is reduced by 26-46% and the
    total reduction is significant.
    
    Many users require zeroing of pages from the page allocator which is the
    vast cost of allocation.  Hence, the impact on a basic page faulting
    benchmark is not that significant
    
                                  4.10.0-rc2            4.10.0-rc2
                                     vanilla          irqsafe-v1r5
    Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
    Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
    Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
    Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
    CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
    CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
    Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
    Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
    
    This is from aim9 and the most notable outcome is that fault variability
    is reduced by the patch.  The headline improvement is small as the
    overall fault cost, zeroing, page table insertion etc dominate relative
    to disabling/enabling IRQs in the per-cpu allocator.
    
    Similarly, little benefit was seen on networking benchmarks both
    localhost and between physical server/clients where other costs
    dominate.  It's possible that this will only be noticable on very high
    speed networks.
    
    Jesper Dangaard Brouer independently tested this with a separate
    microbenchmark from
      https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
    
    Micro-benchmarked with [1] page_bench02:
     modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
      rmmod page_bench02 ; dmesg --notime | tail -n 4
    
    Compared to baseline: 213 cycles(tsc) 53.417 ns
     - against this     : 184 cycles(tsc) 46.056 ns
     - Saving           : -29 cycles
     - Very close to expected 27 cycles saving [see below [2]]
    
    Micro benchmarking via time_bench_sample[3], we get the cost of these
    operations:
    
     time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
     time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
     time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
     time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
     time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
     time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
     time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
     time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
     time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
     [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
     time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
     [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
     time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
     time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
     time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
    
    Thus, expected improvement is: 38-11 = 27 cycles.
    
    [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
      Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
    Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.net
    
    
    Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
    Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
    Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
    Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
    Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    374ad05a