1. 26 Oct, 2018 1 commit
    • Vlastimil Babka's avatar
      mm: rename and change semantics of nr_indirectly_reclaimable_bytes · b29940c1
      Vlastimil Babka authored
      The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
      commit eb592546 ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
      the goal of accounting objects that can be reclaimed, but cannot be
      allocated via a SLAB_RECLAIM_ACCOUNT cache.  This is now possible via
      kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
      is converted.
      
      The counter is however still useful for accounting direct page allocations
      (i.e.  not slab) with a shrinker, such as the ION page pool.  So keep it,
      and:
      
      - change granularity to pages to be more like other counters; sub-page
        allocations should be able to use kmalloc
      - rename the counter to NR_KERNEL_MISC_RECLAIMABLE
      - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
        again remove the check for not printing "hidden" counters
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b29940c1
  2. 05 Oct, 2018 2 commits
  3. 28 Jun, 2018 1 commit
    • Sebastian Andrzej Siewior's avatar
      Revert mm/vmstat.c: fix vmstat_update() preemption BUG · 28557cc1
      Sebastian Andrzej Siewior authored
      Revert commit c7f26ccf ("mm/vmstat.c: fix vmstat_update() preemption
      BUG").  Steven saw a "using smp_processor_id() in preemptible" message
      and added a preempt_disable() section around it to keep it quiet.  This
      is not the right thing to do it does not fix the real problem.
      
      vmstat_update() is invoked by a kworker on a specific CPU.  This worker
      it bound to this CPU.  The name of the worker was "kworker/1:1" so it
      should have been a worker which was bound to CPU1.  A worker which can
      run on any CPU would have a `u' before the first digit.
      
      smp_processor_id() can be used in a preempt-enabled region as long as
      the task is bound to a single CPU which is the case here.  If it could
      run on an arbitrary CPU then this is the problem we have an should seek
      to resolve.
      
      Not only this smp_processor_id() must not be migrated to another CPU but
      also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
      Not to mention that other code relies on the fact that such a worker
      runs on one specific CPU only.
      
      Therefore revert that commit and we should look instead what broke the
      affinity mask of the kworker.
      
      Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.de
      
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven J. Hill <steven.hill@cavium.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28557cc1
  4. 16 May, 2018 1 commit
  5. 12 May, 2018 1 commit
  6. 11 Apr, 2018 1 commit
    • Roman Gushchin's avatar
      mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES · eb592546
      Roman Gushchin authored
      Patch series "indirectly reclaimable memory", v2.
      
      This patchset introduces the concept of indirectly reclaimable memory
      and applies it to fix the issue of when a big number of dentries with
      external names can significantly affect the MemAvailable value.
      
      This patch (of 3):
      
      Introduce a concept of indirectly reclaimable memory and adds the
      corresponding memory counter and /proc/vmstat item.
      
      Indirectly reclaimable memory is any sort of memory, used by the kernel
      (except of reclaimable slabs), which is actually reclaimable, i.e.  will
      be released under memory pressure.
      
      The counter is in bytes, as it's not always possible to count such
      objects in pages.  The name contains BYTES by analogy to
      NR_KERNEL_STACK_KB.
      
      Link: http://lkml.kernel.org/r/20180305133743.12746-2-guro@fb.com
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Joha...
      eb592546
  7. 28 Mar, 2018 1 commit
  8. 16 Nov, 2017 2 commits
  9. 09 Sep, 2017 3 commits
    • Kemi Wang's avatar
      mm: consider the number in local CPUs when reading NUMA stats · 63803222
      Kemi Wang authored
      To avoid deviation, the per cpu number of NUMA stats in
      vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.
      
      Since NUMA stats does not be read by users frequently, and kernel does not
      need it to make a decision, it will not be a problem to make the readers
      more expensive.
      
      Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.com
      
      Signed-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63803222
    • Kemi Wang's avatar
      mm: update NUMA counter threshold size · 1d90ca89
      Kemi Wang authored
      There is significant overhead in cache bouncing caused by zone counters
      (NUMA associated counters) update in parallel in multi-threaded page
      allocation (suggested by Dave Hansen).
      
      This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
      as a small threshold greatly increases the update frequency of the global
      counter from local per cpu counter(suggested by Ying Huang).
      
      The rationality is that these statistics counters don't affect the
      kernel's decision, unlike other VM counters, so it's not a problem to use
      a large threshold.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
      single page allocation and reclaim on Jesper's page_bench03 benchmark.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
      https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
      bench
      
       Threshold   CPU cycles    Throughput(88 threads)
           32          799         241760478
           64          640         301628829
           125         537         358906028 <==> system by default (base)
           256         468         412397590
           512         428         450550704
           4096        399         482520943
           20000       394         489009617
           30000       395         488017817
           65533       369(-31.3%) 521661345(+45.3%) <==> with this patchset
           N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.com
      
      Signed-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarYing Huang <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d90ca89
    • Kemi Wang's avatar
      mm: change the call sites of numa statistics items · 3a321d2a
      Kemi Wang authored
      Patch series "Separate NUMA statistics from zone statistics", v2.
      
      Each page allocation updates a set of per-zone statistics with a call to
      zone_statistics().  As discussed in 2017 MM summit, these are a
      substantial source of overhead in the page allocator and are very rarely
      consumed.  This significant overhead in cache bouncing caused by zone
      counters (NUMA associated counters) update in parallel in multi-threaded
      page allocation (pointed out by Dave Hansen).
      
      A link to the MM summit slides:
        http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
      
      To mitigate this overhead, this patchset separates NUMA statistics from
      zone statistics framework, and update NUMA counter threshold to a fixed
      size of MAX_U16 - 2, as a small threshold greatly increases the update
      frequency of the global counter from local per cpu counter (suggested by
      Ying Huang).  The rationality is that these statistics counters don't
      need to be read often, unlike other VM counters, so it's not a problem
      to use a large threshold and make readers more expensive.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
      below) for per single page allocation and reclaim on Jesper's
      page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
      of virtual memory statistics with little end-user-visible effects (only
      move the numa stats to show behind zone page stats, see the first patch
      for details).
      
      I did an experiment of single page allocation and reclaim concurrently
      using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
      server (88 processors with 126G memory) with different size of threshold
      of pcp counter.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
         Threshold   CPU cycles    Throughput(88 threads)
            32        799         241760478
            64        640         301628829
            125       537         358906028 <==> system by default
            256       468         412397590
            512       428         450550704
            4096      399         482520943
            20000     394         489009617
            30000     395         488017817
            65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
            N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      This patch (of 3):
      
      In this patch, NUMA statistics is separated from zone statistics
      framework, all the call sites of NUMA stats are changed to use
      numa-stats-specific functions, it does not have any functionality change
      except that the number of NUMA stats is shown behind zone page stats
      when users *read* the zone info.
      
      E.g. cat /proc/zoneinfo
          ***Base***                           ***With this patch***
      nr_free_pages 3976                         nr_free_pages 3976
      nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
      nr_zone_active_anon 0                      nr_zone_active_anon 0
      nr_zone_inactive_file 0                    nr_zone_inactive_file 0
      nr_zone_active_file 0                      nr_zone_active_file 0
      nr_zone_unevictable 0                      nr_zone_unevictable 0
      nr_zone_write_pending 0                    nr_zone_write_pending 0
      nr_mlock     0                             nr_mlock     0
      nr_page_table_pages 0                      nr_page_table_pages 0
      nr_kernel_stack 0                          nr_kernel_stack 0
      nr_bounce    0                             nr_bounce    0
      nr_zspages   0                             nr_zspages   0
      numa_hit 0                                *nr_free_cma  0*
      numa_miss 0                                numa_hit     0
      numa_foreign 0                             numa_miss    0
      numa_interleave 0                          numa_foreign 0
      numa_local   0                             numa_interleave 0
      numa_other   0                             numa_local   0
      *nr_free_cma 0*                            numa_other 0
          ...                                        ...
      vm stats threshold: 10                     vm stats threshold: 10
          ...                                        ...
      
      The next patch updates the numa stats counter size and threshold.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.com
      
      Signed-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a321d2a
  10. 07 Sep, 2017 6 commits
    • Huang Ying's avatar
      mm, swap: add swap readahead hit statistics · cbc65df2
      Huang Ying authored
      Patch series "mm, swap: VMA based swap readahead", v4.
      
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      valid.
      
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory space.  And the different tasks in the system may have different
      access patterns, which makes the global space locality estimation
      incorrect.
      
      In this patchset, when page fault occurs, the virtual pages near the
      fault address will be readahead instead of the swap slots near the fault
      swap slot in swap device.  This avoid to readahead the unrelated swap
      slots.  At the same time, the swap readahead is changed to work on
      per-VMA from globally.  So that the different access patterns of the
      different VMAs could be distinguished, and the different readahead
      policy could be applied accordingly.  The original core readahead
      detection and scaling algorithm is reused, because it is an effect
      algorithm to detect the space locality.
      
      In addition to the swap readahead changes, some new sysfs interface is
      added to show the efficiency of the readahead algorithm and some other
      swap statistics.
      
      This new implementation will incur more small random read, on SSD, the
      improved correctness of estimation and readahead target should beat the
      potential increased overhead, this is also illustrated in the test
      results below.  But on HDD, the overhead may beat the benefit, so the
      original implementation will be used by default.
      
      The test and result is as follow,
      
      Common test condition
      =====================
      
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
      Swap device: NVMe disk
      
      Micro-benchmark with combined access pattern
      ============================================
      
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      
      Linpack
      =======
      
      The test memory size is bigger than RAM to trigger swapping.
      
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      
      This patch (of 5):
      
      The statistics for total readahead pages and total readahead hits are
      recorded and exported via the following sysfs interface.
      
      /sys/kernel/mm/swap/ra_hits
      /sys/kernel/mm/swap/ra_total
      
      With them, the efficiency of the swap readahead could be measured, so
      that the swap readahead algorithm and parameters could be tuned
      accordingly.
      
      [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
      Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbc65df2
    • SeongJae Park's avatar
      mm/vmstat.c: fix wrong comment · f113e641
      SeongJae Park authored
      Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from
      pagetypeinfo_show_free()'s comment.  This commit fixes it.
      
      Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com
      Fixes: 467c996c
      
       ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
      Signed-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f113e641
    • Wen Yang's avatar
      mm/vmstat: fix divide error at __fragmentation_index · 88d6ac40
      Wen Yang authored
      When order is -1 or too big, *1UL << order* will be 0, which will cause
      a divide error.  Although it seems that all callers of
      __fragmentation_index() will only do so with a valid order, the patch
      can make it more robust.
      
      Should prevent reoccurrences of
      https://bugzilla.kernel.org/show_bug.cgi?id=196555
      
      Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cn
      
      Signed-off-by: default avatarWen Yang <wen.yang99@zte.com.cn>
      Reviewed-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88d6ac40
    • Michal Hocko's avatar
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko authored
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • Huang Ying's avatar
      mm, THP, swap: add THP swapping out fallback counting · fe490cc0
      Huang Ying authored
      When swapping out THP (Transparent Huge Page), instead of swapping out
      the THP as a whole, sometimes we have to fallback to split the THP into
      normal pages before swapping, because no free swap clusters are
      available, or cgroup limit is exceeded, etc.  To count the number of the
      fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
      we fallback to split the THP.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe490cc0
    • Huang Ying's avatar
      mm: test code to write THP to swap device as a whole · 225311a4
      Huang Ying authored
      To support delay splitting THP (Transparent Huge Page) after swapped
      out, we need to enhance swap writing code to support to write a THP as a
      whole.  This will improve swap write IO performance.
      
      As Ming Lei <ming.lei@redhat.com> pointed out, this should be based on
      multipage bvec support, which hasn't been merged yet.  So this patch is
      only for testing the functionality of the other patches in the series.
      And will be reimplemented after multipage bvec support is merged.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      225311a4
  11. 10 Jul, 2017 1 commit
  12. 06 Jul, 2017 4 commits
  13. 12 May, 2017 1 commit
  14. 03 May, 2017 5 commits
    • David Rientjes's avatar
      mm, vmstat: suppress pcp stats for unpopulated zones in zoneinfo · 7dfb8bf3
      David Rientjes authored
      After "mm, vmstat: print non-populated zones in zoneinfo",
      /proc/zoneinfo will show unpopulated zones.
      
      The per-cpu pageset statistics are not relevant for unpopulated zones
      and can be potentially lengthy, so supress them when they are not
      interesting.
      
      Also moves lowmem reserve protection information above pcp stats since
      it is relevant for all zones per vm.lowmem_reserve_ratio.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.com
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dfb8bf3
    • David Rientjes's avatar
      mm, vmstat: print non-populated zones in zoneinfo · b2bd8598
      David Rientjes authored
      Initscripts can use the information (protection levels) from
      /proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot.
      
      vm.lowmem_reserve_ratio is an array of ratios for each configured zone
      on the system.  If a zone is not populated on an arch, /proc/zoneinfo
      suppresses its output.
      
      This results in there not being a 1:1 mapping between the set of zones
      emitted by /proc/zoneinfo and the zones configured by
      vm.lowmem_reserve_ratio.
      
      This patch shows statistics for non-populated zones in /proc/zoneinfo.
      The zones exist and hold a spot in the vm.lowmem_reserve_ratio array.
      Without this patch, it is not possible to determine which index in the
      array controls which zone if one or more zones on the system are not
      populated.
      
      Remaining users of walk_zones_in_node() are unchanged.  Files such as
      /proc/pagetypeinfo require certain zone data to be initialized properly
      for display, which is not done for unpopulated zones.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.com
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2bd8598
    • Shaohua Li's avatar
      mm: move MADV_FREE pages into LRU_INACTIVE_FILE list · f7ad2a6c
      Shaohua Li authored
      madv()'s MADV_FREE indicate pages are 'lazyfree'.  They are still
      anonymous pages, but they can be freed without pageout.  To distinguish
      these from normal anonymous pages, we clear their SwapBacked flag.
      
      MADV_FREE pages could be freed without pageout, so they pretty much like
      used once file pages.  For such pages, we'd like to reclaim them once
      there is memory pressure.  Also it might be unfair reclaiming MADV_FREE
      pages always before used once file pages and we definitively want to
      reclaim the pages before other anonymous and file pages.
      
      To speed up MADV_FREE pages reclaim, we put the pages into
      LRU_INACTIVE_FILE list.  The rationale is LRU_INACTIVE_FILE list is tiny
      nowadays and should be full of used once file pages.  Reclaiming
      MADV_FREE pages will not have much interfere of anonymous and active
      file pages.  And the inactive file pages and MADV_FREE pages will be
      reclaimed according to their age, so we don't reclaim too many MADV_FREE
      pages too.  Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
      means we can reclaim the pages without swap support.  This idea is
      suggested by Johannes.
      
      This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
      avoid bisect failure, next patch will do it.
      
      The patch is based on Minchan's original patch.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
      
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ad2a6c
    • Johannes Weiner's avatar
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner authored
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • Johannes Weiner's avatar
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner authored
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarJia He <hejianet@gmail.com>
      Tested-by: default avatarJia He <hejianet@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  15. 19 Apr, 2017 1 commit
    • Michal Hocko's avatar
      mm: make mm_percpu_wq non freezable · 80d136e1
      Michal Hocko authored
      Geert has reported a freeze during PM resume and some additional
      debugging has shown that the device_resume worker cannot make a forward
      progress because it waits for an event which is stuck waiting in
      drain_all_pages:
      
        INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
              Not tainted 4.11.0-rc7-koelsch-00029-g005882e5-dirty #3476
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:0    D    0     5      2 0x00000000
        Workqueue: events_unbound async_run_entry_fn
          __schedule
          schedule
          schedule_timeout
          wait_for_common
          dpm_wait_for_superior
          device_resume
          async_resume
          async_run_entry_fn
          process_one_work
          worker_thread
          kthread
        [...]
        bash            D    0  1703   1694 0x00000000
          __schedule
          schedule
          schedule_timeout
          wait_for_common
          flush_work
          drain_all_pages
          start_isolate_page_range
          alloc_contig_range
          cma_alloc
          __alloc_from_contiguous
          cma_allocator_alloc
          __dma_alloc
          arm_dma_alloc
          sh_eth_ring_init
          sh_eth_open
          sh_eth_resume
          dpm_run_callback
          device_resume
          dpm_resume
          dpm_resume_end
          suspend_devices_and_enter
          pm_suspend
          state_store
          kernfs_fop_write
          __vfs_write
          vfs_write
          SyS_write
        [...]
        Showing busy workqueues and worker pools:
        [...]
        workqueue mm_percpu_wq: flags=0xc
          pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
            delayed: drain_local_pages_wq, vmstat_update
          pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
            delayed: drain_local_pages_wq BAR(1703), vmstat_update
      
      Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE
      so it is frozen this early during resume so we are effectively
      deadlocked.  Fix this by dropping WQ_FREEZABLE when creating
      mm_percpu_wq.  We really want to have it operational all the time.
      
      Fixes: ce612879
      
       ("mm: move pcp and lru-pcp draining into single wq")
      Reported-and-tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Debugged-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80d136e1
  16. 08 Apr, 2017 1 commit
  17. 01 Apr, 2017 1 commit
    • Michal Hocko's avatar
      mm: move mm_percpu_wq initialization earlier · 597b7305
      Michal Hocko authored
      Yang Li has reported that drain_all_pages triggers a WARN_ON which means
      that this function is called earlier than the mm_percpu_wq is
      initialized on arm64 with CMA configured:
      
        WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
        Modules linked in:
        CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
        Hardware name: Freescale Layerscape 2088A RDB Board (DT)
        task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
        PC is at drain_all_pages+0x244/0x25c
        LR is at start_isolate_page_range+0x14c/0x1f0
        [...]
         drain_all_pages+0x244/0x25c
         start_isolate_page_range+0x14c/0x1f0
         alloc_contig_range+0xec/0x354
         cma_alloc+0x100/0x1fc
         dma_alloc_from_contiguous+0x3c/0x44
         atomic_pool_init+0x7c/0x208
         arm64_dma_init+0x44/0x4c
         do_one_initcall+0x38/0x128
         kernel_init_freeable+0x1a0/0x240
         kernel_init+0x10/0xfc
         ret_from_fork+0x10/0x20
      
      Fix this by moving the whole setup_vmstat which is an initcall right now
      to init_mm_internals which will be called right after the WQ subsystem
      is initialized.
      
      Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarYang Li <pku.leo@gmail.com>
      Tested-by: default avatarYang Li <pku.leo@gmail.com>
      Tested-by: default avatarXiaolong Ye <xiaolong.ye@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      597b7305
  18. 10 Mar, 2017 1 commit
  19. 23 Feb, 2017 1 commit
  20. 01 Dec, 2016 3 commits
  21. 08 Oct, 2016 2 commits
    • Joe Perches's avatar
      seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char · 75ba1d07
      Joe Perches authored
      Allow some seq_puts removals by taking a string instead of a single
      char.
      
      [akpm@linux-foundation.org: update vmstat_show(), per Joe]
      Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.com
      
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75ba1d07
    • Alexey Dobriyan's avatar
      proc: much faster /proc/vmstat · 68ba0326
      Alexey Dobriyan authored
      Every current KDE system has process named ksysguardd polling files
      below once in several seconds:
      
      	$ strace -e trace=open -p $(pidof ksysguardd)
      	Process 1812 attached
      	open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
      	open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
      	open("/proc/net/dev", O_RDONLY)         = 8
      	open("/proc/net/wireless", O_RDONLY)    = -1 ENOENT (No such file or directory)
      	open("/proc/stat", O_RDONLY)            = 8
      	open("/proc/vmstat", O_RDONLY)          = 8
      
      Hell knows what it is doing but speed up reading /proc/vmstat by 33%!
      
      Benchmark is open+read+close 1.000.000 times.
      
      			BEFORE
      $ perf stat -r 10 taskset -c 3 ./proc-vmstat
      
       Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):
      
            13146.768464      task-clock (msec)         #    0.960 CPUs utilized            ( +-  0.60% )
                      15      context-switches          #    0.001 K/sec                    ( +-  1.41% )
                       1      cpu-migrations            #    0.000 K/sec                    ( +- 11.11% )
                     104      page-faults               #    0.008 K/sec                    ( +-  0.57% )
          45,489,799,349      cycles                    #    3.460 GHz                      ( +-  0.03% )
           9,970,175,743      stalled-cycles-frontend   #   21.92% frontend cycles idle     ( +-  0.10% )
           2,800,298,015      stalled-cycles-backend    #   6.16% backend cycles idle       ( +-  0.32% )
          79,241,190,850      instructions              #    1.74  insn per cycle
                                                        #    0.13  stalled cycles per insn  ( +-  0.00% )
          17,616,096,146      branches                  # 1339.956 M/sec                    ( +-  0.00% )
             176,106,232      branch-misses             #    1.00% of all branches          ( +-  0.18% )
      
            13.691078109 seconds time elapsed                                          ( +-  0.03% )
            ^^^^^^^^^^^^
      
      			AFTER
      $ perf stat -r 10 taskset -c 3 ./proc-vmstat
      
       Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):
      
             8688.353749      task-clock (msec)         #    0.950 CPUs utilized            ( +-  1.25% )
                      10      context-switches          #    0.001 K/sec                    ( +-  2.13% )
                       1      cpu-migrations            #    0.000 K/sec
                     104      page-faults               #    0.012 K/sec                    ( +-  0.56% )
          30,384,010,730      cycles                    #    3.497 GHz                      ( +-  0.07% )
          12,296,259,407      stalled-cycles-frontend   #   40.47% frontend cycles idle     ( +-  0.13% )
           3,370,668,651      stalled-cycles-backend    #  11.09% backend cycles idle       ( +-  0.69% )
          28,969,052,879      instructions              #    0.95  insn per cycle
                                                        #    0.42  stalled cycles per insn  ( +-  0.01% )
           6,308,245,891      branches                  #  726.058 M/sec                    ( +-  0.00% )
             214,685,502      branch-misses             #    3.40% of all branches          ( +-  0.26% )
      
             9.146081052 seconds time elapsed                                          ( +-  0.07% )
             ^^^^^^^^^^^
      
      vsnprintf() is slow because:
      
      1. format_decode() is busy looking for format specifier: 2 branches
         per character (not in this case, but in others)
      
      2. approximately million branches while parsing format mini language
         and everywhere
      
      3.  just look at what string() does /proc/vmstat is good case because
         most of its content are strings
      
      Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.by
      
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68ba0326