1. 28 Mar, 2018 1 commit
  2. 16 Nov, 2017 2 commits
  3. 09 Sep, 2017 3 commits
    • Kemi Wang's avatar
      mm: consider the number in local CPUs when reading NUMA stats · 63803222
      Kemi Wang authored
      To avoid deviation, the per cpu number of NUMA stats in
      vm_numa_stat_diff[] is included when a user *reads* the NUMA stats.
      
      Since NUMA stats does not be read by users frequently, and kernel does not
      need it to make a decision, it will not be a problem to make the readers
      more expensive.
      
      Link: http://lkml.kernel.org/r/1503568801-21305-4-git-send-email-kemi.wang@intel.comSigned-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63803222
    • Kemi Wang's avatar
      mm: update NUMA counter threshold size · 1d90ca89
      Kemi Wang authored
      There is significant overhead in cache bouncing caused by zone counters
      (NUMA associated counters) update in parallel in multi-threaded page
      allocation (suggested by Dave Hansen).
      
      This patch updates NUMA counter threshold to a fixed size of MAX_U16 - 2,
      as a small threshold greatly increases the update frequency of the global
      counter from local per cpu counter(suggested by Ying Huang).
      
      The rationality is that these statistics counters don't affect the
      kernel's decision, unlike other VM counters, so it's not a problem to use
      a large threshold.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369) for per
      single page allocation and reclaim on Jesper's page_bench03 benchmark.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
      https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/
      bench
      
       Threshold   CPU cycles    Throughput(88 threads)
           32          799         241760478
           64          640         301628829
           125         537         358906028 <==> system by default (base)
           256         468         412397590
           512         428         450550704
           4096        399         482520943
           20000       394         489009617
           30000       395         488017817
           65533       369(-31.3%) 521661345(+45.3%) <==> with this patchset
           N/A         342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      Link: http://lkml.kernel.org/r/1503568801-21305-3-git-send-email-kemi.wang@intel.comSigned-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarYing Huang <ying.huang@intel.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1d90ca89
    • Kemi Wang's avatar
      mm: change the call sites of numa statistics items · 3a321d2a
      Kemi Wang authored
      Patch series "Separate NUMA statistics from zone statistics", v2.
      
      Each page allocation updates a set of per-zone statistics with a call to
      zone_statistics().  As discussed in 2017 MM summit, these are a
      substantial source of overhead in the page allocator and are very rarely
      consumed.  This significant overhead in cache bouncing caused by zone
      counters (NUMA associated counters) update in parallel in multi-threaded
      page allocation (pointed out by Dave Hansen).
      
      A link to the MM summit slides:
        http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf
      
      To mitigate this overhead, this patchset separates NUMA statistics from
      zone statistics framework, and update NUMA counter threshold to a fixed
      size of MAX_U16 - 2, as a small threshold greatly increases the update
      frequency of the global counter from local per cpu counter (suggested by
      Ying Huang).  The rationality is that these statistics counters don't
      need to be read often, unlike other VM counters, so it's not a problem
      to use a large threshold and make readers more expensive.
      
      With this patchset, we see 31.3% drop of CPU cycles(537-->369, see
      below) for per single page allocation and reclaim on Jesper's
      page_bench03 benchmark.  Meanwhile, this patchset keeps the same style
      of virtual memory statistics with little end-user-visible effects (only
      move the numa stats to show behind zone page stats, see the first patch
      for details).
      
      I did an experiment of single page allocation and reclaim concurrently
      using Jesper's page_bench03 benchmark on a 2-Socket Broadwell-based
      server (88 processors with 126G memory) with different size of threshold
      of pcp counter.
      
      Benchmark provided by Jesper D Brouer(increase loop times to 10000000):
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
         Threshold   CPU cycles    Throughput(88 threads)
            32        799         241760478
            64        640         301628829
            125       537         358906028 <==> system by default
            256       468         412397590
            512       428         450550704
            4096      399         482520943
            20000     394         489009617
            30000     395         488017817
            65533     369(-31.3%) 521661345(+45.3%) <==> with this patchset
            N/A       342(-36.3%) 562900157(+56.8%) <==> disable zone_statistics
      
      This patch (of 3):
      
      In this patch, NUMA statistics is separated from zone statistics
      framework, all the call sites of NUMA stats are changed to use
      numa-stats-specific functions, it does not have any functionality change
      except that the number of NUMA stats is shown behind zone page stats
      when users *read* the zone info.
      
      E.g. cat /proc/zoneinfo
          ***Base***                           ***With this patch***
      nr_free_pages 3976                         nr_free_pages 3976
      nr_zone_inactive_anon 0                    nr_zone_inactive_anon 0
      nr_zone_active_anon 0                      nr_zone_active_anon 0
      nr_zone_inactive_file 0                    nr_zone_inactive_file 0
      nr_zone_active_file 0                      nr_zone_active_file 0
      nr_zone_unevictable 0                      nr_zone_unevictable 0
      nr_zone_write_pending 0                    nr_zone_write_pending 0
      nr_mlock     0                             nr_mlock     0
      nr_page_table_pages 0                      nr_page_table_pages 0
      nr_kernel_stack 0                          nr_kernel_stack 0
      nr_bounce    0                             nr_bounce    0
      nr_zspages   0                             nr_zspages   0
      numa_hit 0                                *nr_free_cma  0*
      numa_miss 0                                numa_hit     0
      numa_foreign 0                             numa_miss    0
      numa_interleave 0                          numa_foreign 0
      numa_local   0                             numa_interleave 0
      numa_other   0                             numa_local   0
      *nr_free_cma 0*                            numa_other 0
          ...                                        ...
      vm stats threshold: 10                     vm stats threshold: 10
          ...                                        ...
      
      The next patch updates the numa stats counter size and threshold.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1503568801-21305-2-git-send-email-kemi.wang@intel.comSigned-off-by: default avatarKemi Wang <kemi.wang@intel.com>
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Ying Huang <ying.huang@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a321d2a
  4. 07 Sep, 2017 6 commits
    • Huang Ying's avatar
      mm, swap: add swap readahead hit statistics · cbc65df2
      Huang Ying authored
      Patch series "mm, swap: VMA based swap readahead", v4.
      
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      valid.
      
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory space.  And the different tasks in the system may have different
      access patterns, which makes the global space locality estimation
      incorrect.
      
      In this patchset, when page fault occurs, the virtual pages near the
      fault address will be readahead instead of the swap slots near the fault
      swap slot in swap device.  This avoid to readahead the unrelated swap
      slots.  At the same time, the swap readahead is changed to work on
      per-VMA from globally.  So that the different access patterns of the
      different VMAs could be distinguished, and the different readahead
      policy could be applied accordingly.  The original core readahead
      detection and scaling algorithm is reused, because it is an effect
      algorithm to detect the space locality.
      
      In addition to the swap readahead changes, some new sysfs interface is
      added to show the efficiency of the readahead algorithm and some other
      swap statistics.
      
      This new implementation will incur more small random read, on SSD, the
      improved correctness of estimation and readahead target should beat the
      potential increased overhead, this is also illustrated in the test
      results below.  But on HDD, the overhead may beat the benefit, so the
      original implementation will be used by default.
      
      The test and result is as follow,
      
      Common test condition
      =====================
      
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
      Swap device: NVMe disk
      
      Micro-benchmark with combined access pattern
      ============================================
      
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      
      Linpack
      =======
      
      The test memory size is bigger than RAM to trigger swapping.
      
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      
      This patch (of 5):
      
      The statistics for total readahead pages and total readahead hits are
      recorded and exported via the following sysfs interface.
      
      /sys/kernel/mm/swap/ra_hits
      /sys/kernel/mm/swap/ra_total
      
      With them, the efficiency of the swap readahead could be measured, so
      that the swap readahead algorithm and parameters could be tuned
      accordingly.
      
      [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
      Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbc65df2
    • SeongJae Park's avatar
      mm/vmstat.c: fix wrong comment · f113e641
      SeongJae Park authored
      Comment for pagetypeinfo_showblockcount() is mistakenly duplicated from
      pagetypeinfo_show_free()'s comment.  This commit fixes it.
      
      Link: http://lkml.kernel.org/r/20170809185816.11244-1-sj38.park@gmail.com
      Fixes: 467c996c ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
      Signed-off-by: default avatarSeongJae Park <sj38.park@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f113e641
    • Wen Yang's avatar
      mm/vmstat: fix divide error at __fragmentation_index · 88d6ac40
      Wen Yang authored
      When order is -1 or too big, *1UL << order* will be 0, which will cause
      a divide error.  Although it seems that all callers of
      __fragmentation_index() will only do so with a valid order, the patch
      can make it more robust.
      
      Should prevent reoccurrences of
      https://bugzilla.kernel.org/show_bug.cgi?id=196555
      
      Link: http://lkml.kernel.org/r/1501751520-2598-1-git-send-email-wen.yang99@zte.com.cnSigned-off-by: default avatarWen Yang <wen.yang99@zte.com.cn>
      Reviewed-by: default avatarJiang Biao <jiang.biao2@zte.com.cn>
      Suggested-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      88d6ac40
    • Michal Hocko's avatar
      mm: rename global_page_state to global_zone_page_state · c41f012a
      Michal Hocko authored
      global_page_state is error prone as a recent bug report pointed out [1].
      It only returns proper values for zone based counters as the enum it
      gets suggests.  We already have global_node_page_state so let's rename
      global_page_state to global_zone_page_state to be more explicit here.
      All existing users seems to be correct:
      
      $ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
            2 NR_BOUNCE
            2 NR_FREE_CMA_PAGES
           11 NR_FREE_PAGES
            1 NR_KERNEL_STACK_KB
            1 NR_MLOCK
            2 NR_PAGETABLE
      
      This patch shouldn't introduce any functional change.
      
      [1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
      
      Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c41f012a
    • Huang Ying's avatar
      mm, THP, swap: add THP swapping out fallback counting · fe490cc0
      Huang Ying authored
      When swapping out THP (Transparent Huge Page), instead of swapping out
      the THP as a whole, sometimes we have to fallback to split the THP into
      normal pages before swapping, because no free swap clusters are
      available, or cgroup limit is exceeded, etc.  To count the number of the
      fallback, a new VM event THP_SWPOUT_FALLBACK is added, and counted when
      we fallback to split the THP.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-13-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fe490cc0
    • Huang Ying's avatar
      mm: test code to write THP to swap device as a whole · 225311a4
      Huang Ying authored
      To support delay splitting THP (Transparent Huge Page) after swapped
      out, we need to enhance swap writing code to support to write a THP as a
      whole.  This will improve swap write IO performance.
      
      As Ming Lei <ming.lei@redhat.com> pointed out, this should be based on
      multipage bvec support, which hasn't been merged yet.  So this patch is
      only for testing the functionality of the other patches in the series.
      And will be reimplemented after multipage bvec support is merged.
      
      Link: http://lkml.kernel.org/r/20170724051840.2309-7-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@intel.com> [for brd.c, zram_drv.c, pmem.c]
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vishal L Verma <vishal.l.verma@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      225311a4
  5. 10 Jul, 2017 1 commit
  6. 06 Jul, 2017 4 commits
  7. 12 May, 2017 1 commit
  8. 03 May, 2017 5 commits
    • David Rientjes's avatar
      mm, vmstat: suppress pcp stats for unpopulated zones in zoneinfo · 7dfb8bf3
      David Rientjes authored
      After "mm, vmstat: print non-populated zones in zoneinfo",
      /proc/zoneinfo will show unpopulated zones.
      
      The per-cpu pageset statistics are not relevant for unpopulated zones
      and can be potentially lengthy, so supress them when they are not
      interesting.
      
      Also moves lowmem reserve protection information above pcp stats since
      it is relevant for all zones per vm.lowmem_reserve_ratio.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703061400500.46428@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7dfb8bf3
    • David Rientjes's avatar
      mm, vmstat: print non-populated zones in zoneinfo · b2bd8598
      David Rientjes authored
      Initscripts can use the information (protection levels) from
      /proc/zoneinfo to configure vm.lowmem_reserve_ratio at boot.
      
      vm.lowmem_reserve_ratio is an array of ratios for each configured zone
      on the system.  If a zone is not populated on an arch, /proc/zoneinfo
      suppresses its output.
      
      This results in there not being a 1:1 mapping between the set of zones
      emitted by /proc/zoneinfo and the zones configured by
      vm.lowmem_reserve_ratio.
      
      This patch shows statistics for non-populated zones in /proc/zoneinfo.
      The zones exist and hold a spot in the vm.lowmem_reserve_ratio array.
      Without this patch, it is not possible to determine which index in the
      array controls which zone if one or more zones on the system are not
      populated.
      
      Remaining users of walk_zones_in_node() are unchanged.  Files such as
      /proc/pagetypeinfo require certain zone data to be initialized properly
      for display, which is not done for unpopulated zones.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1703031451310.98023@chino.kir.corp.google.comSigned-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Reviewed-by: default avatarAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2bd8598
    • Shaohua Li's avatar
      mm: move MADV_FREE pages into LRU_INACTIVE_FILE list · f7ad2a6c
      Shaohua Li authored
      madv()'s MADV_FREE indicate pages are 'lazyfree'.  They are still
      anonymous pages, but they can be freed without pageout.  To distinguish
      these from normal anonymous pages, we clear their SwapBacked flag.
      
      MADV_FREE pages could be freed without pageout, so they pretty much like
      used once file pages.  For such pages, we'd like to reclaim them once
      there is memory pressure.  Also it might be unfair reclaiming MADV_FREE
      pages always before used once file pages and we definitively want to
      reclaim the pages before other anonymous and file pages.
      
      To speed up MADV_FREE pages reclaim, we put the pages into
      LRU_INACTIVE_FILE list.  The rationale is LRU_INACTIVE_FILE list is tiny
      nowadays and should be full of used once file pages.  Reclaiming
      MADV_FREE pages will not have much interfere of anonymous and active
      file pages.  And the inactive file pages and MADV_FREE pages will be
      reclaimed according to their age, so we don't reclaim too many MADV_FREE
      pages too.  Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
      means we can reclaim the pages without swap support.  This idea is
      suggested by Johannes.
      
      This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
      avoid bisect failure, next patch will do it.
      
      The patch is based on Minchan's original patch.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.comSigned-off-by: default avatarShaohua Li <shli@fb.com>
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f7ad2a6c
    • Johannes Weiner's avatar
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner authored
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • Johannes Weiner's avatar
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner authored
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarShakeel Butt <shakeelb@google.com>
      Reported-by: default avatarJia He <hejianet@gmail.com>
      Tested-by: default avatarJia He <hejianet@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  9. 19 Apr, 2017 1 commit
    • Michal Hocko's avatar
      mm: make mm_percpu_wq non freezable · 80d136e1
      Michal Hocko authored
      Geert has reported a freeze during PM resume and some additional
      debugging has shown that the device_resume worker cannot make a forward
      progress because it waits for an event which is stuck waiting in
      drain_all_pages:
      
        INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
              Not tainted 4.11.0-rc7-koelsch-00029-g005882e5-dirty #3476
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:0    D    0     5      2 0x00000000
        Workqueue: events_unbound async_run_entry_fn
          __schedule
          schedule
          schedule_timeout
          wait_for_common
          dpm_wait_for_superior
          device_resume
          async_resume
          async_run_entry_fn
          process_one_work
          worker_thread
          kthread
        [...]
        bash            D    0  1703   1694 0x00000000
          __schedule
          schedule
          schedule_timeout
          wait_for_common
          flush_work
          drain_all_pages
          start_isolate_page_range
          alloc_contig_range
          cma_alloc
          __alloc_from_contiguous
          cma_allocator_alloc
          __dma_alloc
          arm_dma_alloc
          sh_eth_ring_init
          sh_eth_open
          sh_eth_resume
          dpm_run_callback
          device_resume
          dpm_resume
          dpm_resume_end
          suspend_devices_and_enter
          pm_suspend
          state_store
          kernfs_fop_write
          __vfs_write
          vfs_write
          SyS_write
        [...]
        Showing busy workqueues and worker pools:
        [...]
        workqueue mm_percpu_wq: flags=0xc
          pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
            delayed: drain_local_pages_wq, vmstat_update
          pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
            delayed: drain_local_pages_wq BAR(1703), vmstat_update
      
      Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE
      so it is frozen this early during resume so we are effectively
      deadlocked.  Fix this by dropping WQ_FREEZABLE when creating
      mm_percpu_wq.  We really want to have it operational all the time.
      
      Fixes: ce612879 ("mm: move pcp and lru-pcp draining into single wq")
      Reported-and-tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Debugged-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80d136e1
  10. 08 Apr, 2017 1 commit
  11. 01 Apr, 2017 1 commit
    • Michal Hocko's avatar
      mm: move mm_percpu_wq initialization earlier · 597b7305
      Michal Hocko authored
      Yang Li has reported that drain_all_pages triggers a WARN_ON which means
      that this function is called earlier than the mm_percpu_wq is
      initialized on arm64 with CMA configured:
      
        WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
        Modules linked in:
        CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
        Hardware name: Freescale Layerscape 2088A RDB Board (DT)
        task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
        PC is at drain_all_pages+0x244/0x25c
        LR is at start_isolate_page_range+0x14c/0x1f0
        [...]
         drain_all_pages+0x244/0x25c
         start_isolate_page_range+0x14c/0x1f0
         alloc_contig_range+0xec/0x354
         cma_alloc+0x100/0x1fc
         dma_alloc_from_contiguous+0x3c/0x44
         atomic_pool_init+0x7c/0x208
         arm64_dma_init+0x44/0x4c
         do_one_initcall+0x38/0x128
         kernel_init_freeable+0x1a0/0x240
         kernel_init+0x10/0xfc
         ret_from_fork+0x10/0x20
      
      Fix this by moving the whole setup_vmstat which is an initcall right now
      to init_mm_internals which will be called right after the WQ subsystem
      is initialized.
      
      Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.orgSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarYang Li <pku.leo@gmail.com>
      Tested-by: default avatarYang Li <pku.leo@gmail.com>
      Tested-by: default avatarXiaolong Ye <xiaolong.ye@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      597b7305
  12. 10 Mar, 2017 1 commit
  13. 23 Feb, 2017 1 commit
  14. 01 Dec, 2016 3 commits
  15. 08 Oct, 2016 4 commits
    • Joe Perches's avatar
      seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char · 75ba1d07
      Joe Perches authored
      Allow some seq_puts removals by taking a string instead of a single
      char.
      
      [akpm@linux-foundation.org: update vmstat_show(), per Joe]
      Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.comSigned-off-by: default avatarJoe Perches <joe@perches.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75ba1d07
    • Alexey Dobriyan's avatar
      proc: much faster /proc/vmstat · 68ba0326
      Alexey Dobriyan authored
      Every current KDE system has process named ksysguardd polling files
      below once in several seconds:
      
      	$ strace -e trace=open -p $(pidof ksysguardd)
      	Process 1812 attached
      	open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
      	open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
      	open("/proc/net/dev", O_RDONLY)         = 8
      	open("/proc/net/wireless", O_RDONLY)    = -1 ENOENT (No such file or directory)
      	open("/proc/stat", O_RDONLY)            = 8
      	open("/proc/vmstat", O_RDONLY)          = 8
      
      Hell knows what it is doing but speed up reading /proc/vmstat by 33%!
      
      Benchmark is open+read+close 1.000.000 times.
      
      			BEFORE
      $ perf stat -r 10 taskset -c 3 ./proc-vmstat
      
       Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):
      
            13146.768464      task-clock (msec)         #    0.960 CPUs utilized            ( +-  0.60% )
                      15      context-switches          #    0.001 K/sec                    ( +-  1.41% )
                       1      cpu-migrations            #    0.000 K/sec                    ( +- 11.11% )
                     104      page-faults               #    0.008 K/sec                    ( +-  0.57% )
          45,489,799,349      cycles                    #    3.460 GHz                      ( +-  0.03% )
           9,970,175,743      stalled-cycles-frontend   #   21.92% frontend cycles idle     ( +-  0.10% )
           2,800,298,015      stalled-cycles-backend    #   6.16% backend cycles idle       ( +-  0.32% )
          79,241,190,850      instructions              #    1.74  insn per cycle
                                                        #    0.13  stalled cycles per insn  ( +-  0.00% )
          17,616,096,146      branches                  # 1339.956 M/sec                    ( +-  0.00% )
             176,106,232      branch-misses             #    1.00% of all branches          ( +-  0.18% )
      
            13.691078109 seconds time elapsed                                          ( +-  0.03% )
            ^^^^^^^^^^^^
      
      			AFTER
      $ perf stat -r 10 taskset -c 3 ./proc-vmstat
      
       Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):
      
             8688.353749      task-clock (msec)         #    0.950 CPUs utilized            ( +-  1.25% )
                      10      context-switches          #    0.001 K/sec                    ( +-  2.13% )
                       1      cpu-migrations            #    0.000 K/sec
                     104      page-faults               #    0.012 K/sec                    ( +-  0.56% )
          30,384,010,730      cycles                    #    3.497 GHz                      ( +-  0.07% )
          12,296,259,407      stalled-cycles-frontend   #   40.47% frontend cycles idle     ( +-  0.13% )
           3,370,668,651      stalled-cycles-backend    #  11.09% backend cycles idle       ( +-  0.69% )
          28,969,052,879      instructions              #    0.95  insn per cycle
                                                        #    0.42  stalled cycles per insn  ( +-  0.01% )
           6,308,245,891      branches                  #  726.058 M/sec                    ( +-  0.00% )
             214,685,502      branch-misses             #    3.40% of all branches          ( +-  0.26% )
      
             9.146081052 seconds time elapsed                                          ( +-  0.07% )
             ^^^^^^^^^^^
      
      vsnprintf() is slow because:
      
      1. format_decode() is busy looking for format specifier: 2 branches
         per character (not in this case, but in others)
      
      2. approximately million branches while parsing format mini language
         and everywhere
      
      3.  just look at what string() does /proc/vmstat is good case because
         most of its content are strings
      
      Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.bySigned-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68ba0326
    • Tim Chen's avatar
      cpu: fix node state for whether it contains CPU · 03e86dba
      Tim Chen authored
      In current kernel code, we only call node_set_state(cpu_to_node(cpu),
      N_CPU) when a cpu is hot plugged.  But we do not set the node state for
      N_CPU when the cpus are brought online during boot.
      
      So this could lead to failure when we check to see if a node contains
      cpu with node_state(node_id, N_CPU).
      
      One use case is in the node_reclaime function:
      
              /*
               * Only run node reclaim on the local node or on nodes that do
               * not
               * have associated processors. This will favor the local
               * processor
               * over remote processors and spread off node memory allocations
               * as wide as possible.
               */
              if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
                      numa_node_id())
                      return NODE_RECLAIM_NOSCAN;
      
      I instrumented the kernel to call this function after boot and it always
      returns 0 on a x86 desktop machine until I apply the attached patch.
      
         int num_cpu_node(void)
         {
             int i, nr_cpu_nodes = 0;
      
             for_each_node(i) {
                     if (node_state(i, N_CPU))
                             ++ nr_cpu_nodes;
             }
      
             return nr_cpu_nodes;
         }
      
      Fix this by checking each node for online CPU when we initialize
      vmstat that's responsible for maintaining node state.
      
      Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.comSigned-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: <Huang@linux.intel.com>
      Cc: Ying <ying.huang@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03e86dba
    • Joonsoo Kim's avatar
      mm/page_owner: move page_owner specific function to page_owner.c · e2f612e6
      Joonsoo Kim authored
      There is no reason that page_owner specific function resides on
      vmstat.c.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2f612e6
  16. 28 Jul, 2016 5 commits
    • Mel Gorman's avatar
      mm: remove reclaim and compaction retry approximations · 5a1c84b4
      Mel Gorman authored
      If per-zone LRU accounting is available then there is no point
      approximating whether reclaim and compaction should retry based on pgdat
      statistics.  This is effectively a revert of "mm, vmstat: remove zone
      and node double accounting by approximating retries" with the difference
      that inactive/active stats are still available.  This preserves the
      history of why the approximation was retried and why it had to be
      reverted to handle OOM kills on 32-bit systems.
      
      Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5a1c84b4
    • Minchan Kim's avatar
      mm: add per-zone lru list stat · 71c799f4
      Minchan Kim authored
      When I did stress test with hackbench, I got OOM message frequently
      which didn't ever happen in zone-lru.
      
        gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
        ..
        ..
         __alloc_pages_nodemask+0xe52/0xe60
         ? new_slab+0x39c/0x3b0
         new_slab+0x39c/0x3b0
         ___slab_alloc.constprop.87+0x6da/0x840
         ? __alloc_skb+0x3c/0x260
         ? _raw_spin_unlock_irq+0x27/0x60
         ? trace_hardirqs_on_caller+0xec/0x1b0
         ? finish_task_switch+0xa6/0x220
         ? poll_select_copy_remaining+0x140/0x140
         __slab_alloc.isra.81.constprop.86+0x40/0x6d
         ? __alloc_skb+0x3c/0x260
         kmem_cache_alloc+0x22c/0x260
         ? __alloc_skb+0x3c/0x260
         __alloc_skb+0x3c/0x260
         alloc_skb_with_frags+0x4e/0x1a0
         sock_alloc_send_pskb+0x16a/0x1b0
         ? wait_for_unix_gc+0x31/0x90
         ? alloc_set_pte+0x2ad/0x310
         unix_stream_sendmsg+0x28d/0x340
         sock_sendmsg+0x2d/0x40
         sock_write_iter+0x6c/0xc0
         __vfs_write+0xc0/0x120
         vfs_write+0x9b/0x1a0
         ? __might_fault+0x49/0xa0
         SyS_write+0x44/0x90
         do_fast_syscall_32+0xa6/0x1e0
         sysenter_past_esp+0x45/0x74
      
        Mem-Info:
        active_anon:104698 inactive_anon:105791 isolated_anon:192
         active_file:433 inactive_file:283 isolated_file:22
         unevictable:0 dirty:0 writeback:296 unstable:0
         slab_reclaimable:6389 slab_unreclaimable:78927
         mapped:474 shmem:0 pagetables:101426 bounce:0
         free:10518 free_pcp:334 free_cma:0
        Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
        DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 809 1965 1965
        Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
        lowmem_reserve[]: 0 0 9247 9247
        HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
        Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
        HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        25121 total pagecache pages
        24160 pages in swap cache
        Swap cache stats: add 86371, delete 62211, find 42865/60187
        Free swap  = 4015560kB
        Total swap = 4192252kB
        524186 pages RAM
        295934 pages HighMem/MovableOnly
        9658 pages reserved
        0 pages cma reserved
      
      The order-0 allocation for normal zone failed while there are a lot of
      reclaimable memory(i.e., anonymous memory with free swap).  I wanted to
      analyze the problem but it was hard because we removed per-zone lru stat
      so I couldn't know how many of anonymous memory there are in normal/dma
      zone.
      
      When we investigate OOM problem, reclaimable memory count is crucial
      stat to find a problem.  Without it, it's hard to parse the OOM message
      so I believe we should keep it.
      
      With per-zone lru stat,
      
        gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
        Mem-Info:
        active_anon:101103 inactive_anon:102219 isolated_anon:0
         active_file:503 inactive_file:544 isolated_file:0
         unevictable:0 dirty:0 writeback:34 unstable:0
         slab_reclaimable:6298 slab_unreclaimable:74669
         mapped:863 shmem:0 pagetables:100998 bounce:0
         free:23573 free_pcp:1861 free_cma:0
        Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
        DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 809 1965 1965
        Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
        lowmem_reserve[]: 0 0 9247 9247
        HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
        Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
        HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        54409 total pagecache pages
        53215 pages in swap cache
        Swap cache stats: add 300982, delete 247765, find 157978/226539
        Free swap  = 3803244kB
        Total swap = 4192252kB
        524186 pages RAM
        295934 pages HighMem/MovableOnly
        9642 pages reserved
        0 pages cma reserved
      
      With that, we can see normal zone has a 86M reclaimable memory so we can
      know something goes wrong(I will fix the problem in next patch) in
      reclaim.
      
      [mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat]
       Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net
      Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71c799f4
    • Mel Gorman's avatar
      mm, vmstat: remove zone and node double accounting by approximating retries · bca67592
      Mel Gorman authored
      The number of LRU pages, dirty pages and writeback pages must be
      accounted for on both zones and nodes because of the reclaim retry
      logic, compaction retry logic and highmem calculations all depending on
      per-zone stats.
      
      Many lowmem allocations are immune from OOM kill due to a check in
      __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
      03668b3c ("oom: avoid oom killer for lowmem allocations").  The
      exception is costly high-order allocations or allocations that cannot
      fail.  If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
      allocations then it would fall through to __alloc_pages_direct_compact.
      
      This patch will blindly retry reclaim for zone-constrained allocations
      in should_reclaim_retry up to MAX_RECLAIM_RETRIES.  This is not ideal
      but without per-zone stats there are not many alternatives.  The impact
      it that zone-constrained allocations may delay before considering the
      OOM killer.
      
      As there is no guarantee enough memory can ever be freed to satisfy
      compaction, this patch avoids retrying compaction for zone-contrained
      allocations.
      
      In combination, that means that the per-node stats can be used when
      deciding whether to continue reclaim using a rough approximation.  While
      it is possible this will make the wrong decision on occasion, it will
      not infinite loop as the number of reclaim attempts is capped by
      MAX_RECLAIM_RETRIES.
      
      The final step is calculating the number of dirtyable highmem pages.  As
      those calculations only care about the global count of file pages in
      highmem.  This patch uses a global counter used instead of per-zone
      stats as it is sufficient.
      
      In combination, this allows the per-zone LRU and dirty state counters to
      be removed.
      
      [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
        Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Suggested by: Michal Hocko <mhocko@kernel.org>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bca67592
    • Mel Gorman's avatar
      mm, vmstat: print node-based stats in zoneinfo file · e2ecc8a7
      Mel Gorman authored
      There are a number of stats that were previously accessible via zoneinfo
      that are now invisible.  While it is possible to create a new file for
      the node stats, this may be missed by users.  Instead this patch prints
      the stats under the first populated zone in /proc/zoneinfo.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-34-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2ecc8a7
    • Mel Gorman's avatar
      mm: vmstat: account per-zone stalls and pages skipped during reclaim · 7cc30fcf
      Mel Gorman authored
      The vmstat allocstall was fairly useful in the general sense but
      node-based LRUs change that.  It's important to know if a stall was for
      an address-limited allocation request as this will require skipping
      pages from other zones.  This patch adds pgstall_* counters to replace
      allocstall.  The sum of the counters will equal the old allocstall so it
      can be trivially recalculated.  A high number of address-limited
      allocation requests may result in a lot of useless LRU scanning for
      suitable pages.
      
      As address-limited allocations require pages to be skipped, it's
      important to know how much useless LRU scanning took place so this patch
      adds pgskip* counters.  This yields the following model
      
      1. The number of address-space limited stalls can be accounted for (pgstall)
      2. The amount of useless work required to reclaim the data is accounted (pgskip)
      3. The total number of scans is available from pgscan_kswapd and pgscan_direct
         so from that the ratio of useful to useless scans can be calculated.
      
      [mgorman@techsingularity.net: s/pgstall/allocstall/]
        Link: http://lkml.kernel.org/r/1468404004-5085-3-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-33-git-send-email-mgorman@techsingularity.netSigned-off-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7cc30fcf