1. 24 Sep, 2019 1 commit
  2. 12 Jul, 2019 2 commits
    • Huang Ying's avatar
      mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device() · 054f1d1f
      Huang Ying authored
      total_swapcache_pages() may race with swapper_spaces[] allocation and
      freeing.  Previously, this is protected with a swapper_spaces[] specific
      RCU mechanism.  To simplify the logic/code complexity, it is replaced with
      get/put_swap_device().  The code line number is reduced too.  Although not
      so important, the swapoff() performance improves too because one
      synchronize_rcu() call during swapoff() is deleted.
      
      [ying.huang@intel.com: fix bad swap file entry warning]
        Link: http://lkml.kernel.org/r/20190531024102.21723-1-ying.huang@intel.com
      Link: http://lkml.kernel.org/r/20190527082714.12151-1-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      054f1d1f
    • Huang Ying's avatar
      mm, swap: fix race between swapoff and some swap operations · eb085574
      Huang Ying authored
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217
      
       ("mm/swap: add cluster lock")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eb085574
  3. 06 Jul, 2019 1 commit
    • Linus Torvalds's avatar
      Revert "mm: page cache: store only head pages in i_pages" · 69bf4b6b
      Linus Torvalds authored
      This reverts commit 5fd4ca2d.
      
      Mikhail Gavrilov reports that it causes the VM_BUG_ON_PAGE() in
      __delete_from_swap_cache() to trigger:
      
         page:ffffd6d34dff0000 refcount:1 mapcount:1 mapping:ffff97812323a689 index:0xfecec363
         anon
         flags: 0x17fffe00080034(uptodate|lru|active|swapbacked)
         raw: 0017fffe00080034 ffffd6d34c67c508 ffffd6d3504b8d48 ffff97812323a689
         raw: 00000000fecec363 0000000000000000 0000000100000000 ffff978433ace000
         page dumped because: VM_BUG_ON_PAGE(entry != page)
         page->mem_cgroup:ffff978433ace000
         ------------[ cut here ]------------
         kernel BUG at mm/swap_state.c:170!
         invalid opcode: 0000 [#1] SMP NOPTI
         CPU: 1 PID: 221 Comm: kswapd0 Not tainted 5.2.0-0.rc2.git0.1.fc31.x86_64 #1
         Hardware name: System manufacturer System Product Name/ROG STRIX X470-I GAMING, BIOS 2202 04/11/2019
         RIP: 0010:__delete_from_swap_cache+0x20d/0x240
         Code: 30 65 48 33 04 25 28 00 00 00 75 4a 48 83 c4 38 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 c7 c6 2f dc 0f 8a 48 89 c7 e8 93 1b fd ff <0f> 0b 48 c7 c6 a8 74 0f 8a e8 85 1b fd ff 0f 0b 48 c7 c6 a8 7d 0f
         RSP: 0018:ffffa982036e7980 EFLAGS: 00010046
         RAX: 0000000000000021 RBX: 0000000000000040 RCX: 0000000000000006
         RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff97843d657900
         RBP: 0000000000000001 R08: ffffa982036e7835 R09: 0000000000000535
         R10: ffff97845e21a46c R11: ffffa982036e7835 R12: ffff978426387120
         R13: 0000000000000000 R14: ffffd6d34dff0040 R15: ffffd6d34dff0000
         FS:  0000000000000000(0000) GS:ffff97843d640000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 00002cba88ef5000 CR3: 000000078a97c000 CR4: 00000000003406e0
         Call Trace:
          delete_from_swap_cache+0x46/0xa0
          try_to_free_swap+0xbc/0x110
          swap_writepage+0x13/0x70
          pageout.isra.0+0x13c/0x350
          shrink_page_list+0xc14/0xdf0
          shrink_inactive_list+0x1e5/0x3c0
          shrink_node_memcg+0x202/0x760
          shrink_node+0xe0/0x470
          balance_pgdat+0x2d1/0x510
          kswapd+0x220/0x420
          kthread+0xfb/0x130
          ret_from_fork+0x22/0x40
      
      and it's not immediately obvious why it happens.  It's too late in the
      rc cycle to do anything but revert for now.
      
      Link: https://lore.kernel.org/lkml/CABXGCsN9mYmBD-4GaaeW_NrDu+FDXLzr_6x+XNxfmFV6QkYCDg@mail.gmail.com/
      
      Reported-and-bisected-by: default avatarMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69bf4b6b
  4. 14 May, 2019 1 commit
  5. 06 Mar, 2019 2 commits
    • Yang Shi's avatar
      mm: swap: add comment for swap_vma_readahead · e9f59873
      Yang Shi authored
      swap_vma_readahead()'s comment is missing, just add it.
      
      Link: http://lkml.kernel.org/r/1546543673-108536-2-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e9f59873
    • Yang Shi's avatar
      mm: swap: check if swap backing device is congested or not · 8fd2e0b5
      Yang Shi authored
      Swap readahead would read in a few pages regardless if the underlying
      device is busy or not.  It may incur long waiting time if the device is
      congested, and it may also exacerbate the congestion.
      
      Use inode_read_congested() to check if the underlying device is busy or
      not like what file page readahead does.  Get inode from
      swap_info_struct.
      
      Although we can add inode information in swap_address_space
      (address_space->host), it may lead some unexpected side effect, i.e.  it
      may break mapping_cap_account_dirty().  Using inode from
      swap_info_struct seems simple and good enough.
      
      Just does the check in vma_cluster_readahead() since
      swap_vma_readahead() is just used for non-rotational device which much
      less likely has congestion than traditional HDD.
      
      Although swap slots may be consecutive on swap partition, it still may
      be fragmented on swap file.  This check would help to reduce excessive
      stall for such case.
      
      The test with page_fault1 of will-it-scale (sometimes tracing may just
      show runtest.py that is the wrapper script of page_fault1), which
      basically launches NR_CPU threads to generate 128MB anonymous pages for
      each thread, on my virtual machine with congested HDD shows long tail
      latency is reduced significantly.
      
      Without the patch
       page_fault1_thr-1490  [023]   129.311706: funcgraph_entry:      #57377.796 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369103: funcgraph_entry:        5.642us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369119: funcgraph_entry:      #1289.592 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370411: funcgraph_entry:        4.957us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370419: funcgraph_entry:        1.940us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.378847: funcgraph_entry:      #1411.385 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380262: funcgraph_entry:        3.916us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380275: funcgraph_entry:      #4287.751 us |  do_swap_page();
      
      With the patch
            runtest.py-1417  [020]   301.925911: funcgraph_entry:      #9870.146 us |  do_swap_page();
            runtest.py-1417  [020]   301.935785: funcgraph_entry:        9.802us   |  do_swap_page();
            runtest.py-1417  [020]   301.935799: funcgraph_entry:        3.551us   |  do_swap_page();
            runtest.py-1417  [020]   301.935806: funcgraph_entry:        2.142us   |  do_swap_page();
            runtest.py-1417  [020]   301.935853: funcgraph_entry:        6.938us   |  do_swap_page();
            runtest.py-1417  [020]   301.935864: funcgraph_entry:        3.765us   |  do_swap_page();
            runtest.py-1417  [020]   301.935871: funcgraph_entry:        3.600us   |  do_swap_page();
            runtest.py-1417  [020]   301.935878: funcgraph_entry:        7.202us   |  do_swap_page();
      
      [akpm@linux-foundation.org: code cleanup]
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.com
      
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: default avatarTim Chen <tim.c.chen@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8fd2e0b5
  6. 26 Oct, 2018 1 commit
    • Johannes Weiner's avatar
      mm: workingset: tell cache transitions from workingset thrashing · 1899ad18
      Johannes Weiner authored
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.org
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarDaniel Drake <drake@endlessm.com>
      Tested-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1899ad18
  7. 21 Oct, 2018 3 commits
  8. 12 Jun, 2018 1 commit
    • Kees Cook's avatar
      treewide: kvzalloc() -> kvcalloc() · 778e1cdd
      Kees Cook authored
      
      
      The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
      patch replaces cases of:
      
              kvzalloc(a * b, gfp)
      
      with:
              kvcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kvzalloc(a * b * c, gfp)
      
      with:
      
              kvzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kvcalloc(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kvzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kvzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kvzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kvzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kvzalloc
      + kvcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kvzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(sizeof(THING) * C2, ...)
      |
        kvzalloc(sizeof(TYPE) * C2, ...)
      |
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(C1 * C2, ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      778e1cdd
  9. 08 Jun, 2018 1 commit
  10. 11 Apr, 2018 1 commit
  11. 06 Apr, 2018 3 commits
  12. 16 Nov, 2017 3 commits
  13. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard...
      b2441318
  14. 13 Oct, 2017 1 commit
  15. 04 Oct, 2017 1 commit
  16. 07 Sep, 2017 4 commits
    • Huang Ying's avatar
      mm, swap: add sysfs interface for VMA based swap readahead · d9bfcfdc
      Huang Ying authored
      The sysfs interface to control the VMA based swap readahead is added as
      follow,
      
      /sys/kernel/mm/swap/vma_ra_enabled
      
      Enable the VMA based swap readahead algorithm, or use the original
      global swap readahead algorithm.
      
      /sys/kernel/mm/swap/vma_ra_max_order
      
      Set the max order of the readahead window size for the VMA based swap
      readahead algorithm.
      
      The corresponding ABI documentation is added too.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-5-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d9bfcfdc
    • Huang Ying's avatar
      mm, swap: VMA based swap readahead · ec560175
      Huang Ying authored
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      valid.
      
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory.  And the different tasks in the system may have different access
      patterns, which makes the global space locality estimation incorrect.
      
      In this patch, when page fault occurs, the virtual pages near the fault
      address will be readahead instead of the swap slots near the fault swap
      slot in swap device.  This avoid to readahead the unrelated swap slots.
      At the same time, the swap readahead is changed to work on per-VMA from
      globally.  So that the different access patterns of the different VMAs
      could be distinguished, and the different readahead policy could be
      applied accordingly.  The original core readahead detection and scaling
      algorithm is reused, because it is an effect algorithm to detect the
      space locality.
      
      The test and result is as follow,
      
      Common test condition
      =====================
      
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device:
      NVMe disk
      
      Micro-benchmark with combined access pattern
      ============================================
      
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      
      Linpack
      =======
      
      The test memory size is bigger than RAM to trigger swapping.
      
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-4-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec560175
    • Huang Ying's avatar
      mm, swap: fix swap readahead marking · c4fa6309
      Huang Ying authored
      In the original implementation, it is possible that the existing pages
      in the swap cache (not newly readahead) could be marked as the readahead
      pages.  This will cause the statistics of swap readahead be wrong and
      influence the swap readahead algorithm too.
      
      This is fixed via marking a page as the readahead page only if it is
      newly allocated and read from the disk.
      
      When testing with linpack, after the fixing the swap readahead hit rate
      increased from ~66% to ~86%.
      
      Link: http://lkml.kernel.org/r/20170807054038.1843-3-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4fa6309
    • Huang Ying's avatar
      mm, swap: add swap readahead hit statistics · cbc65df2
      Huang Ying authored
      Patch series "mm, swap: VMA based swap readahead", v4.
      
      The swap readahead is an important mechanism to reduce the swap in
      latency.  Although pure sequential memory access pattern isn't very
      popular for anonymous memory, the space locality is still considered
      valid.
      
      In the original swap readahead implementation, the consecutive blocks in
      swap device are readahead based on the global space locality estimation.
      But the consecutive blocks in swap device just reflect the order of page
      reclaiming, don't necessarily reflect the access pattern in virtual
      memory space.  And the different tasks in the system may have different
      access patterns, which makes the global space locality estimation
      incorrect.
      
      In this patchset, when page fault occurs, the virtual pages near the
      fault address will be readahead instead of the swap slots near the fault
      swap slot in swap device.  This avoid to readahead the unrelated swap
      slots.  At the same time, the swap readahead is changed to work on
      per-VMA from globally.  So that the different access patterns of the
      different VMAs could be distinguished, and the different readahead
      policy could be applied accordingly.  The original core readahead
      detection and scaling algorithm is reused, because it is an effect
      algorithm to detect the space locality.
      
      In addition to the swap readahead changes, some new sysfs interface is
      added to show the efficiency of the readahead algorithm and some other
      swap statistics.
      
      This new implementation will incur more small random read, on SSD, the
      improved correctness of estimation and readahead target should beat the
      potential increased overhead, this is also illustrated in the test
      results below.  But on HDD, the overhead may beat the benefit, so the
      original implementation will be used by default.
      
      The test and result is as follow,
      
      Common test condition
      =====================
      
      Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM)
      Swap device: NVMe disk
      
      Micro-benchmark with combined access pattern
      ============================================
      
      vm-scalability, sequential swap test case, 4 processes to eat 50G
      virtual memory space, repeat the sequential memory writing until 300
      seconds.  The first round writing will trigger swap out, the following
      rounds will trigger sequential swap in and out.
      
      At the same time, run vm-scalability random swap test case in
      background, 8 processes to eat 30G virtual memory space, repeat the
      random memory write until 300 seconds.  This will trigger random swap-in
      in the background.
      
      This is a combined workload with sequential and random memory accessing
      at the same time.  The result (for sequential workload) is as follow,
      
      			Base		Optimized
      			----		---------
      throughput		345413 KB/s	414029 KB/s (+19.9%)
      latency.average		97.14 us	61.06 us (-37.1%)
      latency.50th		2 us		1 us
      latency.60th		2 us		1 us
      latency.70th		98 us		2 us
      latency.80th		160 us		2 us
      latency.90th		260 us		217 us
      latency.95th		346 us		369 us
      latency.99th		1.34 ms		1.09 ms
      ra_hit%			52.69%		99.98%
      
      The original swap readahead algorithm is confused by the background
      random access workload, so readahead hit rate is lower.  The VMA-base
      readahead algorithm works much better.
      
      Linpack
      =======
      
      The test memory size is bigger than RAM to trigger swapping.
      
      			Base		Optimized
      			----		---------
      elapsed_time		393.49 s	329.88 s (-16.2%)
      ra_hit%			86.21%		98.82%
      
      The score of base and optimized kernel hasn't visible changes.  But the
      elapsed time reduced and readahead hit rate improved, so the optimized
      kernel runs better for startup and tear down stages.  And the absolute
      value of readahead hit rate is high, shows that the space locality is
      still valid in some practical workloads.
      
      This patch (of 5):
      
      The statistics for total readahead pages and total readahead hits are
      recorded and exported via the following sysfs interface.
      
      /sys/kernel/mm/swap/ra_hits
      /sys/kernel/mm/swap/ra_total
      
      With them, the efficiency of the swap readahead could be measured, so
      that the swap readahead algorithm and parameters could be tuned
      accordingly.
      
      [akpm@linux-foundation.org: don't display swap stats if CONFIG_SWAP=n]
      Link: http://lkml.kernel.org/r/20170807054038.1843-2-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cbc65df2
  17. 10 Jul, 2017 1 commit
    • Shaohua Li's avatar
      swap: add block io poll in swapin path · 23955622
      Shaohua Li authored
      For fast flash disk, async IO could introduce overhead because of
      context switch.  block-mq now supports IO poll, which improves
      performance and latency a lot.  swapin is a good place to use this
      technique, because the task is waiting for the swapin page to continue
      execution.
      
      In my virtual machine, directly read 4k data from a NVMe with iopoll is
      about 60% better than that without poll.  With iopoll support in swapin
      patch, my microbenchmark (a task does random memory write) is about
      10%~25% faster.  CPU utilization increases a lot though, 2x and even 3x
      CPU utilization.  This will depend on disk speed.
      
      While iopoll in swapin isn't intended for all usage cases, it's a win
      for latency sensistive workloads with high speed swap disk.  block layer
      has knob to control poll in runtime.  If poll isn't enabled in block
      layer, there should be no noticeable change in swapin.
      
      I got a chance to run the same test in a NVMe with DRAM as the media.
      In simple fio IO test, blkpoll boosts 50% performance in single thread
      test and ~20% in 8 threads test.  So this is the base line.  In above
      swap test, blkpoll boosts ~27% performance in single thread test.
      blkpoll uses 2x CPU time though.
      
      If we enable hybid polling, the performance gain has very slight drop
      but CPU time is only 50% worse than that without blkpoll.  Also we can
      adjust parameter of hybid poll, with it, the CPU time penality is
      reduced further.  In 8 threads test, blkpoll doesn't help though.  The
      performance is similar to that without blkpoll, but cpu utilization is
      similar too.  There is lock contention in swap path.  The cpu time
      spending on blkpoll isn't high.  So overall, blkpoll swapin isn't worse
      than that without it.
      
      The swapin readahead might read several pages in in the same time and
      form a big IO request.  Since the IO will take longer time, it doesn't
      make sense to do poll, so the patch only does iopoll for single page
      swapin.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/070c3c3e40b711e7b1390002c991e86a-b5408f0@7511894063d3764ff01ea8111f5a004d7dd700ed078797c204a24e620ddb965c
      
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      23955622
  18. 06 Jul, 2017 3 commits
    • Minchan Kim's avatar
      mm, THP, swap: move anonymous THP split logic to vmscan · 0f074658
      Minchan Kim authored
      The add_to_swap aims to allocate swap_space(ie, swap slot and swapcache)
      so if it fails due to lack of space in case of THP or something(hdd swap
      but tries THP swapout) *caller* rather than add_to_swap itself should
      split the THP page and retry it with base page which is more natural.
      
      Link: http://lkml.kernel.org/r/20170515112522.32457-4-ying.huang@intel.com
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0f074658
    • Minchan Kim's avatar
      mm, THP, swap: unify swap slot free functions to put_swap_page · 75f6d6d2
      Minchan Kim authored
      Now, get_swap_page takes struct page and allocates swap space according
      to page size(ie, normal or THP) so it would be more cleaner to introduce
      put_swap_page which is a counter function of get_swap_page.  Then, it
      calls right swap slot free function depending on page's size.
      
      [ying.huang@intel.com: minor cleanup and fix]
      Link: http://lkml.kernel.org/r/20170515112522.32457-3-ying.huang@intel.com
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75f6d6d2
    • Huang Ying's avatar
      mm, THP, swap: delay splitting THP during swap out · 38d8b4e6
      Huang Ying authored
      Patch series "THP swap: Delay splitting THP during swapping out", v11.
      
      This patchset is to optimize the performance of Transparent Huge Page
      (THP) swap.
      
      Recently, the performance of the storage devices improved so fast that
      we cannot saturate the disk bandwidth with single logical CPU when do
      page swap out even on a high-end server machine.  Because the
      performance of the storage device improved faster than that of single
      logical CPU.  And it seems that the trend will not change in the near
      future.  On the other hand, the THP becomes more and more popular
      because of increased memory size.  So it becomes necessary to optimize
      THP swap performance.
      
      The advantages of the THP swap support include:
      
       - Batch the swap operations for the THP to reduce lock
         acquiring/releasing, including allocating/freeing the swap space,
         adding/deleting to/from the swap cache, and writing/reading the swap
         space, etc. This will help improve the performance of the THP swap.
      
       - The THP swap space read/write will be 2M sequential IO. It is
         particularly helpful for the swap read, which are usually 4k random
         IO. This will improve the performance of the THP swap too.
      
       - It will help the memory fragmentation, especially when the THP is
         heavily used by the applications. The 2M continuous pages will be
         free up after THP swapping out.
      
       - It will improve the THP utilization on the system with the swap
         turned on. Because the speed for khugepaged to collapse the normal
         pages into the THP is quite slow. After the THP is split during the
         swapping out, it will take quite long time for the normal pages to
         collapse back into the THP after being swapped in. The high THP
         utilization helps the efficiency of the page based memory management
         too.
      
      There are some concerns regarding THP swap in, mainly because possible
      enlarged read/write IO size (for swap in/out) may put more overhead on
      the storage device.  To deal with that, the THP swap in should be turned
      on only when necessary.  For example, it can be selected via
      "always/never/madvise" logic, to be turned on globally, turned off
      globally, or turned on only for VMA with MADV_HUGEPAGE, etc.
      
      This patchset is the first step for the THP swap support.  The plan is
      to delay splitting THP step by step, finally avoid splitting THP during
      the THP swapping out and swap out/in the THP as a whole.
      
      As the first step, in this patchset, the splitting huge page is delayed
      from almost the first step of swapping out to after allocating the swap
      space for the THP and adding the THP into the swap cache.  This will
      reduce lock acquiring/releasing for the locks used for the swap cache
      management.
      
      With the patchset, the swap out throughput improves 15.5% (from about
      3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
      with 8 processes.  The test is done on a Xeon E5 v3 system.  The swap
      device used is a RAM simulated PMEM (persistent memory) device.  To test
      the sequential swapping out, the test case creates 8 processes, which
      sequentially allocate and write to the anonymous pages until the RAM and
      part of the swap device is used up.
      
      This patch (of 5):
      
      In this patch, splitting huge page is delayed from almost the first step
      of swapping out to after allocating the swap space for the THP
      (Transparent Huge Page) and adding the THP into the swap cache.  This
      will batch the corresponding operation, thus improve THP swap out
      throughput.
      
      This is the first step for the THP swap optimization.  The plan is to
      delay splitting the THP step by step and avoid splitting the THP
      finally.
      
      In this patch, one swap cluster is used to hold the contents of each THP
      swapped out.  So, the size of the swap cluster is changed to that of the
      THP (Transparent Huge Page) on x86_64 architecture (512).  For other
      architectures which want such THP swap optimization,
      ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
      the architecture.  In effect, this will enlarge swap cluster size by 2
      times on x86_64.  Which may make it harder to find a free cluster when
      the swap space becomes fragmented.  So that, this may reduce the
      continuous swap space allocation and sequential write in theory.  The
      performance test in 0day shows no regressions caused by this.
      
      In the future of THP swap optimization, some information of the swapped
      out THP (such as compound map count) will be recorded in the
      swap_cluster_info data structure.
      
      The mem cgroup swap accounting functions are enhanced to support charge
      or uncharge a swap cluster backing a THP as a whole.
      
      The swap cluster allocate/free functions are added to allocate/free a
      swap cluster for a THP.  A fair simple algorithm is used for swap
      cluster allocation, that is, only the first swap device in priority list
      will be tried to allocate the swap cluster.  The function will fail if
      the trying is not successful, and the caller will fallback to allocate a
      single swap slot instead.  This works good enough for normal cases.  If
      the difference of the number of the free swap clusters among multiple
      swap devices is significant, it is possible that some THPs are split
      earlier than necessary.  For example, this could be caused by big size
      difference among multiple swap devices.
      
      The swap cache functions is enhanced to support add/delete THP to/from
      the swap cache as a set of (HPAGE_PMD_NR) sub-pages.  This may be
      enhanced in the future with multi-order radix tree.  But because we will
      split the THP soon during swapping out, that optimization doesn't make
      much sense for this first step.
      
      The THP splitting functions are enhanced to support to split THP in swap
      cache during swapping out.  The page lock will be held during allocating
      the swap cluster, adding the THP into the swap cache and splitting the
      THP.  So in the code path other than swapping out, if the THP need to be
      split, the PageSwapCache(THP) will be always false.
      
      The swap cluster is only available for SSD, so the THP swap optimization
      in this patchset has no effect for HDD.
      
      [ying.huang@intel.com: fix two issues in THP optimize patch]
        Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
      [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
      Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: Andrew Morton <akpm@linux-foundation.org> [for config option]
      Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> [for changes in huge_memory.c and huge_mm.h]
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      38d8b4e6
  19. 09 May, 2017 1 commit
    • Huang Ying's avatar
      mm, swap: use kvzalloc to allocate some swap data structures · 54f180d3
      Huang Ying authored
      Now vzalloc() is used in swap code to allocate various data structures,
      such as swap cache, swap slots cache, cluster info, etc.  Because the
      size may be too large on some system, so that normal kzalloc() may fail.
      But using kzalloc() has some advantages, for example, less memory
      fragmentation, less TLB pressure, etc.  So change the data structure
      allocation in swap code to use kvzalloc() which will try kzalloc()
      firstly, and fallback to vzalloc() if kzalloc() failed.
      
      In general, although kmalloc() will reduce the number of high-order
      pages in short term, vmalloc() will cause more pain for memory
      fragmentation in the long term.  And the swap data structure allocation
      that is changed in this patch is expected to be long term allocation.
      
      From Dave Hansen:
       "for example, we have a two-page data structure. vmalloc() takes two
        effectively random order-0 pages, probably from two different 2M pages
        and pins them. That "kills" two 2M pages. kmalloc(), allocating two
        *contiguous* pages, will not cross a 2M boundary. That means it will
        only "kill" the possibility of a single 2M page. More 2M pages == less
        fragmentation.
      
      The allocation in this patch occurs during swap on time, which is
      usually done during system boot, so usually we have high opportunity to
      allocate the contiguous pages successfully.
      
      The allocation for swap_map[] in struct swap_info_struct is not changed,
      because that is usually quite large and vmalloc_to_page() is used for
      it.  That makes it a little harder to change.
      
      Link: http://lkml.kernel.org/r/20170407064911.25447-1-ying.huang@intel.com
      
      Signed-off-by: default avatarHuang Ying <ying.huang@intel.com>
      Acked-by: default avatarTim Chen <tim.c.chen@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54f180d3
  20. 03 May, 2017 1 commit
  21. 23 Feb, 2017 4 commits
    • Huang Ying's avatar
      mm/swap: skip readahead only when swap slot cache is enabled · ba81f838
      Huang Ying authored
      Because during swap off, a swap entry may have swap_map[] ==
      SWAP_HAS_CACHE (for example, just allocated).  If we return NULL in
      __read_swap_cache_async(), the swap off will abort.  So when swap slot
      cache is disabled, (for swap off), we will wait for page to be put into
      swap cache in such race condition.  This should not be a problem for swap
      slot cache, because swap slot cache should be drained after clearing
      swap_slot_cache_enabled.
      
      [ying.huang@intel.com: fix memory leak in __read_swap_cache_async()]
        Link: http://lkml.kernel.org/r/874lzt6znd.fsf@yhuang-dev.intel.com
      Link: http://lkml.kernel.org/r/5e2c5f6abe8e6eb0797408897b1bba80938e9b9d.1484082593.git.tim.c.chen@linux.intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba81f838
    • Tim Chen's avatar
      mm/swap: add cache for swap slots allocation · 67afa38e
      Tim Chen authored
      We add per cpu caches for swap slots that can be allocated and freed
      quickly without the need to touch the swap info lock.
      
      Two separate caches are maintained for swap slots allocated and swap
      slots returned.  This is to allow the swap slots to be returned to the
      global pool in a batch so they will have a chance to be coaelesced with
      other slots in a cluster.  We do not reuse the slots that are returned
      right away, as it may increase fragmentation of the slots.
      
      The swap allocation cache is protected by a mutex as we may sleep when
      searching for empty slots in cache.  The swap free cache is protected by
      a spin lock as we cannot sleep in the free path.
      
      We refill the swap slots cache when we run out of slots, and we disable
      the swap slots cache and drain the slots if the global number of slots
      fall below a low watermark threshold.  We re-enable the cache agian when
      the slots available are above a high watermark.
      
      [ying.huang@intel.com: use raw_cpu_ptr over this_cpu_ptr for swap slots access]
      [tim.c.chen@linux.intel.com: add comments on locks in swap_slots.h]
        Link: http://lkml.kernel.org/r/20170118180327.GA24225@linux.intel.com
      Link: http://lkml.kernel.org/r/35de301a4eaa8daa2977de6e987f2c154385eb66.1484082593.git.tim.c.chen@linux.intel.com
      
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      67afa38e
    • Tim Chen's avatar
      mm/swap: skip readahead for unreferenced swap slots · e8c26ab6
      Tim Chen authored
      We can avoid needlessly allocating page for swap slots that are not used
      by anyone.  No pages have to be read in for these slots.
      
      Link: http://lkml.kernel.org/r/0784b3f20b9bd3aa5552219624cb78dc4ae710c9.1484082593.git.tim.c.chen@linux.intel.com
      
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8c26ab6
    • Huang, Ying's avatar
      mm/swap: split swap cache into 64MB trunks · 4b3ef9da
      Huang, Ying authored
      The patch is to improve the scalability of the swap out/in via using
      fine grained locks for the swap cache.  In current kernel, one address
      space will be used for each swap device.  And in the common
      configuration, the number of the swap device is very small (one is
      typical).  This causes the heavy lock contention on the radix tree of
      the address space if multiple tasks swap out/in concurrently.
      
      But in fact, there is no dependency between pages in the swap cache.  So
      that, we can split the one shared address space for each swap device
      into several address spaces to reduce the lock contention.  In the
      patch, the shared address space is split into 64MB trunks.  64MB is
      chosen to balance the memory space usage and effect of lock contention
      reduction.
      
      The size of struct address_space on x86_64 architecture is 408B, so with
      the patch, 6528B more memory will be used for every 1GB swap space on
      x86_64 architecture.
      
      One address space is still shared for the swap entries in the same 64M
      trunks.  To avoid lock contention for the first round of swap space
      allocation, the order of the swap clusters in the initial free clusters
      list is changed.  The swap space distance between the consecutive swap
      clusters in the free cluster list is at least 64M.  After the first
      round of allocation, the swap clusters are expected to be freed
      randomly, so the lock contention should be reduced effectively.
      
      Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b3ef9da
  22. 08 Oct, 2016 3 commits
    • Huang Ying's avatar
      mm, swap: use offset of swap entry as key of swap cache · f6ab1f7f
      Huang Ying authored
      This patch is to improve the performance of swap cache operations when
      the type of the swap device is not 0.  Originally, the whole swap entry
      value is used as the key of the swap cache, even though there is one
      radix tree for each swap device.  If the type of the swap device is not
      0, the height of the radix tree of the swap cache will be increased
      unnecessary, especially on 64bit architecture.  For example, for a 1GB
      swap device on the x86_64 architecture, the height of the radix tree of
      the swap cache is 11.  But if the offset of the swap entry is used as
      the key of the swap cache, the height of the radix tree of the swap
      cache is 4.  The increased height causes unnecessary radix tree
      descending and increased cache footprint.
      
      This patch reduces the height of the radix tree of the swap cache via
      using the offset of the swap entry instead of the whole swap entry value
      as the key of the swap cache.  In 32 processes sequential swap out test
      case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
      for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
      when the type of the swap device is 1.
      
      Use the whole swap entry as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,
      
      Use the swap offset as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,
      
      Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6ab1f7f
    • Aaron Lu's avatar
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu authored
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
      
      Signed-off-by: default avatarAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
    • Huang Ying's avatar
      mm: don't use radix tree writeback tags for pages in swap cache · 371a096e
      Huang Ying authored
      File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
      etc.) to accelerate finding the pages with a specific tag in the radix
      tree during inode writeback.  But for anonymous pages in the swap cache,
      there is no inode writeback.  So there is no need to find the pages with
      some writeback tags in the radix tree.  It is not necessary to touch
      radix tree writeback tags for pages in the swap cache.
      
      Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
      introduced for address spaces which don't need to update the writeback
      tags.  The flag is set for swap caches.  It may be used for DAX file
      systems, etc.
      
      With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
      ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
      The test is done on a Xeon E5 v3 system.  The swap device used is a RAM
      simulated PMEM (persistent memory) device.  The improvement comes from
      the reduced contention on the swap cache radix tree lock.  To test
      sequential swapping out, the test case uses 8 processes, which
      sequentially allocate and write to the anonymous pages until RAM and
      part of the swap device is used up.
      
      Details of comparison is as follow,
      
      base             base+patch
      ---------------- --------------------------
               %stddev     %change         %stddev
                   \          |                \
         2506952 ±  2%     +28.1%    3212076 ±  7%  vm-scalability.throughput
         1207402 ±  7%     +22.3%    1476578 ±  6%  vmstat.swap.so
           10.86 ± 12%     -23.4%       8.31 ± 16%  perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
           10.82 ± 13%     -33.1%       7.24 ± 14%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
           10.36 ± 11%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
           10.52 ± 12%    -100.0%       0.00 ± -1%  perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page
      
      Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      371a096e