1. 23 Feb, 2017 3 commits
    • Huang, Ying's avatar
      mm/swap: split swap cache into 64MB trunks · 4b3ef9da
      Huang, Ying authored
      The patch is to improve the scalability of the swap out/in via using
      fine grained locks for the swap cache.  In current kernel, one address
      space will be used for each swap device.  And in the common
      configuration, the number of the swap device is very small (one is
      typical).  This causes the heavy lock contention on the radix tree of
      the address space if multiple tasks swap out/in concurrently.
      
      But in fact, there is no dependency between pages in the swap cache.  So
      that, we can split the one shared address space for each swap device
      into several address spaces to reduce the lock contention.  In the
      patch, the shared address space is split into 64MB trunks.  64MB is
      chosen to balance the memory space usage and effect of lock contention
      reduction.
      
      The size of struct address_space on x86_64 architecture is 408B, so with
      the patch, 6528B more memory will be used for every 1GB swap space on
      x86_64 architecture.
      
      One address space is still shared for the swap entries in the same 64M
      trunks.  To avoid lock contention for the first round of swap space
      allocation, the order of the swap clusters in the initial free clusters
      list is changed.  The swap space distance between the consecutive swap
      clusters in the free cluster list is at least 64M.  After the first
      round of allocation, the swap clusters are expected to be freed
      randomly, so the lock contention should be reduced effectively.
      
      Link: http://lkml.kernel.org/r/735bab895e64c930581ffb0a05b661e01da82bc5.1484082593.git.tim.c.chen@linux.intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4b3ef9da
    • Huang, Ying's avatar
      mm/swap: add cluster lock · 235b6217
      Huang, Ying authored
      This patch is to reduce the lock contention of swap_info_struct->lock
      via using a more fine grained lock in swap_cluster_info for some swap
      operations.  swap_info_struct->lock is heavily contended if multiple
      processes reclaim pages simultaneously.  Because there is only one lock
      for each swap device.  While in common configuration, there is only one
      or several swap devices in the system.  The lock protects almost all
      swap related operations.
      
      In fact, many swap operations only access one element of
      swap_info_struct->swap_map array.  And there is no dependency between
      different elements of swap_info_struct->swap_map.  So a fine grained
      lock can be used to allow parallel access to the different elements of
      swap_info_struct->swap_map.
      
      In this patch, a spinlock is added to swap_cluster_info to protect the
      elements of swap_info_struct->swap_map in the swap cluster and the
      fields of swap_cluster_info.  This reduced locking contention for
      swap_info_struct->swap_map access greatly.
      
      Because of the added spinlock, the size of swap_cluster_info increases
      from 4 bytes to 8 bytes on the 64 bit and 32 bit system.  This will use
      additional 4k RAM for every 1G swap space.
      
      Because the size of swap_cluster_info is much smaller than the size of
      the cache line (8 vs 64 on x86_64 architecture), there may be false
      cache line sharing between spinlocks in swap_cluster_info.  To avoid the
      false sharing in the first round of the swap cluster allocation, the
      order of the swap clusters in the free clusters list is changed.  So
      that, the swap_cluster_info sharing the same cache line will be placed
      as far as possible.  After the first round of allocation, the order of
      the clusters in free clusters list is expected to be random.  So the
      false sharing should be not serious.
      
      Compared with a previous implementation using bit_spin_lock, the
      sequential swap out throughput improved about 3.2%.  Test was done on a
      Xeon E5 v3 system.  The swap device used is a RAM simulated PMEM
      (persistent memory) device.  To test the sequential swapping out, the
      test case created 32 processes, which sequentially allocate and write to
      the anonymous pages until the RAM and part of the swap device is used.
      
      [ying.huang@intel.com: v5]
        Link: http://lkml.kernel.org/r/878tqeuuic.fsf_-_@yhuang-dev.intel.com
      [minchan@kernel.org: initialize spinlock for swap_cluster_info]
        Link: http://lkml.kernel.org/r/1486434945-29753-1-git-send-email-minchan@kernel.org
      [hughd@google.com: annotate nested locking for cluster lock]
        Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1702161050540.21773@eggly.anvils
      Link: http://lkml.kernel.org/r/dbb860bbd825b1aaba18988015e8963f263c3f0d.1484082593.git.tim.c.chen@linux.intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      235b6217
    • Huang, Ying's avatar
      mm/swap: fix kernel message in swap_info_get() · 6a991fc7
      Huang, Ying authored
      Patch series "mm/swap: Regular page swap optimizations", v5.
      
      Times have changed.  Coming generation of Solid state Block device
      latencies are getting down to sub 100 usec, which is within an order of
      magnitude of DRAM, and their performance is orders of magnitude higher
      than the single- spindle rotational media we've swapped to historically.
      
      This could benefit many usage scenearios.  For example cloud providers
      who overcommit their memory (as VM don't use all the memory
      provisioned).  Having a fast swap will allow them to be more aggressive
      in memory overcommit and fit more VMs to a platform.
      
      In our testing [see footnote], the median latency that the kernel adds
      to a page fault is 15 usec, which comes quite close to the amount that
      will be contributed by the underlying I/O devices.
      
      The software latency comes mostly from contentions on the locks
      protecting the radix tree of the swap cache and also the locks
      protecting the individual swap devices.  The lock contentions already
      consumed 35% of cpu cycles in our test.  In the very near future,
      software latency will become the bottleneck to swap performnace as block
      device I/O latency gets within the shouting distance of DRAM speed.
      
      This patch set, reduced the median page fault latency from 15 usec to 4
      usec (375% reduction) for DRAM based pmem block device.
      
      This patch (of 9):
      
      swap_info_get() is used not only in swap free code path but also in
      page_swapcount(), etc.  So the original kernel message in swap_info_get()
      is not correct now.  Fix it via replacing "swap_free" to "swap_info_get"
      in the message.
      
      Link: http://lkml.kernel.org/r/9b5f8bd6266f9da978c373f2384c8044df5e262c.1484082593.git.tim.c.chen@linux.intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net> escreveu:
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6a991fc7
  2. 11 Jan, 2017 1 commit
    • Minchan Kim's avatar
      mm: support anonymous stable page · f0571429
      Minchan Kim authored
      During developemnt for zram-swap asynchronous writeback, I found strange
      corruption of compressed page, resulting in:
      
        Modules linked in: zram(E)
        CPU: 3 PID: 1520 Comm: zramd-1 Tainted: G            E   4.8.0-mm1-00320-ge0d4894c9c38-dirty #3274
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        task: ffff88007620b840 task.stack: ffff880078090000
        RIP: set_freeobj.part.43+0x1c/0x1f
        RSP: 0018:ffff880078093ca8  EFLAGS: 00010246
        RAX: 0000000000000018 RBX: ffff880076798d88 RCX: ffffffff81c408c8
        RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000246
        RBP: ffff880078093cb0 R08: 0000000000000000 R09: 0000000000000000
        R10: ffff88005bc43030 R11: 0000000000001df3 R12: ffff880076798d88
        R13: 000000000005bc43 R14: ffff88007819d1b8 R15: 0000000000000001
        FS:  0000000000000000(0000) GS:ffff88007e380000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fc934048f20 CR3: 0000000077b01000 CR4: 00000000000406e0
        Call Trace:
          obj_malloc+0x22b/0x260
          zs_malloc+0x1e4/0x580
          zram_bvec_rw+0x4cd/0x830 [zram]
          page_requests_rw+0x9c/0x130 [zram]
          zram_thread+0xe6/0x173 [zram]
          kthread+0xca/0xe0
          ret_from_fork+0x25/0x30
      
      With investigation, it reveals currently stable page doesn't support
      anonymous page.  IOW, reuse_swap_page can reuse the page without waiting
      writeback completion so it can overwrite page zram is compressing.
      
      Unfortunately, zram has used per-cpu stream feature from v4.7.
      It aims for increasing cache hit ratio of scratch buffer for
      compressing. Downside of that approach is that zram should ask
      memory space for compressed page in per-cpu context which requires
      stricted gfp flag which could be failed. If so, it retries to
      allocate memory space out of per-cpu context so it could get memory
      this time and compress the data again, copies it to the memory space.
      
      In this scenario, zram assumes the data should never be changed
      but it is not true unless stable page supports. So, If the data is
      changed under us, zram can make buffer overrun because second
      compression size could be bigger than one we got in previous trial
      and blindly, copy bigger size object to smaller buffer which is
      buffer overrun. The overrun breaks zsmalloc free object chaining
      so system goes crash like above.
      
      I think below is same problem.
      https://bugzilla.suse.com/show_bug.cgi?id=997574
      
      Unfortunately, reuse_swap_page should be atomic so that we cannot wait on
      writeback in there so the approach in this patch is simply return false if
      we found it needs stable page.  Although it increases memory footprint
      temporarily, it happens rarely and it should be reclaimed easily althoug
      it happened.  Also, It would be better than waiting of IO completion,
      which is critial path for application latency.
      
      Fixes: da9556a2 ("zram: user per-cpu compression streams")
      Link: http://lkml.kernel.org/r/20161120233015.GA14113@bbox
      Link: http://lkml.kernel.org/r/1482366980-3782-2-git-send-email-minchan@kernel.org
      
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Hyeoncheol Lee <cheol.lee@lge.com>
      Cc: <yjay.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: <stable@vger.kernel.org> [4.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0571429
  3. 13 Dec, 2016 1 commit
  4. 11 Nov, 2016 1 commit
  5. 08 Oct, 2016 2 commits
    • Huang Ying's avatar
      mm, swap: use offset of swap entry as key of swap cache · f6ab1f7f
      Huang Ying authored
      This patch is to improve the performance of swap cache operations when
      the type of the swap device is not 0.  Originally, the whole swap entry
      value is used as the key of the swap cache, even though there is one
      radix tree for each swap device.  If the type of the swap device is not
      0, the height of the radix tree of the swap cache will be increased
      unnecessary, especially on 64bit architecture.  For example, for a 1GB
      swap device on the x86_64 architecture, the height of the radix tree of
      the swap cache is 11.  But if the offset of the swap entry is used as
      the key of the swap cache, the height of the radix tree of the swap
      cache is 4.  The increased height causes unnecessary radix tree
      descending and increased cache footprint.
      
      This patch reduces the height of the radix tree of the swap cache via
      using the offset of the swap entry instead of the whole swap entry value
      as the key of the swap cache.  In 32 processes sequential swap out test
      case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
      for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
      when the type of the swap device is 1.
      
      Use the whole swap entry as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,
      
      Use the swap offset as key,
      
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
        perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,
      
      Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6ab1f7f
    • Huang Ying's avatar
      mm, swap: add swap_cluster_list · 6b534915
      Huang Ying authored
      This is a code clean up patch without functionality changes.  The
      swap_cluster_list data structure and its operations are introduced to
      provide some better encapsulation for the free cluster and discard
      cluster list operations.  This avoid some code duplication, improved the
      code readability, and reduced the total line number.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1472067356-16004-1-git-send-email-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shaohua Li <shli@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6b534915
  6. 19 Sep, 2016 1 commit
    • Santosh Shilimkar's avatar
      mm: fix the page_swap_info() BUG_ON check · c8de641b
      Santosh Shilimkar authored
      Commit 62c230bc ("mm: add support for a filesystem to activate
      swap files and use direct_IO for writing swap pages") replaced the
      swap_aops dirty hook from __set_page_dirty_no_writeback() with
      swap_set_page_dirty().
      
      For normal cases without these special SWP flags code path falls back to
      __set_page_dirty_no_writeback() so the behaviour is expected to be the
      same as before.
      
      But swap_set_page_dirty() makes use of the page_swap_info() helper to
      get the swap_info_struct to check for the flags like SWP_FILE,
      SWP_BLKDEV etc as desired for those features.  This helper has
      BUG_ON(!PageSwapCache(page)) which is racy and safe only for the
      set_page_dirty_lock() path.
      
      For the set_page_dirty() path which is often needed for cases to be
      called from irq context, kswapd() can toggle the flag behind the back
      while the call is getting executed when system is low on memory and
      heavy swapping is ongoing.
      
      This ends up with undesired kernel panic.
      
      This patch just moves the check outside the helper to its users
      appropriately to fix kernel panic for the described path.  Couple of
      users of helpers already take care of SwapCache condition so I skipped
      them.
      
      Link: http://lkml.kernel.org/r/1473460718-31013-1-git-send-email-santosh.shilimkar@oracle.com
      
      Signed-off-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joe Perches <joe@perches.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[4.7.x]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c8de641b
  7. 26 Jul, 2016 1 commit
    • Vlastimil Babka's avatar
      mm, frontswap: convert frontswap_enabled to static key · 8ea1d2a1
      Vlastimil Babka authored
      I have noticed that frontswap.h first declares "frontswap_enabled" as
      extern bool variable, and then overrides it with "#define
      frontswap_enabled (1)" for CONFIG_FRONTSWAP=Y or (0) when disabled.  The
      bool variable isn't actually instantiated anywhere.
      
      This all looks like an unfinished attempt to make frontswap_enabled
      reflect whether a backend is instantiated.  But in the current state,
      all frontswap hooks call unconditionally into frontswap.c just to check
      if frontswap_ops is non-NULL.  This should at least be checked inline,
      but we can further eliminate the overhead when CONFIG_FRONTSWAP is
      enabled and no backend registered, using a static key that is initially
      disabled, and gets enabled only upon first backend registration.
      
      Thus, checks for "frontswap_enabled" are replaced with
      "frontswap_enabled()" wrapping the static key check.  There are two
      exceptions:
      
      - xen's selfballoon_process() was testing frontswap_enabled in code guarded
        by #ifdef CONFIG_FRONTSWAP, which was effectively always true when reachable.
        The patch just removes this check. Using frontswap_enabled() does not sound
        correct here, as this can be true even without xen's own backend being
        registered.
      
      - in SYSCALL_DEFINE2(swapon), change the check to IS_ENABLED(CONFIG_FRONTSWAP)
        as it seems the bitmap allocation cannot currently be postponed until a
        backend is registered. This means that frontswap will still have some
        memory overhead by being configured, but without a backend.
      
      After the patch, we can expect that some functions in frontswap.c are
      called only when frontswap_ops is non-NULL.  Change the checks there to
      VM_BUG_ONs.  While at it, convert other BUG_ONs to VM_BUG_ONs as
      frontswap has been stable for some time.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1463152235-9717-1-git-send-email-vbabka@suse.cz
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ea1d2a1
  8. 12 May, 2016 1 commit
    • Andrea Arcangeli's avatar
      mm: thp: calculate the mapcount correctly for THP pages during WP faults · 6d0a07ed
      Andrea Arcangeli authored
      This will provide fully accuracy to the mapcount calculation in the
      write protect faults, so page pinning will not get broken by false
      positive copy-on-writes.
      
      total_mapcount() isn't the right calculation needed in
      reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
      that is effectively the full accurate return value for page_mapcount()
      if dealing with Transparent Hugepages, however we only use the
      page_trans_huge_mapcount() during COW faults where it strictly needed,
      due to its higher runtime cost.
      
      This also provide at practical zero cost the total_mapcount
      information which is needed to know if we can still relocate the page
      anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
      can reuse the page no matter if it's a pte or a pmd_trans_huge
      triggering the fault, but we can only relocate the page anon_vma to
      the local vma->anon_vma if we're sure it's only this "vma" mapping the
      whole THP physical range.
      
      Kirill A. Shutemov discovered the problem with moving the page
      anon_vma to the local vma->anon_vma in a previous version of this
      patch and another problem in the way page_move_anon_rmap() was called.
      
      Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
      previous version, because reuse_swap_page must be a macro to call
      page_trans_huge_mapcount from swap.h, so this uses a macro again
      instead of an inline function. With this change at least it's a less
      dangerous usage than it was before, because "page" is used only once
      now, while with the previous code reuse_swap_page(page++) would have
      called page_mapcount on page+1 and it would have increased page twice
      instead of just once.
      
      Dean Luick noticed an uninitialized variable that could result in a
      rmap inefficiency for the non-THP case in a previous version.
      
      Mike Marciniszyn said:
      
      : Our RDMA tests are seeing an issue with memory locking that bisects to
      : commit 61f5d698 ("mm: re-enable THP")
      :
      : The test program registers two rather large MRs (512M) and RDMA
      : writes data to a passive peer using the first and RDMA reads it back
      : into the second MR and compares that data.  The sizes are chosen randomly
      : between 0 and 1024 bytes.
      :
      : The test will get through a few (<= 4 iterations) and then gets a
      : compare error.
      :
      : Tracing indicates the kernel logical addresses associated with the individual
      : pages at registration ARE correct , the data in the "RDMA read response only"
      : packets ARE correct.
      :
      : The "corruption" occurs when the packet crosse two pages that are not physically
      : contiguous.   The second page reads back as zero in the program.
      :
      : It looks like the user VA at the point of the compare error no longer points to
      : the same physical address as was registered.
      :
      : This patch totally resolves the issue!
      
      Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.com
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: default avatar"Kirill A. Shutemov" <kirill@shutemov.name>
      Reviewed-by: default avatarDean Luick <dean.luick@intel.com>
      Tested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Tested-by: default avatarMike Marciniszyn <mike.marciniszyn@intel.com>
      Tested-by: default avatarJosh Collier <josh.d.collier@intel.com>
      Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d0a07ed
  9. 04 Apr, 2016 1 commit
    • Kirill A. Shutemov's avatar
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov authored
      
      
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  10. 17 Mar, 2016 1 commit
  11. 22 Jan, 2016 1 commit
    • Al Viro's avatar
      wrappers for ->i_mutex access · 5955102c
      Al Viro authored
      
      
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  12. 21 Jan, 2016 2 commits
    • Vladimir Davydov's avatar
      mm: free swap cache aggressively if memcg swap is full · 5ccc5aba
      Vladimir Davydov authored
      
      
      Swap cache pages are freed aggressively if swap is nearly full (>50%
      currently), because otherwise we are likely to stop scanning anonymous
      when we near the swap limit even if there is plenty of freeable swap cache
      pages.  We should follow the same trend in case of memory cgroup, which
      has its own swap limit.
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5ccc5aba
    • Vladimir Davydov's avatar
      mm: memcontrol: charge swap to cgroup2 · 37e84351
      Vladimir Davydov authored
      
      
      This patchset introduces swap accounting to cgroup2.
      
      This patch (of 7):
      
      In the legacy hierarchy we charge memsw, which is dubious, because:
      
       - memsw.limit must be >= memory.limit, so it is impossible to limit
         swap usage less than memory usage. Taking into account the fact that
         the primary limiting mechanism in the unified hierarchy is
         memory.high while memory.limit is either left unset or set to a very
         large value, moving memsw.limit knob to the unified hierarchy would
         effectively make it impossible to limit swap usage according to the
         user preference.
      
       - memsw.usage != memory.usage + swap.usage, because a page occupying
         both swap entry and a swap cache page is charged only once to memsw
         counter. As a result, it is possible to effectively eat up to
         memory.limit of memory pages *and* memsw.limit of swap entries, which
         looks unexpected.
      
      That said, we should provide a different swap limiting mechanism for
      cgroup2.
      
      This patch adds mem_cgroup->swap counter, which charges the actual number
      of swap entries used by a cgroup.  It is only charged in the unified
      hierarchy, while the legacy hierarchy memsw logic is left intact.
      
      The swap usage can be monitored using new memory.swap.current file and
      limited using memory.swap.max.
      
      Note, to charge swap resource properly in the unified hierarchy, we have
      to make swap_entry_free uncharge swap only when ->usage reaches zero, not
      just ->count, i.e.  when all references to a swap entry, including the one
      taken by swap cache, are gone.  This is necessary, because otherwise
      swap-in could result in uncharging swap even if the page is still in swap
      cache and hence still occupies a swap entry.  At the same time, this
      shouldn't break memsw counter logic, where a page is never charged twice
      for using both memory and swap, because in case of legacy hierarchy we
      uncharge swap on commit (see mem_cgroup_commit_charge).
      Signed-off-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      37e84351
  13. 16 Jan, 2016 4 commits
  14. 15 Jan, 2016 2 commits
  15. 05 Jan, 2016 1 commit
  16. 08 Sep, 2015 1 commit
    • Minchan Kim's avatar
      mm: /proc/pid/smaps:: show proportional swap share of the mapping · 8334b962
      Minchan Kim authored
      
      
      We want to know per-process workingset size for smart memory management
      on userland and we use swap(ex, zram) heavily to maximize memory
      efficiency so workingset includes swap as well as RSS.
      
      On such system, if there are lots of shared anonymous pages, it's really
      hard to figure out exactly how many each process consumes memory(ie, rss
      + wap) if the system has lots of shared anonymous memory(e.g, android).
      
      This patch introduces SwapPss field on /proc/<pid>/smaps so we can get
      more exact workingset size per process.
      
      Bongkyu tested it. Result is below.
      
      1. 50M used swap
      SwapTotal: 461976 kB
      SwapFree: 411192 kB
      
      $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
      48236
      $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
      141184
      
      2. 240M used swap
      SwapTotal: 461976 kB
      SwapFree: 216808 kB
      
      $ adb shell cat /proc/*/smaps | grep "SwapPss:" | awk '{sum += $2} END {print sum}';
      230315
      $ adb shell cat /proc/*/smaps | grep "Swap:" | awk '{sum += $2} END {print sum}';
      1387744
      
      [akpm@linux-foundation.org: simplify kunmap_atomic() call]
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Reported-by: default avatarBongkyu Kim <bongkyu.kim@lge.com>
      Tested-by: default avatarBongkyu Kim <bongkyu.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8334b962
  17. 21 Aug, 2015 1 commit
    • Hugh Dickins's avatar
      mm: fix potential data race in SyS_swapon · 6f179af8
      Hugh Dickins authored
      
      
      While running KernelThreadSanitizer (ktsan) on upstream kernel with
      trinity, we got a few reports from SyS_swapon, here is one of them:
      
      Read of size 8 by thread T307 (K7621):
       [<     inlined    >] SyS_swapon+0x3c0/0x1850 SYSC_swapon mm/swapfile.c:2395
       [<ffffffff812242c0>] SyS_swapon+0x3c0/0x1850 mm/swapfile.c:2345
       [<ffffffff81e97c8a>] ia32_do_call+0x1b/0x25
      
      Looks like the swap_lock should be taken when iterating through the
      swap_info array on lines 2392 - 2401: q->swap_file may be reset to
      NULL by another thread before it is dereferenced for f_mapping.
      
      But why is that iteration needed at all?  Doesn't the claim_swapfile()
      which follows do all that is needed to check for a duplicate entry -
      FMODE_EXCL on a bdev, testing IS_SWAPFILE under i_mutex on a regfile?
      
      Well, not quite: bd_may_claim() allows the same "holder" to claim the
      bdev again, so we do need to use a different holder than "sys_swapon";
      and we should not replace appropriate -EBUSY by inappropriate -EINVAL.
      
      Index i was reused in a cpu loop further down: renamed cpu there.
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6f179af8
  18. 23 Jun, 2015 1 commit
  19. 15 Apr, 2015 1 commit
  20. 11 Dec, 2014 1 commit
  21. 08 Aug, 2014 2 commits
    • Johannes Weiner's avatar
      mm: memcontrol: rewrite uncharge API · 0a31bc97
      Johannes Weiner authored
      
      
      The memcg uncharging code that is involved towards the end of a page's
      lifetime - truncation, reclaim, swapout, migration - is impressively
      complicated and fragile.
      
      Because anonymous and file pages were always charged before they had their
      page->mapping established, uncharges had to happen when the page type
      could still be known from the context; as in unmap for anonymous, page
      cache removal for file and shmem pages, and swap cache truncation for swap
      pages.  However, these operations happen well before the page is actually
      freed, and so a lot of synchronization is necessary:
      
      - Charging, uncharging, page migration, and charge migration all need
        to take a per-page bit spinlock as they could race with uncharging.
      
      - Swap cache truncation happens during both swap-in and swap-out, and
        possibly repeatedly before the page is actually freed.  This means
        that the memcg swapout code is called from many contexts that make
        no sense and it has to figure out the direction from page state to
        make sure memory and memory+swap are always correctly charged.
      
      - On page migration, the old page might be unmapped but then reused,
        so memcg code has to prevent untimely uncharging in that case.
        Because this code - which should be a simple charge transfer - is so
        special-cased, it is not reusable for replace_page_cache().
      
      But now that charged pages always have a page->mapping, introduce
      mem_cgroup_uncharge(), which is called after the final put_page(), when we
      know for sure that nobody is looking at the page anymore.
      
      For page migration, introduce mem_cgroup_migrate(), which is called after
      the migration is successful and the new page is fully rmapped.  Because
      the old page is no longer uncharged after migration, prevent double
      charges by decoupling the page's memcg association (PCG_USED and
      pc->mem_cgroup) from the page holding an actual charge.  The new bits
      PCG_MEM and PCG_MEMSW represent the respective charges and are transferred
      to the new page during migration.
      
      mem_cgroup_migrate() is suitable for replace_page_cache() as well,
      which gets rid of mem_cgroup_replace_page_cache().  However, care
      needs to be taken because both the source and the target page can
      already be charged and on the LRU when fuse is splicing: grab the page
      lock on the charge moving side to prevent changing pc->mem_cgroup of a
      page under migration.  Also, the lruvecs of both pages change as we
      uncharge the old and charge the new during migration, and putback may
      race with us, so grab the lru lock and isolate the pages iff on LRU to
      prevent races and ensure the pages are on the right lruvec afterward.
      
      Swap accounting is massively simplified: because the page is no longer
      uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can
      transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry
      before the final put_page() in page reclaim.
      
      Finally, page_cgroup changes are now protected by whatever protection the
      page itself offers: anonymous pages are charged under the page table lock,
      whereas page cache insertions, swapin, and migration hold the page lock.
      Uncharging happens under full exclusion with no outstanding references.
      Charging and uncharging also ensure that the page is off-LRU, which
      serializes against charge migration.  Remove the very costly page_cgroup
      lock and set pc->flags non-atomically.
      
      [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable]
      [vdavydov@parallels.com: fix flags definition]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Tested-by: default avatarJet Chen <jet.chen@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Tested-by: default avatarFelipe Balbi <balbi@ti.com>
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a31bc97
    • Johannes Weiner's avatar
      mm: memcontrol: rewrite charge API · 00501b53
      Johannes Weiner authored
      
      
      These patches rework memcg charge lifetime to integrate more naturally
      with the lifetime of user pages.  This drastically simplifies the code and
      reduces charging and uncharging overhead.  The most expensive part of
      charging and uncharging is the page_cgroup bit spinlock, which is removed
      entirely after this series.
      
      Here are the top-10 profile entries of a stress test that reads a 128G
      sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
       executing in the root memcg).  Before:
      
          15.36%              cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.31%              cat  [kernel.kallsyms]   [k] memset
          11.48%              cat  [kernel.kallsyms]   [k] do_mpage_readpage
           4.23%              cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.38%              cat  [kernel.kallsyms]   [k] put_page
           2.32%              cat  [kernel.kallsyms]   [k] __mem_cgroup_commit_charge
           2.18%          kswapd0  [kernel.kallsyms]   [k] __mem_cgroup_uncharge_common
           1.92%          kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.86%              cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.62%              cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
      
      After:
      
          15.67%           cat  [kernel.kallsyms]   [k] copy_user_generic_string
          13.48%           cat  [kernel.kallsyms]   [k] memset
          11.42%           cat  [kernel.kallsyms]   [k] do_mpage_readpage
           3.98%           cat  [kernel.kallsyms]   [k] get_page_from_freelist
           2.46%           cat  [kernel.kallsyms]   [k] put_page
           2.13%       kswapd0  [kernel.kallsyms]   [k] shrink_page_list
           1.88%           cat  [kernel.kallsyms]   [k] __radix_tree_lookup
           1.67%           cat  [kernel.kallsyms]   [k] __pagevec_lru_add_fn
           1.39%       kswapd0  [kernel.kallsyms]   [k] free_pcppages_bulk
           1.30%           cat  [kernel.kallsyms]   [k] kfree
      
      As you can see, the memcg footprint has shrunk quite a bit.
      
         text    data     bss     dec     hex filename
        37970    9892     400   48262    bc86 mm/memcontrol.o.old
        35239    9892     400   45531    b1db mm/memcontrol.o
      
      This patch (of 4):
      
      The memcg charge API charges pages before they are rmapped - i.e.  have an
      actual "type" - and so every callsite needs its own set of charge and
      uncharge functions to know what type is being operated on.  Worse,
      uncharge has to happen from a context that is still type-specific, rather
      than at the end of the page's lifetime with exclusive access, and so
      requires a lot of synchronization.
      
      Rewrite the charge API to provide a generic set of try_charge(),
      commit_charge() and cancel_charge() transaction operations, much like
      what's currently done for swap-in:
      
        mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
        pages from the memcg if necessary.
      
        mem_cgroup_commit_charge() commits the page to the charge once it
        has a valid page->mapping and PageAnon() reliably tells the type.
      
        mem_cgroup_cancel_charge() aborts the transaction.
      
      This reduces the charge API and enables subsequent patches to
      drastically simplify uncharging.
      
      As pages need to be committed after rmap is established but before they
      are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
      additions again.  Revive lru_cache_add_active_or_unevictable().
      
      [hughd@google.com: fix shmem_unuse]
      [hughd@google.com: Add comments on the private use of -EAGAIN]
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      00501b53
  22. 04 Jun, 2014 3 commits
    • Chen Yucong's avatar
      mm/swapfile.c: delete the "last_in_cluster < scan_base" loop in the body of scan_swap_map() · 50088c44
      Chen Yucong authored
      Via commit ebc2a1a6
      
       ("swap: make cluster allocation per-cpu"), we
      can find that all SWP_SOLIDSTATE "seek is cheap"(SSD case) has already
      gone to si->cluster_info scan_swap_map_try_ssd_cluster() route.  So that
      the "last_in_cluster < scan_base" loop in the body of scan_swap_map()
      has already become a dead code snippet, and it should have been deleted.
      
      This patch is to delete the redundant loop as Hugh and Shaohua
      suggested.
      
      [hughd@google.com: fix comment, simplify code]
      Signed-off-by: default avatarChen Yucong <slaoub@gmail.com>
      Cc: Shaohua Li <shli@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      50088c44
    • Dan Streetman's avatar
      swap: change swap_list_head to plist, add swap_avail_head · 18ab4d4c
      Dan Streetman authored
      
      
      Originally get_swap_page() started iterating through the singly-linked
      list of swap_info_structs using swap_list.next or highest_priority_index,
      which both were intended to point to the highest priority active swap
      target that was not full.  The first patch in this series changed the
      singly-linked list to a doubly-linked list, and removed the logic to start
      at the highest priority non-full entry; it starts scanning at the highest
      priority entry each time, even if the entry is full.
      
      Replace the manually ordered swap_list_head with a plist, swap_active_head.
      Add a new plist, swap_avail_head.  The original swap_active_head plist
      contains all active swap_info_structs, as before, while the new
      swap_avail_head plist contains only swap_info_structs that are active and
      available, i.e. not full.  Add a new spinlock, swap_avail_lock, to protect
      the swap_avail_head list.
      
      Mel Gorman suggested using plists since they internally handle ordering
      the list entries based on priority, which is exactly what swap was doing
      manually.  All the ordering code is now removed, and swap_info_struct
      entries and simply added to their corresponding plist and automatically
      ordered correctly.
      
      Using a new plist for available swap_info_structs simplifies and
      optimizes get_swap_page(), which no longer has to iterate over full
      swap_info_structs.  Using a new spinlock for swap_avail_head plist
      allows each swap_info_struct to add or remove themselves from the
      plist when they become full or not-full; previously they could not
      do so because the swap_info_struct->lock is held when they change
      from full<->not-full, and the swap_lock protecting the main
      swap_active_head must be ordered before any swap_info_struct->lock.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      18ab4d4c
    • Dan Streetman's avatar
      swap: change swap_info singly-linked list to list_head · adfab836
      Dan Streetman authored
      The logic controlling the singly-linked list of swap_info_struct entries
      for all active, i.e.  swapon'ed, swap targets is rather complex, because:
      
       - it stores the entries in priority order
       - there is a pointer to the highest priority entry
       - there is a pointer to the highest priority not-full entry
       - there is a highest_priority_index variable set outside the swap_lock
       - swap entries of equal priority should be used equally
      
      this complexity leads to bugs such as: https://lkml.org/lkml/2014/2/13/181
      where different priority swap targets are incorrectly used equally.
      
      That bug probably could be solved with the existing singly-linked lists,
      but I think it would only add more complexity to the already difficult to
      understand get_swap_page() swap_list iteration logic.
      
      The first patch changes from a singly-linked list to a doubly-linked list
      using list_heads; the highest_priority_index and related code are removed
      and get_swap_page() starts each iteration at the highest priority
      swap_info entry, even if it's full.  While this does introduce unnecessary
      list iteration (i.e.  Schlemiel the painter's algorithm) in the case where
      one or more of the highest priority entries are full, the iteration and
      manipulation code is much simpler and behaves correctly re: the above bug;
      and the fourth patch removes the unnecessary iteration.
      
      The second patch adds some minor plist helper functions; nothing new
      really, just functions to match existing regular list functions.  These
      are used by the next two patches.
      
      The third patch adds plist_requeue(), which is used by get_swap_page() in
      the next patch - it performs the requeueing of same-priority entries
      (which moves the entry to the end of its priority in the plist), so that
      all equal-priority swap_info_structs get used equally.
      
      The fourth patch converts the main list into a plist, and adds a new plist
      that contains only swap_info entries that are both active and not full.
      As Mel suggested using plists allows removing all the ordering code from
      swap - plists handle ordering automatically.  The list naming is also
      clarified now that there are two lists, with the original list changed
      from swap_list_head to swap_active_head and the new list named
      swap_avail_head.  A new spinlock is also added for the new list, so
      swap_info entries can be added or removed from the new list immediately as
      they become full or not full.
      
      This patch (of 4):
      
      Replace the singly-linked list tracking active, i.e.  swapon'ed,
      swap_info_struct entries with a doubly-linked list using struct
      list_heads.  Simplify the logic iterating and manipulating the list of
      entries, especially get_swap_page(), by using standard list_head
      functions, and removing the highest priority iteration logic.
      
      The change fixes the bug:
      https://lkml.org/lkml/2014/2/13/181
      
      
      in which different priority swap entries after the highest priority entry
      are incorrectly used equally in pairs.  The swap behavior is now as
      advertised, i.e. different priority swap entries are used in order, and
      equal priority swap targets are used concurrently.
      Signed-off-by: default avatarDan Streetman <ddstreet@ieee.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
      Cc: Weijie Yang <weijieut@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      adfab836
  23. 06 Feb, 2014 1 commit
  24. 24 Jan, 2014 2 commits
  25. 13 Nov, 2013 2 commits
  26. 17 Oct, 2013 1 commit
    • Krzysztof Kozlowski's avatar
      swap: fix set_blocksize race during swapon/swapoff · 5b808a23
      Krzysztof Kozlowski authored
      
      
      Fix race between swapoff and swapon.  Swapoff used old_block_size from
      swap_info outside of swapon_mutex so it could be overwritten by
      concurrent swapon.
      
      The race has visible effect only if more than one swap block device
      exists with different block sizes (e.g.  /dev/sda1 with block size 4096
      and /dev/sdb1 with 512).  In such case it leads to setting the blocksize
      of swapped off device with wrong blocksize.
      
      The bug can be triggered with multiple concurrent swapoff and swapon:
      0. Swap for some device is on.
      1. swapoff:
      First the swapoff is called on this device and "struct swap_info_struct
      *p" is assigned. This is done under swap_lock however this lock is
      released for the call try_to_unuse().
      
      2. swapon:
      After the assignment above (and before acquiring swapon_mutex &
      swap_lock by swapoff) the swapon is called on the same device.
      The p->old_block_size is assigned to the value of block_size the device.
      This block size should be the same as previous but sometimes it is not.
      The swapon ends successfully.
      
      3. swapoff:
      Swapoff resumes, grabs the locks and mutex and continues to disable this
      swap device. Now it sets the block size to value taken from swap_info
      which was overwritten by swapon in 2.
      Signed-off-by: default avatarKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Reported-by: default avatarWeijie Yang <weijie.yang.kh@gmail.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Shaohua Li <shli@fusionio.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b808a23
  27. 11 Sep, 2013 1 commit
    • Shaohua Li's avatar
      swap: make cluster allocation per-cpu · ebc2a1a6
      Shaohua Li authored
      
      
      swap cluster allocation is to get better request merge to improve
      performance.  But the cluster is shared globally, if multiple tasks are
      doing swap, this will cause interleave disk access.  While multiple tasks
      swap is quite common, for example, each numa node has a kswapd thread
      doing swap and multiple threads/processes doing direct page reclaim.
      
      ioscheduler can't help too much here, because tasks don't send swapout IO
      down to block layer in the meantime.  Block layer does merge some IOs, but
      a lot not, depending on how many tasks are doing swapout concurrently.  In
      practice, I've seen a lot of small size IO in swapout workloads.
      
      We makes the cluster allocation per-cpu here.  The interleave disk access
      issue goes away.  All tasks swapout to their own cluster, so swapout will
      become sequential, which can be easily merged to big size IO.  If one CPU
      can't get its per-cpu cluster (for example, there is no free cluster
      anymore in the swap), it will fallback to scan swap_map.  The CPU can
      still continue swap.  We don't need recycle free swap entries of other
      CPUs.
      
      In my test (swap to a 2-disk raid0 partition), this improves around 10%
      swapout throughput, and request size is increased significantly.
      
      How does this impact swap readahead is uncertain though.  On one side,
      page reclaim always isolates and swaps several adjancent pages, this will
      make page reclaim write the pages sequentially and benefit readahead.  On
      the other side, several CPU write pages interleave means the pages don't
      live _sequentially_ but relatively _near_.  In the per-cpu allocation
      case, if adjancent pages are written by different cpus, they will live
      relatively _far_.  So how this impacts swap readahead depends on how many
      pages page reclaim isolates and swaps one time.  If the number is big,
      this patch will benefit swap readahead.  Of course, this is about
      sequential access pattern.  The patch has no impact for random access
      pattern, because the new cluster allocation algorithm is just for SSD.
      
      Alternative solution is organizing swap layout to be per-mm instead of
      this per-cpu approach.  In the per-mm layout, we allocate a disk range for
      each mm, so pages of one mm live in swap disk adjacently.  per-mm layout
      has potential issues of lock contention if multiple reclaimers are swap
      pages from one mm.  For a sequential workload, per-mm layout is better to
      implement swap readahead, because pages from the mm are adjacent in disk.
      But per-cpu layout isn't very bad in this workload, as page reclaim always
      isolates and swaps several pages one time, such pages will still live in
      disk sequentially and readahead can utilize this.  For a random workload,
      per-mm layout isn't beneficial of request merge, because it's quite
      possible pages from different mm are swapout in the meantime and IO can't
      be merged in per-mm layout.  while with per-cpu layout we can merge
      requests from any mm.  Considering random workload is more popular in
      workloads with swap (and per-cpu approach isn't too bad for sequential
      workload too), I'm choosing per-cpu layout.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebc2a1a6