1. 28 Dec, 2018 1 commit
  2. 26 Oct, 2018 2 commits
  3. 16 Oct, 2018 1 commit
  4. 04 Sep, 2018 1 commit
  5. 24 Aug, 2018 2 commits
  6. 08 Jun, 2018 1 commit
  7. 16 Apr, 2018 1 commit
  8. 14 Apr, 2018 1 commit
  9. 11 Apr, 2018 2 commits
    • Kees Cook's avatar
      exec: pass stack rlimit into mm layout functions · 8f2af155
      Kees Cook authored
      Patch series "exec: Pin stack limit during exec".
      
      Attempts to solve problems with the stack limit changing during exec
      continue to be frustrated[1][2].  In addition to the specific issues
      around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
      other places during exec where the stack limit is used and is assumed to
      be unchanging.  Given the many places it gets used and the fact that it
      can be manipulated/raced via setrlimit() and prlimit(), I think the only
      way to handle this is to move away from the "current" view of the stack
      limit and instead attach it to the bprm, and plumb this down into the
      functions that need to know the stack limits.  This series implements
      the approach.
      
      [1] 04e35f44 ("exec: avoid RLIMIT_STACK races with prlimit()")
      [2] 779f4e1c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
      [3] to security@kernel.org, "Subject: existing rlimit races?"
      
      This patch (of 3):
      
      Since it is possible that the stack rlimit can change externally during
      exec (either via another thread calling setrlimit() or another process
      calling prlimit()), provide a way to pass the rlimit down into the
      per-architecture mm layout functions so that the rlimit can stay in the
      bprm structure instead of sitting in the signal structure until exec is
      finalized.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f2af155
    • Roman Gushchin's avatar
      mm: treat indirectly reclaimable memory as free in overcommit logic · d79f7aa4
      Roman Gushchin authored
      Indirectly reclaimable memory can consume a significant part of total
      memory and it's actually reclaimable (it will be released under actual
      memory pressure).
      
      So, the overcommit logic should treat it as free.
      
      Otherwise, it's possible to cause random system-wide memory allocation
      failures by consuming a significant amount of memory by indirectly
      reclaimable memory, e.g.  dentry external names.
      
      If overcommit policy GUESS is used, it might be used for denial of
      service attack under some conditions.
      
      The following program illustrates the approach.  It causes the kernel to
      allocate an unreclaimable kmalloc-256 chunk for each stat() call, so
      that at some point the overcommit logic may start blocking large
      allocation system-wide.
      
        int main()
        {
        	char buf[256];
        	unsigned long i;
        	struct stat statbuf;
      
        	buf[0] = '/';
        	for (i = 1; i < sizeof(buf); i++)
        		buf[i] = '_';
      
        	for (i = 0; 1; i++) {
        		sprintf(&buf[248], "%8lu", i);
        		stat(buf, &statbuf);
        	}
      
        	return 0;
        }
      
      This patch in combination with related indirectly reclaimable memory
      patches closes this issue.
      
      Link: http://lkml.kernel.org/r/20180313130041.8078-1-guro@fb.com
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d79f7aa4
  10. 06 Apr, 2018 1 commit
    • Huang Ying's avatar
      mm: fix races between swapoff and flush dcache · cb9f753a
      Huang Ying authored
      Thanks to commit 4b3ef9da ("mm/swap: split swap cache into 64MB
      trunks"), after swapoff the address_space associated with the swap
      device will be freed.  So page_mapping() users which may touch the
      address_space need some kind of mechanism to prevent the address_space
      from being freed during accessing.
      
      The dcache flushing functions (flush_dcache_page(), etc) in architecture
      specific code may access the address_space of swap device for anonymous
      pages in swap cache via page_mapping() function.  But in some cases
      there are no mechanisms to prevent the swap device from being swapoff,
      for example,
      
        CPU1					CPU2
        __get_user_pages()			swapoff()
          flush_dcache_page()
            mapping = page_mapping()
              ...				  exit_swap_address_space()
              ...				    kvfree(spaces)
              mapping_mapped(mapping)
      
      The address space may be accessed after being freed.
      
      But from cachetlb.txt and Russell King, flush_dcache_page() only care
      about file cache pages, for anonymous pages, flush_anon_page() should be
      used.  The implementation of flush_dcache_page() in all architectures
      follows this too.  They will check whether page_mapping() is NULL and
      whether mapping_mapped() is true to determine whether to flush the
      dcache immediately.  And they will use interval tree (mapping->i_mmap)
      to find all user space mappings.  While mapping_mapped() and
      mapping->i_mmap isn't used by anonymous pages in swap cache at all.
      
      So, to fix the race between swapoff and flush dcache, __page_mapping()
      is add to return the address_space for file cache pages and NULL
      otherwise.  All page_mapping() invoking in flush dcache functions are
      replaced with page_mapping_file().
      
      [akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
      Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
      
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cb9f753a
  11. 07 Jan, 2018 2 commits
  12. 07 Sep, 2017 1 commit
  13. 10 Aug, 2017 1 commit
  14. 12 Jul, 2017 2 commits
    • Michal Hocko's avatar
      mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes · cc965a29
      Michal Hocko authored
      Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
      request size we can drop the hackish implementation for !costly orders.
      __GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
      progress and backs of when we are out of memory for the requested size.
      Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
      to silent the oom killer anymore.
      
      Link: http://lkml.kernel.org/r/20170623085345.11304-5-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc965a29
    • Michal Hocko's avatar
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko authored
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  15. 06 Jul, 2017 1 commit
  16. 02 Jun, 2017 1 commit
  17. 12 May, 2017 1 commit
    • Michal Hocko's avatar
      mm, vmalloc: fix vmalloc users tracking properly · 8594a21c
      Michal Hocko authored
      Commit 1f5307b1 ("mm, vmalloc: properly track vmalloc users") has
      pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
      turned out to be a bad idea for some architectures.  E.g.  m68k fails
      with
      
         In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
                          from arch/m68k/include/asm/pgtable.h:4,
                          from include/linux/vmalloc.h:9,
                          from arch/m68k/kernel/module.c:9:
         arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
      >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
          #define pgd_offset_k(address) pgd_offset(&init_mm, address)
      
      as spotted by kernel build bot. nios2 fails for other reason
      
        In file included from include/asm-generic/io.h:767:0,
                         from arch/nios2/include/asm/io.h:61,
                         from include/linux/io.h:25,
                         from arch/nios2/include/asm/pgtable.h:18,
           ...
      8594a21c
  18. 09 May, 2017 3 commits
    • Michal Hocko's avatar
      mm, vmalloc: use __GFP_HIGHMEM implicitly · 19809c2d
      Michal Hocko authored
      __vmalloc* allows users to provide gfp flags for the underlying
      allocation.  This API is quite popular
      
        $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
        77
      
      The only problem is that many people are not aware that they really want
      to give __GFP_HIGHMEM along with other flags because there is really no
      reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
      which are mapped to the kernel vmalloc space.  About half of users don't
      use this flag, though.  This signals that we make the API unnecessarily
      too complex.
      
      This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
      be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
      are simplified and drop the flag.
      
      Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Vlastimil Babka <vbabka@s...
      19809c2d
    • Michal Hocko's avatar
      mm: support __GFP_REPEAT in kvmalloc_node for >32kB · 6c5ab651
      Michal Hocko authored
      vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
      vhost_vsock because it would really like to prefer kmalloc to the
      vmalloc fallback - see 23cc5a99 ("vhost-net: extend device
      allocation to vmalloc") for more context.  Michael Tsirkin has also
      noted:
      
       "__GFP_REPEAT overhead is during allocation time. Using vmalloc means
        all accesses are slowed down. Allocation is not on data path, accesses
        are."
      
      The similar applies to other vhost_kvzalloc users.
      
      Let's teach kvmalloc_node to handle __GFP_REPEAT properly.  There are
      two things to be careful about.  First we should prevent from the OOM
      killer and so have to involve __GFP_NORETRY by default and secondly
      override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is
      ignored for !costly orders.
      
      Supporting __GFP_REPEAT like semantic for !costly request is possible it
      would require changes in the page allocator.  This is out of scope of
      this patch.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c5ab651
    • Michal Hocko's avatar
      mm: introduce kv[mz]alloc helpers · a7c3e901
      Michal Hocko authored
      Patch series "kvmalloc", v5.
      
      There are many open coded kmalloc with vmalloc fallback instances in the
      tree.  Most of them are not careful enough or simply do not care about
      the underlying semantic of the kmalloc/page allocator which means that
      a) some vmalloc fallbacks are basically unreachable because the kmalloc
      part will keep retrying until it succeeds b) the page allocator can
      invoke a really disruptive steps like the OOM killer to move forward
      which doesn't sound appropriate when we consider that the vmalloc
      fallback is available.
      
      As it can be seen implementing kvmalloc requires quite an intimate
      knowledge if the page allocator and the memory reclaim internals which
      strongly suggests that a helper should be implemented in the memory
      subsystem proper.
      
      Most callers, I could find, have been converted to use the helper
      instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
      in the networking stack which I have converted as well and Eric Dumazet
      was not...
      a7c3e901
  19. 02 Mar, 2017 2 commits
  20. 25 Feb, 2017 1 commit
  21. 24 Dec, 2016 1 commit
  22. 20 Oct, 2016 1 commit
  23. 19 Oct, 2016 1 commit
  24. 18 Oct, 2016 1 commit
  25. 28 Jul, 2016 1 commit
  26. 26 Jul, 2016 2 commits
    • Kirill A. Shutemov's avatar
      rmap: support file thp · dd78fedd
      Kirill A. Shutemov authored
      Naive approach: on mapping/unmapping the page as compound we update
      ->_mapcount on each 4k page.  That's not efficient, but it's not obvious
      how we can optimize this.  We can look into optimization later.
      
      PG_double_map optimization doesn't work for file pages since lifecycle
      of file pages is different comparing to anon pages: file page can be
      mapped again at any time.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd78fedd
    • Minchan Kim's avatar
      mm: migrate: support non-lru movable page migration · bda807d4
      Minchan Kim authored
      We have allowed migration for only LRU pages until now and it was enough
      to make high-order pages.  But recently, embedded system(e.g., webOS,
      android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
      have seen several reports about troubles of small high-order allocation.
      For fixing the problem, there were several efforts (e,g,.  enhance
      compaction algorithm, SLUB fallback to 0-order page, reserved memory,
      vmalloc and so on) but if there are lots of non-movable pages in system,
      their solutions are void in the long run.
      
      So, this patch is to support facility to change non-movable pages with
      movable.  For the feature, this patch introduces functions related to
      migration to address_space_operations as well as some page flags.
      
      If a driver want to make own pages movable, it should define three
      functions which are function pointers of struct
      address_space_operations.
      
      1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
      
      What VM expects on isolate_page function of driver is to return *true*
      if driver isolates page successfully.  On returing true, VM marks the
      page as PG_isolated so concurrent isolation in several CPUs skip the
      page for isolation.  If a driver cannot isolate the page, it should
      return *false*.
      
      Once page is successfully isolated, VM uses page.lru fields so driver
      shouldn't expect to preserve values in that fields.
      
      2. int (*migratepage) (struct address_space *mapping,
      		struct page *newpage, struct page *oldpage, enum migrate_mode);
      
      After isolation, VM calls migratepage of driver with isolated page.  The
      function of migratepage is to move content of the old page to new page
      and set up fields of struct page newpage.  Keep in mind that you should
      indicate to the VM the oldpage is no longer movable via
      __ClearPageMovable() under page_lock if you migrated the oldpage
      successfully and returns 0.  If driver cannot migrate the page at the
      moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
      migration in a short time because VM interprets -EAGAIN as "temporal
      migration failure".  On returning any error except -EAGAIN, VM will give
      up the page migration without retrying in this time.
      
      Driver shouldn't touch page.lru field VM using in the functions.
      
      3. void (*putback_page)(struct page *);
      
      If migration fails on isolated page, VM should return the isolated page
      to the driver so VM calls driver's putback_page with migration failed
      page.  In this function, driver should put the isolated page back to the
      own data structure.
      
      4. non-lru movable page flags
      
      There are two page flags for supporting non-lru movable page.
      
      * PG_movable
      
      Driver should use the below function to make page movable under
      page_lock.
      
      	void __SetPageMovable(struct page *page, struct address_space *mapping)
      
      It needs argument of address_space for registering migration family
      functions which will be called by VM.  Exactly speaking, PG_movable is
      not a real flag of struct page.  Rather than, VM reuses page->mapping's
      lower bits to represent it.
      
      	#define PAGE_MAPPING_MOVABLE 0x2
      	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
      
      so driver shouldn't access page->mapping directly.  Instead, driver
      should use page_mapping which mask off the low two bits of page->mapping
      so it can get right struct address_space.
      
      For testing of non-lru movable page, VM supports __PageMovable function.
      However, it doesn't guarantee to identify non-lru movable page because
      page->mapping field is unified with other variables in struct page.  As
      well, if driver releases the page after isolation by VM, page->mapping
      doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
      __ClearPageMovable).  But __PageMovable is cheap to catch whether page
      is LRU or non-lru movable once the page has been isolated.  Because LRU
      pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
      good for just peeking to test non-lru movable pages before more
      expensive checking with lock_page in pfn scanning to select victim.
      
      For guaranteeing non-lru movable page, VM provides PageMovable function.
      Unlike __PageMovable, PageMovable functions validates page->mapping and
      mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
      sudden destroying of page->mapping.
      
      Driver using __SetPageMovable should clear the flag via
      __ClearMovablePage under page_lock before the releasing the page.
      
      * PG_isolated
      
      To prevent concurrent isolation among several CPUs, VM marks isolated
      page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
      non-lru movable page, it can skip it.  Driver doesn't need to manipulate
      the flag because VM will set/clear it automatically.  Keep in mind that
      if driver sees PG_isolated page, it means the page have been isolated by
      VM so it shouldn't touch page.lru field.  PG_isolated is alias with
      PG_reclaim flag so driver shouldn't use the flag for own purpose.
      
      [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
        Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
      Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
      
      Signed-off-by: default avatarGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: John Einar Reitan <john.reitan@foss.arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bda807d4
  27. 24 May, 2016 2 commits
    • Michal Hocko's avatar
      mm: make vm_mmap killable · 9fbeb5ab
      Michal Hocko authored
      
      
      All the callers of vm_mmap seem to check for the failure already and
      bail out in one way or another on the error which means that we can
      change it to use killable version of vm_mmap_pgoff and return -EINTR if
      the current task gets killed while waiting for mmap_sem.  This also
      means that vm_mmap_pgoff can be killable by default and drop the
      additional parameter.
      
      This will help in the OOM conditions when the oom victim might be stuck
      waiting for the mmap_sem for write which in turn can block oom_reaper
      which relies on the mmap_sem for read to make a forward progress and
      reclaim the address space of the victim.
      
      Please note that load_elf_binary is ignoring vm_mmap error for
      current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
      problem because the address is not used anywhere and we never return to
      the userspace if we got killed.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9fbeb5ab
    • Michal Hocko's avatar
      mm: make mmap_sem for write waits killable for mm syscalls · dc0ef0df
      Michal Hocko authored
      This is a follow up work for oom_reaper [1].  As the async OOM killing
      depends on oom_sem for read we would really appreciate if a holder for
      write didn't stood in the way.  This patchset is changing many of
      down_write calls to be killable to help those cases when the writer is
      blocked and waiting for readers to release the lock and so help
      __oom_reap_task to process the oom victim.
      
      Most of the patches are really trivial because the lock is help from a
      shallow syscall paths where we can return EINTR trivially and allow the
      current task to die (note that EINTR will never get to the userspace as
      the task has fatal signal pending).  Others seem to be easy as well as
      the callers are already handling fatal errors and bail and return to
      userspace which should be sufficient to handle the failure gracefully.
      I am not familiar with all those code paths so a deeper review is really
      appreciated.
      
      As this work is touching more areas which are not directly co...
      dc0ef0df
  28. 20 May, 2016 1 commit
  29. 17 Mar, 2016 1 commit
  30. 16 Feb, 2016 1 commit
    • Dave Hansen's avatar
      mm/gup: Overload get_user_pages() functions · cde70140
      Dave Hansen authored
      
      
      The concept here was a suggestion from Ingo.  The implementation
      horrors are all mine.
      
      This allows get_user_pages(), get_user_pages_unlocked(), and
      get_user_pages_locked() to be called with or without the
      leading tsk/mm arguments.  We will give a compile-time warning
      about the old style being __deprecated and we will also
      WARN_ON() if the non-remote version is used for a remote-style
      access.
      
      Doing this, folks will get nice warnings and will not break the
      build.  This should be nice for -next and will hopefully let
      developers fix up their own code instead of maintainers needing
      to do it at merge time.
      
      The way we do this is hideous.  It uses the __VA_ARGS__ macro
      functionality to call different functions based on the number
      of arguments passed to the macro.
      
      There's an additional hack to ensure that our EXPORT_SYMBOL()
      of the deprecated symbols doesn't trigger a warning.
      
      We should be able to remove this mess as soon as -rc1 hits in
      the release after this is merged.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Leon Romanovsky <leon@leon.nu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cde70140