1. 12 Jul, 2017 1 commit
    • Michal Hocko's avatar
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko authored
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  2. 02 Jun, 2017 1 commit
  3. 12 May, 2017 1 commit
    • Michal Hocko's avatar
      mm, vmalloc: fix vmalloc users tracking properly · 8594a21c
      Michal Hocko authored
      Commit 1f5307b1 ("mm, vmalloc: properly track vmalloc users") has
      pulled asm/pgtable.h include dependency to linux/vmalloc.h and that
      turned out to be a bad idea for some architectures.  E.g.  m68k fails
      with
      
         In file included from arch/m68k/include/asm/pgtable_mm.h:145:0,
                          from arch/m68k/include/asm/pgtable.h:4,
                          from include/linux/vmalloc.h:9,
                          from arch/m68k/kernel/module.c:9:
         arch/m68k/include/asm/mcf_pgtable.h: In function 'nocache_page':
      >> arch/m68k/include/asm/mcf_pgtable.h:339:43: error: 'init_mm' undeclared (first use in this function)
          #define pgd_offset_k(address) pgd_offset(&init_mm, address)
      
      as spotted by kernel build bot. nios2 fails for other reason
      
        In file included from include/asm-generic/io.h:767:0,
                         from arch/nios2/include/asm/io.h:61,
                         from include/linux/io.h:25,
                         from arch/nios2/include/asm/pgtable.h:18,
           ...
      8594a21c
  4. 09 May, 2017 3 commits
    • Michal Hocko's avatar
      mm, vmalloc: use __GFP_HIGHMEM implicitly · 19809c2d
      Michal Hocko authored
      __vmalloc* allows users to provide gfp flags for the underlying
      allocation.  This API is quite popular
      
        $ git grep "=[[:space:]]__vmalloc\|return[[:space:]]*__vmalloc" | wc -l
        77
      
      The only problem is that many people are not aware that they really want
      to give __GFP_HIGHMEM along with other flags because there is really no
      reason to consume precious lowmemory on CONFIG_HIGHMEM systems for pages
      which are mapped to the kernel vmalloc space.  About half of users don't
      use this flag, though.  This signals that we make the API unnecessarily
      too complex.
      
      This patch simply uses __GFP_HIGHMEM implicitly when allocating pages to
      be mapped to the vmalloc space.  Current users which add __GFP_HIGHMEM
      are simplified and drop the flag.
      
      Link: http://lkml.kernel.org/r/20170307141020.29107-1-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Cristopher Lameter <cl@linux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      19809c2d
    • Michal Hocko's avatar
      mm: support __GFP_REPEAT in kvmalloc_node for >32kB · 6c5ab651
      Michal Hocko authored
      vhost code uses __GFP_REPEAT when allocating vhost_virtqueue resp.
      vhost_vsock because it would really like to prefer kmalloc to the
      vmalloc fallback - see 23cc5a99 ("vhost-net: extend device
      allocation to vmalloc") for more context.  Michael Tsirkin has also
      noted:
      
       "__GFP_REPEAT overhead is during allocation time. Using vmalloc means
        all accesses are slowed down. Allocation is not on data path, accesses
        are."
      
      The similar applies to other vhost_kvzalloc users.
      
      Let's teach kvmalloc_node to handle __GFP_REPEAT properly.  There are
      two things to be careful about.  First we should prevent from the OOM
      killer and so have to involve __GFP_NORETRY by default and secondly
      override __GFP_REPEAT for !costly order requests as the __GFP_REPEAT is
      ignored for !costly orders.
      
      Supporting __GFP_REPEAT like semantic for !costly request is possible it
      would require changes in the page allocator.  This is out of scope of
      this patch.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170306103032.2540-3-mhocko@kernel.org
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c5ab651
    • Michal Hocko's avatar
      mm: introduce kv[mz]alloc helpers · a7c3e901
      Michal Hocko authored
      Patch series "kvmalloc", v5.
      
      There are many open coded kmalloc with vmalloc fallback instances in the
      tree.  Most of them are not careful enough or simply do not care about
      the underlying semantic of the kmalloc/page allocator which means that
      a) some vmalloc fallbacks are basically unreachable because the kmalloc
      part will keep retrying until it succeeds b) the page allocator can
      invoke a really disruptive steps like the OOM killer to move forward
      which doesn't sound appropriate when we consider that the vmalloc
      fallback is available.
      
      As it can be seen implementing kvmalloc requires quite an intimate
      knowledge if the page allocator and the memory reclaim internals which
      strongly suggests that a helper should be implemented in the memory
      subsystem proper.
      
      Most callers, I could find, have been converted to use the helper
      instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
      in the networking stack which I have converted as well and Eric Dumazet
      was not...
      a7c3e901
  5. 02 Mar, 2017 2 commits
  6. 25 Feb, 2017 1 commit
  7. 24 Dec, 2016 1 commit
  8. 20 Oct, 2016 1 commit
  9. 19 Oct, 2016 1 commit
  10. 18 Oct, 2016 1 commit
  11. 28 Jul, 2016 1 commit
  12. 26 Jul, 2016 2 commits
    • Kirill A. Shutemov's avatar
      rmap: support file thp · dd78fedd
      Kirill A. Shutemov authored
      Naive approach: on mapping/unmapping the page as compound we update
      ->_mapcount on each 4k page.  That's not efficient, but it's not obvious
      how we can optimize this.  We can look into optimization later.
      
      PG_double_map optimization doesn't work for file pages since lifecycle
      of file pages is different comparing to anon pages: file page can be
      mapped again at any time.
      
      Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd78fedd
    • Minchan Kim's avatar
      mm: migrate: support non-lru movable page migration · bda807d4
      Minchan Kim authored
      We have allowed migration for only LRU pages until now and it was enough
      to make high-order pages.  But recently, embedded system(e.g., webOS,
      android) uses lots of non-movable pages(e.g., zram, GPU memory) so we
      have seen several reports about troubles of small high-order allocation.
      For fixing the problem, there were several efforts (e,g,.  enhance
      compaction algorithm, SLUB fallback to 0-order page, reserved memory,
      vmalloc and so on) but if there are lots of non-movable pages in system,
      their solutions are void in the long run.
      
      So, this patch is to support facility to change non-movable pages with
      movable.  For the feature, this patch introduces functions related to
      migration to address_space_operations as well as some page flags.
      
      If a driver want to make own pages movable, it should define three
      functions which are function pointers of struct
      address_space_operations.
      
      1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
      
      What VM expects on isolate_page function of driver is to return *true*
      if driver isolates page successfully.  On returing true, VM marks the
      page as PG_isolated so concurrent isolation in several CPUs skip the
      page for isolation.  If a driver cannot isolate the page, it should
      return *false*.
      
      Once page is successfully isolated, VM uses page.lru fields so driver
      shouldn't expect to preserve values in that fields.
      
      2. int (*migratepage) (struct address_space *mapping,
      		struct page *newpage, struct page *oldpage, enum migrate_mode);
      
      After isolation, VM calls migratepage of driver with isolated page.  The
      function of migratepage is to move content of the old page to new page
      and set up fields of struct page newpage.  Keep in mind that you should
      indicate to the VM the oldpage is no longer movable via
      __ClearPageMovable() under page_lock if you migrated the oldpage
      successfully and returns 0.  If driver cannot migrate the page at the
      moment, driver can return -EAGAIN.  On -EAGAIN, VM will retry page
      migration in a short time because VM interprets -EAGAIN as "temporal
      migration failure".  On returning any error except -EAGAIN, VM will give
      up the page migration without retrying in this time.
      
      Driver shouldn't touch page.lru field VM using in the functions.
      
      3. void (*putback_page)(struct page *);
      
      If migration fails on isolated page, VM should return the isolated page
      to the driver so VM calls driver's putback_page with migration failed
      page.  In this function, driver should put the isolated page back to the
      own data structure.
      
      4. non-lru movable page flags
      
      There are two page flags for supporting non-lru movable page.
      
      * PG_movable
      
      Driver should use the below function to make page movable under
      page_lock.
      
      	void __SetPageMovable(struct page *page, struct address_space *mapping)
      
      It needs argument of address_space for registering migration family
      functions which will be called by VM.  Exactly speaking, PG_movable is
      not a real flag of struct page.  Rather than, VM reuses page->mapping's
      lower bits to represent it.
      
      	#define PAGE_MAPPING_MOVABLE 0x2
      	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
      
      so driver shouldn't access page->mapping directly.  Instead, driver
      should use page_mapping which mask off the low two bits of page->mapping
      so it can get right struct address_space.
      
      For testing of non-lru movable page, VM supports __PageMovable function.
      However, it doesn't guarantee to identify non-lru movable page because
      page->mapping field is unified with other variables in struct page.  As
      well, if driver releases the page after isolation by VM, page->mapping
      doesn't have stable value although it has PAGE_MAPPING_MOVABLE (Look at
      __ClearPageMovable).  But __PageMovable is cheap to catch whether page
      is LRU or non-lru movable once the page has been isolated.  Because LRU
      pages never can have PAGE_MAPPING_MOVABLE in page->mapping.  It is also
      good for just peeking to test non-lru movable pages before more
      expensive checking with lock_page in pfn scanning to select victim.
      
      For guaranteeing non-lru movable page, VM provides PageMovable function.
      Unlike __PageMovable, PageMovable functions validates page->mapping and
      mapping->a_ops->isolate_page under lock_page.  The lock_page prevents
      sudden destroying of page->mapping.
      
      Driver using __SetPageMovable should clear the flag via
      __ClearMovablePage under page_lock before the releasing the page.
      
      * PG_isolated
      
      To prevent concurrent isolation among several CPUs, VM marks isolated
      page as PG_isolated under lock_page.  So if a CPU encounters PG_isolated
      non-lru movable page, it can skip it.  Driver doesn't need to manipulate
      the flag because VM will set/clear it automatically.  Keep in mind that
      if driver sees PG_isolated page, it means the page have been isolated by
      VM so it shouldn't touch page.lru field.  PG_isolated is alias with
      PG_reclaim flag so driver shouldn't use the flag for own purpose.
      
      [opensource.ganesh@gmail.com: mm/compaction: remove local variable is_lru]
        Link: http://lkml.kernel.org/r/20160618014841.GA7422@leo-test
      Link: http://lkml.kernel.org/r/1464736881-24886-3-git-send-email-minchan@kernel.org
      
      Signed-off-by: default avatarGioh Kim <gi-oh.kim@profitbricks.com>
      Signed-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: John Einar Reitan <john.reitan@foss.arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bda807d4
  13. 24 May, 2016 2 commits
    • Michal Hocko's avatar
      mm: make vm_mmap killable · 9fbeb5ab
      Michal Hocko authored
      
      
      All the callers of vm_mmap seem to check for the failure already and
      bail out in one way or another on the error which means that we can
      change it to use killable version of vm_mmap_pgoff and return -EINTR if
      the current task gets killed while waiting for mmap_sem.  This also
      means that vm_mmap_pgoff can be killable by default and drop the
      additional parameter.
      
      This will help in the OOM conditions when the oom victim might be stuck
      waiting for the mmap_sem for write which in turn can block oom_reaper
      which relies on the mmap_sem for read to make a forward progress and
      reclaim the address space of the victim.
      
      Please note that load_elf_binary is ignoring vm_mmap error for
      current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
      problem because the address is not used anywhere and we never return to
      the userspace if we got killed.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9fbeb5ab
    • Michal Hocko's avatar
      mm: make mmap_sem for write waits killable for mm syscalls · dc0ef0df
      Michal Hocko authored
      This is a follow up work for oom_reaper [1].  As the async OOM killing
      depends on oom_sem for read we would really appreciate if a holder for
      write didn't stood in the way.  This patchset is changing many of
      down_write calls to be killable to help those cases when the writer is
      blocked and waiting for readers to release the lock and so help
      __oom_reap_task to process the oom victim.
      
      Most of the patches are really trivial because the lock is help from a
      shallow syscall paths where we can return EINTR trivially and allow the
      current task to die (note that EINTR will never get to the userspace as
      the task has fatal signal pending).  Others seem to be easy as well as
      the callers are already handling fatal errors and bail and return to
      userspace which should be sufficient to handle the failure gracefully.
      I am not familiar with all those code paths so a deeper review is really
      appreciated.
      
      As this work is touching more areas which are not directly connected I
      have tried to keep the CC list as small as possible and people who I
      believed would be familiar are CCed only to the specific patches (all
      should have received the cover though).
      
      This patchset is based on linux-next and it depends on
      down_write_killable for rw_semaphores which got merged into tip
      locking/rwsem branch and it is merged into this next tree.  I guess it
      would be easiest to route these patches via mmotm because of the
      dependency on the tip tree but if respective maintainers prefer other
      way I have no objections.
      
      I haven't covered all the mmap_write(mm->mmap_sem) instances here
      
        $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l
        98
        $ git grep "down_write(.*\<mmap_sem\>)" | wc -l
        62
      
      I have tried to cover those which should be relatively easy to review in
      this series because this alone should be a nice improvement.  Other
      places can be changed on top.
      
      [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
      [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org
      
      
      
      This patch (of 18):
      
      This is the first step in making mmap_sem write waiters killable.  It
      focuses on the trivial ones which are taking the lock early after
      entering the syscall and they are not changing state before.
      
      Therefore it is very easy to change them to use down_write_killable and
      immediately return with -EINTR.  This will allow the waiter to pass away
      without blocking the mmap_sem which might be required to make a forward
      progress.  E.g.  the oom reaper will need the lock for reading to
      dismantle the OOM victim address space.
      
      The only tricky function in this patch is vm_mmap_pgoff which has many
      call sites via vm_mmap.  To reduce the risk keep vm_mmap with the
      original non-killable semantic for now.
      
      vm_munmap callers do not bother checking the return value so open code
      it into the munmap syscall path for now for simplicity.
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc0ef0df
  14. 20 May, 2016 1 commit
  15. 17 Mar, 2016 1 commit
  16. 16 Feb, 2016 1 commit
    • Dave Hansen's avatar
      mm/gup: Overload get_user_pages() functions · cde70140
      Dave Hansen authored
      
      
      The concept here was a suggestion from Ingo.  The implementation
      horrors are all mine.
      
      This allows get_user_pages(), get_user_pages_unlocked(), and
      get_user_pages_locked() to be called with or without the
      leading tsk/mm arguments.  We will give a compile-time warning
      about the old style being __deprecated and we will also
      WARN_ON() if the non-remote version is used for a remote-style
      access.
      
      Doing this, folks will get nice warnings and will not break the
      build.  This should be nice for -next and will hopefully let
      developers fix up their own code instead of maintainers needing
      to do it at merge time.
      
      The way we do this is hideous.  It uses the __VA_ARGS__ macro
      functionality to call different functions based on the number
      of arguments passed to the macro.
      
      There's an additional hack to ensure that our EXPORT_SYMBOL()
      of the deprecated symbols doesn't trigger a warning.
      
      We should be able to remove this mess as soon as -rc1 hits in
      the release after this is merged.
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Leon Romanovsky <leon@leon.nu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xie XiuQi <xiexiuqi@huawei.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210155.73222EE1@viggo.jf.intel.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cde70140
  17. 03 Feb, 2016 1 commit
    • Johannes Weiner's avatar
      proc: revert /proc/<pid>/maps [stack:TID] annotation · 65376df5
      Johannes Weiner authored
      Commit b7643757 ("procfs: mark thread stack correctly in
      proc/<pid>/maps") added [stack:TID] annotation to /proc/<pid>/maps.
      
      Finding the task of a stack VMA requires walking the entire thread list,
      turning this into quadratic behavior: a thousand threads means a
      thousand stacks, so the rendering of /proc/<pid>/maps needs to look at a
      million combinations.
      
      The cost is not in proportion to the usefulness as described in the
      patch.
      
      Drop the [stack:TID] annotation to make /proc/<pid>/maps (and
      /proc/<pid>/numa_maps) usable again for higher thread counts.
      
      The [stack] annotation inside /proc/<pid>/task/<tid>/maps is retained, as
      identifying the stack VMA there is an O(1) operation.
      
      Siddesh said:
       "The end users needed a way to identify thread stacks programmatically and
        there wasn't a way to do that.  I'm afraid I no longer remember (or have
        access to the resources that would aid my memory since I changed
        employers) the details of their require...
      65376df5
  18. 21 Jan, 2016 1 commit
  19. 16 Jan, 2016 2 commits
    • Kirill A. Shutemov's avatar
      mm: prepare page_referenced() and page_idle to new THP refcounting · b20ce5e0
      Kirill A. Shutemov authored
      
      
      Both page_referenced() and page_idle_clear_pte_refs_one() assume that
      THP can only be mapped with PMD, so there's no reason to look on PTEs
      for PageTransHuge() pages.  That's no true anymore: THP can be mapped
      with PTEs too.
      
      The patch removes PageTransHuge() test from the functions and opencode
      page table check.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b20ce5e0
    • Kirill A. Shutemov's avatar
      mm: sanitize page->mapping for tail pages · 1c290f64
      Kirill A. Shutemov authored
      
      
      We don't define meaning of page->mapping for tail pages.  Currently it's
      always NULL, which can be inconsistent with head page and potentially
      lead to problems.
      
      Let's poison the pointer to catch all illigal uses.
      
      page_rmapping(), page_mapping() and page_anon_vma() are changed to look
      on head page.
      
      The only illegal use I've caught so far is __GPF_COMP pages from sound
      subsystem, mapped with PTEs.  do_shared_fault() is changed to use
      page_rmapping() instead of direct access to fault_page->mapping.
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c290f64
  20. 04 Jan, 2016 1 commit
  21. 06 Nov, 2015 1 commit
  22. 15 Apr, 2015 1 commit
  23. 14 Feb, 2015 1 commit
    • Andrzej Hajda's avatar
      mm/util: add kstrdup_const · a4bb1e43
      Andrzej Hajda authored
      
      
      kstrdup() is often used to duplicate strings where neither source neither
      destination will be ever modified.  In such case we can just reuse the
      source instead of duplicating it.  The problem is that we must be sure
      that the source is non-modifiable and its life-time is long enough.
      
      I suspect the good candidates for such strings are strings located in
      kernel .rodata section, they cannot be modifed because the section is
      read-only and their life-time is equal to kernel life-time.
      
      This small patchset proposes alternative version of kstrdup -
      kstrdup_const, which returns source string if it is located in .rodata
      otherwise it fallbacks to kstrdup.  To verify if the source is in
      .rodata function checks if the address is between sentinels
      __start_rodata, __end_rodata.  I guess it should work with all
      architectures.
      
      The main patch is accompanied by four patches constifying kstrdup for
      cases where situtation described above happens frequently.
      
      I have tested the patchset on mobile platform (exynos4210-trats) and it
      saves 3272 string allocations.  Since minimal allocation is 32 or 64
      bytes depending on Kconfig options the patchset saves respectively about
      100KB or 200KB of memory.
      
      Stats from tested platform show that the main offender is sysfs:
      
      By caller:
        2260 __kernfs_new_node
          631 clk_register+0xc8/0x1b8
          318 clk_register+0x34/0x1b8
            51 kmem_cache_create
            12 alloc_vfsmnt
      
      By string (with count >= 5):
          883 power
          876 subsystem
          135 parameters
          132 device
           61 iommu_group
          ...
      
      This patch (of 5):
      
      Add an alternative version of kstrdup which returns pointer to constant
      char array.  The function checks if input string is in persistent and
      read-only memory section, if yes it returns the input string, otherwise it
      fallbacks to kstrdup.
      
      kstrdup_const is accompanied by kfree_const performing conditional memory
      deallocation of the string.
      Signed-off-by: default avatarAndrzej Hajda <a.hajda@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mike Turquette <mturquette@linaro.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4bb1e43
  24. 12 Feb, 2015 1 commit
  25. 10 Oct, 2014 1 commit
    • Oleg Nesterov's avatar
      proc/maps: make vm_is_stack() logic namespace-friendly · 58cb6548
      Oleg Nesterov authored
      
      
      - Rename vm_is_stack() to task_of_stack() and change it to return
        "struct task_struct *" rather than the global (and thus wrong in
        general) pid_t.
      
      - Add the new pid_of_stack() helper which calls task_of_stack() and
        uses the right namespace to report the correct pid_t.
      
        Unfortunately we need to define this helper twice, in task_mmu.c
        and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
        and move at least pid_of_stack/task_of_stack there to avoid the
        code duplication.
      
      - Change show_map_vma() and show_numa_map() to use the new helper.
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      58cb6548
  26. 08 Aug, 2014 1 commit
  27. 07 Aug, 2014 1 commit
  28. 06 May, 2014 1 commit
  29. 07 Apr, 2014 1 commit
  30. 07 Mar, 2014 1 commit
  31. 22 Jan, 2014 1 commit
    • Jerome Marchand's avatar
      mm: add overcommit_kbytes sysctl variable · 49f0ce5f
      Jerome Marchand authored
      
      
      Some applications that run on HPC clusters are designed around the
      availability of RAM and the overcommit ratio is fine tuned to get the
      maximum usage of memory without swapping.  With growing memory, the
      1%-of-all-RAM grain provided by overcommit_ratio has become too coarse
      for these workload (on a 2TB machine it represents no less than 20GB).
      
      This patch adds the new overcommit_kbytes sysctl variable that allow a
      much finer grain.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: default avatarJerome Marchand <jmarchan@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49f0ce5f
  32. 15 Jan, 2014 1 commit
    • Mikulas Patocka's avatar
      mm: fix crash when using XFS on loopback · 03e5ac2f
      Mikulas Patocka authored
      Commit 8456a648 ("slab: use struct page for slab management") causes
      a crash in the LVM2 testsuite on PA-RISC (the crashing test is
      fsadm.sh).  The testsuite doesn't crash on 3.12, crashes on 3.13-rc1 and
      later.
      
       Bad Address (null pointer deref?): Code=15 regs=000000413edd89a0 (Addr=000006202224647d)
       CPU: 3 PID: 24008 Comm: loop0 Not tainted 3.13.0-rc6 #5
       task: 00000001bf3c0048 ti: 000000413edd8000 task.ti: 000000413edd8000
      
            YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
       PSW: 00001000000001101111100100001110 Not tainted
       r00-03  000000ff0806f90e 00000000405c8de0 000000004013e6c0 000000413edd83f0
       r04-07  00000000405a95e0 0000000000000200 00000001414735f0 00000001bf349e40
       r08-11  0000000010fe3d10 0000000000000001 00000040829c7778 000000413efd9000
       r12-15  0000000000000000 000000004060d800 0000000010fe3000 0000000010fe3000
       r16-19  000000413edd82a0 00000041078ddbc0 0000000000000010 0000000000000001
       r20-23  0008f3d0d83a8000 0000000000000000 00000040829c7778 0000000000000080
       r24-27  00000001bf349e40 00000001bf349e40 202d66202224640d 00000000405a95e0
       r28-31  202d662022246465 000000413edd88f0 000000413edd89a0 0000000000000001
       sr00-03  000000000532c000 0000000000000000 0000000000000000 000000000532c000
       sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
      
       IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000401fe42c 00000000401fe430
        IIR: 539c0030    ISR: 00000000202d6000  IOR: 000006202224647d
        CPU:        3   CR30: 000000413edd8000 CR31: 0000000000000000
        ORIG_R28: 00000000405a95e0
        IAOQ[0]: vma_interval_tree_iter_first+0x14/0x48
        IAOQ[1]: vma_interval_tree_iter_first+0x18/0x48
        RP(r2): flush_dcache_page+0x128/0x388
       Backtrace:
         flush_dcache_page+0x128/0x388
         lo_splice_actor+0x90/0x148 [loop]
         splice_from_pipe_feed+0xc0/0x1d0
         __splice_from_pipe+0xac/0xc0
         lo_direct_splice_actor+0x1c/0x70 [loop]
         splice_direct_to_actor+0xec/0x228
         lo_receive+0xe4/0x298 [loop]
         loop_thread+0x478/0x640 [loop]
         kthread+0x134/0x168
         end_fault_vector+0x20/0x28
         xfs_setsize_buftarg+0x0/0x90 [xfs]
      
       Kernel panic - not syncing: Bad Address (null pointer deref?)
      
      Commit 8456a648
      
       changes the page structure so that the slab
      subsystem reuses the page->mapping field.
      
      The crash happens in the following way:
       * XFS allocates some memory from slab and issues a bio to read data
         into it.
       * the bio is sent to the loopback device.
       * lo_receive creates an actor and calls splice_direct_to_actor.
       * lo_splice_actor copies data to the target page.
       * lo_splice_actor calls flush_dcache_page because the page may be
         mapped by userspace.  In that case we need to flush the kernel cache.
       * flush_dcache_page asks for the list of userspace mappings, however
         that page->mapping field is reused by the slab subsystem for a
         different purpose.  This causes the crash.
      
      Note that other architectures without coherent caches (sparc, arm, mips)
      also call page_mapping from flush_dcache_page, so they may crash in the
      same way.
      
      This patch fixes this bug by testing if the page is a slab page in
      page_mapping and returning NULL if it is.
      
      The patch also fixes VM_BUG_ON(PageSlab(page)) that could happen in
      earlier kernels in the same scenario on architectures without cache
      coherence when CONFIG_DEBUG_VM is enabled - so it should be backported
      to stable kernels.
      
      In the old kernels, the function page_mapping is placed in
      include/linux/mm.h, so you should modify the patch accordingly when
      backporting it.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: John David Anglin <dave.anglin@bell.net>]
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: default avatarPekka Enberg <penberg@kernel.org>
      Reviewed-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03e5ac2f
  33. 13 Nov, 2013 1 commit
  34. 11 Sep, 2013 1 commit