1. 31 Jan, 2020 18 commits
    • John Hubbard's avatar
      mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM · c4237f8b
      John Hubbard authored
      As it says in the updated comment in gup.c: current FOLL_LONGTERM
      behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS
      DAX check requirement on vmas.
      
      However, the corresponding restriction in get_user_pages_remote() was
      slightly stricter than is actually required: it forbade all
      FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
      that do not set the "locked" arg.
      
      Update the code and comments to loosen the restriction, allowing
      FOLL_LONGTERM in some cases.
      
      Also, copy the DAX check ("if a VMA is DAX, don't allow long term
      pinning") from the VFIO call site, all the way into the internals of
      get_user_pages_remote() and __gup_longterm_locked().  That is:
      get_user_pages_remote() calls __gup_longterm_locked(), which in turn
      calls check_dax_vmas().  This check will then be removed from the VFIO
      call site in a subsequent patch.
      
      Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and
      to Dan Williams for helping clarify the DAX refactoring.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-7-jhubbard@nvidia.com
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Tested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4237f8b
    • John Hubbard's avatar
      mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages · 07d80269
      John Hubbard authored
      An upcoming patch changes and complicates the refcounting and especially
      the "put page" aspects of it.  In order to keep everything clean,
      refactor the devmap page release routines:
      
      * Rename put_devmap_managed_page() to page_is_devmap_managed(), and
        limit the functionality to "read only": return a bool, with no side
        effects.
      
      * Add a new routine, put_devmap_managed_page(), to handle decrementing
        the refcount for ZONE_DEVICE pages.
      
      * Change callers (just release_pages() and put_page()) to check
        page_is_devmap_managed() before calling the new
        put_devmap_managed_page() routine.  This is a performance point:
        put_page() is a hot path, so we need to avoid non- inline function calls
        where possible.
      
      * Rename __put_devmap_managed_page() to free_devmap_managed_page(), and
        limit the functionality to unconditionally freeing a devmap page.
      
      This is originally based on a separate patch by Ira Weiny, which applied
      to an early version of the put_user_page() experiments.  Since then,
      Jérôme Glisse suggested the refactoring described above.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-5-jhubbard@nvidia.com
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Suggested-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      07d80269
    • Dan Williams's avatar
      mm: Cleanup __put_devmap_managed_page() vs ->page_free() · 429589d6
      Dan Williams authored
      After the removal of the device-public infrastructure there are only 2
      ->page_free() call backs in the kernel.  One of those is a
      device-private callback in the nouveau driver, the other is a generic
      wakeup needed in the DAX case.  In the hopes that all ->page_free()
      callbacks can be migrated to common core kernel functionality, move the
      device-private specific actions in __put_devmap_managed_page() under the
      is_device_private_page() conditional, including the ->page_free()
      callback.  For the other page types just open-code the generic wakeup.
      
      Yes, the wakeup is only needed in the MEMORY_DEVICE_FSDAX case, but it
      does no harm in the MEMORY_DEVICE_DEVDAX and MEMORY_DEVICE_PCI_P2PDMA
      case.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-4-jhubbard@nvidia.com
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      429589d6
    • John Hubbard's avatar
      mm/gup: move try_get_compound_head() to top, fix minor issues · a707cdd5
      John Hubbard authored
      An upcoming patch uses try_get_compound_head() more widely, so move it to
      the top of gup.c.
      
      Also fix a tiny spelling error and a checkpatch.pl warning.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.com
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a707cdd5
    • John Hubbard's avatar
      mm/gup: factor out duplicate code from four routines · a43e9820
      John Hubbard authored
      Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.
      
      Overview:
      
      This is a prerequisite to solving the problem of proper interactions
      between file-backed pages, and [R]DMA activities, as discussed in [1],
      [2], [3], and in a remarkable number of email threads since about
      2017.  :)
      
      A new internal gup flag, FOLL_PIN is introduced, and thoroughly
      documented in the last patch's Documentation/vm/pin_user_pages.rst.
      
      I believe that this will provide a good starting point for doing the
      layout lease work that Ira Weiny has been working on.  That's because
      these new wrapper functions provide a clean, constrained, systematically
      named set of functionality that, again, is required in order to even
      know if a page is "dma-pinned".
      
      In contrast to earlier approaches, the page tracking can be
      incrementally applied to the kernel call sites that, until now, have
      been simply calling get_user_pages() ("gup").  In other words, opt-in by
      changing from this:
      
          get_user_pages() (sets FOLL_GET)
          put_page()
      
      to this:
          pin_user_pages() (sets FOLL_PIN)
          unpin_user_page()
      
      Testing:
      
      * I've done some overall kernel testing (LTP, and a few other goodies),
        and some directed testing to exercise some of the changes. And as you
        can see, gup_benchmark is enhanced to exercise this. Basically, I've
        been able to runtime test the core get_user_pages() and
        pin_user_pages() and related routines, but not so much on several of
        the call sites--but those are generally just a couple of lines
        changed, each.
      
        Not much of the kernel is actually using this, which on one hand
        reduces risk quite a lot. But on the other hand, testing coverage
        is low. So I'd love it if, in particular, the Infiniband and PowerPC
        folks could do a smoke test of this series for me.
      
        Runtime testing for the call sites so far is pretty light:
      
          * io_uring: Some directed tests from liburing exercise this, and
                      they pass.
          * process_vm_access.c: A small directed test passes.
          * gup_benchmark: the enhanced version hits the new gup.c code, and
                           passes.
          * infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
          * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                            not ready just yet)
          * powerpc: it compiles...
          * drm/via: compiles...
          * goldfish: compiles...
          * net/xdp: compiles...
          * media/v4l2: compiles...
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
      
      This patch (of 22):
      
      There are four locations in gup.c that have a fair amount of code
      duplication.  This means that changing one requires making the same
      changes in four places, not to mention reading the same code four times,
      and wondering if there are subtle differences.
      
      Factor out the common code into static functions, thus reducing the
      overall line count and the code's complexity.
      
      Also, take the opportunity to slightly improve the efficiency of the
      error cases, by doing a mass subtraction of the refcount, surrounded by
      get_page()/put_page().
      
      Also, further simplify (slightly), by waiting until the the successful
      end of each routine, to increment *nr.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.com
      
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a43e9820
    • Wei Yang's avatar
      be9d3045
    • Qiujun Huang's avatar
      mm: fix gup_pud_range · 15494520
      Qiujun Huang authored
      sorry for not processing for a long time.  I met it again.
      
      patch v1   https://lkml.org/lkml/2019/9/20/656
      
      do_machine_check()
        do_memory_failure()
          memory_failure()
            hw_poison_user_mappings()
              try_to_unmap()
                pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
      
      ...and now we have a swap entry that indicates that the page entry
      refers to a bad (and poisoned) page of memory, but gup_fast() at this
      level of the page table was ignoring swap entries, and incorrectly
      assuming that "!pxd_none() == valid and present".
      
      And this was not just a poisoned page problem, but a generaly swap entry
      problem.  So, any swap entry type (device memory migration, numa
      migration, or just regular swapping) could lead to the same problem.
      
      Fix this by checking for pxd_present(), instead of pxd_none().
      
      Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.com
      
      Signed-off-by: default avatarQiujun Huang <hqjagain@gmail.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15494520
    • Ira Weiny's avatar
      mm/filemap.c: clean up filemap_write_and_wait() · ddf8f376
      Ira Weiny authored
      At some point filemap_write_and_wait() and
      filemap_write_and_wait_range() got the exact same implementation with
      the exception of the range being specified in *_range()
      
      Similar to other functions in fs.h which call *_range(..., 0,
      LLONG_MAX), change filemap_write_and_wait() to be a static inline which
      calls filemap_write_and_wait_range()
      
      Link: http://lkml.kernel.org/r/20191129160713.30892-1-ira.weiny@intel.com
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ddf8f376
    • Vlastimil Babka's avatar
      mm/debug.c: always print flags in dump_page() · 5b57b8f2
      Vlastimil Babka authored
      Commit 76a1850e ("mm/debug.c: __dump_page() prints an extra line")
      inadvertently removed printing of page flags for pages that are neither
      anon nor ksm nor have a mapping.  Fix that.
      
      Using pr_cont() again would be a solution, but the commit explicitly
      removed its use.  Avoiding the danger of mixing up split lines from
      multiple CPUs might be beneficial for near-panic dumps like this, so fix
      this without reintroducing pr_cont().
      
      Link: http://lkml.kernel.org/r/9f884d5c-ca60-dc7b-219c-c081c755fab6@suse.cz
      Fixes: 76a1850e
      
       ("mm/debug.c: __dump_page() prints an extra line")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reported-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5b57b8f2
    • He Zhe's avatar
      mm/kmemleak: turn kmemleak_lock and object->lock to raw_spinlock_t · 8c96f1bc
      He Zhe authored
      kmemleak_lock as a rwlock on RT can possibly be acquired in atomic
      context which does work.
      
      Since the kmemleak operation is performed in atomic context make it a
      raw_spinlock_t so it can also be acquired on RT.  This is used for
      debugging and is not enabled by default in a production like environment
      (where performance/latency matters) so it makes sense to make it a
      raw_spinlock_t instead trying to get rid of the atomic context.  Turn
      also the kmemleak_object->lock into raw_spinlock_t which is acquired
      (nested) while the kmemleak_lock is held.
      
      The time spent in "echo scan > kmemleak" slightly improved on 64core box
      with this patch applied after boot.
      
      [bigeasy@linutronix.de: redo the description, update comments. Merge the individual bits:  He Zhe did the kmemleak_lock, Liu Haitao the ->lock and Yongxin Liu forwarded Liu's patch.]
      Link: http://lkml.kernel.org/r/20191219170834.4tah3prf2gdothz4@linutronix.de
      Link: https://lkml.kernel.org/r/20181218150744.GB20197@arrakis.emea.arm.com
      Link: https://lkml.kernel.org/r/1542877459-144382-1-git-send-email-zhe.he@windriver.com
      Link: https://lkml.kernel.org/r/20190927082230.34152-1-yongxin.liu@windriver.com
      
      Signed-off-by: default avatarHe Zhe <zhe.he@windriver.com>
      Signed-off-by: default avatarLiu Haitao <haitao.liu@windriver.com>
      Signed-off-by: default avatarYongxin Liu <yongxin.liu@windriver.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8c96f1bc
    • Yu Zhao's avatar
      mm/slub.c: avoid slub allocation while holding list_lock · 90e9f6a6
      Yu Zhao authored
      If we are already under list_lock, don't call kmalloc().  Otherwise we
      will run into a deadlock because kmalloc() also tries to grab the same
      lock.
      
      Fix the problem by using a static bitmap instead.
      
        WARNING: possible recursive locking detected
        --------------------------------------------
        mount-encrypted/4921 is trying to acquire lock:
        (&(&n->list_lock)->rlock){-.-.}, at: ___slab_alloc+0x104/0x437
      
        but task is already holding lock:
        (&(&n->list_lock)->rlock){-.-.}, at: __kmem_cache_shutdown+0x81/0x3cb
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(&(&n->list_lock)->rlock);
          lock(&(&n->list_lock)->rlock);
      
         *** DEADLOCK ***
      
      Link: http://lkml.kernel.org/r/20191108193958.205102-2-yuzhao@google.com
      
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      90e9f6a6
    • Yang Shi's avatar
      mm: move_pages: report the number of non-attempted pages · 5984fabb
      Yang Shi authored
      Since commit a49bd4d7 ("mm, numa: rework do_pages_move"), the
      semantic of move_pages() has changed to return the number of
      non-migrated pages if they were result of a non-fatal reasons (usually a
      busy page).
      
      This was an unintentional change that hasn't been noticed except for LTP
      tests which checked for the documented behavior.
      
      There are two ways to go around this change.  We can even get back to
      the original behavior and return -EAGAIN whenever migrate_pages is not
      able to migrate pages due to non-fatal reasons.  Another option would be
      to simply continue with the changed semantic and extend move_pages
      documentation to clarify that -errno is returned on an invalid input or
      when migration simply cannot succeed (e.g.  -ENOMEM, -EBUSY) or the
      number of pages that couldn't have been migrated due to ephemeral
      reasons (e.g.  page is pinned or locked for other reasons).
      
      This patch implements the second option because this behavior is in
      place for some time without anybody complaining and possibly new users
      depending on it.  Also it allows to have a slightly easier error
      handling as the caller knows that it is worth to retry when err > 0.
      
      But since the new semantic would be aborted immediately if migration is
      failed due to ephemeral reasons, need include the number of
      non-attempted pages in the return value too.
      
      Link: http://lkml.kernel.org/r/1580160527-109104-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: a49bd4d7
      
       ("mm, numa: rework do_pages_move")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Cc: <stable@vger.kernel.org>    [4.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5984fabb
    • Wei Yang's avatar
      mm: thp: don't need care deferred split queue in memcg charge move path · fac0516b
      Wei Yang authored
      If compound is true, this means it is a PMD mapped THP.  Which implies
      the page is not linked to any defer list.  So the first code chunk will
      not be executed.
      
      Also with this reason, it would not be proper to add this page to a
      defer list.  So the second code chunk is not correct.
      
      Based on this, we should remove the defer list related code.
      
      [yang.shi@linux.alibaba.com: better patch title]
      Link: http://lkml.kernel.org/r/20200117233836.3434-1-richardw.yang@linux.intel.com
      Fixes: 87eaceb3
      
       ("mm: thp: make deferred split shrinker memcg aware")
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Suggested-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>    [5.4+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fac0516b
    • Dan Williams's avatar
      mm/memory_hotplug: fix remove_memory() lockdep splat · f1037ec0
      Dan Williams authored
      The daxctl unit test for the dax_kmem driver currently triggers the
      (false positive) lockdep splat below.  It results from the fact that
      remove_memory_block_devices() is invoked under the mem_hotplug_lock()
      causing lockdep entanglements with cpu_hotplug_lock() and sysfs (kernfs
      active state tracking).  It is a false positive because the sysfs
      attribute path triggering the memory remove is not the same attribute
      path associated with memory-block device.
      
      sysfs_break_active_protection() is not applicable since there is no real
      deadlock conflict, instead move memory-block device removal outside the
      lock.  The mem_hotplug_lock() is not needed to synchronize the
      memory-block device removal vs the page online state, that is already
      handled by lock_device_hotplug().  Specifically, lock_device_hotplug()
      is sufficient to allow try_remove_memory() to check the offline state of
      the memblocks and be assured that any in progress online attempts are
      flushed / blocked by kernfs_drain() / attribute removal.
      
      The add_memory() path safely creates memblock devices under the
      mem_hotplug_lock().  There is no kernfs active state synchronization in
      the memblock device_register() path, so nothing to fix there.
      
      This change is only possible thanks to the recent change that refactored
      memory block device removal out of arch_remove_memory() (commit
      4c4b7f9b "mm/memory_hotplug: remove memory block devices before
      arch_remove_memory()"), and David's due diligence tracking down the
      guarantees afforded by kernfs_drain().  Not flagged for -stable since
      this only impacts ongoing development and lockdep validation, not a
      runtime issue.
      
          ======================================================
          WARNING: possible circular locking dependency detected
          5.5.0-rc3+ #230 Tainted: G           OE
          ------------------------------------------------------
          lt-daxctl/6459 is trying to acquire lock:
          ffff99c7f0003510 (kn->count#241){++++}, at: kernfs_remove_by_name_ns+0x41/0x80
      
          but task is already holding lock:
          ffffffffa76a5450 (mem_hotplug_lock.rw_sem){++++}, at: percpu_down_write+0x20/0xe0
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #2 (mem_hotplug_lock.rw_sem){++++}:
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 get_online_mems+0x3e/0xb0
                 kmem_cache_create_usercopy+0x2e/0x260
                 kmem_cache_create+0x12/0x20
                 ptlock_cache_init+0x20/0x28
                 start_kernel+0x243/0x547
                 secondary_startup_64+0xb6/0xc0
      
          -> #1 (cpu_hotplug_lock.rw_sem){++++}:
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 cpus_read_lock+0x3e/0xb0
                 online_pages+0x37/0x300
                 memory_subsys_online+0x17d/0x1c0
                 device_online+0x60/0x80
                 state_store+0x65/0xd0
                 kernfs_fop_write+0xcf/0x1c0
                 vfs_write+0xdb/0x1d0
                 ksys_write+0x65/0xe0
                 do_syscall_64+0x5c/0xa0
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          -> #0 (kn->count#241){++++}:
                 check_prev_add+0x98/0xa40
                 validate_chain+0x576/0x860
                 __lock_acquire+0x39c/0x790
                 lock_acquire+0xa2/0x1b0
                 __kernfs_remove+0x25f/0x2e0
                 kernfs_remove_by_name_ns+0x41/0x80
                 remove_files.isra.0+0x30/0x70
                 sysfs_remove_group+0x3d/0x80
                 sysfs_remove_groups+0x29/0x40
                 device_remove_attrs+0x39/0x70
                 device_del+0x16a/0x3f0
                 device_unregister+0x16/0x60
                 remove_memory_block_devices+0x82/0xb0
                 try_remove_memory+0xb5/0x130
                 remove_memory+0x26/0x40
                 dev_dax_kmem_remove+0x44/0x6a [kmem]
                 device_release_driver_internal+0xe4/0x1c0
                 unbind_store+0xef/0x120
                 kernfs_fop_write+0xcf/0x1c0
                 vfs_write+0xdb/0x1d0
                 ksys_write+0x65/0xe0
                 do_syscall_64+0x5c/0xa0
                 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
          other info that might help us debug this:
      
          Chain exists of:
            kn->count#241 --> cpu_hotplug_lock.rw_sem --> mem_hotplug_lock.rw_sem
      
           Possible unsafe locking scenario:
      
                 CPU0                    CPU1
                 ----                    ----
            lock(mem_hotplug_lock.rw_sem);
                                         lock(cpu_hotplug_lock.rw_sem);
                                         lock(mem_hotplug_lock.rw_sem);
            lock(kn->count#241);
      
           *** DEADLOCK ***
      
      No fixes tag as this has been a long standing issue that predated the
      addition of kernfs lockdep annotations.
      
      Link: http://lkml.kernel.org/r/157991441887.2763922.4770790047389427325.stgit@dwillia2-desk3.amr.corp.intel.com
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1037ec0
    • Wei Yang's avatar
      mm/migrate.c: also overwrite error when it is bigger than zero · dfe9aa23
      Wei Yang authored
      If we get here after successfully adding page to list, err would be 1 to
      indicate the page is queued in the list.
      
      Current code has two problems:
      
        * on success, 0 is not returned
        * on error, if add_page_for_migratioin() return 1, and the following err1
          from do_move_pages_to_node() is set, the err1 is not returned since err
          is 1
      
      And these behaviors break the user interface.
      
      Link: http://lkml.kernel.org/r/20200119065753.21694-1-richardw.yang@linux.intel.com
      Fixes: e0153fc2
      
       ("mm: move_pages: return valid node id in status if the page is already on the target node").
      Signed-off-by: default avatarWei Yang <richardw.yang@linux.intel.com>
      Acked-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfe9aa23
    • Pingfan Liu's avatar
      mm/sparse.c: reset section's mem_map when fully deactivated · 1f503443
      Pingfan Liu authored
      After commit ba72b4c8 ("mm/sparsemem: support sub-section hotplug"),
      when a mem section is fully deactivated, section_mem_map still records
      the section's start pfn, which is not used any more and will be
      reassigned during re-addition.
      
      In analogy with alloc/free pattern, it is better to clear all fields of
      section_mem_map.
      
      Beside this, it breaks the user space tool "makedumpfile" [1], which
      makes assumption that a hot-removed section has mem_map as NULL, instead
      of checking directly against SECTION_MARKED_PRESENT bit.  (makedumpfile
      will be better to change the assumption, and need a patch)
      
      The bug can be reproduced on IBM POWERVM by "drmgr -c mem -r -q 5" ,
      trigger a crash, and save vmcore by makedumpfile
      
      [1]: makedumpfile, commit e73016540293 ("[v1.6.7] Update version")
      
      Link: http://lkml.kernel.org/r/1579487594-28889-1-git-send-email-kernelfans@gmail.com
      
      Signed-off-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f503443
    • Dan Carpenter's avatar
      mm/mempolicy.c: fix out of bounds write in mpol_parse_str() · c7a91bc7
      Dan Carpenter authored
      What we are trying to do is change the '=' character to a NUL terminator
      and then at the end of the function we restore it back to an '='.  The
      problem is there are two error paths where we jump to the end of the
      function before we have replaced the '=' with NUL.
      
      We end up putting the '=' in the wrong place (possibly one element
      before the start of the buffer).
      
      Link: http://lkml.kernel.org/r/20200115055426.vdjwvry44nfug7yy@kili.mountain
      Reported-by: syzbot+e64a13c5369a194d67df@syzkaller.appspotmail.com
      Fixes: 095f1fc4
      
       ("mempolicy: rework shmem mpol parsing and display")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Dmitry Vyukov <dvyukov@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c7a91bc7
    • Theodore Ts'o's avatar
      memcg: fix a crash in wb_workfn when a device disappears · 68f23b89
      Theodore Ts'o authored
      Without memcg, there is a one-to-one mapping between the bdi and
      bdi_writeback structures.  In this world, things are fairly
      straightforward; the first thing bdi_unregister() does is to shutdown
      the bdi_writeback structure (or wb), and part of that writeback ensures
      that no other work queued against the wb, and that the wb is fully
      drained.
      
      With memcg, however, there is a one-to-many relationship between the bdi
      and bdi_writeback structures; that is, there are multiple wb objects
      which can all point to a single bdi.  There is a refcount which prevents
      the bdi object from being released (and hence, unregistered).  So in
      theory, the bdi_unregister() *should* only get called once its refcount
      goes to zero (bdi_put will drop the refcount, and when it is zero,
      release_bdi gets called, which calls bdi_unregister).
      
      Unfortunately, del_gendisk() in block/gen_hd.c never got the memo about
      the Brave New memcg World, and calls bdi_unregister directly.  It does
      this without informing the file system, or the memcg code, or anything
      else.  This causes the root wb associated with the bdi to be
      unregistered, but none of the memcg-specific wb's are shutdown.  So when
      one of these wb's are woken up to do delayed work, they try to
      dereference their wb->bdi->dev to fetch the device name, but
      unfortunately bdi->dev is now NULL, thanks to the bdi_unregister()
      called by del_gendisk().  As a result, *boom*.
      
      Fortunately, it looks like the rest of the writeback path is perfectly
      happy with bdi->dev and bdi->owner being NULL, so the simplest fix is to
      create a bdi_dev_name() function which can handle bdi->dev being NULL.
      This also allows us to bulletproof the writeback tracepoints to prevent
      them from dereferencing a NULL pointer and crashing the kernel if one is
      tracing with memcg's enabled, and an iSCSI device dies or a USB storage
      stick is pulled.
      
      The most common way of triggering this will be hotremoval of a device
      while writeback with memcg enabled is going on.  It was triggering
      several times a day in a heavily loaded production environment.
      
      Google Bug Id: 145475544
      
      Link: https://lore.kernel.org/r/20191227194829.150110-1-tytso@mit.edu
      Link: http://lkml.kernel.org/r/20191228005211.163952-1-tytso@mit.edu
      
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68f23b89
  2. 24 Jan, 2020 1 commit
  3. 21 Jan, 2020 1 commit
  4. 20 Jan, 2020 1 commit
  5. 14 Jan, 2020 14 commits
    • Jason Gunthorpe's avatar
      mm/mmu_notifiers: Use 'interval_sub' as the variable for mmu_interval_notifier · 5292e24a
      Jason Gunthorpe authored
      
      
      The 'interval_sub' is placed on the 'notifier_subscriptions' interval
      tree.
      
      This eliminates the poor name 'mni' for this variable.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      5292e24a
    • Jason Gunthorpe's avatar
      mm/mmu_notifiers: Use 'subscription' as the variable name for mmu_notifier · 1991722a
      Jason Gunthorpe authored
      
      
      The 'subscription' is placed on the 'notifier_subscriptions' list.
      
      This eliminates the poor name 'mn' for this variable.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      1991722a
    • Jason Gunthorpe's avatar
      mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions · 984cfe4e
      Jason Gunthorpe authored
      
      
      The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
      and is difficult to abbreviate. The struct is actually holding the
      interval tree and hlist containing the notifiers subscribed to a mm.
      
      Use 'subscriptions' as the variable name for this struct instead of the
      really terrible and misleading 'mmn_mm'.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      984cfe4e
    • Dmitry Safonov's avatar
      x86/vdso: Handle faults on timens page · af34ebeb
      Dmitry Safonov authored
      
      
      If a task belongs to a time namespace then the VVAR page which contains
      the system wide VDSO data is replaced with a namespace specific page
      which has the same layout as the VVAR page.
      Co-developed-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20191112012724.250792-25-dima@arista.com
      
      af34ebeb
    • Adrian Huang's avatar
      mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid · 2fe20210
      Adrian Huang authored
      When booting with amd_iommu=off, the following WARNING message
      appears:
      
        AMD-Vi: AMD IOMMU disabled on kernel command-line
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
        Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
        RIP: 0010:flush_workqueue+0x42e/0x450
        Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
        Call Trace:
         kmem_cache_destroy+0x69/0x260
         iommu_go_to_state+0x40c/0x5ab
         amd_iommu_prepare+0x16/0x2a
         irq_remapping_prepare+0x36/0x5f
         enable_IR_x2apic+0x21/0x172
         default_setup_apic_routing+0x12/0x6f
         apic_intr_mode_init+0x1a1/0x1f1
         x86_late_time_init+0x17/0x1c
         start_kernel+0x480/0x53f
         secondary_startup_64+0xb6/0xc0
        ---[ end trace 30894107c3749449 ]---
        x2apic: IRQ remapping doesn't support X2APIC mode
        x2apic disabled
      
      The warning is caused by the calling of 'kmem_cache_destroy()'
      in free_iommu_resources(). Here is the call path:
      
        free_iommu_resources
          kmem_cache_destroy
            flush_memcg_workqueue
              flush_workqueue
      
      The root cause is that the IOMMU subsystem runs before the workqueue
      subsystem, which the variable 'wq_online' is still 'false'.  This leads
      to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
      'true'.
      
      Since the variable 'memcg_kmem_cache_wq' is not allocated during the
      time, it is unnecessary to call flush_memcg_workqueue().  This prevents
      the WARNING message triggered by flush_workqueue().
      
      Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
      Fixes: 92ee383f
      
       ("mm: fix race between kmem_cache destroy, create and deactivate")
      Signed-off-by: default avatarAdrian Huang <ahuang12@lenovo.com>
      Reported-by: default avatarXiaochun Lee <lixc17@lenovo.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2fe20210
    • Wen Yang's avatar
      mm/page-writeback.c: improve arithmetic divisions · 0a5d1a7f
      Wen Yang authored
      Use div64_ul() instead of do_div() if the divisor is unsigned long, to
      avoid truncation to 32-bit on 64-bit platforms.
      
      Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.com
      
      Signed-off-by: default avatarWen Yang <wenyang@linux.alibaba.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a5d1a7f
    • Wen Yang's avatar
      mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide · d3ac946e
      Wen Yang authored
      The two variables 'numerator' and 'denominator', though they are
      declared as long, they should actually be unsigned long (according to
      the implementation of the fprop_fraction_percpu() function)
      
      And do_div() does a 64-by-32 division, while the divisor 'denominator'
      is unsigned long, thus 64-bit on 64-bit platforms.  Hence the proper
      function to call is div64_ul().
      
      Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.com
      
      Signed-off-by: default avatarWen Yang <wenyang@linux.alibaba.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3ac946e
    • Wen Yang's avatar
      mm/page-writeback.c: avoid potential division by zero in wb_min_max_ratio() · 6d9e8c65
      Wen Yang authored
      Patch series "use div64_ul() instead of div_u64() if the divisor is
      unsigned long".
      
      We were first inspired by commit b0ab99e7 ("sched: Fix possible divide
      by zero in avg_atom () calculation"), then refer to the recently analyzed
      mm code, we found this suspicious place.
      
       201                 if (min) {
       202                         min *= this_bw;
       203                         do_div(min, tot_bw);
       204                 }
      
      And we also disassembled and confirmed it:
      
        /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 201
        0xffffffff811c37da <__wb_calc_thresh+234>:      xor    %r10d,%r10d
        0xffffffff811c37dd <__wb_calc_thresh+237>:      test   %rax,%rax
        0xffffffff811c37e0 <__wb_calc_thresh+240>:      je 0xffffffff811c3800 <__wb_calc_thresh+272>
        /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 202
        0xffffffff811c37e2 <__wb_calc_thresh+242>:      imul   %r8,%rax
        /usr/src/debug/kernel-4.9.168-016.ali3000/linux-4.9.168-016.ali3000.alios7.x86_64/mm/page-writeback.c: 203
        0xffffffff811c37e6 <__wb_calc_thresh+246>:      mov    %r9d,%r10d    ---> truncates it to 32 bits here
        0xffffffff811c37e9 <__wb_calc_thresh+249>:      xor    %edx,%edx
        0xffffffff811c37eb <__wb_calc_thresh+251>:      div    %r10
        0xffffffff811c37ee <__wb_calc_thresh+254>:      imul   %rbx,%rax
        0xffffffff811c37f2 <__wb_calc_thresh+258>:      shr    $0x2,%rax
        0xffffffff811c37f6 <__wb_calc_thresh+262>:      mul    %rcx
        0xffffffff811c37f9 <__wb_calc_thresh+265>:      shr    $0x2,%rdx
        0xffffffff811c37fd <__wb_calc_thresh+269>:      mov    %rdx,%r10
      
      This series uses div64_ul() instead of div_u64() if the divisor is
      unsigned long, to avoid truncation to 32-bit on 64-bit platforms.
      
      This patch (of 3):
      
      The variables 'min' and 'max' are unsigned long and do_div truncates
      them to 32 bits, which means it can test non-zero and be truncated to
      zero for division.  Fix this issue by using div64_ul() instead.
      
      Link: http://lkml.kernel.org/r/20200102081442.8273-2-wenyang@linux.alibaba.com
      Fixes: 693108a8
      
       ("writeback: make bdi->min/max_ratio handling cgroup writeback aware")
      Signed-off-by: default avatarWen Yang <wenyang@linux.alibaba.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d9e8c65
    • Vlastimil Babka's avatar
      mm, debug_pagealloc: don't rely on static keys too early · 8e57f8ac
      Vlastimil Babka authored
      Commit 96a2b03f ("mm, debug_pagelloc: use static keys to enable
      debugging") has introduced a static key to reduce overhead when
      debug_pagealloc is compiled in but not enabled.  It relied on the
      assumption that jump_label_init() is called before parse_early_param()
      as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
      it is safe to enable the static key.
      
      However, it turns out multiple architectures call parse_early_param()
      earlier from their setup_arch().  x86 also calls jump_label_init() even
      earlier, so no issue was found while testing the commit, but same is not
      true for e.g.  ppc64 and s390 where the kernel would not boot with
      debug_pagealloc=on as found by our QA.
      
      To fix this without tricky changes to init code of multiple
      architectures, this patch partially reverts the static key conversion
      from 96a2b03f.  Init-time and non-fastpath calls (such as in arch
      code) of debug_pagealloc_enabled() will again test a simple bool
      variable.  Fastpath mm code is converted to a new
      debug_pagealloc_enabled_static() variant that relies on the static key,
      which is enabled in a well-defined point in mm_init() where it's
      guaranteed that jump_label_init() has been called, regardless of
      architecture.
      
      [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
        Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
      Fixes: 96a2b03f ("mm, debug_pagelloc: use static keys to enable debugging")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8e57f8ac
    • Roman Gushchin's avatar
      mm: memcg/slab: fix percpu slab vmstats flushing · 4a87e2a2
      Roman Gushchin authored
      Currently slab percpu vmstats are flushed twice: during the memcg
      offlining and just before freeing the memcg structure.  Each time percpu
      counters are summed, added to the atomic counterparts and propagated up
      by the cgroup tree.
      
      The second flushing is required due to how recursive vmstats are
      implemented: counters are batched in percpu variables on a local level,
      and once a percpu value is crossing some predefined threshold, it spills
      over to atomic values on the local and each ascendant levels.  It means
      that without flushing some numbers cached in percpu variables will be
      dropped on floor each time a cgroup is destroyed.  And with uptime the
      error on upper levels might become noticeable.
      
      The first flushing aims to make counters on ancestor levels more
      precise.  Dying cgroups may resume in the dying state for a long time.
      After kmem_cache reparenting which is performed during the offlining
      slab counters of the dying cgroup don't have any chances to be updated,
      because any slab operations will be performed on the parent level.  It
      means that the inaccuracy caused by percpu batching will not decrease up
      to the final destruction of the cgroup.  By the original idea flushing
      slab counters during the offlining should minimize the visible
      inaccuracy of slab counters on the parent level.
      
      The problem is that percpu counters are not zeroed after the first
      flushing.  So every cached percpu value is summed twice.  It creates a
      small error (up to 32 pages per cpu, but usually less) which accumulates
      on parent cgroup level.  After creating and destroying of thousands of
      child cgroups, slab counter on parent level can be way off the real
      value.
      
      For now, let's just stop flushing slab counters on memcg offlining.  It
      can't be done correctly without scheduling a work on each cpu: reading
      and zeroing it during css offlining can race with an asynchronous
      update, which doesn't expect values to be changed underneath.
      
      With this change, slab counters on parent level will become eventually
      consistent.  Once all dying children are gone, values are correct.  And
      if not, the error is capped by 32 * NR_CPUS pages per dying cgroup.
      
      It's not perfect, as slab are reparented, so any updates after the
      reparenting will happen on the parent level.  It means that if a slab
      page was allocated, a counter on child level was bumped, then the page
      was reparented and freed, the annihilation of positive and negative
      counter values will not happen until the child cgroup is released.  It
      makes slab counters different from others, and it might want us to
      implement flushing in a correct form again.  But it's also a question of
      performance: scheduling a work on each cpu isn't free, and it's an open
      question if the benefit of having more accurate counters is worth it.
      
      We might also consider flushing all counters on offlining, not only slab
      counters.
      
      So let's fix the main problem now: make the slab counters eventually
      consistent, so at least the error won't grow with uptime (or more
      precisely the number of created and destroyed cgroups).  And think about
      the accuracy of counters separately.
      
      Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com
      Fixes: bee07b33
      
       ("mm: memcontrol: flush percpu slab vmstats on kmem offlining")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a87e2a2
    • Kirill A. Shutemov's avatar
      mm/shmem.c: thp, shmem: fix conflict of above-47bit hint address and PMD alignment · 99158997
      Kirill A. Shutemov authored
      Shmem/tmpfs tries to provide THP-friendly mappings if huge pages are
      enabled.  But it doesn't work well with above-47bit hint address.
      
      Normally, the kernel doesn't create userspace mappings above 47-bit,
      even if the machine allows this (such as with 5-level paging on x86-64).
      Not all user space is ready to handle wide addresses.  It's known that
      at least some JIT compilers use higher bits in pointers to encode their
      information.
      
      Userspace can ask for allocation from full address space by specifying
      hint address (with or without MAP_FIXED) above 47-bits.  If the
      application doesn't need a particular address, but wants to allocate
      from whole address space it can specify -1 as a hint address.
      
      Unfortunately, this trick breaks THP alignment in shmem/tmp:
      shmem_get_unmapped_area() would not try to allocate PMD-aligned area if
      *any* hint address specified.
      
      This can be fixed by requesting the aligned area if the we failed to
      allocated at user-specified hint address.  The request with inflated
      length will also take the user-specified hint address.  This way we will
      not lose an allocation request from the full address space.
      
      [kirill@shutemov.name: fold in a fixup]
        Link: http://lkml.kernel.org/r/20191223231309.t6bh5hkbmokihpfu@box
      Link: http://lkml.kernel.org/r/20191220142548.7118-3-kirill.shutemov@linux.intel.com
      Fixes: b569bab7
      
       ("x86/mm: Prepare to expose larger address space to userspace")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Willhalm, Thomas" <thomas.willhalm@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99158997
    • Kirill A. Shutemov's avatar
      mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment · 97d3d0f9
      Kirill A. Shutemov authored
      Patch series "Fix two above-47bit hint address vs.  THP bugs".
      
      The two get_unmapped_area() implementations have to be fixed to provide
      THP-friendly mappings if above-47bit hint address is specified.
      
      This patch (of 2):
      
      Filesystems use thp_get_unmapped_area() to provide THP-friendly
      mappings.  For DAX in particular.
      
      Normally, the kernel doesn't create userspace mappings above 47-bit,
      even if the machine allows this (such as with 5-level paging on x86-64).
      Not all user space is ready to handle wide addresses.  It's known that
      at least some JIT compilers use higher bits in pointers to encode their
      information.
      
      Userspace can ask for allocation from full address space by specifying
      hint address (with or without MAP_FIXED) above 47-bits.  If the
      application doesn't need a particular address, but wants to allocate
      from whole address space it can specify -1 as a hint address.
      
      Unfortunately, this trick breaks thp_get_unmapped_area(): the function
      would not try to allocate PMD-aligned area if *any* hint address
      specified.
      
      Modify the routine to handle it correctly:
      
       - Try to allocate the space at the specified hint address with length
         padding required for PMD alignment.
       - If failed, retry without length padding (but with the same hint
         address);
       - If the returned address matches the hint address return it.
       - Otherwise, align the address as required for THP and return.
      
      The user specified hint address is passed down to get_unmapped_area() so
      above-47bit hint address will be taken into account without breaking
      alignment requirements.
      
      Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com
      Fixes: b569bab7
      
       ("x86/mm: Prepare to expose larger address space to userspace")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarThomas Willhalm <thomas.willhalm@intel.com>
      Tested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Bruggeman, Otto G" <otto.g.bruggeman@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      97d3d0f9
    • David Hildenbrand's avatar
      mm/memory_hotplug: don't free usage map when removing a re-added early section · 8068df3b
      David Hildenbrand authored
      When we remove an early section, we don't free the usage map, as the
      usage maps of other sections are placed into the same page.  Once the
      section is removed, it is no longer an early section (especially, the
      memmap is freed).  When we re-add that section, the usage map is reused,
      however, it is no longer an early section.  When removing that section
      again, we try to kfree() a usage map that was allocated during early
      boot - bad.
      
      Let's check against PageReserved() to see if we are dealing with an
      usage map that was allocated during boot.  We could also check against
      !(PageSlab(usage_page) || PageCompound(usage_page)), but PageReserved() is
      cleaner.
      
      Can be triggered using memtrace under ppc64/powernv:
      
        $ mount -t debugfs none /sys/kernel/debug/
        $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
        $ echo 0x20000000 > /sys/kernel/debug/powerpc/memtrace/enable
         ------------[ cut here ]------------
         kernel BUG at mm/slub.c:3969!
         Oops: Exception in kernel mode, sig: 5 [#1]
         LE PAGE_SIZE=3D64K MMU=3DHash SMP NR_CPUS=3D2048 NUMA PowerNV
         Modules linked in:
         CPU: 0 PID: 154 Comm: sh Not tainted 5.5.0-rc2-next-20191216-00005-g0be1dba7b7c0 #61
         NIP kfree+0x338/0x3b0
         LR section_deactivate+0x138/0x200
         Call Trace:
           section_deactivate+0x138/0x200
           __remove_pages+0x114/0x150
           arch_remove_memory+0x3c/0x160
           try_remove_memory+0x114/0x1a0
           __remove_memory+0x20/0x40
           memtrace_enable_set+0x254/0x850
           simple_attr_write+0x138/0x160
           full_proxy_write+0x8c/0x110
           __vfs_write+0x38/0x70
           vfs_write+0x11c/0x2a0
           ksys_write+0x84/0x140
           system_call+0x5c/0x68
         ---[ end trace 4b053cbd84e0db62 ]---
      
      The first invocation will offline+remove memory blocks.  The second
      invocation will first add+online them again, in order to offline+remove
      them again (usually we are lucky and the exact same memory blocks will
      get "reallocated").
      
      Tested on powernv with boot memory: The usage map will not get freed.
      Tested on x86-64 with DIMMs: The usage map will get freed.
      
      Using Dynamic Memory under a Power DLAPR can trigger it easily.
      
      Triggering removal (I assume after previously removed+re-added) of
      memory from the HMC GUI can crash the kernel with the same call trace
      and is fixed by this patch.
      
      Link: http://lkml.kernel.org/r/20191217104637.5509-1-david@redhat.com
      Fixes: 326e1b8f
      
       ("mm/sparsemem: introduce a SECTION_IS_EARLY flag")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarPingfan Liu <piliu@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8068df3b
    • Vlastimil Babka's avatar
      mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations · cc638f32
      Vlastimil Babka authored
      THP page faults now attempt a __GFP_THISNODE allocation first, which
      should only compact existing free memory, followed by another attempt
      that can allocate from any node using reclaim/compaction effort
      specified by global defrag setting and madvise.
      
      This patch makes the following changes to the scheme:
      
       - Before the patch, the first allocation relies on a check for
         pageblock order and __GFP_IO to prevent excessive reclaim. This
         however affects also the second attempt, which is not limited to
         single node.
      
         Instead of that, reuse the existing check for costly order
         __GFP_NORETRY allocations, and make sure the first THP attempt uses
         __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
         allocations will bail out if compaction needs reclaim, while
         previously they only bailed out when compaction was deferred due to
         previous failures.
      
         This should be still acceptable within the __GFP_NORETRY semantics.
      
       - Before the patch, the second allocation attempt (on all nodes) was
         passing __GFP_NORETRY. This is redundant as the check for pageblock
         order (discussed above) was stronger. It's also contrary to
         madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
         requested.
      
         After this patch, the second attempt doesn't pass __GFP_THISNODE nor
         __GFP_NORETRY.
      
      To sum up, THP page faults now try the following attempts:
      
      1. local node only THP allocation with no reclaim, just compaction.
      2. for madvised VMA's or when synchronous compaction is enabled always - THP
         allocation from any node with effort determined by global defrag setting
         and VMA madvise
      3. fallback to base pages on any node
      
      Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
      Fixes: b39d0ee2
      
       ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cc638f32
  6. 06 Jan, 2020 1 commit
    • Catalin Marinas's avatar
      arm64: Revert support for execute-only user mappings · 24cecc37
      Catalin Marinas authored
      The ARMv8 64-bit architecture supports execute-only user permissions by
      clearing the PTE_USER and PTE_UXN bits, practically making it a mostly
      privileged mapping but from which user running at EL0 can still execute.
      
      The downside, however, is that the kernel at EL1 inadvertently reading
      such mapping would not trip over the PAN (privileged access never)
      protection.
      
      Revert the relevant bits from commit cab15ce6 ("arm64: Introduce
      execute-only page access permissions") so that PROT_EXEC implies
      PROT_READ (and therefore PTE_USER) until the architecture gains proper
      support for execute-only user mappings.
      
      Fixes: cab15ce6
      
       ("arm64: Introduce execute-only page access permissions")
      Cc: <stable@vger.kernel.org> # 4.9.x-
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      24cecc37
  7. 04 Jan, 2020 4 commits
    • Waiman Long's avatar
      mm/hugetlb: defer freeing of huge pages if in non-task context · c77c0a8a
      Waiman Long authored
      The following lockdep splat was observed when a certain hugetlbfs test
      was run:
      
        ================================
        WARNING: inconsistent lock state
        4.18.0-159.el8.x86_64+debug #1 Tainted: G        W --------- -  -
        --------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
        ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
        {SOFTIRQ-ON-W} state was registered at:
          lock_acquire+0x14f/0x3b0
          _raw_spin_lock+0x30/0x70
          __nr_hugepages_store_common+0x11b/0xb30
          hugetlb_sysctl_handler_common+0x209/0x2d0
          proc_sys_call_handler+0x37f/0x450
          vfs_write+0x157/0x460
          ksys_write+0xb8/0x170
          do_syscall_64+0xa5/0x4d0
          entry_SYSCALL_64_after_hwframe+0x6a/0xdf
        irq event stamp: 691296
        hardirqs last  enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60
        hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81
        softirqs last  enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0
        softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(hugetlb_lock);
          <Interrupt>
            lock(hugetlb_lock);
      
         *** DEADLOCK ***
            :
        Call Trace:
         <IRQ>
         __lock_acquire+0x146b/0x48c0
         lock_acquire+0x14f/0x3b0
         _raw_spin_lock+0x30/0x70
         free_huge_page+0x36f/0xaa0
         bio_check_pages_dirty+0x2fc/0x5c0
         clone_endio+0x17f/0x670 [dm_mod]
         blk_update_request+0x276/0xe50
         scsi_end_request+0x7b/0x6a0
         scsi_io_completion+0x1c6/0x1570
         blk_done_softirq+0x22e/0x350
         __do_softirq+0x23d/0xad8
         irq_exit+0x23e/0x2b0
         do_IRQ+0x11a/0x200
         common_interrupt+0xf/0xf
         </IRQ>
      
      Both the hugetbl_lock and the subpool lock can be acquired in
      free_huge_page().  One way to solve the problem is to make both locks
      irq-safe.  However, Mike Kravetz had learned that the hugetlb_lock is
      held for a linear scan of ALL hugetlb pages during a cgroup reparentling
      operation.  So it is just too long to have irq disabled unless we can
      break hugetbl_lock down into finer-grained locks with shorter lock hold
      times.
      
      Another alternative is to defer the freeing to a workqueue job.  This
      patch implements the deferred freeing by adding a free_hpage_workfn()
      work function to do the actual freeing.  The free_huge_page() call in a
      non-task context saves the page to be freed in the hpage_freelist linked
      list in a lockless manner using the llist APIs.
      
      The generic workqueue is used to process the work, but a dedicated
      workqueue can be used instead if it is desirable to have the huge page
      freed ASAP.
      
      Thanks to Kirill Tkhai <ktkhai@virtuozzo.com> for suggesting the use of
      llist APIs which simplfy the code.
      
      Link: http://lkml.kernel.org/r/20191217170331.30893-1-longman@redhat.com
      
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c77c0a8a
    • Navid Emamdoost's avatar
      mm/gup: fix memory leak in __gup_benchmark_ioctl · a7c46c0c
      Navid Emamdoost authored
      In the implementation of __gup_benchmark_ioctl() the allocated pages
      should be released before returning in case of an invalid cmd.  Release
      pages via kvfree().
      
      [akpm@linux-foundation.org: rework code flow, return -EINVAL rather than -1]
      Link: http://lkml.kernel.org/r/20191211174653.4102-1-navid.emamdoost@gmail.com
      Fixes: 714a3a1e
      
       ("mm/gup_benchmark.c: add additional pinning methods")
      Signed-off-by: default avatarNavid Emamdoost <navid.emamdoost@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7c46c0c
    • Ilya Dryomov's avatar
      mm/oom: fix pgtables units mismatch in Killed process message · 941f762b
      Ilya Dryomov authored
      pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
      As everything else is printed in kB, I chose to fix the value rather than
      the string.
      
      Before:
      
      [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
      ...
      [   1878]  1000  1878   217253   151144  1269760        0             0 python
      ...
      Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0
      
      After:
      
      [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
      ...
      [   1436]  1000  1436   217253   151890  1294336        0             0 python
      ...
      Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0
      
      Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
      Fixes: 70cb6d26
      
       ("mm/oom: add oom_score_adj and pgtables to Killed process message")
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Edward Chron <echron@arista.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      941f762b
    • Yang Shi's avatar
      mm: move_pages: return valid node id in status if the page is already on the target node · e0153fc2
      Yang Shi authored
      Felix Abecassis reports move_pages() would return random status if the
      pages are already on the target node by the below test program:
      
        int main(void)
        {
      	const long node_id = 1;
      	const long page_size = sysconf(_SC_PAGESIZE);
      	const int64_t num_pages = 8;
      
      	unsigned long nodemask =  1 << node_id;
      	long ret = set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask));
      	if (ret < 0)
      		return (EXIT_FAILURE);
      
      	void **pages = malloc(sizeof(void*) * num_pages);
      	for (int i = 0; i < num_pages; ++i) {
      		pages[i] = mmap(NULL, page_size, PROT_WRITE | PROT_READ,
      				MAP_PRIVATE | MAP_POPULATE | MAP_ANONYMOUS,
      				-1, 0);
      		if (pages[i] == MAP_FAILED)
      			return (EXIT_FAILURE);
      	}
      
      	ret = set_mempolicy(MPOL_DEFAULT, NULL, 0);
      	if (ret < 0)
      		return (EXIT_FAILURE);
      
      	int *nodes = malloc(sizeof(int) * num_pages);
      	int *status = malloc(sizeof(int) * num_pages);
      	for (int i = 0; i < num_pages; ++i) {
      		nodes[i] = node_id;
      		status[i] = 0xd0; /* simulate garbage values */
      	}
      
      	ret = move_pages(0, num_pages, pages, nodes, status, MPOL_MF_MOVE);
      	printf("move_pages: %ld\n", ret);
      	for (int i = 0; i < num_pages; ++i)
      		printf("status[%d] = %d\n", i, status[i]);
        }
      
      Then running the program would return nonsense status values:
      
        $ ./move_pages_bug
        move_pages: 0
        status[0] = 208
        status[1] = 208
        status[2] = 208
        status[3] = 208
        status[4] = 208
        status[5] = 208
        status[6] = 208
        status[7] = 208
      
      This is because the status is not set if the page is already on the
      target node, but move_pages() should return valid status as long as it
      succeeds.  The valid status may be errno or node id.
      
      We can't simply initialize status array to zero since the pages may be
      not on node 0.  Fix it by updating status with node id which the page is
      already on.
      
      Link: http://lkml.kernel.org/r/1575584353-125392-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: a49bd4d7
      
       ("mm, numa: rework do_pages_move")
      Signed-off-by: default avatarYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: default avatarFelix Abecassis <fabecassis@nvidia.com>
      Tested-by: default avatarFelix Abecassis <fabecassis@nvidia.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e0153fc2