1. 02 Apr, 2020 6 commits
    • John Hubbard's avatar
      mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages · 47e29d32
      John Hubbard authored
      For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
      scheme tends to overflow too easily, each tail page increments the head
      page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
      of huge pages that can be pinned.
      
      This patch removes that limitation, by using an exact form of pin counting
      for compound pages of order > 1.  The "order > 1" is required because this
      approach uses the 3rd struct page in the compound page, and order 1
      compound pages only have two pages, so that won't work there.
      
      A new struct page field, hpage_pinned_refcount, has been added, replacing
      a padding field in the union (so no new space is used).
      
      This enhancement also has a useful side effect: huge pages and compound
      pages (of order > 1) do not suffer from the "potential false positives"
      problem that is discussed in the page_dma_pinned() comment block.  That is
      because these compound pages have extra space for tracking things, so they
      get exact pin counts instead of overloading page->_refcount.
      
      Documentation/core-api/pin_user_pages.rst is updated accordingly.
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47e29d32
    • John Hubbard's avatar
      mm/gup: track FOLL_PIN pages · 3faa52c0
      John Hubbard authored
      Add tracking of pages that were pinned via FOLL_PIN.  This tracking is
      implemented via overloading of page->_refcount: pins are added by adding
      GUP_PIN_COUNTING_BIAS (1024) to the refcount.  This provides a fuzzy
      indication of pinning, and it can have false positives (and that's OK).
      Please see the pre-existing Documentation/core-api/pin_user_pages.rst for
      details.
      
      As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN
      (typically via pin_user_pages*()) are required to ultimately free such
      pages via unpin_user_page().
      
      Please also note the limitation, discussed in pin_user_pages.rst under the
      "TODO: for 1GB and larger huge pages" section.  (That limitation will be
      removed in a following patch.)
      
      The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be
      thought of as "FOLL_GET for DIO and/or RDMA use".
      
      Pages that have been pinned via FOLL_PIN are identifiable via a new
      function call:
      
         bool page_maybe_dma_pinned(struct page *page);
      
      What to do in response to encountering such a page, is left to later
      patchsets. There is discussion about this in [1], [2], [3], and [4].
      
      This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      [4] LWN kernel index: get_user_pages():
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      [jhubbard@nvidia.com: add kerneldoc]
        Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com
      [imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough]
        Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com
      [akpm@linux-foundation.org: fix put_compound_head defined but not used]
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Suggested-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3faa52c0
    • John Hubbard's avatar
      mm/gup: require FOLL_GET for get_user_pages_fast() · 94202f12
      John Hubbard authored
      Internal to mm/gup.c, require that get_user_pages_fast() and
      __get_user_pages_fast() identify themselves, by setting FOLL_GET.  This is
      required in order to be able to make decisions based on "FOLL_PIN, or
      FOLL_GET, or both or neither are set", in upcoming patches.
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-6-jhubbard@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      94202f12
    • John Hubbard's avatar
      mm/gup: pass gup flags to two more routines · 3b78d834
      John Hubbard authored
      In preparation for an upcoming patch, send gup flags args to two more
      routines: put_compound_head(), and undo_dev_pagemap().
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-5-jhubbard@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b78d834
    • John Hubbard's avatar
      mm/gup: pass a flags arg to __gup_device_* functions · 86dfbed4
      John Hubbard authored
      A subsequent patch requires access to gup flags, so pass the flags
      argument through to the __gup_device_* functions.
      
      Also placate checkpatch.pl by shortening a nearby line.
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-3-jhubbard@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      86dfbed4
    • John Hubbard's avatar
      mm/gup: split get_user_pages_remote() into two routines · 22bf29b6
      John Hubbard authored
      Patch series "mm/gup: track FOLL_PIN pages", v6.
      
      This activates tracking of FOLL_PIN pages.  This is in support of fixing
      the get_user_pages()+DMA problem described in [1]-[4].
      
      FOLL_PIN support is now in the main linux tree.  However, the patch to use
      FOLL_PIN to track pages was *not* submitted, because Leon saw an RDMA test
      suite failure that involved (I think) page refcount overflows when huge
      pages were used.
      
      This patch definitively solves that kind of overflow problem, by adding an
      exact pincount, for compound pages (of order > 1), in the 3rd struct page
      of a compound page.  If available, that form of pincounting is used,
      instead of the GUP_PIN_COUNTING_BIAS approach.  Thanks again to Jan Kara
      for that idea.
      
      Other interesting changes:
      
      * dump_page(): added one, or two new things to report for compound
        pages: head refcount (for all compound pages), and map_pincount (for
        compound pages of order > 1).
      
      * Documentation/core-api/pin_user_pages.rst: removed the "TODO" for the
        huge page refcount upper limit problems, and added notes about how it
        works now.  Also added a note about the dump_page() enhancements.
      
      * Added some comments in gup.c and mm.h, to explain that there are two
        ways to count pinned pages: exact (for compound pages of order > 1) and
        fuzzy (GUP_PIN_COUNTING_BIAS: for all other pages).
      
      ============================================================
      General notes about the tracking patch:
      
      This is a prerequisite to solving the problem of proper interactions
      between file-backed pages, and [R]DMA activities, as discussed in [1],
      [2], [3], [4] and in a remarkable number of email threads since about
      2017.  :)
      
      In contrast to earlier approaches, the page tracking can be incrementally
      applied to the kernel call sites that, until now, have been simply calling
      get_user_pages() ("gup").  In other words, opt-in by changing from this:
      
          get_user_pages() (sets FOLL_GET)
          put_page()
      
      to this:
          pin_user_pages() (sets FOLL_PIN)
          unpin_user_page()
      
      ============================================================
      Future steps:
      
      * Convert more subsystems from get_user_pages() to pin_user_pages().
        The first probably needs to be bio/biovecs, because any filesystem
        testing is too difficult without those in place.
      
      * Change VFS and filesystems to respond appropriately when encountering
        dma-pinned pages.
      
      * Work with Ira and others to connect this all up with file system
        leases.
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019):
          https://lwn.net/Articles/784574/
      
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
          https://lwn.net/Articles/774411/
      
      [3] The trouble with get_user_pages() (Apr 30, 2018):
          https://lwn.net/Articles/753027/
      
      [4] LWN kernel index: get_user_pages()
          https://lwn.net/Kernel/Index/#Memory_management-get_user_pages
      
      This patch (of 12):
      
      An upcoming patch requires reusing the implementation of
      get_user_pages_remote().  Split up get_user_pages_remote() into an outer
      routine that checks flags, and an implementation routine that will be
      reused.  This makes subsequent changes much easier to understand.
      
      There should be no change in behavior due to this patch.
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-2-jhubbard@nvidia.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22bf29b6
  2. 04 Feb, 2020 1 commit
  3. 31 Jan, 2020 8 commits
    • John Hubbard's avatar
      mm, tree-wide: rename put_user_page*() to unpin_user_page*() · f1f6a7dd
      John Hubbard authored
      In order to provide a clearer, more symmetric API for pinning and
      unpinning DMA pages.  This way, pin_user_pages*() calls match up with
      unpin_user_pages*() calls, and the API is a lot closer to being
      self-explanatory.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f1f6a7dd
    • John Hubbard's avatar
      mm/gup: introduce pin_user_pages*() and FOLL_PIN · eddb1c22
      John Hubbard authored
      Introduce pin_user_pages*() variations of get_user_pages*() calls, and
      also pin_longterm_pages*() variations.
      
      For now, these are placeholder calls, until the various call sites are
      converted to use the correct get_user_pages*() or pin_user_pages*() API.
      
      These variants will eventually all set FOLL_PIN, which is also
      introduced, and thoroughly documented.
      
          pin_user_pages()
          pin_user_pages_remote()
          pin_user_pages_fast()
      
      All pages that are pinned via the above calls, must be unpinned via
      put_user_page().
      
      The underlying rules are:
      
      * FOLL_PIN is a gup-internal flag, so the call sites should not directly
        set it.  That behavior is enforced with assertions.
      
      * Call sites that want to indicate that they are going to do DirectIO
        ("DIO") or something with similar characteristics, should call a
        get_user_pages()-like wrapper call that sets FOLL_PIN.  These wrappers
        will:
      
          * Start with "pin_user_pages" instead of "get_user_pages".  That
            makes it easy to find and audit the call sites.
      
          * Set FOLL_PIN
      
      * For pages that are received via FOLL_PIN, those pages must be returned
        via put_user_page().
      
      Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases in
      this documentation.  (I've reworded it and expanded upon it.)
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-12-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[Documentation]
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      eddb1c22
    • John Hubbard's avatar
      mm/gup: allow FOLL_FORCE for get_user_pages_fast() · f4000fdf
      John Hubbard authored
      Commit 817be129 ("mm: validate get_user_pages_fast flags") allowed
      only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
      This, combined with the fact that get_user_pages_fast() falls back to
      "slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
      if you need FOLL_FORCE, you cannot call get_user_pages_fast().
      
      There does not appear to be any reason for filtering out FOLL_FORCE.
      There is nothing in the _fast() implementation that requires that we
      avoid writing to the pages.  So it appears to have been an oversight.
      
      Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-9-jhubbard@nvidia.com
      Fixes: 817be129 ("mm: validate get_user_pages_fast flags")
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f4000fdf
    • John Hubbard's avatar
      mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM · c4237f8b
      John Hubbard authored
      As it says in the updated comment in gup.c: current FOLL_LONGTERM
      behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS
      DAX check requirement on vmas.
      
      However, the corresponding restriction in get_user_pages_remote() was
      slightly stricter than is actually required: it forbade all
      FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
      that do not set the "locked" arg.
      
      Update the code and comments to loosen the restriction, allowing
      FOLL_LONGTERM in some cases.
      
      Also, copy the DAX check ("if a VMA is DAX, don't allow long term
      pinning") from the VFIO call site, all the way into the internals of
      get_user_pages_remote() and __gup_longterm_locked().  That is:
      get_user_pages_remote() calls __gup_longterm_locked(), which in turn
      calls check_dax_vmas().  This check will then be removed from the VFIO
      call site in a subsequent patch.
      
      Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and
      to Dan Williams for helping clarify the DAX refactoring.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-7-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Tested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4237f8b
    • John Hubbard's avatar
      mm/gup: move try_get_compound_head() to top, fix minor issues · a707cdd5
      John Hubbard authored
      An upcoming patch uses try_get_compound_head() more widely, so move it to
      the top of gup.c.
      
      Also fix a tiny spelling error and a checkpatch.pl warning.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a707cdd5
    • John Hubbard's avatar
      mm/gup: factor out duplicate code from four routines · a43e9820
      John Hubbard authored
      Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.
      
      Overview:
      
      This is a prerequisite to solving the problem of proper interactions
      between file-backed pages, and [R]DMA activities, as discussed in [1],
      [2], [3], and in a remarkable number of email threads since about
      2017.  :)
      
      A new internal gup flag, FOLL_PIN is introduced, and thoroughly
      documented in the last patch's Documentation/vm/pin_user_pages.rst.
      
      I believe that this will provide a good starting point for doing the
      layout lease work that Ira Weiny has been working on.  That's because
      these new wrapper functions provide a clean, constrained, systematically
      named set of functionality that, again, is required in order to even
      know if a page is "dma-pinned".
      
      In contrast to earlier approaches, the page tracking can be
      incrementally applied to the kernel call sites that, until now, have
      been simply calling get_user_pages() ("gup").  In other words, opt-in by
      changing from this:
      
          get_user_pages() (sets FOLL_GET)
          put_page()
      
      to this:
          pin_user_pages() (sets FOLL_PIN)
          unpin_user_page()
      
      Testing:
      
      * I've done some overall kernel testing (LTP, and a few other goodies),
        and some directed testing to exercise some of the changes. And as you
        can see, gup_benchmark is enhanced to exercise this. Basically, I've
        been able to runtime test the core get_user_pages() and
        pin_user_pages() and related routines, but not so much on several of
        the call sites--but those are generally just a couple of lines
        changed, each.
      
        Not much of the kernel is actually using this, which on one hand
        reduces risk quite a lot. But on the other hand, testing coverage
        is low. So I'd love it if, in particular, the Infiniband and PowerPC
        folks could do a smoke test of this series for me.
      
        Runtime testing for the call sites so far is pretty light:
      
          * io_uring: Some directed tests from liburing exercise this, and
                      they pass.
          * process_vm_access.c: A small directed test passes.
          * gup_benchmark: the enhanced version hits the new gup.c code, and
                           passes.
          * infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
          * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                            not ready just yet)
          * powerpc: it compiles...
          * drm/via: compiles...
          * goldfish: compiles...
          * net/xdp: compiles...
          * media/v4l2: compiles...
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
      
      This patch (of 22):
      
      There are four locations in gup.c that have a fair amount of code
      duplication.  This means that changing one requires making the same
      changes in four places, not to mention reading the same code four times,
      and wondering if there are subtle differences.
      
      Factor out the common code into static functions, thus reducing the
      overall line count and the code's complexity.
      
      Also, take the opportunity to slightly improve the efficiency of the
      error cases, by doing a mass subtraction of the refcount, surrounded by
      get_page()/put_page().
      
      Also, further simplify (slightly), by waiting until the the successful
      end of each routine, to increment *nr.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a43e9820
    • Wei Yang's avatar
      be9d3045
    • Qiujun Huang's avatar
      mm: fix gup_pud_range · 15494520
      Qiujun Huang authored
      sorry for not processing for a long time.  I met it again.
      
      patch v1   https://lkml.org/lkml/2019/9/20/656
      
      do_machine_check()
        do_memory_failure()
          memory_failure()
            hw_poison_user_mappings()
              try_to_unmap()
                pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
      
      ...and now we have a swap entry that indicates that the page entry
      refers to a bad (and poisoned) page of memory, but gup_fast() at this
      level of the page table was ignoring swap entries, and incorrectly
      assuming that "!pxd_none() == valid and present".
      
      And this was not just a poisoned page problem, but a generaly swap entry
      problem.  So, any swap entry type (device memory migration, numa
      migration, or just regular swapping) could lead to the same problem.
      
      Fix this by checking for pxd_present(), instead of pxd_none().
      
      Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.comSigned-off-by: default avatarQiujun Huang <hqjagain@gmail.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15494520
  4. 01 Dec, 2019 2 commits
  5. 19 Oct, 2019 1 commit
  6. 26 Sep, 2019 1 commit
  7. 24 Sep, 2019 3 commits
  8. 17 Jul, 2019 1 commit
  9. 12 Jul, 2019 14 commits
  10. 02 Jul, 2019 1 commit
  11. 01 Jun, 2019 1 commit
    • Mike Rapoport's avatar
      mm/gup: continue VM_FAULT_RETRY processing even for pre-faults · df17277b
      Mike Rapoport authored
      When get_user_pages*() is called with pages = NULL, the processing of
      VM_FAULT_RETRY terminates early without actually retrying to fault-in all
      the pages.
      
      If the pages in the requested range belong to a VMA that has userfaultfd
      registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
      has populated the page, but for the gup pre-fault case there's no actual
      retry and the caller will get no pages although they are present.
      
      This issue was uncovered when running post-copy memory restore in CRIU
      after d9c9ce34 ("x86/fpu: Fault-in user stack if
      copy_fpstate_to_sigframe() fails").
      
      After this change, the copying of FPU state to the sigframe switched from
      copy_to_user() variants which caused a real page fault to get_user_pages()
      with pages parameter set to NULL.
      
      In post-copy mode of CRIU, the destination memory is managed with
      userfaultfd and lack of the retry for pre-fault case in get_user_pages()
      causes a crash of the restored process.
      
      Making the pre-fault behavior of get_user_pages() the same as the "normal"
      one fixes the issue.
      
      Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com
      Fixes: d9c9ce34 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: Andrei Vagin <avagin@gmail.com> [https://travis-ci.org/avagin/linux/builds/533184940]
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df17277b
  12. 21 May, 2019 1 commit