1. 31 Jan, 2020 5 commits
    • John Hubbard's avatar
      mm: fix get_user_pages_remote()'s handling of FOLL_LONGTERM · c4237f8b
      John Hubbard authored
      As it says in the updated comment in gup.c: current FOLL_LONGTERM
      behavior is incompatible with FAULT_FLAG_ALLOW_RETRY because of the FS
      DAX check requirement on vmas.
      
      However, the corresponding restriction in get_user_pages_remote() was
      slightly stricter than is actually required: it forbade all
      FOLL_LONGTERM callers, but we can actually allow FOLL_LONGTERM callers
      that do not set the "locked" arg.
      
      Update the code and comments to loosen the restriction, allowing
      FOLL_LONGTERM in some cases.
      
      Also, copy the DAX check ("if a VMA is DAX, don't allow long term
      pinning") from the VFIO call site, all the way into the internals of
      get_user_pages_remote() and __gup_longterm_locked().  That is:
      get_user_pages_remote() calls __gup_longterm_locked(), which in turn
      calls check_dax_vmas().  This check will then be removed from the VFIO
      call site in a subsequent patch.
      
      Thanks to Jason Gunthorpe for pointing out a clean way to fix this, and
      to Dan Williams for helping clarify the DAX refactoring.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-7-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Tested-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Suggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4237f8b
    • John Hubbard's avatar
      mm/gup: move try_get_compound_head() to top, fix minor issues · a707cdd5
      John Hubbard authored
      An upcoming patch uses try_get_compound_head() more widely, so move it to
      the top of gup.c.
      
      Also fix a tiny spelling error and a checkpatch.pl warning.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-3-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a707cdd5
    • John Hubbard's avatar
      mm/gup: factor out duplicate code from four routines · a43e9820
      John Hubbard authored
      Patch series "mm/gup: prereqs to track dma-pinned pages: FOLL_PIN", v12.
      
      Overview:
      
      This is a prerequisite to solving the problem of proper interactions
      between file-backed pages, and [R]DMA activities, as discussed in [1],
      [2], [3], and in a remarkable number of email threads since about
      2017.  :)
      
      A new internal gup flag, FOLL_PIN is introduced, and thoroughly
      documented in the last patch's Documentation/vm/pin_user_pages.rst.
      
      I believe that this will provide a good starting point for doing the
      layout lease work that Ira Weiny has been working on.  That's because
      these new wrapper functions provide a clean, constrained, systematically
      named set of functionality that, again, is required in order to even
      know if a page is "dma-pinned".
      
      In contrast to earlier approaches, the page tracking can be
      incrementally applied to the kernel call sites that, until now, have
      been simply calling get_user_pages() ("gup").  In other words, opt-in by
      changing from this:
      
          get_user_pages() (sets FOLL_GET)
          put_page()
      
      to this:
          pin_user_pages() (sets FOLL_PIN)
          unpin_user_page()
      
      Testing:
      
      * I've done some overall kernel testing (LTP, and a few other goodies),
        and some directed testing to exercise some of the changes. And as you
        can see, gup_benchmark is enhanced to exercise this. Basically, I've
        been able to runtime test the core get_user_pages() and
        pin_user_pages() and related routines, but not so much on several of
        the call sites--but those are generally just a couple of lines
        changed, each.
      
        Not much of the kernel is actually using this, which on one hand
        reduces risk quite a lot. But on the other hand, testing coverage
        is low. So I'd love it if, in particular, the Infiniband and PowerPC
        folks could do a smoke test of this series for me.
      
        Runtime testing for the call sites so far is pretty light:
      
          * io_uring: Some directed tests from liburing exercise this, and
                      they pass.
          * process_vm_access.c: A small directed test passes.
          * gup_benchmark: the enhanced version hits the new gup.c code, and
                           passes.
          * infiniband: Ran rdma-core tests: rdma-core/build/bin/run_tests.py
          * VFIO: compiles (I'm vowing to set up a run time test soon, but it's
                            not ready just yet)
          * powerpc: it compiles...
          * drm/via: compiles...
          * goldfish: compiles...
          * net/xdp: compiles...
          * media/v4l2: compiles...
      
      [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/
      [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/
      [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/
      
      This patch (of 22):
      
      There are four locations in gup.c that have a fair amount of code
      duplication.  This means that changing one requires making the same
      changes in four places, not to mention reading the same code four times,
      and wondering if there are subtle differences.
      
      Factor out the common code into static functions, thus reducing the
      overall line count and the code's complexity.
      
      Also, take the opportunity to slightly improve the efficiency of the
      error cases, by doing a mass subtraction of the refcount, surrounded by
      get_page()/put_page().
      
      Also, further simplify (slightly), by waiting until the the successful
      end of each routine, to increment *nr.
      
      Link: http://lkml.kernel.org/r/20200107224558.2362728-2-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Leon Romanovsky <leonro@mellanox.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a43e9820
    • Wei Yang's avatar
      be9d3045
    • Qiujun Huang's avatar
      mm: fix gup_pud_range · 15494520
      Qiujun Huang authored
      sorry for not processing for a long time.  I met it again.
      
      patch v1   https://lkml.org/lkml/2019/9/20/656
      
      do_machine_check()
        do_memory_failure()
          memory_failure()
            hw_poison_user_mappings()
              try_to_unmap()
                pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
      
      ...and now we have a swap entry that indicates that the page entry
      refers to a bad (and poisoned) page of memory, but gup_fast() at this
      level of the page table was ignoring swap entries, and incorrectly
      assuming that "!pxd_none() == valid and present".
      
      And this was not just a poisoned page problem, but a generaly swap entry
      problem.  So, any swap entry type (device memory migration, numa
      migration, or just regular swapping) could lead to the same problem.
      
      Fix this by checking for pxd_present(), instead of pxd_none().
      
      Link: http://lkml.kernel.org/r/1578479084-15508-1-git-send-email-hqjagain@gmail.comSigned-off-by: default avatarQiujun Huang <hqjagain@gmail.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15494520
  2. 01 Dec, 2019 2 commits
  3. 19 Oct, 2019 1 commit
  4. 26 Sep, 2019 1 commit
  5. 24 Sep, 2019 3 commits
  6. 17 Jul, 2019 1 commit
  7. 12 Jul, 2019 14 commits
  8. 02 Jul, 2019 1 commit
  9. 01 Jun, 2019 1 commit
    • Mike Rapoport's avatar
      mm/gup: continue VM_FAULT_RETRY processing even for pre-faults · df17277b
      Mike Rapoport authored
      When get_user_pages*() is called with pages = NULL, the processing of
      VM_FAULT_RETRY terminates early without actually retrying to fault-in all
      the pages.
      
      If the pages in the requested range belong to a VMA that has userfaultfd
      registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
      has populated the page, but for the gup pre-fault case there's no actual
      retry and the caller will get no pages although they are present.
      
      This issue was uncovered when running post-copy memory restore in CRIU
      after d9c9ce34 ("x86/fpu: Fault-in user stack if
      copy_fpstate_to_sigframe() fails").
      
      After this change, the copying of FPU state to the sigframe switched from
      copy_to_user() variants which caused a real page fault to get_user_pages()
      with pages parameter set to NULL.
      
      In post-copy mode of CRIU, the destination memory is managed with
      userfaultfd and lack of the retry for pre-fault case in get_user_pages()
      causes a crash of the restored process.
      
      Making the pre-fault behavior of get_user_pages() the same as the "normal"
      one fixes the issue.
      
      Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com
      Fixes: d9c9ce34 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: Andrei Vagin <avagin@gmail.com> [https://travis-ci.org/avagin/linux/builds/533184940]
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df17277b
  10. 21 May, 2019 1 commit
  11. 14 May, 2019 5 commits
    • John Hubbard's avatar
      mm: introduce put_user_page*(), placeholder versions · fc1d8e7c
      John Hubbard authored
      A discussion of the overall problem is below.
      
      As mentioned in patch 0001, the steps are to fix the problem are:
      
      1) Provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      Overview
      ========
      
      Some kernel components (file systems, device drivers) need to access
      memory that is specified via process virtual address.  For a long time,
      the API to achieve that was get_user_pages ("GUP") and its variations.
      However, GUP has critical limitations that have been overlooked; in
      particular, GUP does not interact correctly with filesystems in all
      situations.  That means that file-backed memory + GUP is a recipe for
      potential problems, some of which have already occurred in the field.
      
      GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
      code to get the struct page behind a virtual address and to let storage
      hardware perform a direct copy to or from that page.  This is a
      short-lived access pattern, and as such, the window for a concurrent
      writeback of GUP'd page was small enough that there were not (we think)
      any reported problems.  Also, userspace was expected to understand and
      accept that Direct IO was not synchronized with memory-mapped access to
      that data, nor with any process address space changes such as munmap(),
      mremap(), etc.
      
      Over the years, more GUP uses have appeared (virtualization, device
      drivers, RDMA) that can keep the pages they get via GUP for a long period
      of time (seconds, minutes, hours, days, ...).  This long-term pinning
      makes an underlying design problem more obvious.
      
      In fact, there are a number of key problems inherent to GUP:
      
      Interactions with file systems
      ==============================
      
      File systems expect to be able to write back data, both to reclaim pages,
      and for data integrity.  Allowing other hardware (NICs, GPUs, etc) to gain
      write access to the file memory pages means that such hardware can dirty
      the pages, without the filesystem being aware.  This can, in some cases
      (depending on filesystem, filesystem options, block device, block device
      options, and other variables), lead to data corruption, and also to kernel
      bugs of the form:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on every
          write access to a clean file backed page, not just the first one.
          How long the GUP reference lasts is irrelevant, if the page is clean
          and you need to dirty it, you must call ->page_mkwrite before it is
          marked writeable and dirtied. Every. Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      Long term GUP
      =============
      
      Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
      writeable mapping is created), and the pages are file-backed.  That can
      lead to filesystem corruption.  What happens is that when a file-backed
      page is being written back, it is first mapped read-only in all of the CPU
      page tables; the file system then assumes that nobody can write to the
      page, and that the page content is therefore stable.  Unfortunately, the
      GUP callers generally do not monitor changes to the CPU pages tables; they
      instead assume that the following pattern is safe (it's not):
      
          get_user_pages()
      
          Hardware can keep a reference to those pages for a very long time,
          and write to it at any time.  Because "hardware" here means "devices
          that are not a CPU", this activity occurs without any interaction with
          the kernel's file system code.
      
          for each page
              set_page_dirty
              put_page()
      
      In fact, the GUP documentation even recommends that pattern.
      
      Anyway, the file system assumes that the page is stable (nothing is
      writing to the page), and that is a problem: stable page content is
      necessary for many filesystem actions during writeback, such as checksum,
      encryption, RAID striping, etc.  Furthermore, filesystem features like COW
      (copy on write) or snapshot also rely on being able to use a new page for
      as memory for that memory range inside the file.
      
      Corruption during write back is clearly possible here.  To solve that, one
      idea is to identify pages that have active GUP, so that we can use a
      bounce page to write stable data to the filesystem.  The filesystem would
      work on the bounce page, while any of the active GUP might write to the
      original page.  This would avoid the stable page violation problem, but
      note that it is only part of the overall solution, because other problems
      remain.
      
      Other filesystem features that need to replace the page with a new one can
      be inhibited for pages that are GUP-pinned.  This will, however, alter and
      limit some of those filesystem features.  The only fix for that would be
      to require GUP users to monitor and respond to CPU page table updates.
      Subsystems such as ODP and HMM do this, for example.  This aspect of the
      problem is still under discussion.
      
      Direct IO
      =========
      
      Direct IO can cause corruption, if userspace does Direct-IO that writes to
      a range of virtual addresses that are mmap'd to a file.  The pages written
      to are file-backed pages that can be under write back, while the Direct IO
      is taking place.  Here, Direct IO races with a write back: it calls GUP
      before page_mkclean() has replaced the CPU pte with a read-only entry.
      The race window is pretty small, which is probably why years have gone by
      before we noticed this problem: Direct IO is generally very quick, and
      tends to finish up before the filesystem gets around to do anything with
      the page contents.  However, it's still a real problem.  The solution is
      to never let GUP return pages that are under write back, but instead,
      force GUP to take a write fault on those pages.  That way, GUP will
      properly synchronize with the active write back.  This does not change the
      required GUP behavior, it just avoids that race.
      
      Details
      =======
      
      Introduces put_user_page(), which simply calls put_page().  This provides
      a way to update all get_user_pages*() callers, so that they call
      put_user_page(), instead of put_page().
      
      Also introduces put_user_pages(), and a few dirty/locked variations, as a
      replacement for release_pages(), and also as a replacement for open-coded
      loops that release multiple pages.  These may be used for subsequent
      performance improvements, via batching of pages to be released.
      
      This is the first step of fixing a problem (also described in [1] and [2])
      with interactions between get_user_pages ("gup") and filesystems.
      
      Problem description: let's start with a bug report.  Below, is what
      happens sometimes, under memory pressure, when a driver pins some pages
      via gup, and then marks those pages dirty, and releases them.  Note that
      the gup documentation actually recommends that pattern.  The problem is
      that the filesystem may do a writeback while the pages were gup-pinned,
      and then the filesystem believes that the pages are clean.  So, when the
      driver later marks the pages as dirty, that conflicts with the
      filesystem's page tracking and results in a BUG(), like this one that I
      experienced:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on
          every write access to a clean file backed page, not just the first
          one.  How long the GUP reference lasts is irrelevant, if the page is
          clean and you need to dirty it, you must call ->page_mkwrite before it
          is marked writeable and dirtied.  Every.  Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      The steps are to fix it are:
      
      1) (This patch): provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
      [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
      
      Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[docs]
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc1d8e7c
    • Ira Weiny's avatar
      mm/gup: add FOLL_LONGTERM capability to GUP fast · 7af75561
      Ira Weiny authored
      DAX pages were previously unprotected from longterm pins when users called
      get_user_pages_fast().
      
      Use the new FOLL_LONGTERM flag to check for DEVMAP pages and fall back to
      regular GUP processing if a DEVMAP page is encountered.
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-5-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-5-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-5-ira.weiny@intel.comSigned-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7af75561
    • Ira Weiny's avatar
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny authored
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.comSigned-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
    • Ira Weiny's avatar
      mm/gup: change write parameter to flags in fast walk · b798bec4
      Ira Weiny authored
      In order to support more options in the GUP fast walk, change the write
      parameter to flags throughout the call stack.
      
      This patch does not change functionality and passes FOLL_WRITE where write
      was previously used.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-3-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-3-ira.weiny@intel.comSigned-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b798bec4
    • Ira Weiny's avatar
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny authored
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
  12. 14 Apr, 2019 1 commit
  13. 06 Mar, 2019 1 commit
    • Aneesh Kumar K.V's avatar
      mm: update get_user_pages_longterm to migrate pages allocated from CMA region · 9a4e9f3b
      Aneesh Kumar K.V authored
      This patch updates get_user_pages_longterm to migrate pages allocated
      out of CMA region.  This makes sure that we don't keep non-movable pages
      (due to page reference count) in the CMA area.
      
      This will be used by ppc64 in a later patch to avoid pinning pages in
      the CMA region.  ppc64 uses CMA region for allocation of the hardware
      page table (hash page table) and not able to migrate pages out of CMA
      region results in page table allocation failures.
      
      One case where we hit this easy is when a guest using a VFIO passthrough
      device.  VFIO locks all the guest's memory and if the guest memory is
      backed by CMA region, it becomes unmovable resulting in fragmenting the
      CMA and possibly preventing other guests from allocation a large enough
      hash page table.
      
      NOTE: We allocate the new page without using __GFP_THISNODE
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a4e9f3b
  14. 13 Feb, 2019 1 commit
  15. 11 Feb, 2019 1 commit
  16. 04 Jan, 2019 1 commit