1. 12 Jul, 2019 10 commits
  2. 01 Jun, 2019 1 commit
    • Mike Rapoport's avatar
      mm/gup: continue VM_FAULT_RETRY processing even for pre-faults · df17277b
      Mike Rapoport authored
      When get_user_pages*() is called with pages = NULL, the processing of
      VM_FAULT_RETRY terminates early without actually retrying to fault-in all
      the pages.
      
      If the pages in the requested range belong to a VMA that has userfaultfd
      registered, handle_userfault() returns VM_FAULT_RETRY *after* user space
      has populated the page, but for the gup pre-fault case there's no actual
      retry and the caller will get no pages although they are present.
      
      This issue was uncovered when running post-copy memory restore in CRIU
      after d9c9ce34 ("x86/fpu: Fault-in user stack if
      copy_fpstate_to_sigframe() fails").
      
      After this change, the copying of FPU state to the sigframe switched from
      copy_to_user() variants which caused a real page fault to get_user_pages()
      with pages parameter set to NULL.
      
      In post-copy mode of CRIU, the destination memory is managed with
      userfaultfd and lack of the retry for pre-fault case in get_user_pages()
      causes a crash of the restored process.
      
      Making the pre-fault behavior of get_user_pages() the same as the "normal"
      one fixes the issue.
      
      Link: http://lkml.kernel.org/r/1557844195-18882-1-git-send-email-rppt@linux.ibm.com
      Fixes: d9c9ce34 ("x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: Andrei Vagin <avagin@gmail.com> [https://travis-ci.org/avagin/linux/builds/533184940]
      Tested-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df17277b
  3. 21 May, 2019 1 commit
  4. 14 May, 2019 5 commits
    • John Hubbard's avatar
      mm: introduce put_user_page*(), placeholder versions · fc1d8e7c
      John Hubbard authored
      A discussion of the overall problem is below.
      
      As mentioned in patch 0001, the steps are to fix the problem are:
      
      1) Provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      Overview
      ========
      
      Some kernel components (file systems, device drivers) need to access
      memory that is specified via process virtual address.  For a long time,
      the API to achieve that was get_user_pages ("GUP") and its variations.
      However, GUP has critical limitations that have been overlooked; in
      particular, GUP does not interact correctly with filesystems in all
      situations.  That means that file-backed memory + GUP is a recipe for
      potential problems, some of which have already occurred in the field.
      
      GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem
      code to get the struct page behind a virtual address and to let storage
      hardware perform a direct copy to or from that page.  This is a
      short-lived access pattern, and as such, the window for a concurrent
      writeback of GUP'd page was small enough that there were not (we think)
      any reported problems.  Also, userspace was expected to understand and
      accept that Direct IO was not synchronized with memory-mapped access to
      that data, nor with any process address space changes such as munmap(),
      mremap(), etc.
      
      Over the years, more GUP uses have appeared (virtualization, device
      drivers, RDMA) that can keep the pages they get via GUP for a long period
      of time (seconds, minutes, hours, days, ...).  This long-term pinning
      makes an underlying design problem more obvious.
      
      In fact, there are a number of key problems inherent to GUP:
      
      Interactions with file systems
      ==============================
      
      File systems expect to be able to write back data, both to reclaim pages,
      and for data integrity.  Allowing other hardware (NICs, GPUs, etc) to gain
      write access to the file memory pages means that such hardware can dirty
      the pages, without the filesystem being aware.  This can, in some cases
      (depending on filesystem, filesystem options, block device, block device
      options, and other variables), lead to data corruption, and also to kernel
      bugs of the form:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on every
          write access to a clean file backed page, not just the first one.
          How long the GUP reference lasts is irrelevant, if the page is clean
          and you need to dirty it, you must call ->page_mkwrite before it is
          marked writeable and dirtied. Every. Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      Long term GUP
      =============
      
      Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a
      writeable mapping is created), and the pages are file-backed.  That can
      lead to filesystem corruption.  What happens is that when a file-backed
      page is being written back, it is first mapped read-only in all of the CPU
      page tables; the file system then assumes that nobody can write to the
      page, and that the page content is therefore stable.  Unfortunately, the
      GUP callers generally do not monitor changes to the CPU pages tables; they
      instead assume that the following pattern is safe (it's not):
      
          get_user_pages()
      
          Hardware can keep a reference to those pages for a very long time,
          and write to it at any time.  Because "hardware" here means "devices
          that are not a CPU", this activity occurs without any interaction with
          the kernel's file system code.
      
          for each page
              set_page_dirty
              put_page()
      
      In fact, the GUP documentation even recommends that pattern.
      
      Anyway, the file system assumes that the page is stable (nothing is
      writing to the page), and that is a problem: stable page content is
      necessary for many filesystem actions during writeback, such as checksum,
      encryption, RAID striping, etc.  Furthermore, filesystem features like COW
      (copy on write) or snapshot also rely on being able to use a new page for
      as memory for that memory range inside the file.
      
      Corruption during write back is clearly possible here.  To solve that, one
      idea is to identify pages that have active GUP, so that we can use a
      bounce page to write stable data to the filesystem.  The filesystem would
      work on the bounce page, while any of the active GUP might write to the
      original page.  This would avoid the stable page violation problem, but
      note that it is only part of the overall solution, because other problems
      remain.
      
      Other filesystem features that need to replace the page with a new one can
      be inhibited for pages that are GUP-pinned.  This will, however, alter and
      limit some of those filesystem features.  The only fix for that would be
      to require GUP users to monitor and respond to CPU page table updates.
      Subsystems such as ODP and HMM do this, for example.  This aspect of the
      problem is still under discussion.
      
      Direct IO
      =========
      
      Direct IO can cause corruption, if userspace does Direct-IO that writes to
      a range of virtual addresses that are mmap'd to a file.  The pages written
      to are file-backed pages that can be under write back, while the Direct IO
      is taking place.  Here, Direct IO races with a write back: it calls GUP
      before page_mkclean() has replaced the CPU pte with a read-only entry.
      The race window is pretty small, which is probably why years have gone by
      before we noticed this problem: Direct IO is generally very quick, and
      tends to finish up before the filesystem gets around to do anything with
      the page contents.  However, it's still a real problem.  The solution is
      to never let GUP return pages that are under write back, but instead,
      force GUP to take a write fault on those pages.  That way, GUP will
      properly synchronize with the active write back.  This does not change the
      required GUP behavior, it just avoids that race.
      
      Details
      =======
      
      Introduces put_user_page(), which simply calls put_page().  This provides
      a way to update all get_user_pages*() callers, so that they call
      put_user_page(), instead of put_page().
      
      Also introduces put_user_pages(), and a few dirty/locked variations, as a
      replacement for release_pages(), and also as a replacement for open-coded
      loops that release multiple pages.  These may be used for subsequent
      performance improvements, via batching of pages to be released.
      
      This is the first step of fixing a problem (also described in [1] and [2])
      with interactions between get_user_pages ("gup") and filesystems.
      
      Problem description: let's start with a bug report.  Below, is what
      happens sometimes, under memory pressure, when a driver pins some pages
      via gup, and then marks those pages dirty, and releases them.  Note that
      the gup documentation actually recommends that pattern.  The problem is
      that the filesystem may do a writeback while the pages were gup-pinned,
      and then the filesystem believes that the pages are clean.  So, when the
      driver later marks the pages as dirty, that conflicts with the
      filesystem's page tracking and results in a BUG(), like this one that I
      experienced:
      
          kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899!
          backtrace:
              ext4_writepage
              __writepage
              write_cache_pages
              ext4_writepages
              do_writepages
              __writeback_single_inode
              writeback_sb_inodes
              __writeback_inodes_wb
              wb_writeback
              wb_workfn
              process_one_work
              worker_thread
              kthread
              ret_from_fork
      
      ...which is due to the file system asserting that there are still buffer
      heads attached:
      
              ({                                                      \
                      BUG_ON(!PagePrivate(page));                     \
                      ((struct buffer_head *)page_private(page));     \
              })
      
      Dave Chinner's description of this is very clear:
      
          "The fundamental issue is that ->page_mkwrite must be called on
          every write access to a clean file backed page, not just the first
          one.  How long the GUP reference lasts is irrelevant, if the page is
          clean and you need to dirty it, you must call ->page_mkwrite before it
          is marked writeable and dirtied.  Every.  Time."
      
      This is just one symptom of the larger design problem: real filesystems
      that actually write to a backing device, do not actually support
      get_user_pages() being called on their pages, and letting hardware write
      directly to those pages--even though that pattern has been going on since
      about 2005 or so.
      
      The steps are to fix it are:
      
      1) (This patch): provide put_user_page*() routines, intended to be used
         for releasing pages that were pinned via get_user_pages*().
      
      2) Convert all of the call sites for get_user_pages*(), to
         invoke put_user_page*(), instead of put_page(). This involves dozens of
         call sites, and will take some time.
      
      3) After (2) is complete, use get_user_pages*() and put_user_page*() to
         implement tracking of these pages. This tracking will be separate from
         the existing struct page refcounting.
      
      4) Use the tracking and identification of these pages, to implement
         special handling (especially in writeback paths) when the pages are
         backed by a filesystem.
      
      [1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()"
      [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
      
      Link: http://lkml.kernel.org/r/20190327023632.13307-2-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>		[docs]
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarChristoph Lameter <cl@linux.com>
      Tested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc1d8e7c
    • Ira Weiny's avatar
      mm/gup: add FOLL_LONGTERM capability to GUP fast · 7af75561
      Ira Weiny authored
      DAX pages were previously unprotected from longterm pins when users called
      get_user_pages_fast().
      
      Use the new FOLL_LONGTERM flag to check for DEVMAP pages and fall back to
      regular GUP processing if a DEVMAP page is encountered.
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-5-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-5-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-5-ira.weiny@intel.comSigned-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7af75561
    • Ira Weiny's avatar
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny authored
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.comSigned-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
    • Ira Weiny's avatar
      mm/gup: change write parameter to flags in fast walk · b798bec4
      Ira Weiny authored
      In order to support more options in the GUP fast walk, change the write
      parameter to flags throughout the call stack.
      
      This patch does not change functionality and passes FOLL_WRITE where write
      was previously used.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-3-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-3-ira.weiny@intel.comSigned-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b798bec4
    • Ira Weiny's avatar
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny authored
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.comSigned-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
  5. 14 Apr, 2019 1 commit
  6. 06 Mar, 2019 1 commit
    • Aneesh Kumar K.V's avatar
      mm: update get_user_pages_longterm to migrate pages allocated from CMA region · 9a4e9f3b
      Aneesh Kumar K.V authored
      This patch updates get_user_pages_longterm to migrate pages allocated
      out of CMA region.  This makes sure that we don't keep non-movable pages
      (due to page reference count) in the CMA area.
      
      This will be used by ppc64 in a later patch to avoid pinning pages in
      the CMA region.  ppc64 uses CMA region for allocation of the hardware
      page table (hash page table) and not able to migrate pages out of CMA
      region results in page table allocation failures.
      
      One case where we hit this easy is when a guest using a VFIO passthrough
      device.  VFIO locks all the guest's memory and if the guest memory is
      backed by CMA region, it becomes unmovable resulting in fragmenting the
      CMA and possibly preventing other guests from allocation a large enough
      hash page table.
      
      NOTE: We allocate the new page without using __GFP_THISNODE
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-3-aneesh.kumar@linux.ibm.comSigned-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a4e9f3b
  7. 13 Feb, 2019 1 commit
  8. 11 Feb, 2019 1 commit
  9. 04 Jan, 2019 2 commits
    • Davidlohr Bueso's avatar
      fa45f116
    • Linus Torvalds's avatar
      Remove 'type' argument from access_ok() function · 96d4f267
      Linus Torvalds authored
      Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
      of the user address range verification function since we got rid of the
      old racy i386-only code to walk page tables by hand.
      
      It existed because the original 80386 would not honor the write protect
      bit when in kernel mode, so you had to do COW by hand before doing any
      user access.  But we haven't supported that in a long time, and these
      days the 'type' argument is a purely historical artifact.
      
      A discussion about extending 'user_access_begin()' to do the range
      checking resulted this patch, because there is no way we're going to
      move the old VERIFY_xyz interface to that model.  And it's best done at
      the end of the merge window when I've done most of my merges, so let's
      just get this done once and for all.
      
      This patch was mostly done with a sed-script, with manual fix-ups for
      the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.
      
      There were a couple of notable cases:
      
       - csky still had the old "verify_area()" name as an alias.
      
       - the iter_iov code had magical hardcoded knowledge of the actual
         values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
         really used it)
      
       - microblaze used the type argument for a debug printout
      
      but other than those oddities this should be a total no-op patch.
      
      I tried to fix up all architectures, did fairly extensive grepping for
      access_ok() uses, and the changes are trivial, but I may have missed
      something.  Any missed conversion should be trivially fixable, though.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96d4f267
  10. 30 Nov, 2018 1 commit
    • John Hubbard's avatar
      mm/gup: finish consolidating error handling · 08be37b7
      John Hubbard authored
      Commit df06b37f ("mm/gup: cache dev_pagemap while pinning pages")
      attempted to operate on each page that get_user_pages had retrieved.  In
      order to do that, it created a common exit point from the routine.
      However, one case was missed, which this patch fixes up.
      
      Also, there was still an unnecessary shadow declaration (with a
      different type) of the "ret" variable, which this patch removes.
      
      Keith's description of the situation is:
      
        This also fixes a potentially leaked dev_pagemap reference count if a
        failure occurs when an iteration crosses a vma boundary.  I don't think
        it's normal to have different vma's on a users mapped zone device
        memory, but good to fix anyway.
      
      I actually thought that this code:
      
          /* first iteration or cross vma bound */
          if (!vma || start >= vma->vm_end) {
      	        vma = find_extend_vma(mm, start);
      	        if (!vma && in_gate_area(mm, start)) {
      		            ret = get_gate_page(mm, start & PAGE_MASK,
      		                    gup_flags, &vma,
      		                    pages ? &pages[i] : NULL);
      		            if (ret)
      		                goto out;
      
      dealt with the "you're trying to pin the gate page, as part of this
      call", rather than the generic case of crossing a vma boundary.  (I
      think there's a fine point that I must be overlooking.) But it's still a
      valid case, either way.
      
      Link: http://lkml.kernel.org/r/20181121081402.29641-2-jhubbard@nvidia.com
      Fixes: df06b37f ("mm/gup: cache dev_pagemap while pinning pages")
      Signed-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      08be37b7
  11. 18 Nov, 2018 1 commit
  12. 31 Oct, 2018 1 commit
  13. 26 Oct, 2018 2 commits
  14. 24 Aug, 2018 1 commit
  15. 14 Jul, 2018 1 commit
    • Michal Hocko's avatar
      mm: do not bug_on on incorrect length in __mm_populate() · bb177a73
      Michal Hocko authored
      syzbot has noticed that a specially crafted library can easily hit
      VM_BUG_ON in __mm_populate
      
        kernel BUG at mm/gup.c:1242!
        invalid opcode: 0000 [#1] SMP
        CPU: 2 PID: 9667 Comm: a.out Not tainted 4.18.0-rc3 #644
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/19/2017
        RIP: 0010:__mm_populate+0x1e2/0x1f0
        Code: 55 d0 65 48 33 14 25 28 00 00 00 89 d8 75 21 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 75 18 f1 ff 0f 0b e8 6e 18 f1 ff <0f> 0b 31 db eb c9 e8 93 06 e0 ff 0f 1f 00 55 48 89 e5 53 48 89 fb
        Call Trace:
           vm_brk_flags+0xc3/0x100
           vm_brk+0x1f/0x30
           load_elf_library+0x281/0x2e0
           __ia32_sys_uselib+0x170/0x1e0
           do_fast_syscall_32+0xca/0x420
           entry_SYSENTER_compat+0x70/0x7f
      
      The reason is that the length of the new brk is not page aligned when we
      try to populate the it.  There is no reason to bug on that though.
      do_brk_flags already aligns the length properly so the mapping is
      expanded as it should.  All we need is to tell mm_populate about it.
      Besides that there is absolutely no reason to to bug_on in the first
      place.  The worst thing that could happen is that the last page wouldn't
      get populated and that is far from putting system into an inconsistent
      state.
      
      Fix the issue by moving the length sanitization code from do_brk_flags
      up to vm_brk_flags.  The only other caller of do_brk_flags is brk
      syscall entry and it makes sure to provide the proper length so t here
      is no need for sanitation and so we can use do_brk_flags without it.
      
      Also remove the bogus BUG_ONs.
      
      [osalvador@techadventures.net: fix up vm_brk_flags s@request@len@]
      Link: http://lkml.kernel.org/r/20180706090217.GI32658@dhcp22.suse.czSigned-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reported-by: default avatarsyzbot <syzbot+5dcb560fe12aa5091c06@syzkaller.appspotmail.com>
      Tested-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb177a73
  16. 08 Jun, 2018 2 commits
    • Huang Ying's avatar
      mm, gup: prevent pmd checking race in follow_pmd_mask() · 68827280
      Huang Ying authored
      mmap_sem will be read locked when calling follow_pmd_mask().  But this
      cannot prevent PMD from being changed for all cases when PTL is
      unlocked, for example, from pmd_trans_huge() to pmd_none() via
      MADV_DONTNEED.  So it is possible for the pmd_present() check in
      follow_pmd_mask() to encounter an invalid PMD.  This may cause an
      incorrect VM_BUG_ON() or an infinite loop.  Fix this by reading the PMD
      entry into a local variable with READ_ONCE() and checking the local
      variable and pmd_none() in the retry loop.
      
      As Kirill pointed out, with PTL unlocked, the *pmd may be changed under
      us, so reading it directly again and again may incur weird bugs.  So
      although using *pmd directly other than for pmd_present() checking may
      be safe, it is still better to replace them to read *pmd once and check
      the local variable multiple times.
      
      When PTL unlocked, replace all *pmd with local variable was suggested by
      Kirill.
      
      Link: http://lkml.kernel.org/r/20180419083514.1365-1-ying.huang@intel.comSigned-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68827280
    • Laurent Dufour's avatar
      mm: introduce ARCH_HAS_PTE_SPECIAL · 3010a5ea
      Laurent Dufour authored
      Currently the PTE special supports is turned on in per architecture
      header files.  Most of the time, it is defined in
      arch/*/include/asm/pgtable.h depending or not on some other per
      architecture static definition.
      
      This patch introduce a new configuration variable to manage this
      directly in the Kconfig files.  It would later replace
      __HAVE_ARCH_PTE_SPECIAL.
      
      Here notes for some architecture where the definition of
      __HAVE_ARCH_PTE_SPECIAL is not obvious:
      
      arm
       __HAVE_ARCH_PTE_SPECIAL which is currently defined in
      arch/arm/include/asm/pgtable-3level.h which is included by
      arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
      So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
      
      powerpc
      __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
       - arch/powerpc/include/asm/book3s/64/pgtable.h
       - arch/powerpc/include/asm/pte-common.h
      The first one is included if (PPC_BOOK3S & PPC64) while the second is
      included in all the other cases.
      So select ARCH_HAS_PTE_SPECIAL all the time.
      
      sparc:
      __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
      defined(__arch64__) which are defined through the compiler in
      sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
      So select ARCH_HAS_PTE_SPECIAL if SPARC64
      
      There is no functional change introduced by this patch.
      
      Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.comSigned-off-by: default avatarLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Suggested-by: default avatarJerome Glisse <jglisse@redhat.com>
      Reviewed-by: default avatarJerome Glisse <jglisse@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Albert Ou <albert@sifive.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Christophe LEROY <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3010a5ea
  17. 22 May, 2018 1 commit
  18. 17 May, 2018 1 commit
    • Willy Tarreau's avatar
      proc: do not access cmdline nor environ from file-backed areas · 7f7ccc2c
      Willy Tarreau authored
      proc_pid_cmdline_read() and environ_read() directly access the target
      process' VM to retrieve the command line and environment. If this
      process remaps these areas onto a file via mmap(), the requesting
      process may experience various issues such as extra delays if the
      underlying device is slow to respond.
      
      Let's simply refuse to access file-backed areas in these functions.
      For this we add a new FOLL_ANON gup flag that is passed to all calls
      to access_remote_vm(). The code already takes care of such failures
      (including unmapped areas). Accesses via /proc/pid/mem were not
      changed though.
      
      This was assigned CVE-2018-1120.
      
      Note for stable backports: the patch may apply to kernels prior to 4.11
      but silently miss one location; it must be checked that no call to
      access_remote_vm() keeps zero as the last argument.
      Reported-by: default avatarQualys Security Advisory <qsa@qualys.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f7ccc2c
  19. 14 Apr, 2018 2 commits
  20. 06 Apr, 2018 1 commit
  21. 10 Mar, 2018 1 commit
  22. 08 Jan, 2018 1 commit
  23. 16 Dec, 2017 1 commit
    • Linus Torvalds's avatar
      Revert "mm: replace p??_write with pte_access_permitted in fault + gup paths" · f6f37321
      Linus Torvalds authored
      This reverts commits 5c9d2d5c, c7da82b8, and e7fe7b5c.
      
      We'll probably need to revisit this, but basically we should not
      complicate the get_user_pages_fast() case, and checking the actual page
      table protection key bits will require more care anyway, since the
      protection keys depend on the exact state of the VM in question.
      
      Particularly when doing a "remote" page lookup (ie in somebody elses VM,
      not your own), you need to be much more careful than this was.  Dave
      Hansen says:
      
       "So, the underlying bug here is that we now a get_user_pages_remote()
        and then go ahead and do the p*_access_permitted() checks against the
        current PKRU. This was introduced recently with the addition of the
        new p??_access_permitted() calls.
      
        We have checks in the VMA path for the "remote" gups and we avoid
        consulting PKRU for them. This got missed in the pkeys selftests
        because I did a ptrace read, but not a *write*. I also didn't
        explicitly test it against something where a COW needed to be done"
      
      It's also not entirely clear that it makes sense to check the protection
      key bits at this level at all.  But one possible eventual solution is to
      make the get_user_pages_fast() case just abort if it sees protection key
      bits set, which makes us fall back to the regular get_user_pages() case,
      which then has a vma and can do the check there if we want to.
      
      We'll see.
      
      Somewhat related to this all: what we _do_ want to do some day is to
      check the PAGE_USER bit - it should obviously always be set for user
      pages, but it would be a good check to have back.  Because we have no
      generic way to test for it, we lost it as part of moving over from the
      architecture-specific x86 GUP implementation to the generic one in
      commit e585513b ("x86/mm/gup: Switch GUP to the generic
      get_user_page_fast() implementation").
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f6f37321