1. 31 Jan, 2020 3 commits
  2. 18 Oct, 2019 1 commit
  3. 15 Oct, 2019 1 commit
    • Joerg Roedel's avatar
      vfio/type1: Initialize resv_msi_base · 95f89e09
      Joerg Roedel authored
      
      
      After enabling CONFIG_IOMMU_DMA on X86 a new warning appears when
      compiling vfio:
      
      drivers/vfio/vfio_iommu_type1.c: In function ‘vfio_iommu_type1_attach_group’:
      drivers/vfio/vfio_iommu_type1.c:1827:7: warning: ‘resv_msi_base’ may be used uninitialized in this function [-Wmaybe-uninitialized]
         ret = iommu_get_msi_cookie(domain->domain, resv_msi_base);
         ~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      The warning is a false positive, because the call to iommu_get_msi_cookie()
      only happens when vfio_iommu_has_sw_msi() returned true. And that only
      happens when it also set resv_msi_base.
      
      But initialize the variable anyway to get rid of the warning.
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      95f89e09
  4. 26 Sep, 2019 1 commit
  5. 19 Aug, 2019 6 commits
  6. 24 Jul, 2019 2 commits
    • Will Deacon's avatar
      iommu: Introduce struct iommu_iotlb_gather for batching TLB flushes · a7d20dc1
      Will Deacon authored
      
      
      To permit batching of TLB flushes across multiple calls to the IOMMU
      driver's ->unmap() implementation, introduce a new structure for
      tracking the address range to be flushed and the granularity at which
      the flushing is required.
      
      This is hooked into the IOMMU API and its caller are updated to make use
      of the new structure. Subsequent patches will plumb this into the IOMMU
      drivers as well, but for now the gathering information is ignored.
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      a7d20dc1
    • Will Deacon's avatar
      iommu: Remove empty iommu_tlb_range_add() callback from iommu_ops · 6d1bcb95
      Will Deacon authored
      Commit add02cfd
      
       ("iommu: Introduce Interface for IOMMU TLB Flushing")
      added three new TLB flushing operations to the IOMMU API so that the
      underlying driver operations can be batched when unmapping large regions
      of IO virtual address space.
      
      However, the ->iotlb_range_add() callback has not been implemented by
      any IOMMU drivers (amd_iommu.c implements it as an empty function, which
      incurs the overhead of an indirect branch). Instead, drivers either flush
      the entire IOTLB in the ->iotlb_sync() callback or perform the necessary
      invalidation during ->unmap().
      
      Attempting to implement ->iotlb_range_add() for arm-smmu-v3.c revealed
      two major issues:
      
        1. The page size used to map the region in the page-table is not known,
           and so it is not generally possible to issue TLB flushes in the most
           efficient manner.
      
        2. The only mutable state passed to the callback is a pointer to the
           iommu_domain, which can be accessed concurrently and therefore
           requires expensive synchronisation to keep track of the outstanding
           flushes.
      
      Remove the callback entirely in preparation for extending ->unmap() and
      ->iotlb_sync() to update a token on the caller's stack.
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      6d1bcb95
  7. 17 Jul, 2019 1 commit
  8. 19 Jun, 2019 1 commit
  9. 14 May, 2019 1 commit
    • Ira Weiny's avatar
      mm/gup: replace get_user_pages_longterm() with FOLL_LONGTERM · 932f4a63
      Ira Weiny authored
      Pach series "Add FOLL_LONGTERM to GUP fast and use it".
      
      HFI1, qib, and mthca, use get_user_pages_fast() due to its performance
      advantages.  These pages can be held for a significant time.  But
      get_user_pages_fast() does not protect against mapping FS DAX pages.
      
      Introduce FOLL_LONGTERM and use this flag in get_user_pages_fast() which
      retains the performance while also adding the FS DAX checks.  XDP has also
      shown interest in using this functionality.[1]
      
      In addition we change get_user_pages() to use the new FOLL_LONGTERM flag
      and remove the specialized get_user_pages_longterm call.
      
      [1] https://lkml.org/lkml/2019/3/19/939
      
      "longterm" is a relative thing and at this point is probably a misnomer.
      This is really flagging a pin which is going to be given to hardware and
      can't move.  I've thought of a couple of alternative names but I think we
      have to settle on if we are going to use FL_LAYOUT or something else to
      solve the "longterm" problem.  Then I think we can change the flag to a
      better name.
      
      Secondly, it depends on how often you are registering memory.  I have
      spoken with some RDMA users who consider MR in the performance path...
      For the overall application performance.  I don't have the numbers as the
      tests for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an aside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      This patch (of 7):
      
      This patch starts a series which aims to support FOLL_LONGTERM in
      get_user_pages_fast().  Some callers who would like to do a longterm (user
      controlled pin) of pages with the fast variant of GUP for performance
      purposes.
      
      Rather than have a separate get_user_pages_longterm() call, introduce
      FOLL_LONGTERM and change the longterm callers to use it.
      
      This patch does not change any functionality.  In the short term
      "longterm" or user controlled pins are unsafe for Filesystems and FS DAX
      in particular has been blocked.  However, callers of get_user_pages_fast()
      were not "protected".
      
      FOLL_LONGTERM can _only_ be supported with get_user_pages[_fast]() as it
      requires vmas to determine if DAX is in use.
      
      NOTE: In merging with the CMA changes we opt to change the
      get_user_pages() call in check_and_migrate_cma_pages() to a call of
      __get_user_pages_locked() on the newly migrated pages.  This makes the
      code read better in that we are calling __get_user_pages_locked() on the
      pages before and after a potential migration.
      
      As a side affect some of the interfaces are cleaned up but this is not the
      primary purpose of the series.
      
      In review[1] it was asked:
      
      <quote>
      > This I don't get - if you do lock down long term mappings performance
      > of the actual get_user_pages call shouldn't matter to start with.
      >
      > What do I miss?
      
      A couple of points.
      
      First "longterm" is a relative thing and at this point is probably a
      misnomer.  This is really flagging a pin which is going to be given to
      hardware and can't move.  I've thought of a couple of alternative names
      but I think we have to settle on if we are going to use FL_LAYOUT or
      something else to solve the "longterm" problem.  Then I think we can
      change the flag to a better name.
      
      Second, It depends on how often you are registering memory.  I have spoken
      with some RDMA users who consider MR in the performance path...  For the
      overall application performance.  I don't have the numbers as the tests
      for HFI1 were done a long time ago.  But there was a significant
      advantage.  Some of which is probably due to the fact that you don't have
      to hold mmap_sem.
      
      Finally, architecturally I think it would be good for everyone to use
      *_fast.  There are patches submitted to the RDMA list which would allow
      the use of *_fast (they reworking the use of mmap_sem) and as soon as they
      are accepted I'll submit a patch to convert the RDMA core as well.  Also
      to this point others are looking to use *_fast.
      
      As an asside, Jasons pointed out in my previous submission that *_fast and
      *_unlocked look very much the same.  I agree and I think further cleanup
      will be coming.  But I'm focused on getting the final solution for DAX at
      the moment.
      
      </quote>
      
      [1] https://lore.kernel.org/lkml/20190220180255.GA12020@iweiny-DESK2.sc.intel.com/T/#md6abad2569f3bf6c1f03686c8097ab6563e94965
      
      [ira.weiny@intel.com: v3]
        Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190328084422.29911-2-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-2-ira.weiny@intel.com
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Marshall <hubcap@omnibond.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      932f4a63
  10. 12 Apr, 2019 2 commits
  11. 03 Apr, 2019 1 commit
    • Alex Williamson's avatar
      vfio/type1: Limit DMA mappings per container · 49285593
      Alex Williamson authored
      
      
      Memory backed DMA mappings are accounted against a user's locked
      memory limit, including multiple mappings of the same memory.  This
      accounting bounds the number of such mappings that a user can create.
      However, DMA mappings that are not backed by memory, such as DMA
      mappings of device MMIO via mmaps, do not make use of page pinning
      and therefore do not count against the user's locked memory limit.
      These mappings still consume memory, but the memory is not well
      associated to the process for the purpose of oom killing a task.
      
      To add bounding on this use case, we introduce a limit to the total
      number of concurrent DMA mappings that a user is allowed to create.
      This limit is exposed as a tunable module option where the default
      value of 64K is expected to be well in excess of any reasonable use
      case (a large virtual machine configuration would typically only make
      use of tens of concurrent mappings).
      
      This fixes CVE-2019-3882.
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Tested-by: default avatarEric Auger <eric.auger@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      49285593
  12. 08 Jan, 2019 1 commit
  13. 15 Nov, 2018 1 commit
  14. 06 Aug, 2018 1 commit
  15. 30 Jun, 2018 1 commit
  16. 08 Jun, 2018 1 commit
    • Alex Williamson's avatar
      vfio/type1: Fix task tracking for QEMU vCPU hotplug · 48d8476b
      Alex Williamson authored
      
      
      MAP_DMA ioctls might be called from various threads within a process,
      for example when using QEMU, the vCPU threads are often generating
      these calls and we therefore take a reference to that vCPU task.
      However, QEMU also supports vCPU hotplug on some machines and the task
      that called MAP_DMA may have exited by the time UNMAP_DMA is called,
      resulting in the mm_struct pointer being NULL and thus a failure to
      match against the existing mapping.
      
      To resolve this, we instead take a reference to the thread
      group_leader, which has the same mm_struct and resource limits, but
      is less likely exit, at least in the QEMU case.  A difficulty here is
      guaranteeing that the capabilities of the group_leader match that of
      the calling thread, which we resolve by tracking CAP_IPC_LOCK at the
      time of calling rather than at an indeterminate time in the future.
      Potentially this also results in better efficiency as this is now
      recorded once per MAP_DMA ioctl.
      Reported-by: default avatarXu Yandong <xuyandong2@huawei.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      48d8476b
  17. 02 Jun, 2018 1 commit
  18. 22 Mar, 2018 1 commit
  19. 21 Mar, 2018 1 commit
  20. 03 Mar, 2018 1 commit
    • Dan Williams's avatar
      vfio: disable filesystem-dax page pinning · 94db151d
      Dan Williams authored
      
      
      Filesystem-DAX is incompatible with 'longterm' page pinning. Without
      page cache indirection a DAX mapping maps filesystem blocks directly.
      This means that the filesystem must not modify a file's block map while
      any page in a mapping is pinned. In order to prevent the situation of
      userspace holding of filesystem operations indefinitely, disallow
      'longterm' Filesystem-DAX mappings.
      
      RDMA has the same conflict and the plan there is to add a 'with lease'
      mechanism to allow the kernel to notify userspace that the mapping is
      being torn down for block-map maintenance. Perhaps something similar can
      be put in place for vfio.
      
      Note that xfs and ext4 still report:
      
         "DAX enabled. Warning: EXPERIMENTAL, use at your own risk"
      
      ...at mount time, and resolving the dax-dma-vs-truncate problem is one
      of the last hurdles to remove that designation.
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: kvm@vger.kernel.org
      Cc: <stable@vger.kernel.org>
      Reported-by: default avatarHaozhong Zhang <haozhong.zhang@intel.com>
      Tested-by: default avatarHaozhong Zhang <haozhong.zhang@intel.com>
      Fixes: d475c634
      
       ("dax,ext2: replace XIP read and write with DAX I/O")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      94db151d
  21. 20 Oct, 2017 1 commit
  22. 10 Aug, 2017 2 commits
  23. 18 Apr, 2017 2 commits
  24. 13 Apr, 2017 1 commit
    • Alex Williamson's avatar
      vfio/type1: Remove locked page accounting workqueue · 0cfef2b7
      Alex Williamson authored
      
      
      If the mmap_sem is contented then the vfio type1 IOMMU backend will
      defer locked page accounting updates to a workqueue task.  This has a
      few problems and depending on which side the user tries to play, they
      might be over-penalized for unmaps that haven't yet been accounted or
      race the workqueue to enter more mappings than they're allowed.  The
      original intent of this workqueue mechanism seems to be focused on
      reducing latency through the ioctl, but we cannot do so at the cost
      of correctness.  Remove this workqueue mechanism and update the
      callers to allow for failure.  We can also now recheck the limit under
      write lock to make sure we don't exceed it.
      
      vfio_pin_pages_remote() also now necessarily includes an unwind path
      which we can jump to directly if the consecutive page pinning finds
      that we're exceeding the user's memory limits.  This avoids the
      current lazy approach which does accounting and mapping up to the
      fault, only to return an error on the next iteration to unwind the
      entire vfio_dma.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarKirti Wankhede <kwankhede@nvidia.com>
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      0cfef2b7
  25. 22 Mar, 2017 1 commit
    • Robin Murphy's avatar
      iommu: Disambiguate MSI region types · 9d3a4de4
      Robin Murphy authored
      The introduction of reserved regions has left a couple of rough edges
      which we could do with sorting out sooner rather than later. Since we
      are not yet addressing the potential dynamic aspect of software-managed
      reservations and presenting them at arbitrary fixed addresses, it is
      incongruous that we end up displaying hardware vs. software-managed MSI
      regions to userspace differently, especially since ARM-based systems may
      actually require one or the other, or even potentially both at once,
      (which iommu-dma currently has no hope of dealing with at all). Let's
      resolve the former user-visible inconsistency ASAP before the ABI has
      been baked into a kernel release, in a way that also lays the groundwork
      for the latter shortcoming to be addressed by follow-up patches.
      
      For clarity, rename the software-managed type to IOMMU_RESV_SW_MSI, use
      IOMMU_RESV_MSI to describe the hardware type, and document everything a
      little bit. Since the x86 MSI remapping hardware falls squarely under
      this meaning of IOMMU_RESV_MSI, apply that type to their regions as well,
      so that we tell the same story to userspace across all platforms.
      
      Secondly, as the various region types require quite different handling,
      and it really makes little sense to ever try combining them, convert the
      bitfield-esque #defines to a plain enum in the process before anyone
      gets the wrong impression.
      
      Fixes: d30ddcaa
      
       ("iommu: Add a new type field in iommu_resv_region")
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      CC: Alex Williamson <alex.williamson@redhat.com>
      CC: David Woodhouse <dwmw2@infradead.org>
      CC: kvm@vger.kernel.org
      Signed-off-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      9d3a4de4
  26. 02 Mar, 2017 2 commits
  27. 10 Feb, 2017 1 commit
  28. 23 Jan, 2017 1 commit