Skip to content
Snippets Groups Projects
  1. Feb 22, 2021
    • David Stevens's avatar
      KVM: x86/mmu: Consider the hva in mmu_notifier retry · 4a42d848
      David Stevens authored
      
      Track the range being invalidated by mmu_notifier and skip page fault
      retries if the fault address is not affected by the in-progress
      invalidation. Handle concurrent invalidations by finding the minimal
      range which includes all ranges being invalidated. Although the combined
      range may include unrelated addresses and cannot be shrunk as individual
      invalidation operations complete, it is unlikely the marginal gains of
      proper range tracking are worth the additional complexity.
      
      The primary benefit of this change is the reduction in the likelihood of
      extreme latency when handing a page fault due to another thread having
      been preempted while modifying host virtual addresses.
      
      Signed-off-by: default avatarDavid Stevens <stevensd@chromium.org>
      Message-Id: <20210222024522.1751719-3-stevensd@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4a42d848
  2. Feb 09, 2021
    • Sean Christopherson's avatar
      KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped() · a9545779
      Sean Christopherson authored
      
      Use kvm_pfn_t, a.k.a. u64, for the local 'pfn' variable when retrieving
      a so called "remapped" hva/pfn pair.  In theory, the hva could resolve to
      a pfn in high memory on a 32-bit kernel.
      
      This bug was inadvertantly exposed by commit bd2fae8d ("KVM: do not
      assume PTE is writable after follow_pfn"), which added an error PFN value
      to the mix, causing gcc to comlain about overflowing the unsigned long.
      
        arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function ‘hva_to_pfn_remapped’:
        include/linux/kvm_host.h:89:30: error: conversion from ‘long long unsigned int’
                                        to ‘long unsigned int’ changes value from
                                        ‘9218868437227405314’ to ‘2’ [-Werror=overflow]
         89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
            |                              ^
      virt/kvm/kvm_main.c:1935:9: note: in expansion of macro ‘KVM_PFN_ERR_RO_FAULT’
      
      Cc: stable@vger.kernel.org
      Fixes: add6a0cd ("KVM: MMU: try to fix up page faults before giving up")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210208201940.1258328-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a9545779
    • Paolo Bonzini's avatar
      mm: provide a saner PTE walking API for modules · 9fd6dad1
      Paolo Bonzini authored
      
      Currently, the follow_pfn function is exported for modules but
      follow_pte is not.  However, follow_pfn is very easy to misuse,
      because it does not provide protections (so most of its callers
      assume the page is writable!) and because it returns after having
      already unlocked the page table lock.
      
      Provide instead a simplified version of follow_pte that does
      not have the pmdpp and range arguments.  The older version
      survives as follow_invalidate_pte() for use by fs/dax.c.
      
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9fd6dad1
  3. Feb 04, 2021
    • Ben Gardon's avatar
      KVM: x86/mmu: Use an rwlock for the x86 MMU · 531810ca
      Ben Gardon authored
      
      Add a read / write lock to be used in place of the MMU spinlock on x86.
      The rwlock will enable the TDP MMU to handle page faults, and other
      operations in parallel in future commits.
      
      Reviewed-by: default avatarPeter Feiner <pfeiner@google.com>
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      
      Message-Id: <20210202185734.1680553-19-bgardon@google.com>
      [Introduce virt/kvm/mmu_lock.h - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      531810ca
    • Tian Tao's avatar
      KVM: X86: use vzalloc() instead of vmalloc/memset · c910662c
      Tian Tao authored
      
      fixed the following warning:
      /virt/kvm/dirty_ring.c:70:20-27: WARNING: vzalloc should be used for
      ring -> dirty_gfns, instead of vmalloc/memset.
      
      Signed-off-by: default avatarTian Tao <tiantao6@hisilicon.com>
      Message-Id: <1611547045-13669-1-git-send-email-tiantao6@hisilicon.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c910662c
    • Paolo Bonzini's avatar
      KVM: do not assume PTE is writable after follow_pfn · bd2fae8d
      Paolo Bonzini authored
      
      In order to convert an HVA to a PFN, KVM usually tries to use
      the get_user_pages family of functinso.  This however is not
      possible for VM_IO vmas; in that case, KVM instead uses follow_pfn.
      
      In doing this however KVM loses the information on whether the
      PFN is writable.  That is usually not a problem because the main
      use of VM_IO vmas with KVM is for BARs in PCI device assignment,
      however it is a bug.  To fix it, use follow_pte and check pte_write
      while under the protection of the PTE lock.  The information can
      be used to fail hva_to_pfn_remapped or passed back to the
      caller via *writable.
      
      Usage of follow_pfn was introduced in commit add6a0cd ("KVM: MMU: try to fix
      up page faults before giving up", 2016-07-05); however, even older version
      have the same issue, all the way back to commit 2e2e3738 ("KVM:
      Handle vma regions with no backing page", 2008-07-20), as they also did
      not check whether the PFN was writable.
      
      Fixes: 2e2e3738 ("KVM: Handle vma regions with no backing page")
      Reported-by: default avatarDavid Stevens <stevensd@google.com>
      Cc: 3pvd@google.com
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bd2fae8d
  4. Jan 21, 2021
  5. Jan 07, 2021
    • Lai Jiangshan's avatar
      kvm: check tlbs_dirty directly · 88bf56d0
      Lai Jiangshan authored
      
      In kvm_mmu_notifier_invalidate_range_start(), tlbs_dirty is used as:
              need_tlb_flush |= kvm->tlbs_dirty;
      with need_tlb_flush's type being int and tlbs_dirty's type being long.
      
      It means that tlbs_dirty is always used as int and the higher 32 bits
      is useless.  We need to check tlbs_dirty in a correct way and this
      change checks it directly without propagating it to need_tlb_flush.
      
      Note: it's _extremely_ unlikely this neglecting of higher 32 bits can
      cause problems in practice.  It would require encountering tlbs_dirty
      on a 4 billion count boundary, and KVM would need to be using shadow
      paging or be running a nested guest.
      
      Cc: stable@vger.kernel.org
      Fixes: a4ee1ca4 ("KVM: MMU: delay flush all tlbs on sync_page path")
      Signed-off-by: default avatarLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20201217154118.16497-1-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      88bf56d0
  6. Dec 19, 2020
  7. Nov 15, 2020
    • Peter Xu's avatar
      KVM: Don't allocate dirty bitmap if dirty ring is enabled · 044c59c4
      Peter Xu authored
      
      Because kvm dirty rings and kvm dirty log is used in an exclusive way,
      Let's avoid creating the dirty_bitmap when kvm dirty ring is enabled.
      At the meantime, since the dirty_bitmap will be conditionally created
      now, we can't use it as a sign of "whether this memory slot enabled
      dirty tracking".  Change users like that to check against the kvm
      memory slot flags.
      
      Note that there still can be chances where the kvm memory slot got its
      dirty_bitmap allocated, _if_ the memory slots are created before
      enabling of the dirty rings and at the same time with the dirty
      tracking capability enabled, they'll still with the dirty_bitmap.
      However it should not hurt much (e.g., the bitmaps will always be
      freed if they are there), and the real users normally won't trigger
      this because dirty bit tracking flag should in most cases only be
      applied to kvm slots only before migration starts, that should be far
      latter than kvm initializes (VM starts).
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012226.5868-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      044c59c4
    • Peter Xu's avatar
      KVM: Make dirty ring exclusive to dirty bitmap log · b2cc64c4
      Peter Xu authored
      
      There's no good reason to use both the dirty bitmap logging and the
      new dirty ring buffer to track dirty bits.  We should be able to even
      support both of them at the same time, but it could complicate things
      which could actually help little.  Let's simply make it the rule
      before we enable dirty ring on any arch, that we don't allow these two
      interfaces to be used together.
      
      The big world switch would be KVM_CAP_DIRTY_LOG_RING capability
      enablement.  That's where we'll switch from the default dirty logging
      way to the dirty ring way.  As long as kvm->dirty_ring_size is setup
      correctly, we'll once and for all switch to the dirty ring buffer mode
      for the current virtual machine.
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012224.5818-1-peterx@redhat.com>
      [Change errno from EINVAL to ENXIO. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b2cc64c4
    • Peter Xu's avatar
      KVM: X86: Implement ring-based dirty memory tracking · fb04a1ed
      Peter Xu authored
      This patch is heavily based on previous work from Lei Cao
      <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
      
      KVM currently uses large bitmaps to track dirty memory.  These bitmaps
      are copied to userspace when userspace queries KVM for its dirty page
      information.  The use of bitmaps is mostly sufficient for live
      migration, as large parts of memory are be dirtied from one log-dirty
      pass to another.  However, in a checkpointing system, the number of
      dirty pages is small and in fact it is often bounded---the VM is
      paused when it has dirtied a pre-defined number of pages. Traversing a
      large, sparsely populated bitmap to find set bits is time-consuming,
      as is copying the bitmap to user-space.
      
      A similar issue will be there for live migration when the guest memory
      is huge while the page dirty procedure is trivial.  In that case for
      each dirty sync we need to pull the whole dirty bitmap to userspace
      and analyse every bit even if it's mostly zeros.
      
      The preferred data structure for above scenarios is a dense list of
      guest frame numbers (GFN).  This patch series stores the dirty list in
      kernel memory that can be memory mapped into userspace to allow speedy
      harvesting.
      
      This patch enables dirty ring for X86 only.  However it should be
      easily extended to other archs as well.
      
      [1] https://patchwork.kernel.org/patch/10471409/
      
      
      
      Signed-off-by: default avatarLei Cao <lei.cao@stratus.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012222.5767-1-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fb04a1ed
    • Peter Xu's avatar
      KVM: Pass in kvm pointer into mark_page_dirty_in_slot() · 28bd726a
      Peter Xu authored
      
      The context will be needed to implement the kvm dirty ring.
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012044.5151-5-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      28bd726a
    • Paolo Bonzini's avatar
      KVM: remove kvm_clear_guest_page · 2f541442
      Paolo Bonzini authored
      
      kvm_clear_guest_page is not used anymore after "KVM: X86: Don't track dirty
      for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR]", except from kvm_clear_guest.
      We can just inline it in its sole user.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2f541442
    • David Woodhouse's avatar
      kvm/eventfd: Drain events from eventfd in irqfd_wakeup() · b59e00dd
      David Woodhouse authored
      
      Don't allow the events to accumulate in the eventfd counter, drain them
      as they are handled.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20201027135523.646811-4-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b59e00dd
    • David Woodhouse's avatar
      kvm/eventfd: Use priority waitqueue to catch events before userspace · e8dbf195
      David Woodhouse authored
      
      As far as I can tell, when we use posted interrupts we silently cut off
      the events from userspace, if it's listening on the same eventfd that
      feeds the irqfd.
      
      I like that behaviour. Let's do it all the time, even without posted
      interrupts. It makes it much easier to handle IRQ remapping invalidation
      without having to constantly add/remove the fd from the userspace poll
      set. We can just leave userspace polling on it, and the bypass will...
      well... bypass it.
      
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20201026175325.585623-2-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8dbf195
  8. Oct 23, 2020
  9. Oct 21, 2020
  10. Sep 28, 2020
  11. Sep 11, 2020
  12. Aug 21, 2020
    • Will Deacon's avatar
      KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd
      Will Deacon authored
      
      The 'flags' field of 'struct mmu_notifier_range' is used to indicate
      whether invalidate_range_{start,end}() are permitted to block. In the
      case of kvm_mmu_notifier_invalidate_range_start(), this field is not
      forwarded on to the architecture-specific implementation of
      kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
      whether or not to block.
      
      Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
      architectures are aware as to whether or not they are permitted to block.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-2-will@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdfe7cbd
  13. Aug 12, 2020
  14. Aug 05, 2020
  15. Jul 29, 2020
  16. Jul 24, 2020
    • Thomas Gleixner's avatar
      entry: Provide infrastructure for work before transitioning to guest mode · 935ace2f
      Thomas Gleixner authored
      
      Entering a guest is similar to exiting to user space. Pending work like
      handling signals, rescheduling, task work etc. needs to be handled before
      that.
      
      Provide generic infrastructure to avoid duplication of the same handling
      code all over the place.
      
      The transfer to guest mode handling is different from the exit to usermode
      handling, e.g. vs. rseq and live patching, so a separate function is used.
      
      The initial list of work items handled is:
      
          TIF_SIGPENDING, TIF_NEED_RESCHED, TIF_NOTIFY_RESUME
      
      Architecture specific TIF flags can be added via defines in the
      architecture specific include files.
      
      The calling convention is also different from the syscall/interrupt entry
      functions as KVM invokes this from the outer vcpu_run() loop with
      interrupts and preemption enabled. To prevent missing a pending work item
      it invokes a check for pending TIF work from interrupt disabled code right
      before transitioning to guest mode. The lockdep, RCU and tracing state
      handling is also done directly around the switch to and from guest mode.
      
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20200722220519.833296398@linutronix.de
      935ace2f
  17. Jul 09, 2020
  18. Jul 08, 2020
  19. Jul 02, 2020
  20. Jun 11, 2020
  21. Jun 09, 2020
    • Michel Lespinasse's avatar
      mmap locking API: use coccinelle to convert mmap_sem rwsem call sites · d8ed45c5
      Michel Lespinasse authored
      
      This change converts the existing mmap_sem rwsem calls to use the new mmap
      locking API instead.
      
      The change is generated using coccinelle with the following rule:
      
      // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .
      
      @@
      expression mm;
      @@
      (
      -init_rwsem
      +mmap_init_lock
      |
      -down_write
      +mmap_write_lock
      |
      -down_write_killable
      +mmap_write_lock_killable
      |
      -down_write_trylock
      +mmap_write_trylock
      |
      -up_write
      +mmap_write_unlock
      |
      -downgrade_write
      +mmap_write_downgrade
      |
      -down_read
      +mmap_read_lock
      |
      -down_read_killable
      +mmap_read_lock_killable
      |
      -down_read_trylock
      +mmap_read_trylock
      |
      -up_read
      +mmap_read_unlock
      )
      -(&mm->mmap_sem)
      +(mm)
      
      Signed-off-by: default avatarMichel Lespinasse <walken@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Liam Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ying Han <yinghan@google.com>
      Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d8ed45c5
    • Mike Rapoport's avatar
      mm: don't include asm/pgtable.h if linux/mm.h is already included · e31cf2f4
      Mike Rapoport authored
      
      Patch series "mm: consolidate definitions of page table accessors", v2.
      
      The low level page table accessors (pXY_index(), pXY_offset()) are
      duplicated across all architectures and sometimes more than once.  For
      instance, we have 31 definition of pgd_offset() for 25 supported
      architectures.
      
      Most of these definitions are actually identical and typically it boils
      down to, e.g.
      
      static inline unsigned long pmd_index(unsigned long address)
      {
              return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
      }
      
      static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
      {
              return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
      }
      
      These definitions can be shared among 90% of the arches provided
      XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.
      
      For architectures that really need a custom version there is always
      possibility to override the generic version with the usual ifdefs magic.
      
      These patches introduce include/linux/pgtable.h that replaces
      include/asm-generic/pgtable.h and add the definitions of the page table
      accessors to the new header.
      
      This patch (of 12):
      
      The linux/mm.h header includes <asm/pgtable.h> to allow inlining of the
      functions involving page table manipulations, e.g.  pte_alloc() and
      pmd_alloc().  So, there is no point to explicitly include <asm/pgtable.h>
      in the files that include <linux/mm.h>.
      
      The include statements in such cases are remove with a simple loop:
      
      	for f in $(git grep -l "include <linux/mm.h>") ; do
      		sed -i -e '/include <asm\/pgtable.h>/ d' $f
      	done
      
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
      Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e31cf2f4
  22. Jun 08, 2020
    • Souptick Joarder's avatar
      mm/gup.c: convert to use get_user_{page|pages}_fast_only() · dadbb612
      Souptick Joarder authored
      
      API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
      align with pin_user_pages_fast_only().
      
      As part of this we will get rid of write parameter.  Instead caller will
      pass FOLL_WRITE to get_user_pages_fast_only().  This will not change any
      existing functionality of the API.
      
      All the callers are changed to pass FOLL_WRITE.
      
      Also introduce get_user_page_fast_only(), and use it in a few places
      that hard-code nr_pages to 1.
      
      Updated the documentation of the API.
      
      Signed-off-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: Paul Mackerras <paulus@ozlabs.org>		[arch/powerpc/kvm]
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Michal Suchanek <msuchanek@suse.de>
      Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dadbb612
    • Eiichi Tsukata's avatar
      KVM: x86: Fix APIC page invalidation race · e649b3f0
      Eiichi Tsukata authored
      Commit b1394e74 ("KVM: x86: fix APIC page invalidation") tried
      to fix inappropriate APIC page invalidation by re-introducing arch
      specific kvm_arch_mmu_notifier_invalidate_range() and calling it from
      kvm_mmu_notifier_invalidate_range_start. However, the patch left a
      possible race where the VMCS APIC address cache is updated *before*
      it is unmapped:
      
        (Invalidator) kvm_mmu_notifier_invalidate_range_start()
        (Invalidator) kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD)
        (KVM VCPU) vcpu_enter_guest()
        (KVM VCPU) kvm_vcpu_reload_apic_access_page()
        (Invalidator) actually unmap page
      
      Because of the above race, there can be a mismatch between the
      host physical address stored in the APIC_ACCESS_PAGE VMCS field and
      the host physical address stored in the EPT entry for the APIC GPA
      (0xfee0000).  When this happens, the processor will not trap APIC
      accesses, and will instead show the raw contents of the APIC-access page.
      Because Windows OS periodically checks for unexpected modifications to
      the LAPIC register, this will show up as a BSOD crash with BugCheck
      CRITICAL_STRUCTURE_CORRUPTION (109) we are currently seeing in
      https://bugzilla.redhat.com/show_bug.cgi?id=1751017.
      
      The root cause of the issue is that kvm_arch_mmu_notifier_invalidate_range()
      cannot guarantee that no additional references are taken to the pages in
      the range before kvm_mmu_notifier_invalidate_range_end().  Fortunately,
      this case is supported by the MMU notifier API, as documented in
      include/linux/mmu_notifier.h:
      
      	 * If the subsystem
               * can't guarantee that no additional references are taken to
               * the pages in the range, it has to implement the
               * invalidate_range() notifier to remove any references taken
               * after invalidate_range_start().
      
      The fix therefore is to reload the APIC-access page field in the VMCS
      from kvm_mmu_notifier_invalidate_range() instead of ..._range_start().
      
      Cc: stable@vger.kernel.org
      Fixes: b1394e74 ("KVM: x86: fix APIC page invalidation")
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=197951
      
      
      Signed-off-by: default avatarEiichi Tsukata <eiichi.tsukata@nutanix.com>
      Message-Id: <20200606042627.61070-1-eiichi.tsukata@nutanix.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e649b3f0
  23. Jun 04, 2020
  24. Jun 01, 2020
    • Paolo Bonzini's avatar
      KVM: check userspace_addr for all memslots · 09d952c9
      Paolo Bonzini authored
      
      The userspace_addr alignment and range checks are not performed for private
      memory slots that are prepared by KVM itself.  This is unnecessary and makes
      it questionable to use __*_user functions to access memory later on.  We also
      rely on the userspace address being aligned since we have an entire family
      of functions to map gfn to pfn.
      
      Fortunately skipping the check is completely unnecessary.  Only x86 uses
      private memslots and their userspace_addr is obtained from vm_mmap,
      therefore it must be below PAGE_OFFSET.  In fact, any attempt to pass
      an address above PAGE_OFFSET would have failed because such an address
      would return true for kvm_is_error_hva.
      
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      09d952c9
    • Vitaly Kuznetsov's avatar
      KVM: x86: acknowledgment mechanism for async pf page ready notifications · 557a961a
      Vitaly Kuznetsov authored
      
      If two page ready notifications happen back to back the second one is not
      delivered and the only mechanism we currently have is
      kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
      only be performed with the next vmexit when it happens and in some cases
      it may take a while. With interrupt based page ready notification delivery
      the situation is even worse: unlike exceptions, interrupts are not handled
      immediately so we must check if the slot is empty. This is slow and
      unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
      the fact that the slot is free and host should check its notification
      queue. Mandate using it for interrupt based 'page ready' APF event
      delivery.
      
      As kvm_check_async_pf_completion() is going away from vcpu_run() we need
      a way to communicate the fact that vcpu->async_pf.done queue has
      transitioned from empty to non-empty state. Introduce
      kvm_arch_async_page_present_queued() and KVM_REQ_APF_READY to do the job.
      
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200525144125.143875-7-vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      557a961a
Loading