1. 22 Jun, 2020 2 commits
  2. 09 Jun, 2020 1 commit
  3. 15 May, 2020 2 commits
  4. 20 Apr, 2020 1 commit
  5. 16 Mar, 2020 2 commits
  6. 15 Feb, 2020 1 commit
  7. 12 Feb, 2020 1 commit
    • Sean Christopherson's avatar
      KVM: x86/mmu: Fix struct guest_walker arrays for 5-level paging · f6ab0107
      Sean Christopherson authored
      Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow
      paging for 5-level guest page tables.  PT_MAX_FULL_LEVELS is used to
      size the arrays that track guest pages table information, i.e. using a
      "max levels" of 4 causes KVM to access garbage beyond the end of an
      array when querying state for level 5 entries.  E.g. FNAME(gpte_changed)
      will read garbage and most likely return %true for a level 5 entry,
      soft-hanging the guest because FNAME(fetch) will restart the guest
      instead of creating SPTEs because it thinks the guest PTE has changed.
      Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS
      gets to stay "4" for the PTTYPE_EPT case.
      Fixes: 855feb67
       ("KVM: MMU: Add 5 level EPT & Shadow page table support.")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  8. 27 Jan, 2020 4 commits
  9. 21 Jan, 2020 2 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Micro-optimize nEPT's bad memptype/XWR checks · b5c3c1b3
      Sean Christopherson authored
      Rework the handling of nEPT's bad memtype/XWR checks to micro-optimize
      the checks as much as possible.  Move the check to a separate helper,
      __is_bad_mt_xwr(), which allows the guest_rsvd_check usage in
      paging_tmpl.h to omit the check entirely for paging32/64 (bad_mt_xwr is
      always zero for non-nEPT) while retaining the bitwise-OR of the current
      code for the shadow_zero_check in walk_shadow_page_get_mmio_spte().
      Add a comment for the bitwise-OR usage in the mmio spte walk to avoid
      future attempts to "fix" the code, which is what prompted this
      optimization in the first place[*].
      Opportunistically remove the superfluous '!= 0' and parantheses, and
      use BIT_ULL() instead of open coding its equivalent.
      The net effect is that code generation is largely unchanged for
      walk_shadow_page_get_mmio_spte(), marginally better for
      ept_prefetch_invalid_gpte(), and significantly improved for
      Note, walk_shadow_page_get_mmio_spte() can't use a templated version of
      the memtype/XRW as it works on the host's shadow PTEs, e.g. checks that
      KVM hasn't borked its EPT tables.  Even if it could be templated, the
      benefits of having a single implementation far outweight the few uops
      that would be saved for NPT or non-TDP paging, e.g. most compilers
      inline it all the way to up kvm_mmu_page_fault().
      [*] https://lkml.kernel.org/r/20200108001859.25254-1-sean.j.christopherson@intel.com
      Cc: Jim Mattson <jmattson@google.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Arvind Sankar <nivedita@alum.mit.edu>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Sean Christopherson's avatar
      KVM: x86/mmu: Reorder the reserved bit check in prefetch_invalid_gpte() · f8052a05
      Sean Christopherson authored
      Move the !PRESENT and !ACCESSED checks in FNAME(prefetch_invalid_gpte)
      above the call to is_rsvd_bits_set().  For a well behaved guest, the
      !PRESENT and !ACCESSED are far more likely to evaluate true than the
      reserved bit checks, and they do not require additional memory accesses.
       Dump of assembler code for function paging32_prefetch_invalid_gpte:
         0x0000000000044240 <+0>:     callq  0x44245 <paging32_prefetch_invalid_gpte+5>
         0x0000000000044245 <+5>:     mov    %rcx,%rax
         0x0000000000044248 <+8>:     shr    $0x7,%rax
         0x000000000004424c <+12>:    and    $0x1,%eax
         0x000000000004424f <+15>:    lea    0x0(,%rax,4),%r8
         0x0000000000044257 <+23>:    add    %r8,%rax
         0x000000000004425a <+26>:    mov    %rcx,%r8
         0x000000000004425d <+29>:    and    0x120(%rsi,%rax,8),%r8
         0x0000000000044265 <+37>:    mov    0x170(%rsi),%rax
         0x000000000004426c <+44>:    shr    %cl,%rax
         0x000000000004426f <+47>:    and    $0x1,%eax
         0x0000000000044272 <+50>:    or     %rax,%r8
         0x0000000000044275 <+53>:    jne    0x4427c <paging32_prefetch_invalid_gpte+60>
         0x0000000000044277 <+55>:    test   $0x1,%cl
         0x000000000004427a <+58>:    jne    0x4428a <paging32_prefetch_invalid_gpte+74>
         0x000000000004427c <+60>:    mov    %rdx,%rsi
         0x000000000004427f <+63>:    callq  0x44080 <drop_spte>
         0x0000000000044284 <+68>:    mov    $0x1,%eax
         0x0000000000044289 <+73>:    retq
         0x000000000004428a <+74>:    xor    %eax,%eax
         0x000000000004428c <+76>:    and    $0x20,%ecx
         0x000000000004428f <+79>:    jne    0x44289 <paging32_prefetch_invalid_gpte+73>
         0x0000000000044291 <+81>:    mov    %rdx,%rsi
         0x0000000000044294 <+84>:    callq  0x44080 <drop_spte>
         0x0000000000044299 <+89>:    mov    $0x1,%eax
         0x000000000004429e <+94>:    jmp    0x44289 <paging32_prefetch_invalid_gpte+73>
       End of assembler dump.
       Dump of assembler code for function paging32_prefetch_invalid_gpte:
         0x0000000000044240 <+0>:     callq  0x44245 <paging32_prefetch_invalid_gpte+5>
         0x0000000000044245 <+5>:     test   $0x1,%cl
         0x0000000000044248 <+8>:     je     0x4424f <paging32_prefetch_invalid_gpte+15>
         0x000000000004424a <+10>:    test   $0x20,%cl
         0x000000000004424d <+13>:    jne    0x4425d <paging32_prefetch_invalid_gpte+29>
         0x000000000004424f <+15>:    mov    %rdx,%rsi
         0x0000000000044252 <+18>:    callq  0x44080 <drop_spte>
         0x0000000000044257 <+23>:    mov    $0x1,%eax
         0x000000000004425c <+28>:    retq
         0x000000000004425d <+29>:    mov    %rcx,%rax
         0x0000000000044260 <+32>:    mov    (%rsi),%rsi
         0x0000000000044263 <+35>:    shr    $0x7,%rax
         0x0000000000044267 <+39>:    and    $0x1,%eax
         0x000000000004426a <+42>:    lea    0x0(,%rax,4),%r8
         0x0000000000044272 <+50>:    add    %r8,%rax
         0x0000000000044275 <+53>:    mov    %rcx,%r8
         0x0000000000044278 <+56>:    and    0x120(%rsi,%rax,8),%r8
         0x0000000000044280 <+64>:    mov    0x170(%rsi),%rax
         0x0000000000044287 <+71>:    shr    %cl,%rax
         0x000000000004428a <+74>:    and    $0x1,%eax
         0x000000000004428d <+77>:    mov    %rax,%rcx
         0x0000000000044290 <+80>:    xor    %eax,%eax
         0x0000000000044292 <+82>:    or     %rcx,%r8
         0x0000000000044295 <+85>:    je     0x4425c <paging32_prefetch_invalid_gpte+28>
         0x0000000000044297 <+87>:    mov    %rdx,%rsi
         0x000000000004429a <+90>:    callq  0x44080 <drop_spte>
         0x000000000004429f <+95>:    mov    $0x1,%eax
         0x00000000000442a4 <+100>:   jmp    0x4425c <paging32_prefetch_invalid_gpte+28>
       End of assembler dump.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  10. 08 Jan, 2020 5 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: WARN on an invalid root_hpa · 0c7a98e3
      Sean Christopherson authored
      WARN on the existing invalid root_hpa checks in __direct_map() and
      FNAME(fetch).  The "legitimate" path that invalidated root_hpa in the
      middle of a page fault is long since gone, i.e. it should no longer be
      impossible to invalidate in the middle of a page fault[*].
      The root_hpa checks were added by two related commits
        989c6b34 ("KVM: MMU: handle invalid root_hpa at __direct_map")
       ("KVM: x86: handle invalid root_hpa everywhere")
      to fix a bug where nested_vmx_vmexit() could be called *in the middle*
      of a page fault.  At the time, vmx_interrupt_allowed(), which was and
      still is used by kvm_can_do_async_pf() via ->interrupt_allowed(),
      directly invoked nested_vmx_vmexit() to switch from L2 to L1 to emulate
      a VM-Exit on a pending interrupt.  Emulating the nested VM-Exit resulted
      in root_hpa being invalidated by kvm_mmu_reset_context() without
      explicitly terminating the page fault.
      Now that root_hpa is checked for validity by kvm_mmu_page_fault(), WARN
      on an invalid root_hpa to detect any flows that reset the MMU while
      handling a page fault.  The broken vmx_interrupt_allowed() behavior has
      long since been fixed and resetting the MMU during a page fault should
      not be considered legal behavior.
      [*] It's actually technically possible in FNAME(page_fault)() because it
          calls inject_page_fault() when the guest translation is invalid, but
          in that case the page fault handling is immediately terminated.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move calls to thp_adjust() down a level · 4cd071d1
      Sean Christopherson authored
      Move the calls to thp_adjust() down a level from the page fault handlers
      to the map/fetch helpers and remove the page count shuffling done in
      Despite holding a reference to the underlying page while processing a
      page fault, the page fault flows don't actually rely on holding a
      reference to the page when thp_adjust() is called.  At that point, the
      fault handlers hold mmu_lock, which prevents mmu_notifier from completing
      any invalidations, and have verified no invalidations from mmu_notifier
      have occurred since the page reference was acquired (which is done prior
      to taking mmu_lock).
      The kvm_release_pfn_clean()/kvm_get_pfn() dance in thp_adjust() is a
      quirk that is necessitated because thp_adjust() modifies the pfn that is
      consumed by its caller.  Because the page fault handlers call
      kvm_release_pfn_clean() on said pfn, thp_adjust() needs to transfer the
      reference to the correct pfn purely for correctness when the pfn is
      Calling thp_adjust() from __direct_map() and FNAME(fetch) means the pfn
      adjustment doesn't change the pfn as seen by the page fault handlers,
      i.e. the pfn released by the page fault handlers is the same pfn that
      was returned by gfn_to_pfn().
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Sean Christopherson's avatar
      KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU · cbe1e6f0
      Sean Christopherson authored
      Restrict the max level for a shadow page based on the guest's level
      instead of capping the level after the fact for host-mapped huge pages,
      e.g. hugetlbfs pages.  Explicitly capping the max level using the guest
      mapping level also eliminates FNAME(page_fault)'s subtle dependency on
      THP only supporting 2mb pages.
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Sean Christopherson's avatar
      KVM: x86/mmu: Refactor handling of forced 4k pages in page faults · 39ca1ecb
      Sean Christopherson authored
      Refactor the page fault handlers and mapping_level() to track the max
      allowed page level instead of only tracking if a 4k page is mandatory
      due to one restriction or another.  This paves the way for cleanly
      consolidating tdp_page_fault() and nonpaging_page_fault(), and for
      eliminating a redundant check on mmu_gfn_lpage_is_disallowed().
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    • Sean Christopherson's avatar
      KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM · 736c291c
      Sean Christopherson authored
      Convert a plethora of parameters and variables in the MMU and page fault
      flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
      Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
      addresses.  When TDP is enabled, the fault address is a guest physical
      address and thus can be a 64-bit value, even when both KVM and its guest
      are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
      64-bit field, not a natural width field.
      Using a gva_t for the fault address means KVM will incorrectly drop the
      upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
      translate L2 GPAs to L1 GPAs.
      Opportunistically rename variables and parameters to better reflect the
      dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
      "addr" instead of "vaddr" when the address may be either a GVA or an L2
      GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
      a confusing "gpa_t gva" declaration; this also sets the stage for a
      future patch to combing nonpaging_page_fault() and tdp_page_fault() with
      minimal churn.
      Sprinkle in a few comments to document flows where an address is known
      to be a GVA and thus can be safely truncated to a 32-bit value.  Add
      WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
      document such cases and detect bugs.
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  11. 21 Nov, 2019 1 commit
  12. 04 Nov, 2019 1 commit
    • Paolo Bonzini's avatar
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini authored
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: default avatarJunaid Shahid <junaids@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
  13. 05 Jul, 2019 4 commits
  14. 19 Jun, 2019 1 commit
  15. 14 May, 2019 1 commit
    • Ira Weiny's avatar
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny authored
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  16. 30 Apr, 2019 1 commit
  17. 21 Dec, 2018 1 commit
  18. 16 Oct, 2018 1 commit
  19. 06 Aug, 2018 3 commits
  20. 16 Mar, 2018 1 commit
    • KarimAllah Ahmed's avatar
      KVM: x86: Update the exit_qualification access bits while walking an address · ddd6f0e9
      KarimAllah Ahmed authored
      ... to avoid having a stale value when handling an EPT misconfig for MMIO
      MMIO regions that are not passed-through to the guest are handled through
      EPT misconfigs. The first time a certain MMIO page is touched it causes an
      EPT violation, then KVM marks the EPT entry to cause an EPT misconfig
      instead. Any subsequent accesses to the entry will generate an EPT
      Things gets slightly complicated with nested guest handling for MMIO
      regions that are not passed through from L0 (i.e. emulated by L0
      An EPT violation for one of these MMIO regions from L2, exits to L0
      hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor and
      realize it is not present (or not sufficient to serve the request). Then L0
      injects an EPT violation to L1. L1 would then update its EPT mappings. The
      EXIT_QUALIFICATION value for L1 would come from exit_qualification variable
      in "struct vcpu". The problem is that this variable is only updated on EPT
      violation and not on EPT misconfig. So if an EPT violation because of a
      read happened first, then an EPT misconfig because of a write happened
      afterwards. The L0 hypervisor will still contain exit_qualification value
      from the previous read instead of the write and end up injecting an EPT
      violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION.
      The EPT violation that is injected from L0 to L1 needs to have the correct
      EXIT_QUALIFICATION specially for the access bits because the individual
      access bits for MMIO EPTs are updated only on actual access of this
      specific type. So for the example above, the L1 hypervisor will keep
      updating only the read bit in the EPT then resume the L2 guest. The L2
      guest would end up causing another exit where the L0 *again* will inject
      another EPT violation to L1 hypervisor with *again* an out of date
      exit_qualification which indicates a read and not a write. Then this
      ping-pong just keeps happening without making any forward progress.
      The behavior of mapping MMIO regions changed in:
         commit a340b3e2
       ("kvm: Map PFN-type memory regions as writable (if possible)")
      ... where an EPT violation for a read would also fixup the write bits to
      avoid another EPT violation which by acciddent would fix the bug mentioned
      This commit fixes this situation and ensures that the access bits for the
      exit_qualifcation is up to date. That ensures that even L1 hypervisor
      running with a KVM version before the commit mentioned above would still
      ( The description above assumes EPT to be available and used by L1
        hypervisor + the L1 hypervisor is passing through the MMIO region to the L2
        guest while this MMIO region is emulated by the L0 user-space ).
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
  21. 12 Oct, 2017 1 commit
  22. 10 Oct, 2017 1 commit
    • Ladi Prosek's avatar
      KVM: MMU: always terminate page walks at level 1 · 829ee279
      Ladi Prosek authored
      is_last_gpte() is not equivalent to the pseudo-code given in commit
      6bb69c9b ("KVM: MMU: simplify last_pte_bitmap") because an incorrect
      value of last_nonleaf_level may override the result even if level == 1.
      It is critical for is_last_gpte() to return true on level == 1 to
      terminate page walks. Otherwise memory corruption may occur as level
      is used as an index to various data structures throughout the page
      walking code.  Even though the actual bug would be wherever the MMU is
      initialized (as in the previous patch), be defensive and ensure here
      that is_last_gpte() returns the correct value.
      This patch is also enough to fix CVE-2017-12188.
      Fixes: 6bb69c9b
      Cc: stable@vger.kernel.org
      Cc: Andy Honig <ahonig@google.com>
      Signed-off-by: default avatarLadi Prosek <lprosek@redhat.com>
      [Panic if walk_addr_generic gets an incorrect level; this is a serious
       bug and it's not worth a WARN_ON where the recovery path might hide
       further exploitable issues; suggested by Andrew Honig. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
  23. 18 Aug, 2017 1 commit
    • Paolo Bonzini's avatar
      KVM: x86: fix use of L1 MMIO areas in nested guests · 9034e6e8
      Paolo Bonzini authored
      There is currently some confusion between nested and L1 GPAs.  The
      assignment to "direct" in kvm_mmu_page_fault tries to fix that, but
      it is not enough.  What this patch does is fence off the MMIO cache
      completely when using shadow nested page tables, since we have neither
      a GVA nor an L1 GPA to put in the cache.  This also allows some
      simplifications in kvm_mmu_page_fault and FNAME(page_fault).
      The EPT misconfig likewise does not have an L1 GPA to pass to
      kvm_io_bus_write, so that must be skipped for guest mode.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      [Changed comment to say "GPAs" instead of "L1's physical addresses", as
       per David's review. - Radim]
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
  24. 11 Aug, 2017 1 commit