1. 21 Apr, 2020 7 commits
  2. 20 Apr, 2020 1 commit
  3. 15 Apr, 2020 1 commit
    • Sean Christopherson's avatar
      KVM: VMX: Flush all EPTP/VPID contexts on remote TLB flush · e8eff282
      Sean Christopherson authored
      
      
      Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by
      a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH.  Remote TLB
      flushes require all contexts to be invalidated, not just the active
      contexts, e.g. all mappings in all contexts for a given HVA need to be
      invalidated on a mmu_notifier invalidation.  Similarly, the instigator
      of the deferred TLB flush may be expecting all contexts to be flushed,
      e.g. vmx_vcpu_load_vmcs().
      
      Without nested VMX, flushing only the current EPTP/VPID context isn't
      problematic because KVM uses a constant VPID for each vCPU, and
      mmu_alloc_direct_roots() all but guarantees KVM will use a single EPTP
      for L1.  In the rare case where a different EPTP is created or reused,
      KVM (currently) unconditionally flushes the new EPTP context prior to
      entering the guest.
      
      With nested VMX, KVM conditionally uses a different VPID for L2, and
      unconditionally uses a different EPTP for L2.  Because KVM doesn't
      _intentionally_ guarantee L2's EPTP/VPID context is flushed on nested
      VM-Enter, it'd be possible for a malicious L1 to attack the host and/or
      different VMs by exploiting the lack of flushing for L2.
      
        1) Launch nested guest from malicious L1.
      
        2) Nested VM-Enter to L2.
      
        3) Access target GPA 'g'.  CPU inserts TLB entry tagged with L2's ASID
           mapping 'g' to host PFN 'x'.
      
        2) Nested VM-Exit to L1.
      
        3) L1 triggers kernel same-page merging (ksm) by duplicating/zeroing
           the page for PFN 'x'.
      
        4) Host kernel merges PFN 'x' with PFN 'y', i.e. unmaps PFN 'x' and
           remaps the page to PFN 'y'.  mmu_notifier sends invalidate command,
           KVM flushes TLB only for L1's ASID.
      
        4) Host kernel reallocates PFN 'x' to some other task/guest.
      
        5) Nested VM-Enter to L2.  KVM does not invalidate L2's EPTP or VPID.
      
        6) L2 accesses GPA 'g' and gains read/write access to PFN 'x' via its
           stale TLB entry.
      
      However, current KVM unconditionally flushes L1's EPTP/VPID context on
      nested VM-Exit.  But, that behavior is mostly unintentional, KVM doesn't
      go out of its way to flush EPTP/VPID on nested VM-Enter/VM-Exit, rather
      a TLB flush is guaranteed to occur prior to re-entering L1 due to
      __kvm_mmu_new_cr3() always being called with skip_tlb_flush=false.  On
      nested VM-Enter, this happens via kvm_init_shadow_ept_mmu() (nested EPT
      enabled) or in nested_vmx_load_cr3() (nested EPT disabled).  On nested
      VM-Exit it occurs via nested_vmx_load_cr3().
      
      This also fixes a bug where a deferred TLB flush in the context of L2,
      with EPT disabled, would flush L1's VPID instead of L2's VPID, as
      vmx_flush_tlb() flushes L1's VPID regardless of is_guest_mode().
      
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Junaid Shahid <junaids@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: John Haxby <john.haxby@oracle.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Fixes: efebf0aa
      
       ("KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200320212833.3507-2-sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8eff282
  4. 23 Mar, 2020 1 commit
  5. 16 Mar, 2020 3 commits
  6. 23 Feb, 2020 1 commit
    • Oliver Upton's avatar
      KVM: nVMX: Emulate MTF when performing instruction emulation · 5ef8acbd
      Oliver Upton authored
      Since commit 5f3d45e7 ("kvm/x86: add support for
      MONITOR_TRAP_FLAG"), KVM has allowed an L1 guest to use the monitor trap
      flag processor-based execution control for its L2 guest. KVM simply
      forwards any MTF VM-exits to the L1 guest, which works for normal
      instruction execution.
      
      However, when KVM needs to emulate an instruction on the behalf of an L2
      guest, the monitor trap flag is not emulated. Add the necessary logic to
      kvm_skip_emulated_instruction() to synthesize an MTF VM-exit to L1 upon
      instruction emulation for L2.
      
      Fixes: 5f3d45e7
      
       ("kvm/x86: add support for MONITOR_TRAP_FLAG")
      Signed-off-by: default avatarOliver Upton <oupton@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ef8acbd
  7. 17 Feb, 2020 1 commit
  8. 13 Jan, 2020 1 commit
    • Sean Christopherson's avatar
      x86/msr-index: Clean up bit defines for IA32_FEATURE_CONTROL MSR · 32ad73db
      Sean Christopherson authored
      
      
      As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL
      are quite a mouthful, especially the VMX bits which must differentiate
      between enabling VMX inside and outside SMX (TXT) operation.  Rename the
      MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to
      make them a little friendlier on the eyes.
      
      Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name
      to match Intel's SDM, but a future patch will add a dedicated Kconfig,
      file and functions for the MSR. Using the full name for those assets is
      rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its
      nomenclature is consistent throughout the kernel.
      
      Opportunistically, fix a few other annoyances with the defines:
      
        - Relocate the bit defines so that they immediately follow the MSR
          define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL.
        - Add whitespace around the block of feature control defines to make
          it clear they're all related.
        - Use BIT() instead of manually encoding the bit shift.
        - Use "VMX" instead of "VMXON" to match the SDM.
        - Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to
          be consistent with the kernel's verbiage used for all other feature
          control bits.  Note, the SDM refers to the LMCE bit as LMCE_ON,
          likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN.  Ignore
          the (literal) one-off usage of _ON, the SDM is simply "wrong".
      
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
      32ad73db
  9. 04 Dec, 2019 1 commit
  10. 15 Nov, 2019 3 commits
    • Aaron Lewis's avatar
      KVM: nVMX: Add support for capturing highest observable L2 TSC · 662f1d1d
      Aaron Lewis authored
      
      
      The L1 hypervisor may include the IA32_TIME_STAMP_COUNTER MSR in the
      vmcs12 MSR VM-exit MSR-store area as a way of determining the highest
      TSC value that might have been observed by L2 prior to VM-exit. The
      current implementation does not capture a very tight bound on this
      value.  To tighten the bound, add the IA32_TIME_STAMP_COUNTER MSR to the
      vmcs02 VM-exit MSR-store area whenever it appears in the vmcs12 VM-exit
      MSR-store area.  When L0 processes the vmcs12 VM-exit MSR-store area
      during the emulation of an L2->L1 VM-exit, special-case the
      IA32_TIME_STAMP_COUNTER MSR, using the value stored in the vmcs02
      VM-exit MSR-store area to derive the value to be stored in the vmcs12
      VM-exit MSR-store area.
      
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarAaron Lewis <aaronlewis@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      662f1d1d
    • Aaron Lewis's avatar
      kvm: vmx: Rename NR_AUTOLOAD_MSRS to NR_LOADSTORE_MSRS · 7cfe0526
      Aaron Lewis authored
      
      
      Rename NR_AUTOLOAD_MSRS to NR_LOADSTORE_MSRS.  This needs to be done
      due to the addition of the MSR-autostore area that will be added in a
      future patch.  After that the name AUTOLOAD will no longer make sense.
      
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarAaron Lewis <aaronlewis@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7cfe0526
    • Liran Alon's avatar
      KVM: nVMX: Update vmcs01 TPR_THRESHOLD if L2 changed L1 TPR · 02d496cf
      Liran Alon authored
      
      
      When L1 don't use TPR-Shadow to run L2, L0 configures vmcs02 without
      TPR-Shadow and install intercepts on CR8 access (load and store).
      
      If L1 do not intercept L2 CR8 access, L0 intercepts on those accesses
      will emulate load/store on L1's LAPIC TPR. If in this case L2 lowers
      TPR such that there is now an injectable interrupt to L1,
      apic_update_ppr() will request a KVM_REQ_EVENT which will trigger a call
      to update_cr8_intercept() to update TPR-Threshold to highest pending IRR
      priority.
      
      However, this update to TPR-Threshold is done while active vmcs is
      vmcs02 instead of vmcs01. Thus, when later at some point L0 will
      emulate an exit from L2 to L1, L1 will still run with high
      TPR-Threshold. This will result in every VMEntry to L1 to immediately
      exit on TPR_BELOW_THRESHOLD and continue to do so infinitely until
      some condition will cause KVM_REQ_EVENT to be set.
      (Note that TPR_BELOW_THRESHOLD exit handler do not set KVM_REQ_EVENT
      until apic_update_ppr() will notice a new injectable interrupt for PPR)
      
      To fix this issue, change update_cr8_intercept() such that if L2 lowers
      L1's TPR in a way that requires to lower L1's TPR-Threshold, save update
      to TPR-Threshold and apply it to vmcs01 when L0 emulates an exit from
      L2 to L1.
      
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      02d496cf
  11. 12 Nov, 2019 2 commits
    • Joao Martins's avatar
      KVM: VMX: Introduce pi_is_pir_empty() helper · 29881b6e
      Joao Martins authored
      
      
      Streamline the PID.PIR check and change its call sites to use
      the newly added helper.
      
      Suggested-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      29881b6e
    • Joao Martins's avatar
      KVM: VMX: Do not change PID.NDST when loading a blocked vCPU · 132194ff
      Joao Martins authored
      When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU
      linked list of all vCPUs that are blocked on this pCPU. Afterwards, it
      changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler
      (wakeup_handler()) is responsible to kick (unblock) any vCPU on that
      linked list that now has pending posted interrupts.
      
      While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which
      will cause vmx_vcpu_pi_put() to set PID.SN.  If later the vCPU will be
      scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear
      PID.SN but will also *overwrite PID.NDST to this different pCPU*.
      Instead of keeping it with original pCPU which vCPU had entered block
      phase on.
      
      This results in an issue because when a posted interrupt is delivered, as
      the wakeup_handler() will be executed and fail to find blocked vCPU on
      its per pCPU linked list of all vCPUs that are blocked on this pCPU.
      Which is due to the vCPU being placed on a *different* per pCPU
      linked list i.e. the original pCPU in which it entered block phase.
      
      The regression is introduced by commit c112b5f5 ("KVM: x86:
      Recompute PID.ON when clearing PID.SN"). Therefore, partially revert
      it and reintroduce the condition in vmx_vcpu_pi_load() responsible for
      avoiding changing PID.NDST when loading a blocked vCPU.
      
      Fixes: c112b5f5
      
       ("KVM: x86: Recompute PID.ON when clearing PID.SN")
      Tested-by: default avatarNathan Ni <nathan.ni@oracle.com>
      Co-developed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      132194ff
  12. 24 Sep, 2019 1 commit
  13. 10 Sep, 2019 1 commit
  14. 18 Jun, 2019 15 commits
  15. 24 May, 2019 1 commit