- 21 Apr, 2020 7 commits
-
-
Sean Christopherson authored
Introduce a new "extended register" type, EXIT_INFO_2 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_INTR_INFO. Drop a comment in vmx_recover_nmi_blocking() that is obsoleted by the generic caching mechanism. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-6-sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Introduce a new "extended register" type, EXIT_INFO_1 (to pair with the nomenclature in .get_exit_info()), and use it to cache VMX's vmcs.EXIT_QUALIFICATION. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-5-sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop the call to vmx_segment_cache_clear() in vmx_switch_vmcs() now that the entire register cache is reset when switching the active VMCS, e.g. vmx_segment_cache_test_set() will reset the segment cache due to VCPU_EXREG_SEGMENTS being unavailable. Move vmx_segment_cache_clear() to vmx.c now that it's no longer invoked by the nested code. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-4-sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Reset the per-vCPU available and dirty register masks when switching between vmcs01 and vmcs02, as the masks track state relative to the current VMCS. The stale masks don't cause problems in the current code base because the registers are either unconditionally written on nested transitions or, in the case of segment registers, have an additional tracker that is manually reset. Note, by dropping (previously implicitly, now explicitly) the dirty mask when switching the active VMCS, KVM is technically losing writes to the associated fields. But, the only regs that can be dirtied (RIP, RSP and PDPTRs) are unconditionally written on nested transitions, e.g. explicit writeback is a waste of cycles, and a WARN_ON would be rather pointless. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200415203454.8296-3-sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Defer reloading L1's APIC page by logging the need for a reload and processing it during nested VM-Exit instead of unconditionally reloading the APIC page on nested VM-Exit. This eliminates a TLB flush on the majority of VM-Exits as the APIC page rarely needs to be reloaded. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-28-sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move vmx_flush_tlb() to vmx.c and make it non-inline static now that all its callers live in vmx.c. No functional change intended. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-19-sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Drop @invalidate_gpa from ->tlb_flush() and kvm_vcpu_flush_tlb() now that all callers pass %true for said param, or ignore the param (SVM has an internal call to svm_flush_tlb() in svm_flush_tlb_guest that somewhat arbitrarily passes %false). Remove __vmx_flush_tlb() as it is no longer used. No functional change intended. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-17-sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 20 Apr, 2020 1 commit
-
-
Sean Christopherson authored
Remove the INVVPID capabilities checks from vpid_sync_vcpu_single() and vpid_sync_vcpu_global() now that all callers ensure the INVVPID variant is supported. Note, in some cases the guarantee is provided in concert with hardware_setup(), which enables VPID if and only if at least of invvpid_single() or invvpid_global() is supported. Drop the WARN_ON_ONCE() from vmx_flush_tlb() as vpid_sync_vcpu_single() will trigger a WARN() on INVVPID failure, i.e. if SINGLE_CONTEXT isn't supported. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-13-sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 15 Apr, 2020 1 commit
-
-
Sean Christopherson authored
Flush all EPTP/VPID contexts if a TLB flush _may_ have been triggered by a remote or deferred TLB flush, i.e. by KVM_REQ_TLB_FLUSH. Remote TLB flushes require all contexts to be invalidated, not just the active contexts, e.g. all mappings in all contexts for a given HVA need to be invalidated on a mmu_notifier invalidation. Similarly, the instigator of the deferred TLB flush may be expecting all contexts to be flushed, e.g. vmx_vcpu_load_vmcs(). Without nested VMX, flushing only the current EPTP/VPID context isn't problematic because KVM uses a constant VPID for each vCPU, and mmu_alloc_direct_roots() all but guarantees KVM will use a single EPTP for L1. In the rare case where a different EPTP is created or reused, KVM (currently) unconditionally flushes the new EPTP context prior to entering the guest. With nested VMX, KVM conditionally uses a different VPID for L2, and unconditionally uses a different EPTP for L2. Because KVM doesn't _intentionally_ guarantee L2's EPTP/VPID context is flushed on nested VM-Enter, it'd be possible for a malicious L1 to attack the host and/or different VMs by exploiting the lack of flushing for L2. 1) Launch nested guest from malicious L1. 2) Nested VM-Enter to L2. 3) Access target GPA 'g'. CPU inserts TLB entry tagged with L2's ASID mapping 'g' to host PFN 'x'. 2) Nested VM-Exit to L1. 3) L1 triggers kernel same-page merging (ksm) by duplicating/zeroing the page for PFN 'x'. 4) Host kernel merges PFN 'x' with PFN 'y', i.e. unmaps PFN 'x' and remaps the page to PFN 'y'. mmu_notifier sends invalidate command, KVM flushes TLB only for L1's ASID. 4) Host kernel reallocates PFN 'x' to some other task/guest. 5) Nested VM-Enter to L2. KVM does not invalidate L2's EPTP or VPID. 6) L2 accesses GPA 'g' and gains read/write access to PFN 'x' via its stale TLB entry. However, current KVM unconditionally flushes L1's EPTP/VPID context on nested VM-Exit. But, that behavior is mostly unintentional, KVM doesn't go out of its way to flush EPTP/VPID on nested VM-Enter/VM-Exit, rather a TLB flush is guaranteed to occur prior to re-entering L1 due to __kvm_mmu_new_cr3() always being called with skip_tlb_flush=false. On nested VM-Enter, this happens via kvm_init_shadow_ept_mmu() (nested EPT enabled) or in nested_vmx_load_cr3() (nested EPT disabled). On nested VM-Exit it occurs via nested_vmx_load_cr3(). This also fixes a bug where a deferred TLB flush in the context of L2, with EPT disabled, would flush L1's VPID instead of L2's VPID, as vmx_flush_tlb() flushes L1's VPID regardless of is_guest_mode(). Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Junaid Shahid <junaids@google.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: John Haxby <john.haxby@oracle.com> Reviewed-by:
Liran Alon <liran.alon@oracle.com> Fixes: efebf0aa ("KVM: nVMX: Do not flush TLB on L1<->L2 transitions if L1 uses VPID and EPT") Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200320212833.3507-2-sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 23 Mar, 2020 1 commit
-
-
Sean Christopherson authored
Subsume loaded_vmcs_init() into alloc_loaded_vmcs(), its only remaining caller, and drop the VMCLEAR on the shadow VMCS, which is guaranteed to be NULL. loaded_vmcs_init() was previously used by loaded_vmcs_clear(), but loaded_vmcs_clear() also subsumed loaded_vmcs_init() to properly handle smp_wmb() with respect to VMCLEAR. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200321193751.24985-3-sean.j.christopherson@intel.com> Reviewed-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 16 Mar, 2020 3 commits
-
-
Paolo Bonzini authored
The set_cr3 callback is not setting the guest CR3, it is setting the root of the guest page tables, either shadow or two-dimensional. To make this clearer as well as to indicate that the MMU calls it via kvm_mmu_load_cr3, rename it to load_mmu_pgd. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Move host_efer to common x86 code and use it for CPUID's is_efer_nx() to avoid constantly re-reading the MSR. No functional change intended. Reviewed-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Add helpers to query which of the (two) supported PT modes is active. The primary motivation is to help document that there is a third PT mode (host-only) that's currently not supported by KVM. As is, it's not obvious that PT_MODE_SYSTEM != !PT_MODE_HOST_GUEST and vice versa, e.g. that "pt_mode == PT_MODE_SYSTEM" and "pt_mode != PT_MODE_HOST_GUEST" are two distinct checks. No functional change intended. Reviewed-by:
Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 23 Feb, 2020 1 commit
-
-
Oliver Upton authored
Since commit 5f3d45e7 ("kvm/x86: add support for MONITOR_TRAP_FLAG"), KVM has allowed an L1 guest to use the monitor trap flag processor-based execution control for its L2 guest. KVM simply forwards any MTF VM-exits to the L1 guest, which works for normal instruction execution. However, when KVM needs to emulate an instruction on the behalf of an L2 guest, the monitor trap flag is not emulated. Add the necessary logic to kvm_skip_emulated_instruction() to synthesize an MTF VM-exit to L1 upon instruction emulation for L2. Fixes: 5f3d45e7 ("kvm/x86: add support for MONITOR_TRAP_FLAG") Signed-off-by:
Oliver Upton <oupton@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 17 Feb, 2020 1 commit
-
-
Benjamin Thiel authored
.. in order to fix a -Wmissing-prototypes warning. No functional change. Signed-off-by:
Benjamin Thiel <b.thiel@posteo.de> Signed-off-by:
Borislav Petkov <bp@suse.de> Cc: kvm@vger.kernel.org Link: https://lkml.kernel.org/r/20200123172945.7235-1-b.thiel@posteo.de
-
- 13 Jan, 2020 1 commit
-
-
Sean Christopherson authored
As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL are quite a mouthful, especially the VMX bits which must differentiate between enabling VMX inside and outside SMX (TXT) operation. Rename the MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to make them a little friendlier on the eyes. Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name to match Intel's SDM, but a future patch will add a dedicated Kconfig, file and functions for the MSR. Using the full name for those assets is rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its nomenclature is consistent throughout the kernel. Opportunistically, fix a few other annoyances with the defines: - Relocate the bit defines so that they immediately follow the MSR define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL. - Add whitespace around the block of feature control defines to make it clear they're all related. - Use BIT() instead of manually encoding the bit shift. - Use "VMX" instead of "VMXON" to match the SDM. - Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to be consistent with the kernel's verbiage used for all other feature control bits. Note, the SDM refers to the LMCE bit as LMCE_ON, likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN. Ignore the (literal) one-off usage of _ON, the SDM is simply "wrong". Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
-
- 04 Dec, 2019 1 commit
-
-
Jim Mattson authored
We will never need more guest_msrs than there are indices in vmx_msr_index. Thus, at present, the guest_msrs array will not exceed 168 bytes. Signed-off-by:
Jim Mattson <jmattson@google.com> Reviewed-by:
Liran Alon <liran.alon@oracle.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 15 Nov, 2019 3 commits
-
-
Aaron Lewis authored
The L1 hypervisor may include the IA32_TIME_STAMP_COUNTER MSR in the vmcs12 MSR VM-exit MSR-store area as a way of determining the highest TSC value that might have been observed by L2 prior to VM-exit. The current implementation does not capture a very tight bound on this value. To tighten the bound, add the IA32_TIME_STAMP_COUNTER MSR to the vmcs02 VM-exit MSR-store area whenever it appears in the vmcs12 VM-exit MSR-store area. When L0 processes the vmcs12 VM-exit MSR-store area during the emulation of an L2->L1 VM-exit, special-case the IA32_TIME_STAMP_COUNTER MSR, using the value stored in the vmcs02 VM-exit MSR-store area to derive the value to be stored in the vmcs12 VM-exit MSR-store area. Reviewed-by:
Liran Alon <liran.alon@oracle.com> Reviewed-by:
Jim Mattson <jmattson@google.com> Signed-off-by:
Aaron Lewis <aaronlewis@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Aaron Lewis authored
Rename NR_AUTOLOAD_MSRS to NR_LOADSTORE_MSRS. This needs to be done due to the addition of the MSR-autostore area that will be added in a future patch. After that the name AUTOLOAD will no longer make sense. Reviewed-by:
Jim Mattson <jmattson@google.com> Signed-off-by:
Aaron Lewis <aaronlewis@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Liran Alon authored
When L1 don't use TPR-Shadow to run L2, L0 configures vmcs02 without TPR-Shadow and install intercepts on CR8 access (load and store). If L1 do not intercept L2 CR8 access, L0 intercepts on those accesses will emulate load/store on L1's LAPIC TPR. If in this case L2 lowers TPR such that there is now an injectable interrupt to L1, apic_update_ppr() will request a KVM_REQ_EVENT which will trigger a call to update_cr8_intercept() to update TPR-Threshold to highest pending IRR priority. However, this update to TPR-Threshold is done while active vmcs is vmcs02 instead of vmcs01. Thus, when later at some point L0 will emulate an exit from L2 to L1, L1 will still run with high TPR-Threshold. This will result in every VMEntry to L1 to immediately exit on TPR_BELOW_THRESHOLD and continue to do so infinitely until some condition will cause KVM_REQ_EVENT to be set. (Note that TPR_BELOW_THRESHOLD exit handler do not set KVM_REQ_EVENT until apic_update_ppr() will notice a new injectable interrupt for PPR) To fix this issue, change update_cr8_intercept() such that if L2 lowers L1's TPR in a way that requires to lower L1's TPR-Threshold, save update to TPR-Threshold and apply it to vmcs01 when L0 emulates an exit from L2 to L1. Reviewed-by:
Joao Martins <joao.m.martins@oracle.com> Signed-off-by:
Liran Alon <liran.alon@oracle.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 12 Nov, 2019 2 commits
-
-
Joao Martins authored
Streamline the PID.PIR check and change its call sites to use the newly added helper. Suggested-by:
Liran Alon <liran.alon@oracle.com> Signed-off-by:
Joao Martins <joao.m.martins@oracle.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Joao Martins authored
When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU linked list of all vCPUs that are blocked on this pCPU. Afterwards, it changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler (wakeup_handler()) is responsible to kick (unblock) any vCPU on that linked list that now has pending posted interrupts. While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which will cause vmx_vcpu_pi_put() to set PID.SN. If later the vCPU will be scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear PID.SN but will also *overwrite PID.NDST to this different pCPU*. Instead of keeping it with original pCPU which vCPU had entered block phase on. This results in an issue because when a posted interrupt is delivered, as the wakeup_handler() will be executed and fail to find blocked vCPU on its per pCPU linked list of all vCPUs that are blocked on this pCPU. Which is due to the vCPU being placed on a *different* per pCPU linked list i.e. the original pCPU in which it entered block phase. The regression is introduced by commit c112b5f5 ("KVM: x86: Recompute PID.ON when clearing PID.SN"). Therefore, partially revert it and reintroduce the condition in vmx_vcpu_pi_load() responsible for avoiding changing PID.NDST when loading a blocked vCPU. Fixes: c112b5f5 ("KVM: x86: Recompute PID.ON when clearing PID.SN") Tested-by:
Nathan Ni <nathan.ni@oracle.com> Co-developed-by:
Liran Alon <liran.alon@oracle.com> Signed-off-by:
Liran Alon <liran.alon@oracle.com> Signed-off-by:
Joao Martins <joao.m.martins@oracle.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 24 Sep, 2019 1 commit
-
-
Tao Xu authored
UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index E1H to determines the maximum time in TSC-quanta that the processor can reside in either C0.1 or C0.2. This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate IA32_UMWAIT_CONTROL between host and guest. The variable mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value, so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL. Co-developed-by:
Jingqi Liu <jingqi.liu@intel.com> Signed-off-by:
Jingqi Liu <jingqi.liu@intel.com> Signed-off-by:
Tao Xu <tao3.xu@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 10 Sep, 2019 1 commit
-
-
Peter Xu authored
The VMX ple_window is 32 bits wide, so logically it can overflow with an int. The module parameter is declared as unsigned int which is good, however the dynamic variable is not. Switching all the ple_window references to use unsigned int. The tracepoint changes will also affect SVM, but SVM is using an even smaller width (16 bits) so it's always fine. Suggested-by:
Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Peter Xu <peterx@redhat.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 18 Jun, 2019 15 commits
-
-
Sean Christopherson authored
Or: Don't re-initialize vmcs02's controls on every nested VM-Entry. VMWRITEs to the major VMCS controls are deceptively expensive. Intel CPUs with VMCS caching (Westmere and later) also optimize away consistency checks on VM-Entry, i.e. skip consistency checks if the relevant fields have not changed since the last successful VM-Entry (of the cached VMCS). Because uops are a precious commodity, uCode's dirty VMCS field tracking isn't as precise as software would prefer. Notably, writing any of the major VMCS fields effectively marks the entire VMCS dirty, i.e. causes the next VM-Entry to perform all consistency checks, which consumes several hundred cycles. Zero out the controls' shadow copies during VMCS allocation and use the optimized setter when "initializing" controls. While this technically affects both non-nested and nested virtualization, nested virtualization is the primary beneficiary as avoid VMWRITEs when prepare vmcs02 allows hardware to optimizie away consistency checks. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
... now that the shadow copies are per-VMCS. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
... to pave the way for not preserving the shadow copies across switches between vmcs01 and vmcs02, and eventually to avoid VMWRITEs to vmcs02 when the desired value is unchanged across nested VM-Enters. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid VMREADs when switching between vmcs01 and vmcs02, and more importantly can eliminate costly VMWRITEs to controls when preparing vmcs02. Shadowing exec controls also saves a VMREAD when opening virtual INTR/NMI windows, yay... Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Shadowing pin controls also allows a future patch to remove the per-VMCS 'hv_timer_armed' flag, as the shadow copy is a superset of said flag. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
... to pave the way for shadowing all (five) major VMCS control fields without massive amounts of error prone copy+paste+modify. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
KVM provides a module parameter to allow disabling virtual NMI support to simplify testing (hardware *without* virtual NMI support is hard to come by but it does have users). When preparing vmcs02, use the accessor for pin controls to ensure that the module param is respected for nested guests. Opportunistically swap the order of applying L0's and L1's pin controls to better align with other controls and to prepare for a future patche that will ignore L1's, but not L0's, preemption timer flag. Fixes: d02fcf50 ("kvm: vmx: Allow disabling virtual NMI support") Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When switching between vmcs01 and vmcs02, there is no need to update state tracking for values that aren't tied to any particular VMCS as the per-vCPU values are already up-to-date (vmx_switch_vmcs() can only be called when the vCPU is loaded). Avoiding the update eliminates a RDMSR, and potentially a RDPKRU and posted-interrupt update (cmpxchg64() and more). Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
When switching between vmcs01 and vmcs02, KVM isn't actually switching between guest and host. If guest state is already loaded (the likely, if not guaranteed, case), keep the guest state loaded and manually swap the loaded_cpu_state pointer after propagating saved host state to the new vmcs0{1,2}. Avoiding the switch between guest and host reduces the latency of switching between vmcs01 and vmcs02 by several hundred cycles, and reduces the roundtrip time of a nested VM by upwards of 1000 cycles. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
vmx->loaded_cpu_state can only be NULL or equal to vmx->loaded_vmcs, so change it to a bool. Because the direction of the bool is now the opposite of vmx->guest_msrs_dirty, change the direction of vmx->guest_msrs_dirty so that they match. Finally, do not imply that MSRs have to be reloaded when vmx->guest_state_loaded is false; instead, set vmx->guest_msrs_ready to false explicitly in vmx_prepare_switch_to_host. Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Many guest fields are rarely read (or written) by VMMs, i.e. likely aren't accessed between runs of a nested VMCS. Delay pulling rarely accessed guest fields from vmcs02 until they are VMREAD or until vmcs12 is dirtied. The latter case is necessary because nested VM-Entry will consume all manner of fields when vmcs12 is dirty, e.g. for consistency checks. Note, an alternative to synchronizing all guest fields on VMREAD would be to read *only* the field being accessed, but switching VMCS pointers is expensive and odds are good if one guest field is being accessed then others will soon follow, or that vmcs12 will be dirtied due to a VMWRITE (see above). And the full synchronization results in slightly cleaner code. Note, although GUEST_PDPTRs are relevant only for a 32-bit PAE guest, they are accessed quite frequently for said guests, and a separate patch is in flight to optimize away GUEST_PDTPR synchronziation for non-PAE guests. Skipping rarely accessed guest fields reduces the latency of a nested VM-Exit by ~200 cycles. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Nested virtualization involves copying data between many different types of VMCSes, e.g. vmcs02, vmcs12, shadow VMCS and eVMCS. Rename a variety of functions and flags to document both the source and destination of each sync. Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Sean Christopherson authored
Although the kernel may use multiple IDTs, KVM should only ever see the "real" IDT, e.g. the early init IDT is long gone by the time KVM runs and the debug stack IDT is only used for small windows of time in very specific flows. Before commit a547c6db ("KVM: VMX: Enable acknowledge interupt on vmexit"), the kernel's IDT base was consumed by KVM only when setting constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE. Because constant host state is done once per vCPU, there was ostensibly no need to cache the kernel's IDT base. When support for "ack interrupt on exit" was introduced, KVM added a second consumer of the IDT base as handling already-acked interrupts requires directly calling the interrupt handler, i.e. KVM uses the IDT base to find the address of the handler. Because interrupts are a fast path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE. Presumably, the IDT base was cached on a per-vCPU basis simply because the existing code grabbed the IDT base on a per-vCPU (VMCS) basis. Note, all post-boot IDTs use the same handlers for external interrupts, i.e. the "ack interrupt on exit" use of the IDT base would be unaffected even if the cached IDT somehow did not match the current IDT. And as for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the above analysis is wrong then KVM has had a bug since the beginning of time since KVM has effectively been caching the IDT at vCPU creation since commit a8b732ca01c ("[PATCH] kvm: userspace interface"). Signed-off-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Paolo Bonzini authored
Make it available to AMD hosts as well, just in case someone is trying to use an Intel processor's CPUID setup. Suggested-by:
Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- 24 May, 2019 1 commit
-
-
Yi Wang authored
We get a warning when build kernel W=1: arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for ‘vmx_update_host_rsp’ [-Wmissing-prototypes] void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp) Add the missing declaration to fix this. Signed-off-by:
Yi Wang <wang.yi59@zte.com.cn> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-