1. 23 Jun, 2020 1 commit
    • Sean Christopherson's avatar
      KVM: VMX: Stop context switching MSR_IA32_UMWAIT_CONTROL · bf09fb6c
      Sean Christopherson authored
      Remove support for context switching between the guest's and host's
      desired UMWAIT_CONTROL.  Propagating the guest's value to hardware isn't
      required for correct functionality, e.g. KVM intercepts reads and writes
      to the MSR, and the latency effects of the settings controlled by the
      MSR are not architecturally visible.
      
      As a general rule, KVM should not allow the guest to control power
      management settings unless explicitly enabled by userspace, e.g. see
      KVM_CAP_X86_DISABLE_EXITS.  E.g. Intel's SDM explicitly states that C0.2
      can improve the performance of SMT siblings.  A devious guest could
      disable C0.2 so as to improve the performance of their workloads at the
      detriment to workloads running in the host or on other VMs.
      
      Wholesale removal of UMWAIT_CONTROL context switching also fixes a race
      condition where updates from the host may cause KVM to enter the guest
      with the incorrect value.  Because updates are are propagated to all
      CPUs via IPI (SMP function callback), the value in hardware may be
      stale with respect to the cached value and KVM could enter the guest
      with the wrong value in hardware.  As above, the guest can't observe the
      bad value, but it's a weird and confusing wart in the implementation.
      
      Removal also fixes the unnecessary usage of VMX's atomic load/store MSR
      lists.  Using the lists is only necessary for MSRs that are required for
      correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on
      old hardware, or for MSRs that need to-the-uop precision, e.g. perf
      related MSRs.  For UMWAIT_CONTROL, the effects are only visible in the
      kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in
      vcpu_vmx_run().  Using the atomic lists is undesirable as they are more
      expensive than direct RDMSR/WRMSR.
      
      Furthermore, even if giving the guest control of the MSR is legitimate,
      e.g. in pass-through scenarios, it's not clear that the benefits would
      outweigh the overhead.  E.g. saving and restoring an MSR across a VMX
      roundtrip costs ~250 cycles, and if the guest diverged from the host
      that cost would be paid on every run of the guest.  In other words, if
      there is a legitimate use case then it should be enabled by a new
      per-VM capability.
      
      Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can
      correctly expose other WAITPKG features to the guest, e.g. TPAUSE,
      UMWAIT and UMONITOR.
      
      Fixes: 6e3ba4ab
      
       ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
      Cc: stable@vger.kernel.org
      Cc: Jingqi Liu <jingqi.liu@intel.com>
      Cc: Tao Xu <tao3.xu@intel.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bf09fb6c
  2. 17 Feb, 2020 1 commit
  3. 24 Sep, 2019 1 commit
  4. 12 Aug, 2019 1 commit
    • Fenghua Yu's avatar
      x86/umwait: Fix error handling in umwait_init() · e7409258
      Fenghua Yu authored
      
      
      Currently, failure of cpuhp_setup_state() is ignored and the syscore ops
      and the control interfaces can still be added even after the failure. But,
      this error handling will cause a few issues:
      
      1. The CPUs may have different values in the IA32_UMWAIT_CONTROL
         MSR because there is no way to roll back the control MSR on
         the CPUs which already set the MSR before the failure.
      
      2. If the sysfs interface is added successfully, there will be a mismatch
         between the global control value and the control MSR:
         - The interface shows the default global control value. But,
           the control MSR is not set to the value because the CPU online
           function, which is supposed to set the MSR to the value,
           is not installed.
         - If the sysadmin changes the global control value through
           the interface, the control MSR on all current online CPUs is
           set to the new value. But, the control MSR on newly onlined CPUs
           after the value change will not be set to the new value due to
           lack of the CPU online function.
      
      3. On resume from suspend/hibernation, the boot CPU restores the control
         MSR to the global control value through the syscore ops. But, the
         control MSR on all APs is not set due to lake of the CPU online
         function.
      
      To solve the issues and enforce consistent behavior on the failure
      of the CPU hotplug setup, make the following changes:
      
      1. Cache the original control MSR value which is configured by
         hardware or BIOS before kernel boot. This value is likely to
         be 0. But it could be a different number as well. Cache the
         control MSR only once before the MSR is changed.
      2. Add the CPU offline function so that the MSR is restored to the
         original control value on all CPUs on the failure.
      3. On the failure, exit from cpumait_init() so that the syscore ops
         and the control interfaces are not added.
      Reported-by: default avatarValdis Kletnieks <valdis.kletnieks@vt.edu>
      Suggested-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1565401237-60936-1-git-send-email-fenghua.yu@intel.com
      e7409258
  5. 23 Jun, 2019 3 commits
    • Fenghua Yu's avatar
      x86/umwait: Add sysfs interface to control umwait maximum time · bd9a0c97
      Fenghua Yu authored
      
      
      IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-quanta
      that processor can stay in C0.1 or C0.2. A zero value means no maximum
      time.
      
      Each instruction sets its own deadline in the instruction's implicit
      input EDX:EAX value. The instruction wakes up if the time-stamp counter
      reaches or exceeds the specified deadline, or the umwait maximum time
      expires, or a store happens in the monitored address range in umwait.
      
      The administrator can write an unsigned 32-bit number to
      /sys/devices/system/cpu/umwait_control/max_time to change the default
      value. Note that a value of zero means there is no limit. The lower two
      bits of the value must be zero.
      
      [ tglx: Simplify the write function. Massage changelog ]
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAshok Raj <ashok.raj@intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Andy Lutomirski" <luto@kernel.org>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-5-git-send-email-fenghua.yu@intel.com
      bd9a0c97
    • Fenghua Yu's avatar
      x86/umwait: Add sysfs interface to control umwait C0.2 state · ff4b353f
      Fenghua Yu authored
      
      
      C0.2 state in umwait and tpause instructions can be enabled or disabled
      on a processor through IA32_UMWAIT_CONTROL MSR register.
      
      By default, C0.2 is enabled and the user wait instructions results in
      lower power consumption with slower wakeup time.
      
      But in real time systems which require faster wakeup time although power
      savings could be smaller, the administrator needs to disable C0.2 and all
      umwait invocations from user applications use C0.1.
      
      Create a sysfs interface which allows the administrator to control C0.2
      state during run time.
      
      Andy Lutomirski suggested to turn off local irqs before writing the MSR to
      ensure the cached control value is not changed by a concurrent sysfs write
      from a different CPU via IPI.
      
      [ tglx: Simplified the update logic in the write function and got rid of
        	all the convoluted type casts. Added a shared update function and
      	made the namespace consistent. Moved the sysfs create invocation.
      	Massaged changelog ]
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAshok Raj <ashok.raj@intel.com>
      Reviewed-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Andy Lutomirski" <luto@kernel.org>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-4-git-send-email-fenghua.yu@intel.com
      ff4b353f
    • Fenghua Yu's avatar
      x86/umwait: Initialize umwait control values · bd688c69
      Fenghua Yu authored
      
      
      umwait or tpause allows the processor to enter a light-weight
      power/performance optimized state (C0.1 state) or an improved
      power/performance optimized state (C0.2 state) for a period specified by
      the instruction or until the system time limit or until a store to the
      monitored address range in umwait.
      
      IA32_UMWAIT_CONTROL MSR register allows the OS to enable/disable C0.2 on
      the processor and to set the maximum time the processor can reside in C0.1
      or C0.2.
      
      By default C0.2 is enabled so the user wait instructions can enter the
      C0.2 state to save more power with slower wakeup time.
      
      Andy Lutomirski proposed to set the maximum umwait time to 100000 cycles by
      default. A quote from Andy:
      
        "What I want to avoid is the case where it works dramatically differently
         on NO_HZ_FULL systems as compared to everything else. Also, UMWAIT may
         behave a bit differently if the max timeout is hit, and I'd like that
         path to get exercised widely by making it happen even on default
         configs."
      
      A sysfs interface to adjust the time and the C0.2 enablement is provided in
      a follow up change.
      
      [ tglx: Renamed MSR_IA32_UMWAIT_CONTROL_MAX_TIME to
        	MSR_IA32_UMWAIT_CONTROL_TIME_MASK because the constant is used as
        	mask throughout the code.
      	Massaged comments and changelog ]
      Signed-off-by: default avatarFenghua Yu <fenghua.yu@intel.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarAshok Raj <ashok.raj@intel.com>
      Reviewed-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: "Borislav Petkov" <bp@alien8.de>
      Cc: "H Peter Anvin" <hpa@zytor.com>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Tony Luck" <tony.luck@intel.com>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Link: https://lkml.kernel.org/r/1560994438-235698-3-git-send-email-fenghua.yu@intel.com
      bd688c69