1. 14 May, 2019 1 commit
    • Ira Weiny's avatar
      mm/gup: change GUP fast to use flags rather than a write 'bool' · 73b0140b
      Ira Weiny authored
      To facilitate additional options to get_user_pages_fast() change the
      singular write parameter to be gup_flags.
      
      This patch does not change any functionality.  New functionality will
      follow in subsequent patches.
      
      Some of the get_user_pages_fast() call sites were unchanged because they
      already passed FOLL_WRITE or 0 for the write parameter.
      
      NOTE: It was suggested to change the ordering of the get_user_pages_fast()
      arguments to ensure that callers were converted.  This breaks the current
      GUP call site convention of having the returned pages be the final
      parameter.  So the suggestion was rejected.
      
      Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
      Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
      
      Signed-off-by: default avatarIra Weiny <ira.weiny@intel.com>
      Reviewed-by: default avatarMike Marshall <hubcap@omnibond.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73b0140b
  2. 16 Apr, 2019 6 commits
    • Sean Christopherson's avatar
      KVM: x86: clear SMM flags before loading state while leaving SMM · 9ec19493
      Sean Christopherson authored
      
      
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Stop dancing around the issue of HF_SMM_MASK being set when
      loading SMSTATE into architectural state, e.g. by toggling it for
      problematic flows, and simply clear HF_SMM_MASK prior to loading
      architectural state (from SMRAM save state area).
      Reported-by: default avatarJon Doron <arilou@gmail.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Fixes: 5bea5123
      
       ("KVM: VMX: check nested state and CR4.VMXE against SMM")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Tested-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9ec19493
    • Sean Christopherson's avatar
      KVM: x86: Load SMRAM in a single shot when leaving SMM · ed19321f
      Sean Christopherson authored
      
      
      RSM emulation is currently broken on VMX when the interrupted guest has
      CR4.VMXE=1.  Rather than dance around the issue of HF_SMM_MASK being set
      when loading SMSTATE into architectural state, ideally RSM emulation
      itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
      architectural state.
      
      Ostensibly, the only motivation for having HF_SMM_MASK set throughout
      the loading of state from the SMRAM save state area is so that the
      memory accesses from GET_SMSTATE() are tagged with role.smm.  Load
      all of the SMRAM save state area from guest memory at the beginning of
      RSM emulation, and load state from the buffer instead of reading guest
      memory one-by-one.
      
      This paves the way for clearing HF_SMM_MASK prior to loading state,
      and also aligns RSM with the enter_smm() behavior, which fills a
      buffer and writes SMRAM save state in a single go.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ed19321f
    • WANG Chao's avatar
      x86/kvm: move kvm_load/put_guest_xcr0 into atomic context · 1811d979
      WANG Chao authored
      
      
      guest xcr0 could leak into host when MCE happens in guest mode. Because
      do_machine_check() could schedule out at a few places.
      
      For example:
      
      kvm_load_guest_xcr0
      ...
      kvm_x86_ops->run(vcpu) {
        vmx_vcpu_run
          vmx_complete_atomic_exit
            kvm_machine_check
              do_machine_check
                do_memory_failure
                  memory_failure
                    lock_page
      
      In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
      out, host cpu has guest xcr0 loaded (0xff).
      
      In __switch_to {
           switch_fpu_finish
             copy_kernel_to_fpregs
               XRSTORS
      
      If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
      generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
      and tries to reinitialize fpu by restoring init fpu state. Same story as
      last #GP, except we get DOUBLE FAULT this time.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWANG Chao <chao.wang@ucloud.cn>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1811d979
    • Vitaly Kuznetsov's avatar
      KVM: x86: svm: make sure NMI is injected after nmi_singlestep · 99c22179
      Vitaly Kuznetsov authored
      
      
      I noticed that apic test from kvm-unit-tests always hangs on my EPYC 7401P,
      the hanging test nmi-after-sti is trying to deliver 30000 NMIs and tracing
      shows that we're sometimes able to deliver a few but never all.
      
      When we're trying to inject an NMI we may fail to do so immediately for
      various reasons, however, we still need to inject it so enable_nmi_window()
      arms nmi_singlestep mode. #DB occurs as expected, but we're not checking
      for pending NMIs before entering the guest and unless there's a different
      event to process, the NMI will never get delivered.
      
      Make KVM_REQ_EVENT request on the vCPU from db_interception() to make sure
      pending NMIs are checked and possibly injected.
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      99c22179
    • Suthikulpanit, Suravee's avatar
      svm/avic: Fix invalidate logical APIC id entry · e44e3eac
      Suthikulpanit, Suravee authored
      Only clear the valid bit when invalidate logical APIC id entry.
      The current logic clear the valid bit, but also set the rest of
      the bits (including reserved bits) to 1.
      
      Fixes: 98d90582
      
       ('svm: Fix AVIC DFR and LDR handling')
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e44e3eac
    • Suthikulpanit, Suravee's avatar
      Revert "svm: Fix AVIC incomplete IPI emulation" · 4a58038b
      Suthikulpanit, Suravee authored
      This reverts commit bb218fbc.
      
      As Oren Twaig pointed out the old discussion:
      
        https://patchwork.kernel.org/patch/8292231/
      
      that the change coud potentially cause an extra IPI to be sent to
      the destination vcpu because the AVIC hardware already set the IRR bit
      before the incomplete IPI #VMEXIT with id=1 (target vcpu is not running).
      Since writting to ICR and ICR2 will also set the IRR. If something triggers
      the destination vcpu to get scheduled before the emulation finishes, then
      this could result in an additional IPI.
      
      Also, the issue mentioned in the commit bb218fbc
      
       was misdiagnosed.
      
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reported-by: default avatarOren Twaig <oren@scalemp.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4a58038b
  3. 05 Apr, 2019 2 commits
  4. 28 Mar, 2019 1 commit
    • Singh, Brijesh's avatar
      KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) · 05d5a486
      Singh, Brijesh authored
      
      
      Errata#1096:
      
      On a nested data page fault when CR.SMAP=1 and the guest data read
      generates a SMAP violation, GuestInstrBytes field of the VMCB on a
      VMEXIT will incorrectly return 0h instead the correct guest
      instruction bytes .
      
      Recommend Workaround:
      
      To determine what instruction the guest was executing the hypervisor
      will have to decode the instruction at the instruction pointer.
      
      The recommended workaround can not be implemented for the SEV
      guest because guest memory is encrypted with the guest specific key,
      and instruction decoder will not be able to decode the instruction
      bytes. If we hit this errata in the SEV guest then log the message
      and request a guest shutdown.
      Reported-by: default avatarVenkatesh Srinivas <venkateshs@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: "Radim Krčmář" <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      05d5a486
  5. 20 Feb, 2019 3 commits
    • Ben Gardon's avatar
      kvm: svm: Add memcg accounting to KVM allocations · 1ec69647
      Ben Gardon authored
      
      
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      Signed-off-by: default avatarBen Gardon <bgardon@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      1ec69647
    • Suthikulpanit, Suravee's avatar
      svm: Fix improper check when deactivate AVIC · c57cd3c8
      Suthikulpanit, Suravee authored
      The function svm_refresh_apicv_exec_ctrl() always returning prematurely
      as kvm_vcpu_apicv_active() always return false when calling from
      the function arch/x86/kvm/x86.c:kvm_vcpu_deactivate_apicv().
      This is because the apicv_active is set to false just before calling
      refresh_apicv_exec_ctrl().
      
      Also, we need to mark VMCB_AVIC bit as dirty instead of VMCB_INTR.
      
      So, fix svm_refresh_apicv_exec_ctrl() to properly deactivate AVIC.
      
      Fixes: 67034bb9
      
       ('KVM: SVM: Add irqchip_split() checks before enabling AVIC')
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c57cd3c8
    • Suthikulpanit, Suravee's avatar
      svm: Fix AVIC DFR and LDR handling · 98d90582
      Suthikulpanit, Suravee authored
      Current SVM AVIC driver makes two incorrect assumptions:
        1. APIC LDR register cannot be zero
        2. APIC DFR for all vCPUs must be the same
      
      LDR=0 means the local APIC does not support logical destination mode.
      Therefore, the driver should mark any previously assigned logical APIC ID
      table entry as invalid, and return success.  Also, DFR is specific to
      a particular local APIC, and can be different among all vCPUs
      (as observed on Windows 10).
      
      These incorrect assumptions cause Windows 10 and FreeBSD VMs to fail
      to boot with AVIC enabled. So, instead of flush the whole logical APIC ID
      table, handle DFR and LDR for each vCPU independently.
      
      Fixes: 18f40c53
      
       ('svm: Add VMEXIT handlers for AVIC')
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reported-by: default avatarJulian Stecklina <jsteckli@amazon.de>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      98d90582
  6. 25 Jan, 2019 4 commits
    • Gustavo A. R. Silva's avatar
      KVM: x86: Mark expected switch fall-throughs · b2869f28
      Gustavo A. R. Silva authored
      
      
      In preparation to enabling -Wimplicit-fallthrough, mark switch
      cases where we are expecting to fall through.
      
      This patch fixes the following warnings:
      
      arch/x86/kvm/lapic.c:1037:27: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/lapic.c:1876:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/hyperv.c:1637:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/svm.c:4396:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/mmu.c:4372:36: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:3835:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/x86.c:7938:23: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:2015:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      arch/x86/kvm/vmx/vmx.c:1773:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
      
      Warning level 3 was used: -Wimplicit-fallthrough=3
      
      This patch is part of the ongoing efforts to enabling -Wimplicit-fallthrough.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b2869f28
    • Vitaly Kuznetsov's avatar
      KVM: nSVM: clear events pending from svm_complete_interrupts() when exiting to L1 · 619ad846
      Vitaly Kuznetsov authored
      kvm-unit-tests' eventinj "NMI failing on IDT" test results in NMI being
      delivered to the host (L1) when it's running nested. The problem seems to
      be: svm_complete_interrupts() raises 'nmi_injected' flag but later we
      decide to reflect EXIT_NPF to L1. The flag remains pending and we do NMI
      injection upon entry so it got delivered to L1 instead of L2.
      
      It seems that VMX code solves the same issue in prepare_vmcs12(), this was
      introduced with code refactoring in commit 5f3d5799
      
       ("KVM: nVMX: Rework
      event injection and recovery").
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      619ad846
    • Suravee Suthikulpanit's avatar
      svm: Fix AVIC incomplete IPI emulation · bb218fbc
      Suravee Suthikulpanit authored
      
      
      In case of incomplete IPI with invalid interrupt type, the current
      SVM driver does not properly emulate the IPI, and fails to boot
      FreeBSD guests with multiple vcpus when enabling AVIC.
      
      Fix this by update APIC ICR high/low registers, which also
      emulate sending the IPI.
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bb218fbc
    • Suravee Suthikulpanit's avatar
      svm: Add warning message for AVIC IPI invalid target · 37ef0c44
      Suravee Suthikulpanit authored
      
      
      Print warning message when IPI target ID is invalid due to one of
      the following reasons:
        * In logical mode: cluster > max_cluster (64)
        * In physical mode: target > max_physical (512)
        * Address is not present in the physical or logical ID tables
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      37ef0c44
  7. 11 Jan, 2019 1 commit
  8. 21 Dec, 2018 4 commits
  9. 19 Dec, 2018 1 commit
    • Vitaly Kuznetsov's avatar
      KVM: x86: nSVM: fix switch to guest mmu · 3cf85f9f
      Vitaly Kuznetsov authored
      Recent optimizations in MMU code broke nested SVM with NPT in L1
      completely: when we do nested_svm_{,un}init_mmu_context() we want
      to switch from TDP MMU to shadow MMU, both init_kvm_tdp_mmu() and
      kvm_init_shadow_mmu() check if re-configuration is needed by looking
      at cache source data. The data, however, doesn't change - it's only
      the type of the MMU which changes. We end up not re-initializing
      guest MMU as shadow and everything goes off the rails.
      
      The issue could have been fixed by putting MMU type into extended MMU
      role but this is not really needed. We can just split root and guest MMUs
      the exact same way we did for nVMX, their types never change in the
      lifetime of a vCPU.
      
      There is still room for improvement: currently, we reset all MMU roots
      when switching from L1 to L2 and back and this is not needed.
      
      Fixes: 7dcd5755
      
       ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3cf85f9f
  10. 14 Dec, 2018 3 commits
  11. 27 Nov, 2018 4 commits
  12. 17 Oct, 2018 2 commits
  13. 16 Oct, 2018 2 commits
  14. 09 Oct, 2018 1 commit
    • Paolo Bonzini's avatar
      KVM: x86: support CONFIG_KVM_AMD=y with CONFIG_CRYPTO_DEV_CCP_DD=m · 853c1109
      Paolo Bonzini authored
      
      
      SEV requires access to the AMD cryptographic device APIs, and this
      does not work when KVM is builtin and the crypto driver is a module.
      Actually the Kconfig conditions for CONFIG_KVM_AMD_SEV try to disable
      SEV in that case, but it does not work because the actual crypto
      calls are not culled, only sev_hardware_setup() is.
      
      This patch adds two CONFIG_KVM_AMD_SEV checks that gate all the remaining
      SEV code; it fixes this particular configuration, and drops 5 KiB of
      code when CONFIG_KVM_AMD_SEV=n.
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      853c1109
  15. 19 Sep, 2018 2 commits
    • Sean Christopherson's avatar
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson authored
      
      
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
    • Andy Shevchenko's avatar
      KVM: SVM: Switch to bitmap_zalloc() · a101c9d6
      Andy Shevchenko authored
      
      
      Switch to bitmap_zalloc() to show clearly what we are allocating.
      Besides that it returns pointer of bitmap type instead of opaque void *.
      Signed-off-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a101c9d6
  16. 30 Aug, 2018 3 commits