1. 01 Jun, 2017 1 commit
    • ZhuangYanying's avatar
      KVM: x86: Fix nmi injection failure when vcpu got blocked · 47a66eed
      ZhuangYanying authored
      
      
      When spin_lock_irqsave() deadlock occurs inside the guest, vcpu threads,
      other than the lock-holding one, would enter into S state because of
      pvspinlock. Then inject NMI via libvirt API "inject-nmi", the NMI could
      not be injected into vm.
      
      The reason is:
      1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets
      cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile.
      2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because
      cpu->kvm_vcpu_dirty is true.
      
      It's not enough just to check nmi_queued to decide whether to stay in
      vcpu_block() or not. NMI should be injected immediately at any situation.
      Add checking nmi_pending, and testing KVM_REQ_NMI replaces nmi_queued
      in vm_vcpu_has_events().
      
      Do the same change for SMIs.
      
      Signed-off-by: default avatarZhuang Yanying <ann.zhuangyanying@huawei.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      47a66eed
  2. 19 May, 2017 3 commits
    • Radim Krčmář's avatar
      KVM: x86: zero base3 of unusable segments · f0367ee1
      Radim Krčmář authored
      
      
      Static checker noticed that base3 could be used uninitialized if the
      segment was not present (useable).  Random stack values probably would
      not pass VMCS entry checks.
      
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 1aa36616
      
       ("KVM: x86 emulator: consolidate segment accessors")
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      f0367ee1
    • Wanpeng Li's avatar
      KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation · cbfc6c91
      Wanpeng Li authored
      
      
      Huawei folks reported a read out-of-bounds vulnerability in kvm pio emulation.
      
      - "inb" instruction to access PIT Mod/Command register (ioport 0x43, write only,
        a read should be ignored) in guest can get a random number.
      - "rep insb" instruction to access PIT register port 0x43 can control memcpy()
        in emulator_pio_in_emulated() to copy max 0x400 bytes but only read 1 bytes,
        which will disclose the unimportant kernel memory in host but no crash.
      
      The similar test program below can reproduce the read out-of-bounds vulnerability:
      
      void hexdump(void *mem, unsigned int len)
      {
              unsigned int i, j;
      
              for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++)
              {
                      /* print offset */
                      if(i % HEXDUMP_COLS == 0)
                      {
                              printf("0x%06x: ", i);
                      }
      
                      /* print hex data */
                      if(i < len)
                      {
                              printf("%02x ", 0xFF & ((char*)mem)[i]);
                      }
                      else /* end of block, just aligning for ASCII dump */
                      {
                              printf("   ");
                      }
      
                      /* print ASCII dump */
                      if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1))
                      {
                              for(j = i - (HEXDUMP_COLS - 1); j <= i; j++)
                              {
                                      if(j >= len) /* end of block, not really printing */
                                      {
                                              putchar(' ');
                                      }
                                      else if(isprint(((char*)mem)[j])) /* printable char */
                                      {
                                              putchar(0xFF & ((char*)mem)[j]);
                                      }
                                      else /* other char */
                                      {
                                              putchar('.');
                                      }
                              }
                              putchar('\n');
                      }
              }
      }
      
      int main(void)
      {
      	int i;
      	if (iopl(3))
      	{
      		err(1, "set iopl unsuccessfully\n");
      		return -1;
      	}
      	static char buf[0x40];
      
      	/* test ioport 0x40,0x41,0x42,0x43,0x44,0x45 */
      
      	memset(buf, 0xab, sizeof(buf));
      
      	asm volatile("push %rdi;");
      	asm volatile("mov %0, %%rdi;"::"q"(buf));
      
      	asm volatile ("mov $0x40, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x41, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x42, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x43, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x44, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x45, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("pop %rdi;");
      	hexdump(buf, 0x40);
      
      	printf("\n");
      
      	/* ins port 0x40 */
      
      	memset(buf, 0xab, sizeof(buf));
      
      	asm volatile("push %rdi;");
      	asm volatile("mov %0, %%rdi;"::"q"(buf));
      
      	asm volatile ("mov $0x20, %rcx;");
      	asm volatile ("mov $0x40, %rdx;");
      	asm volatile ("rep insb;");
      
      	asm volatile ("pop %rdi;");
      	hexdump(buf, 0x40);
      
      	printf("\n");
      
      	/* ins port 0x43 */
      
      	memset(buf, 0xab, sizeof(buf));
      
      	asm volatile("push %rdi;");
      	asm volatile("mov %0, %%rdi;"::"q"(buf));
      
      	asm volatile ("mov $0x20, %rcx;");
      	asm volatile ("mov $0x43, %rdx;");
      	asm volatile ("rep insb;");
      
      	asm volatile ("pop %rdi;");
      	hexdump(buf, 0x40);
      
      	printf("\n");
      	return 0;
      }
      
      The vcpu->arch.pio_data buffer is used by both in/out instrutions emulation
      w/o clear after using which results in some random datas are left over in
      the buffer. Guest reads port 0x43 will be ignored since it is write only,
      however, the function kernel_pio() can't distigush this ignore from successfully
      reads data from device's ioport. There is no new data fill the buffer from
      port 0x43, however, emulator_pio_in_emulated() will copy the stale data in
      the buffer to the guest unconditionally. This patch fixes it by clearing the
      buffer before in instruction emulation to avoid to grant guest the stale data
      in the buffer.
      
      In addition, string I/O is not supported for in kernel device. So there is no
      iteration to read ioport %RCX times for string I/O. The function kernel_pio()
      just reads one round, and then copy the io size * %RCX to the guest unconditionally,
      actually it copies the one round ioport data w/ other random datas which are left
      over in the vcpu->arch.pio_data buffer to the guest. This patch fixes it by
      introducing the string I/O support for in kernel device in order to grant the right
      ioport datas to the guest.
      
      Before the patch:
      
      0x000000: fe 38 93 93 ff ff ab ab .8......
      0x000008: ab ab ab ab ab ab ab ab ........
      0x000010: ab ab ab ab ab ab ab ab ........
      0x000018: ab ab ab ab ab ab ab ab ........
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      0x000000: f6 00 00 00 00 00 00 00 ........
      0x000008: 00 00 00 00 00 00 00 00 ........
      0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
      0x000018: 30 30 20 33 20 20 20 20 00 3
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      0x000000: f6 00 00 00 00 00 00 00 ........
      0x000008: 00 00 00 00 00 00 00 00 ........
      0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
      0x000018: 30 30 20 33 20 20 20 20 00 3
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      After the patch:
      
      0x000000: 1e 02 f8 00 ff ff ab ab ........
      0x000008: ab ab ab ab ab ab ab ab ........
      0x000010: ab ab ab ab ab ab ab ab ........
      0x000018: ab ab ab ab ab ab ab ab ........
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      0x000000: d2 e2 d2 df d2 db d2 d7 ........
      0x000008: d2 d3 d2 cf d2 cb d2 c7 ........
      0x000010: d2 c4 d2 c0 d2 bc d2 b8 ........
      0x000018: d2 b4 d2 b0 d2 ac d2 a8 ........
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      0x000000: 00 00 00 00 00 00 00 00 ........
      0x000008: 00 00 00 00 00 00 00 00 ........
      0x000010: 00 00 00 00 00 00 00 00 ........
      0x000018: 00 00 00 00 00 00 00 00 ........
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      Reported-by: default avatarMoguofang <moguofang@huawei.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Moguofang <moguofang@huawei.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      cbfc6c91
    • Wanpeng Li's avatar
      KVM: x86: Fix potential preemption when get the current kvmclock timestamp · e2c2206a
      Wanpeng Li authored
      
      
       BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/2809
       caller is __this_cpu_preempt_check+0x13/0x20
       CPU: 2 PID: 2809 Comm: qemu-system-x86 Not tainted 4.11.0+ #13
       Call Trace:
        dump_stack+0x99/0xce
        check_preemption_disabled+0xf5/0x100
        __this_cpu_preempt_check+0x13/0x20
        get_kvmclock_ns+0x6f/0x110 [kvm]
        get_time_ref_counter+0x5d/0x80 [kvm]
        kvm_hv_process_stimers+0x2a1/0x8a0 [kvm]
        ? kvm_hv_process_stimers+0x2a1/0x8a0 [kvm]
        ? kvm_arch_vcpu_ioctl_run+0xac9/0x1ce0 [kvm]
        kvm_arch_vcpu_ioctl_run+0x5bf/0x1ce0 [kvm]
        kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? __fget+0xf3/0x210
        do_vfs_ioctl+0xa4/0x700
        ? __fget+0x114/0x210
        SyS_ioctl+0x79/0x90
        entry_SYSCALL_64_fastpath+0x23/0xc2
       RIP: 0033:0x7f9d164ed357
        ? __this_cpu_preempt_check+0x13/0x20
      
      This can be reproduced by run kvm-unit-tests/hyperv_stimer.flat w/
      CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT enabled.
      
      Safe access to per-CPU data requires a couple of constraints, though: the
      thread working with the data cannot be preempted and it cannot be migrated
      while it manipulates per-CPU variables. If the thread is preempted, the
      thread that replaces it could try to work with the same variables; migration
      to another CPU could also cause confusion. However there is no preemption
      disable when reads host per-CPU tsc rate to calculate the current kvmclock
      timestamp.
      
      This patch fixes it by utilizing get_cpu/put_cpu pair to guarantee both
      __this_cpu_read() and rdtsc() are not preempted.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      e2c2206a
  3. 15 May, 2017 1 commit
    • Wanpeng Li's avatar
      KVM: x86: Fix load damaged SSEx MXCSR register · a575813b
      Wanpeng Li authored
      
      
      Reported by syzkaller:
      
         BUG: unable to handle kernel paging request at ffffffffc07f6a2e
         IP: report_bug+0x94/0x120
         PGD 348e12067
         P4D 348e12067
         PUD 348e14067
         PMD 3cbd84067
         PTE 80000003f7e87161
      
         Oops: 0003 [#1] SMP
         CPU: 2 PID: 7091 Comm: kvm_load_guest_ Tainted: G           OE   4.11.0+ #8
         task: ffff92fdfb525400 task.stack: ffffbda6c3d04000
         RIP: 0010:report_bug+0x94/0x120
         RSP: 0018:ffffbda6c3d07b20 EFLAGS: 00010202
          do_trap+0x156/0x170
          do_error_trap+0xa3/0x170
          ? kvm_load_guest_fpu.part.175+0x12a/0x170 [kvm]
          ? mark_held_locks+0x79/0xa0
          ? retint_kernel+0x10/0x10
          ? trace_hardirqs_off_thunk+0x1a/0x1c
          do_invalid_op+0x20/0x30
          invalid_op+0x1e/0x30
         RIP: 0010:kvm_load_guest_fpu.part.175+0x12a/0x170 [kvm]
          ? kvm_load_guest_fpu.part.175+0x1c/0x170 [kvm]
          kvm_arch_vcpu_ioctl_run+0xed6/0x1b70 [kvm]
          kvm_vcpu_ioctl+0x384/0x780 [kvm]
          ? kvm_vcpu_ioctl+0x384/0x780 [kvm]
          ? sched_clock+0x13/0x20
          ? __do_page_fault+0x2a0/0x550
          do_vfs_ioctl+0xa4/0x700
          ? up_read+0x1f/0x40
          ? __do_page_fault+0x2a0/0x550
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x23/0xc2
      
      SDM mentioned that "The MXCSR has several reserved bits, and attempting to write
      a 1 to any of these bits will cause a general-protection exception(#GP) to be
      generated". The syzkaller forks' testcase overrides xsave area w/ random values
      and steps on the reserved bits of MXCSR register. The damaged MXCSR register
      values of guest will be restored to SSEx MXCSR register before vmentry. This
      patch fixes it by catching userspace override MXCSR register reserved bits w/
      random values and bails out immediately.
      
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      a575813b
  4. 09 May, 2017 1 commit
    • Michal Hocko's avatar
      mm: introduce kv[mz]alloc helpers · a7c3e901
      Michal Hocko authored
      Patch series "kvmalloc", v5.
      
      There are many open coded kmalloc with vmalloc fallback instances in the
      tree.  Most of them are not careful enough or simply do not care about
      the underlying semantic of the kmalloc/page allocator which means that
      a) some vmalloc fallbacks are basically unreachable because the kmalloc
      part will keep retrying until it succeeds b) the page allocator can
      invoke a really disruptive steps like the OOM killer to move forward
      which doesn't sound appropriate when we consider that the vmalloc
      fallback is available.
      
      As it can be seen implementing kvmalloc requires quite an intimate
      knowledge if the page allocator and the memory reclaim internals which
      strongly suggests that a helper should be implemented in the memory
      subsystem proper.
      
      Most callers, I could find, have been converted to use the helper
      instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
      in the networking stack which I have converted as well and Eric Dumazet
      was not opposed [2] to convert them as well.
      
      [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
      
      This patch (of 9):
      
      Using kmalloc with the vmalloc fallback for larger allocations is a
      common pattern in the kernel code.  Yet we do not have any common helper
      for that and so users have invented their own helpers.  Some of them are
      really creative when doing so.  Let's just add kv[mz]alloc and make sure
      it is implemented properly.  This implementation makes sure to not make
      a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
      to not warn about allocation failures.  This also rules out the OOM
      killer as the vmalloc is a more approapriate fallback than a disruptive
      user visible action.
      
      This patch also changes some existing users and removes helpers which
      are specific for them.  In some cases this is not possible (e.g.
      ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
      require GFP_NO{FS,IO} context which is not vmalloc compatible in general
      (note that the page table allocation is GFP_KERNEL).  Those need to be
      fixed separately.
      
      While we are at it, document that __vmalloc{_node} about unsupported gfp
      mask because there seems to be a lot of confusion out there.
      kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
      superset) flags to catch new abusers.  Existing ones would have to die
      slowly.
      
      [sfr@canb.auug.org.au: f2fs fixup]
        Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.org
      
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7c3e901
  5. 03 May, 2017 1 commit
    • Paolo Bonzini's avatar
      Revert "KVM: Support vCPU-based gfn->hva cache" · 4e335d9e
      Paolo Bonzini authored
      This reverts commit bbd64115.
      
      I've been sitting on this revert for too long and it unfortunately
      missed 4.11.  It's also the reason why I haven't merged ring-based
      dirty tracking for 4.12.
      
      Using kvm_vcpu_memslots in kvm_gfn_to_hva_cache_init and
      kvm_vcpu_write_guest_offset_cached means that the MSR value can
      now be used to access SMRAM, simply by making it point to an SMRAM
      physical address.  This is problematic because it lets the guest
      OS overwrite memory that it shouldn't be able to touch.
      
      Cc: stable@vger.kernel.org
      Fixes: bbd64115
      
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4e335d9e
  6. 02 May, 2017 1 commit
  7. 27 Apr, 2017 4 commits
    • Ladi Prosek's avatar
      KVM: x86: fix emulation of RSM and IRET instructions · 6ed071f0
      Ladi Prosek authored
      On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
      on hflags is reverted later on in x86_emulate_instruction where hflags are
      overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
      as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.
      
      Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
      an instruction is emulated, this commit deletes emul_flags altogether and
      makes the emulator access vcpu->arch.hflags using two new accessors. This
      way all changes, on the emulator side as well as in functions called from
      the emulator and accessing vcpu state with emul_to_vcpu, are preserved.
      
      More details on the bug and its manifestation with Windows and OVMF:
      
        It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
        I believe that the SMM part explains why we started seeing this only with
        OVMF.
      
        KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
        the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
        later on in x86_emulate_instruction we overwrite arch.hflags with
        ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
        The AMD-specific hflag of interest here is HF_NMI_MASK.
      
        When rebooting the system, Windows sends an NMI IPI to all but the current
        cpu to shut them down. Only after all of them are parked in HLT will the
        initiating cpu finish the restart. If NMI is masked, other cpus never get
        the memo and the initiating cpu spins forever, waiting for
        hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.
      
      Fixes: a584539b
      
       ("KVM: x86: pass the whole hflags field to emulator and back")
      Signed-off-by: default avatarLadi Prosek <lprosek@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6ed071f0
    • Andrew Jones's avatar
      KVM: add explicit barrier to kvm_vcpu_kick · cde9af6e
      Andrew Jones authored
      
      
      kvm_vcpu_kick() must issue a general memory barrier prior to reading
      vcpu->mode in order to ensure correctness of the mutual-exclusion
      memory barrier pattern used with vcpu->requests.  While the cmpxchg
      called from kvm_vcpu_kick():
      
       kvm_vcpu_kick
         kvm_arch_vcpu_should_kick
           kvm_vcpu_exiting_guest_mode
             cmpxchg
      
      implies general memory barriers before and after the operation, that
      implication is only valid when cmpxchg succeeds.  We need an explicit
      barrier for when it fails, otherwise a VCPU thread on its entry path
      that reads zero for vcpu->requests does not exclude the possibility
      the requesting thread sees !IN_GUEST_MODE when it reads vcpu->mode.
      
      kvm_make_all_cpus_request already had a barrier, so we remove it, as
      now it would be redundant.
      
      Signed-off-by: default avatarAndrew Jones <drjones@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cde9af6e
    • Radim Krčmář's avatar
    • Radim Krčmář's avatar
      KVM: add kvm_{test,clear}_request to replace {test,clear}_bit · 72875d8a
      Radim Krčmář authored
      
      
      Users were expected to use kvm_check_request() for testing and clearing,
      but request have expanded their use since then and some users want to
      only test or do a faster clear.
      
      Make sure that requests are not directly accessed with bit operations.
      
      Reviewed-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      72875d8a
  8. 21 Apr, 2017 3 commits
    • Marcelo Tosatti's avatar
      KVM: x86: remove irq disablement around KVM_SET_CLOCK/KVM_GET_CLOCK · e891a32e
      Marcelo Tosatti authored
      
      
      The disablement of interrupts at KVM_SET_CLOCK/KVM_GET_CLOCK
      attempts to disable software suspend from causing "non atomic behaviour" of
      the operation:
      
          Add a helper function to compute the kernel time and convert nanoseconds
          back to CPU specific cycles.  Note that these must not be called in preemptible
          context, as that would mean the kernel could enter software suspend state,
          which would cause non-atomic operation.
      
      However, assume the kernel can enter software suspend at the following 2 points:
      
              ktime_get_ts(&ts);
      1.
      						hypothetical_ktime_get_ts(&ts)
              monotonic_to_bootbased(&ts);
      2.
      
      monotonic_to_bootbased() should be correct relative to a ktime_get_ts(&ts)
      performed after point 1 (that is after resuming from software suspend),
      hypothetical_ktime_get_ts()
      
      Therefore it is also correct for the ktime_get_ts(&ts) before point 1,
      which is
      
      	ktime_get_ts(&ts) = hypothetical_ktime_get_ts(&ts) + time-to-execute-suspend-code
      
      Note CLOCK_MONOTONIC does not count during suspension.
      
      So remove the irq disablement, which causes the following warning on
      -RT kernels:
      
       With this reasoning, and the -RT bug that the irq disablement causes
       (because spin_lock is now a sleeping lock), remove the IRQ protection as it
       causes:
      
       [ 1064.668109] in_atomic(): 0, irqs_disabled(): 1, pid: 15296, name:m
       [ 1064.668110] INFO: lockdep is turned off.
       [ 1064.668110] irq event stamp: 0
       [ 1064.668112] hardirqs last  enabled at (0): [<          (null)>]  )
       [ 1064.668116] hardirqs last disabled at (0): [] c0
       [ 1064.668118] softirqs last  enabled at (0): [] c0
       [ 1064.668118] softirqs last disabled at (0): [<          (null)>]  )
       [ 1064.668121] CPU: 13 PID: 15296 Comm: qemu-kvm Not tainted 3.10.0-1
       [ 1064.668121] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 5
       [ 1064.668123]  ffff8c1796b88000 00000000afe7344c ffff8c179abf3c68 f3
       [ 1064.668125]  ffff8c179abf3c90 ffffffff930ccb3d ffff8c1b992b3610 f0
       [ 1064.668126]  00007ffc1a26fbc0 ffff8c179abf3cb0 ffffffff9375f694 f0
       [ 1064.668126] Call Trace:
       [ 1064.668132]  [] dump_stack+0x19/0x1b
       [ 1064.668135]  [] __might_sleep+0x12d/0x1f0
       [ 1064.668138]  [] rt_spin_lock+0x24/0x60
       [ 1064.668155]  [] __get_kvmclock_ns+0x36/0x110 [k]
       [ 1064.668159]  [] ? futex_wait_queue_me+0x103/0x10
       [ 1064.668171]  [] kvm_arch_vm_ioctl+0xa2/0xd70 [k]
       [ 1064.668173]  [] ? futex_wait+0x1ac/0x2a0
      
      v2: notice get_kvmclock_ns with the same problem (Pankaj).
      v3: remove useless helper function (Pankaj).
      
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e891a32e
    • Michael S. Tsirkin's avatar
      kvm: better MWAIT emulation for guests · 668fffa3
      Michael S. Tsirkin authored
      
      
      Guests that are heavy on futexes end up IPI'ing each other a lot. That
      can lead to significant slowdowns and latency increase for those guests
      when running within KVM.
      
      If only a single guest is needed on a host, we have a lot of spare host
      CPU time we can throw at the problem. Modern CPUs implement a feature
      called "MWAIT" which allows guests to wake up sleeping remote CPUs without
      an IPI - thus without an exit - at the expense of never going out of guest
      context.
      
      The decision whether this is something sensible to use should be up to the
      VM admin, so to user space. We can however allow MWAIT execution on systems
      that support it properly hardware wise.
      
      This patch adds a CAP to user space and a KVM cpuid leaf to indicate
      availability of native MWAIT execution. With that enabled, the worst a
      guest can do is waste as many cycles as a "jmp ." would do, so it's not
      a privilege problem.
      
      We consciously do *not* expose the feature in our CPUID bitmap, as most
      people will want to benefit from sleeping vCPUs to allow for over commit.
      
      Reported-by: default avatar"Gabriel L. Somlo" <gsomlo@gmail.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      [agraf: fix amd, change commit message]
      Signed-off-by: default avatarAlexander Graf <agraf@suse.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      668fffa3
    • Kyle Huey's avatar
      KVM: x86: virtualize cpuid faulting · db2336a8
      Kyle Huey authored
      
      
      Hardware support for faulting on the cpuid instruction is not required to
      emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
      MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
      cpuid-induced VM exit checks the cpuid faulting state and the CPL.
      kvm_require_cpl is even kind enough to inject the GP fault for us.
      
      Signed-off-by: default avatarKyle Huey <khuey@kylehuey.com>
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      [Return "1" from kvm_emulate_cpuid, it's not void. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      db2336a8
  9. 12 Apr, 2017 8 commits
  10. 07 Apr, 2017 4 commits
    • Jim Mattson's avatar
      kvm: nVMX: Disallow userspace-injected exceptions in guest mode · 28d06353
      Jim Mattson authored
      
      
      The userspace exception injection API and code path are entirely
      unprepared for exceptions that might cause a VM-exit from L2 to L1, so
      the best course of action may be to simply disallow this for now.
      
      1. The API provides no mechanism for userspace to specify the new DR6
      bits for a #DB exception or the new CR2 value for a #PF
      exception. Presumably, userspace is expected to modify these registers
      directly with KVM_SET_SREGS before the next KVM_RUN ioctl. However, in
      the event that L1 intercepts the exception, these registers should not
      be changed. Instead, the new values should be provided in the
      exit_qualification field of vmcs12 (Intel SDM vol 3, section 27.1).
      
      2. In the case of a userspace-injected #DB, inject_pending_event()
      clears DR7.GD before calling vmx_queue_exception(). However, in the
      event that L1 intercepts the exception, this is too early, because
      DR7.GD should not be modified by a #DB that causes a VM-exit directly
      (Intel SDM vol 3, section 27.1).
      
      3. If the injected exception is a #PF, nested_vmx_check_exception()
      doesn't properly check whether or not L1 is interested in the
      associated error code (using the #PF error code mask and match fields
      from vmcs12). It may either return 0 when it should call
      nested_vmx_vmexit() or vice versa.
      
      4. nested_vmx_check_exception() assumes that it is dealing with a
      hardware-generated exception intercept from L2, with some of the
      relevant details (the VM-exit interruption-information and the exit
      qualification) live in vmcs02. For userspace-injected exceptions, this
      is not the case.
      
      5. prepare_vmcs12() assumes that when its exit_intr_info argument
      specifies valid information with a valid error code that it can VMREAD
      the VM-exit interruption error code from vmcs02. For
      userspace-injected exceptions, this is not the case.
      
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      28d06353
    • David Hildenbrand's avatar
      KVM: x86: fix user triggerable warning in kvm_apic_accept_events() · 28bf2888
      David Hildenbrand authored
      If we already entered/are about to enter SMM, don't allow switching to
      INIT/SIPI_RECEIVED, otherwise the next call to kvm_apic_accept_events()
      will report a warning.
      
      Same applies if we are already in MP state INIT_RECEIVED and SMM is
      requested to be turned on. Refuse to set the VCPU events in this case.
      
      Fixes: cd7764fe
      
       ("KVM: x86: latch INITs while in system management mode")
      Cc: stable@vger.kernel.org # 4.2+
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      28bf2888
    • Paolo Bonzini's avatar
      kvm: make KVM_CAP_COALESCED_MMIO architecture agnostic · 30422558
      Paolo Bonzini authored
      
      
      Remove code from architecture files that can be moved to virt/kvm, since there
      is already common code for coalesced MMIO.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      [Removed a pointless 'break' after 'return'.]
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      30422558
    • Paolo Bonzini's avatar
      KVM: x86: drop legacy device assignment · ad6260da
      Paolo Bonzini authored
      
      
      Legacy device assignment has been deprecated since 4.2 (released
      1.5 years ago).  VFIO is better and everyone should have switched to it.
      If they haven't, this should convince them. :)
      
      Reviewed-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ad6260da
  11. 28 Mar, 2017 1 commit
  12. 23 Mar, 2017 2 commits
    • Wanpeng Li's avatar
      KVM: x86: correct async page present tracepoint · 24dccf83
      Wanpeng Li authored
      
      
      After async pf setup successfully, there is a broadcast wakeup w/ special
      token 0xffffffff which tells vCPU that it should wake up all processes
      waiting for APFs though there is no real process waiting at the moment.
      
      The async page present tracepoint print prematurely and fails to catch the
      special token setup. This patch fixes it by moving the async page present
      tracepoint after the special token setup.
      
      Before patch:
      
      qemu-system-x86-8499  [006] ...1  5973.473292: kvm_async_pf_ready: token 0x0 gva 0x0
      
      After patch:
      
      qemu-system-x86-8499  [006] ...1  5973.473292: kvm_async_pf_ready: token 0xffffffff gva 0x0
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      24dccf83
    • Peter Xu's avatar
      KVM: x86: use pic/ioapic destructor when destroy vm · c761159c
      Peter Xu authored
      
      
      We have specific destructors for pic/ioapic, we'd better use them when
      destroying the VM as well.
      
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      c761159c
  13. 02 Mar, 2017 1 commit
  14. 17 Feb, 2017 2 commits
    • Paolo Bonzini's avatar
      KVM: x86: remove code for lazy FPU handling · bd7e5b08
      Paolo Bonzini authored
      
      
      The FPU is always active now when running KVM.
      
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarBandan Das <bsd@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bd7e5b08
    • Paolo Bonzini's avatar
      KVM: race-free exit from KVM_RUN without POSIX signals · 460df4c1
      Paolo Bonzini authored
      
      
      The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick"
      a VCPU out of KVM_RUN through a POSIX signal.  A signal is attached
      to a dummy signal handler; by blocking the signal outside KVM_RUN and
      unblocking it inside, this possible race is closed:
      
                VCPU thread                     service thread
         --------------------------------------------------------------
              check flag
                                                set flag
                                                raise signal
              (signal handler does nothing)
              KVM_RUN
      
      However, one issue with KVM_SET_SIGNAL_MASK is that it has to take
      tsk->sighand->siglock on every KVM_RUN.  This lock is often on a
      remote NUMA node, because it is on the node of a thread's creator.
      Taking this lock can be very expensive if there are many userspace
      exits (as is the case for SMP Windows VMs without Hyper-V reference
      time counter).
      
      As an alternative, we can put the flag directly in kvm_run so that
      KVM can see it:
      
                VCPU thread                     service thread
         --------------------------------------------------------------
                                                raise signal
              signal handler
                set run->immediate_exit
              KVM_RUN
                check run->immediate_exit
      
      Reviewed-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      460df4c1
  15. 16 Feb, 2017 1 commit
  16. 15 Feb, 2017 4 commits
    • Paolo Bonzini's avatar
      kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection · b95234c8
      Paolo Bonzini authored
      Since bf9f6ac8
      
       ("KVM: Update Posted-Interrupts Descriptor when vCPU
      is blocked", 2015-09-18) the posted interrupt descriptor is checked
      unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
      trigger the scan and, if NMIs or SMIs are not involved, we can avoid
      the complicated event injection path.
      
      Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
      there since APICv was introduced.
      
      However, without the KVM_REQ_EVENT safety net KVM needs to be much
      more careful about races between vmx_deliver_posted_interrupt and
      vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
      between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
      If that happens, kvm_trigger_posted_interrupt returns true, but
      smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
      entered with PIR.ON, but the posted interrupt IPI has not been sent
      and the interrupt is only delivered to the guest on the next vmentry
      (if any).  To fix this, disable interrupts before setting vcpu->mode.
      This ensures that the IPI is delayed until the guest enters non-root mode;
      it is then trapped by the processor causing the interrupt to be injected.
      
      Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
      and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
      but it (correctly) doesn't do anything because it sees vcpu->mode ==
      OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
      posted interrupt IPI is pending; this time, the fix for this is to move
      the RVI update after IN_GUEST_MODE.
      
      Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
      though the second could actually happen with VT-d posted interrupts.
      In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
      in another vmentry which would inject the interrupt.
      
      This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b95234c8
    • Paolo Bonzini's avatar
      KVM: x86: do not scan IRR twice on APICv vmentry · 76dfafd5
      Paolo Bonzini authored
      
      
      Calls to apic_find_highest_irr are scanning IRR twice, once
      in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
      sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr;
      now that it does the computation, it can also do the RVI write.
      
      In order to avoid complications in svm.c, make the callback optional.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      76dfafd5
    • Paolo Bonzini's avatar
    • Paolo Bonzini's avatar
      kvm: nVMX: move nested events check to kvm_vcpu_running · 0ad3bed6
      Paolo Bonzini authored
      
      
      vcpu_run calls kvm_vcpu_running, not kvm_arch_vcpu_runnable,
      and the former does not call check_nested_events.
      
      Once KVM_REQ_EVENT is removed from the APICv interrupt injection
      path, however, this would leave no place to trigger a vmexit
      from L2 to L1, causing a missed interrupt delivery while in guest
      mode.  This is caught by the "ack interrupt on exit" test in
      vmx.flat.
      
      [This does not change the calls to check_nested_events in
       inject_pending_event.  That is material for a separate cleanup.]
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0ad3bed6
  17. 09 Feb, 2017 1 commit
    • Arnd Bergmann's avatar
      KVM: x86: hide KVM_HC_CLOCK_PAIRING on 32 bit · 8ef81a9a
      Arnd Bergmann authored
      The newly added hypercall doesn't work on x86-32:
      
      arch/x86/kvm/x86.c: In function 'kvm_pv_clock_pairing':
      arch/x86/kvm/x86.c:6163:6: error: implicit declaration of function 'kvm_get_walltime_and_clockread';did you mean 'kvm_get_time_scale'? [-Werror=implicit-function-declaration]
      
      This adds an #ifdef around it, matching the one around the related
      functions that are also only implemented on 64-bit systems.
      
      Fixes: 55dd00a7
      
       ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8ef81a9a
  18. 08 Feb, 2017 1 commit