Skip to content
Snippets Groups Projects
  1. Apr 30, 2019
    • Paolo Bonzini's avatar
      KVM: fix KVM_CLEAR_DIRTY_LOG for memory slots of unaligned size · 76d58e0f
      Paolo Bonzini authored
      
      If a memory slot's size is not a multiple of 64 pages (256K), then
      the KVM_CLEAR_DIRTY_LOG API is unusable: clearing the final 64 pages
      either requires the requested page range to go beyond memslot->npages,
      or requires log->num_pages to be unaligned, and kvm_clear_dirty_log_protect
      requires log->num_pages to be both in range and aligned.
      
      To allow this case, allow log->num_pages not to be a multiple of 64 if
      it ends exactly on the last page of the slot.
      
      Reported-by: default avatarPeter Xu <peterx@redhat.com>
      Fixes: 98938aa8 ("KVM: validate userspace input in kvm_clear_dirty_log_protect()", 2019-01-02)
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      76d58e0f
    • Vitaly Kuznetsov's avatar
      x86/kvm/mmu: reset MMU context when 32-bit guest switches PAE · 0699c64a
      Vitaly Kuznetsov authored
      
      Commit 47c42e6b ("KVM: x86: fix handling of role.cr4_pae and rename it
      to 'gpte_size'") introduced a regression: 32-bit PAE guests stopped
      working. The issue appears to be: when guest switches (enables) PAE we need
      to re-initialize MMU context (set context->root_level, do
      reset_rsvds_bits_mask(), ...) but init_kvm_tdp_mmu() doesn't do that
      because we threw away is_pae(vcpu) flag from mmu role. Restore it to
      kvm_mmu_extended_role (as we now don't need it in base role) to fix
      the issue.
      
      Fixes: 47c42e6b ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'")
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0699c64a
    • Sean Christopherson's avatar
      KVM: x86: Whitelist port 0x7e for pre-incrementing %rip · 8764ed55
      Sean Christopherson authored
      
      KVM's recent bug fix to update %rip after emulating I/O broke userspace
      that relied on the previous behavior of incrementing %rip prior to
      exiting to userspace.  When running a Windows XP guest on AMD hardware,
      Qemu may patch "OUT 0x7E" instructions in reaction to the OUT itself.
      Because KVM's old behavior was to increment %rip before exiting to
      userspace to handle the I/O, Qemu manually adjusted %rip to account for
      the OUT instruction.
      
      Arguably this is a userspace bug as KVM requires userspace to re-enter
      the kernel to complete instruction emulation before taking any other
      actions.  That being said, this is a bit of a grey area and breaking
      userspace that has worked for many years is bad.
      
      Pre-increment %rip on OUT to port 0x7e before exiting to userspace to
      hack around the issue.
      
      Fixes: 45def77e ("KVM: x86: update %rip after emulating IO")
      Reported-by: default avatarSimon Becherer <simon@becherer.de>
      Reported-and-tested-by: default avatarIakov Karpov <srid@rkmail.ru>
      Reported-by: default avatarGabriele Balducci <balducci@units.it>
      Reported-by: default avatarAntti Antinoja <reader@fennosys.fi>
      Cc: stable@vger.kernel.org
      Cc: Takashi Iwai <tiwai@suse.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8764ed55
  2. Apr 29, 2019
  3. Apr 27, 2019
    • Rick Edgecombe's avatar
      KVM: VMX: Move RSB stuffing to before the first RET after VM-Exit · f2fde6a5
      Rick Edgecombe authored
      
      The not-so-recent change to move VMX's VM-Exit handing to a dedicated
      "function" unintentionally exposed KVM to a speculative attack from the
      guest by executing a RET prior to stuffing the RSB.  Make RSB stuffing
      happen immediately after VM-Exit, before any unpaired returns.
      
      Alternatively, the VM-Exit path could postpone full RSB stuffing until
      its current location by stuffing the RSB only as needed, or by avoiding
      returns in the VM-Exit path entirely, but both alternatives are beyond
      ugly since vmx_vmexit() has multiple indirect callers (by way of
      vmx_vmenter()).  And putting the RSB stuffing immediately after VM-Exit
      makes it much less likely to be re-broken in the future.
      
      Note, the cost of PUSH/POP could be avoided in the normal flow by
      pairing the PUSH RAX with the POP RAX in __vmx_vcpu_run() and adding an
      a POP to nested_vmx_check_vmentry_hw(), but such a weird/subtle
      dependency is likely to cause problems in the long run, and PUSH/POP
      will take all of a few cycles, which is peanuts compared to the number
      of cycles required to fill the RSB.
      
      Fixes: 453eafbe ("KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines")
      Reported-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: default avatarRick Edgecombe <rick.p.edgecombe@intel.com>
      Co-developed-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f2fde6a5
  4. Apr 18, 2019
    • Sean Christopherson's avatar
      KVM: lapic: Convert guest TSC to host time domain if necessary · b6aa57c6
      Sean Christopherson authored
      
      To minimize the latency of timer interrupts as observed by the guest,
      KVM adjusts the values it programs into the host timers to account for
      the host's overhead of programming and handling the timer event.  In
      the event that the adjustments are too aggressive, i.e. the timer fires
      earlier than the guest expects, KVM busy waits immediately prior to
      entering the guest.
      
      Currently, KVM manually converts the delay from nanoseconds to clock
      cycles.  But, the conversion is done in the guest's time domain, while
      the delay occurs in the host's time domain.  This is perfectly ok when
      the guest and host are using the same TSC ratio, but if the guest is
      using a different ratio then the delay may not be accurate and could
      wait too little or too long.
      
      When the guest is not using the host's ratio, convert the delay from
      guest clock cycles to host nanoseconds and use ndelay() instead of
      __delay() to provide more accurate timing.  Because converting to
      nanoseconds is relatively expensive, e.g. requires division and more
      multiplication ops, continue using __delay() directly when guest and
      host TSCs are running at the same ratio.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b6aa57c6
    • Sean Christopherson's avatar
      KVM: lapic: Allow user to disable adaptive tuning of timer advancement · c3941d9e
      Sean Christopherson authored
      
      The introduction of adaptive tuning of lapic timer advancement did not
      allow for the scenario where userspace would want to disable adaptive
      tuning but still employ timer advancement, e.g. for testing purposes or
      to handle a use case where adaptive tuning is unable to settle on a
      suitable time.  This is epecially pertinent now that KVM places a hard
      threshold on the maximum advancment time.
      
      Rework the timer semantics to accept signed values, with a value of '-1'
      being interpreted as "use adaptive tuning with KVM's internal default",
      and any other value being used as an explicit advancement time, e.g. a
      time of '0' effectively disables advancement.
      
      Note, this does not completely restore the original behavior of
      lapic_timer_advance_ns.  Prior to tracking the advancement per vCPU,
      which is necessary to support autotuning, userspace could adjust
      lapic_timer_advance_ns for *running* vCPU.  With per-vCPU tracking, the
      module params are snapshotted at vCPU creation, i.e. applying a new
      advancement effectively requires restarting a VM.
      
      Dynamically updating a running vCPU is possible, e.g. a helper could be
      added to retrieve the desired delay, choosing between the global module
      param and the per-VCPU value depending on whether or not auto-tuning is
      (globally) enabled, but introduces a great deal of complexity.  The
      wrapper itself is not complex, but understanding and documenting the
      effects of dynamically toggling auto-tuning and/or adjusting the timer
      advancement is nigh impossible since the behavior would be dependent on
      KVM's implementation as well as compiler optimizations.  In other words,
      providing stable behavior would require extremely careful consideration
      now and in the future.
      
      Given that the expected use of a manually-tuned timer advancement is to
      "tune once, run many", use the vastly simpler approach of recognizing
      changes to the module params only when creating a new vCPU.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c3941d9e
    • Sean Christopherson's avatar
      KVM: lapic: Track lapic timer advance per vCPU · 39497d76
      Sean Christopherson authored
      
      Automatically adjusting the globally-shared timer advancement could
      corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
      the advancement value.  That could be partially fixed by using a local
      variable for the arithmetic, but it would still be susceptible to a
      race when setting timer_advance_adjust_done.
      
      And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
      correct calibration for a given vCPU may not apply to all vCPUs.
      
      Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
      effectively violated when finding a stable advancement takes an extended
      amount of timer.
      
      Opportunistically change the definition of lapic_timer_advance_ns to
      a u32 so that it matches the style of struct kvm_timer.  Explicitly
      pass the param to kvm_create_lapic() so that it doesn't have to be
      exposed to lapic.c, thus reducing the probability of unintentionally
      using the global value instead of the per-vCPU value.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Reviewed-by: default avatarLiran Alon <liran.alon@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39497d76
    • Sean Christopherson's avatar
      KVM: lapic: Disable timer advancement if adaptive tuning goes haywire · 57bf67e7
      Sean Christopherson authored
      
      To minimize the latency of timer interrupts as observed by the guest,
      KVM adjusts the values it programs into the host timers to account for
      the host's overhead of programming and handling the timer event.  Now
      that the timer advancement is automatically tuned during runtime, it's
      effectively unbounded by default, e.g. if KVM is running as L1 the
      advancement can measure in hundreds of milliseconds.
      
      Disable timer advancement if adaptive tuning yields an advancement of
      more than 5000ns, as large advancements can break reasonable assumptions
      of the guest, e.g. that a timer configured to fire after 1ms won't
      arrive on the next instruction.  Although KVM busy waits to mitigate the
      case of a timer event arriving too early, complications can arise when
      shifting the interrupt too far, e.g. kvm-unit-test's vmx.interrupt test
      will fail when its "host" exits on interrupts as KVM may inject the INTR
      before the guest executes STI+HLT.   Arguably the unit test is "broken"
      in the sense that delaying a timer interrupt by 1ms doesn't technically
      guarantee the interrupt will arrive after STI+HLT, but it's a reasonable
      assumption that KVM should support.
      
      Furthermore, an unbounded advancement also effectively unbounds the time
      spent busy waiting, e.g. if the guest programs a timer with a very large
      delay.
      
      5000ns is a somewhat arbitrary threshold.  When running on bare metal,
      which is the intended use case, timer advancement is expected to be in
      the general vicinity of 1000ns.  5000ns is high enough that false
      positives are unlikely, while not being so high as to negatively affect
      the host's performance/stability.
      
      Note, a future patch will enable userspace to disable KVM's adaptive
      tuning, which will allow priveleged userspace will to specifying an
      advancement value in excess of this arbitrary threshold in order to
      satisfy an abnormal use case.
      
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: stable@vger.kernel.org
      Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      57bf67e7
    • Vitaly Kuznetsov's avatar
      x86: kvm: hyper-v: deal with buggy TLB flush requests from WS2012 · da66761c
      Vitaly Kuznetsov authored
      
      It was reported that with some special Multi Processor Group configuration,
      e.g:
       bcdedit.exe /set groupsize 1
       bcdedit.exe /set maxgroup on
       bcdedit.exe /set groupaware on
      for a 16-vCPU guest WS2012 shows BSOD on boot when PV TLB flush mechanism
      is in use.
      
      Tracing kvm_hv_flush_tlb immediately reveals the issue:
      
       kvm_hv_flush_tlb: processor_mask 0x0 address_space 0x0 flags 0x2
      
      The only flag set in this request is HV_FLUSH_ALL_VIRTUAL_ADDRESS_SPACES,
      however, processor_mask is 0x0 and no HV_FLUSH_ALL_PROCESSORS is specified.
      We don't flush anything and apparently it's not what Windows expects.
      
      TLFS doesn't say anything about such requests and newer Windows versions
      seem to be unaffected. This all feels like a WS2012 bug, which is, however,
      easy to workaround in KVM: let's flush everything when we see an empty
      flush request, over-flushing doesn't hurt.
      
      Signed-off-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      da66761c
    • Liran Alon's avatar
      KVM: x86: Consider LAPIC TSC-Deadline timer expired if deadline too short · c09d65d9
      Liran Alon authored
      
      If guest sets MSR_IA32_TSCDEADLINE to value such that in host
      time-domain it's shorter than lapic_timer_advance_ns, we can
      reach a case that we call hrtimer_start() with expiration time set at
      the past.
      
      Because lapic_timer.timer is init with HRTIMER_MODE_ABS_PINNED, it
      is not allowed to run in softirq and therefore will never expire.
      
      To avoid such a scenario, verify that deadline expiration time is set on
      host time-domain further than (now + lapic_timer_advance_ns).
      
      A future patch can also consider adding a min_timer_deadline_ns module parameter,
      similar to min_timer_period_us to avoid races that amount of ns it takes
      to run logic could still call hrtimer_start() with expiration timer set
      at the past.
      
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c09d65d9
    • Paolo Bonzini's avatar
      Merge tag 'kvm-ppc-fixes-5.1-1' of... · 78671ab4
      Paolo Bonzini authored
      Merge tag 'kvm-ppc-fixes-5.1-1' of git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD
      
      KVM/PPC fixes for 5.1
      
      - Fix host hang in the HTM assist code for POWER9
      - Take srcu read lock around memslot lookup
      78671ab4
  5. Apr 16, 2019
  6. Apr 15, 2019
    • Sean Christopherson's avatar
      KVM: x86/mmu: Fix an inverted list_empty() check when zapping sptes · cfd32acf
      Sean Christopherson authored
      
      A recently introduced helper for handling zap vs. remote flush
      incorrectly bails early, effectively leaking defunct shadow pages.
      Manifests as a slab BUG when exiting KVM due to the shadow pages
      being alive when their associated cache is destroyed.
      
      ==========================================================================
      BUG kvm_mmu_page_header: Objects remaining in kvm_mmu_page_header on ...
      --------------------------------------------------------------------------
      Disabling lock debugging due to kernel taint
      INFO: Slab 0x00000000fc436387 objects=26 used=23 fp=0x00000000d023caee ...
      CPU: 6 PID: 4315 Comm: rmmod Tainted: G    B             5.1.0-rc2+ #19
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      Call Trace:
       dump_stack+0x46/0x5b
       slab_err+0xad/0xd0
       ? on_each_cpu_mask+0x3c/0x50
       ? ksm_migrate_page+0x60/0x60
       ? on_each_cpu_cond_mask+0x7c/0xa0
       ? __kmalloc+0x1ca/0x1e0
       __kmem_cache_shutdown+0x13a/0x310
       shutdown_cache+0xf/0x130
       kmem_cache_destroy+0x1d5/0x200
       kvm_mmu_module_exit+0xa/0x30 [kvm]
       kvm_arch_exit+0x45/0x60 [kvm]
       kvm_exit+0x6f/0x80 [kvm]
       vmx_exit+0x1a/0x50 [kvm_intel]
       __x64_sys_delete_module+0x153/0x1f0
       ? exit_to_usermode_loop+0x88/0xc0
       do_syscall_64+0x4f/0x100
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: a2113634 ("KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap()")
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cfd32acf
  7. Apr 10, 2019
  8. Apr 09, 2019
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 869e3305
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Off by one and bounds checking fixes in NFC, from Dan Carpenter.
      
       2) There have been many weird regressions in r8169 since we turned ASPM
          support on, some are still not understood nor completely resolved.
          Let's turn this back off for now. From Heiner Kallweit.
      
       3) Signess fixes for ethtool speed value handling, from Michael
          Zhivich.
      
       4) Handle timestamps properly in macb driver, from Paul Thomas.
      
       5) Two erspan fixes, it's the usual "skb ->data potentially reallocated
          and we're holding a stale protocol header pointer". From Lorenzo
          Bianconi.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
        bnxt_en: Reset device on RX buffer errors.
        bnxt_en: Improve RX consumer index validity check.
        net: macb driver, check for SKBTX_HW_TSTAMP
        qlogic: qlcnic: fix use of SPEED_UNKNOWN ethtool constant
        broadcom: tg3: fix use of SPEED_UNKNOWN ethtool constant
        ethtool: avoid signed-unsigned comparison in ethtool_validate_speed()
        net: ip6_gre: fix possible use-after-free in ip6erspan_rcv
        net: ip_gre: fix possible use-after-free in erspan_rcv
        r8169: disable ASPM again
        MAINTAINERS: ieee802154: update documentation file pattern
        net: vrf: Fix ping failed when vrf mtu is set to 0
        selftests: add a tc matchall test case
        nfc: nci: Potential off by one in ->pipes[] array
        NFC: nci: Add some bounds checking in nci_hci_cmd_received()
      869e3305
    • Linus Torvalds's avatar
      Merge branch 'fixes-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security · a556810d
      Linus Torvalds authored
      Pull TPM fixes from James Morris:
       "From Jarkko: These are critical fixes for v5.1. Contains also couple
        of new selftests for v5.1 features (partial reads in /dev/tpm0)"
      
      * 'fixes-v5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
        selftests/tpm2: Open tpm dev in unbuffered mode
        selftests/tpm2: Extend tests to cover partial reads
        KEYS: trusted: fix -Wvarags warning
        tpm: Fix the type of the return value in calc_tpm2_event_size()
        KEYS: trusted: allow trusted.ko to initialize w/o a TPM
        tpm: fix an invalid condition in tpm_common_poll
        tpm: turn on TPM on suspend for TPM 1.x
      a556810d
    • Linus Torvalds's avatar
      Merge tag 'xtensa-20190408' of git://github.com/jcmvbkbc/linux-xtensa · 10d43397
      Linus Torvalds authored
      Pull xtensa fixes from Max Filippov:
      
       - fix syscall number passed to trace_sys_exit
      
       - fix syscall number initialization in start_thread
      
       - fix level interpretation in the return_address
      
       - fix format string warning in init_pmd
      
      * tag 'xtensa-20190408' of git://github.com/jcmvbkbc/linux-xtensa:
        xtensa: fix format string warning in init_pmd
        xtensa: fix return_address
        xtensa: fix initialization of pt_regs::syscall in start_thread
        xtensa: use actual syscall number in do_syscall_trace_leave
      10d43397
  9. Apr 08, 2019
    • David S. Miller's avatar
      Merge branch 'bnxt_en-fixes' · e063f459
      David S. Miller authored
      
      Michael Chan says:
      
      ====================
      bnxt_en: 2 bug fixes.
      
      The first patch prevents possible driver crash if we get a bad RX index
      from the hardware.  The second patch resets the device when the hardware
      reports buffer error to recover from the error.
      
      Please queue these for -stable also.  Thanks.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e063f459
Loading