Skip to content
Snippets Groups Projects
  1. Sep 15, 2017
  2. Sep 14, 2017
    • Wanpeng Li's avatar
      KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously · 9a6e7c39
      Wanpeng Li authored
      
      qemu-system-x86-8600  [004] d..1  7205.687530: kvm_entry: vcpu 2
      qemu-system-x86-8600  [004] ....  7205.687532: kvm_exit: reason EXCEPTION_NMI rip 0xffffffffa921297d info ffffeb2c0e44e018 80000b0e
      qemu-system-x86-8600  [004] ....  7205.687532: kvm_page_fault: address ffffeb2c0e44e018 error_code 0
      qemu-system-x86-8600  [004] ....  7205.687620: kvm_try_async_get_page: gva = 0xffffeb2c0e44e018, gfn = 0x427e4e
      qemu-system-x86-8600  [004] .N..  7205.687628: kvm_async_pf_not_present: token 0x8b002 gva 0xffffeb2c0e44e018
          kworker/4:2-7814  [004] ....  7205.687655: kvm_async_pf_completed: gva 0xffffeb2c0e44e018 address 0x7fcc30c4e000
      qemu-system-x86-8600  [004] ....  7205.687703: kvm_async_pf_ready: token 0x8b002 gva 0xffffeb2c0e44e018
      qemu-system-x86-8600  [004] d..1  7205.687711: kvm_entry: vcpu 2
      
      After running some memory intensive workload in guest, I catch the kworker
      which completes the GUP too quickly, and queues an "Page Ready" #PF exception
      after the "Page not Present" exception before the next vmentry as the above
      trace which will result in #DF injected to guest.
      
      This patch fixes it by clearing the queue for "Page not Present" if "Page Ready"
      occurs before the next vmentry since the GUP has already got the required page
      and shadow page table has already been fixed by "Page Ready" handler.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Fixes: 7c90705b ("KVM: Inject asynchronous page fault into a PV guest if page is swapped out.")
      [Changed indentation and added clearing of injected. - Radim]
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      9a6e7c39
    • Wanpeng Li's avatar
      KVM: X86: Don't block vCPU if there is pending exception · a5f01f8e
      Wanpeng Li authored
      
      Don't block vCPU if there is pending exception.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      a5f01f8e
    • Suravee Suthikulpanit's avatar
      KVM: SVM: Add irqchip_split() checks before enabling AVIC · 67034bb9
      Suravee Suthikulpanit authored
      
      SVM AVIC hardware accelerates guest write to APIC_EOI register
      (for edge-trigger interrupt), which means it does not trap to KVM.
      
      So, only enable SVM AVIC only in split irqchip mode.
      (e.g. launching qemu w/ option '-machine kernel_irqchip=split').
      
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Fixes: 44a95dae ("KVM: x86: Detect and Initialize AVIC support")
      [Removed pr_debug - Radim.]
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      67034bb9
  3. Sep 13, 2017
  4. Sep 12, 2017
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Fix bug causing host SLB to be restored incorrectly · 67f8a8c1
      Paul Mackerras authored
      
      Aneesh Kumar reported seeing host crashes when running recent kernels
      on POWER8.  The symptom was an oops like this:
      
      Unable to handle kernel paging request for data at address 0xf00000000786c620
      Faulting instruction address: 0xc00000000030e1e4
      Oops: Kernel access of bad area, sig: 11 [#1]
      LE SMP NR_CPUS=2048 NUMA PowerNV
      Modules linked in: powernv_op_panel
      CPU: 24 PID: 6663 Comm: qemu-system-ppc Tainted: G        W 4.13.0-rc7-43932-gfc36c59 #2
      task: c000000fdeadfe80 task.stack: c000000fdeb68000
      NIP:  c00000000030e1e4 LR: c00000000030de6c CTR: c000000000103620
      REGS: c000000fdeb6b450 TRAP: 0300   Tainted: G        W        (4.13.0-rc7-43932-gfc36c59)
      MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24044428  XER: 20000000
      CFAR: c00000000030e134 DAR: f00000000786c620 DSISR: 40000000 SOFTE: 0
      GPR00: 0000000000000000 c000000fdeb6b6d0 c0000000010bd000 000000000000e1b0
      GPR04: c00000000115e168 c000001fffa6e4b0 c00000000115d000 c000001e1b180386
      GPR08: f000000000000000 c000000f9a8913e0 f00000000786c600 00007fff587d0000
      GPR12: c000000fdeb68000 c00000000fb0f000 0000000000000001 00007fff587cffff
      GPR16: 0000000000000000 c000000000000000 00000000003fffff c000000fdebfe1f8
      GPR20: 0000000000000004 c000000fdeb6b8a8 0000000000000001 0008000000000040
      GPR24: 07000000000000c0 00007fff587cffff c000000fdec20bf8 00007fff587d0000
      GPR28: c000000fdeca9ac0 00007fff587d0000 00007fff587c0000 00007fff587d0000
      NIP [c00000000030e1e4] __get_user_pages_fast+0x434/0x1070
      LR [c00000000030de6c] __get_user_pages_fast+0xbc/0x1070
      Call Trace:
      [c000000fdeb6b6d0] [c00000000139dab8] lock_classes+0x0/0x35fe50 (unreliable)
      [c000000fdeb6b7e0] [c00000000030ef38] get_user_pages_fast+0xf8/0x120
      [c000000fdeb6b830] [c000000000112318] kvmppc_book3s_hv_page_fault+0x308/0xf30
      [c000000fdeb6b960] [c00000000010e10c] kvmppc_vcpu_run_hv+0xfdc/0x1f00
      [c000000fdeb6bb20] [c0000000000e915c] kvmppc_vcpu_run+0x2c/0x40
      [c000000fdeb6bb40] [c0000000000e5650] kvm_arch_vcpu_ioctl_run+0x110/0x300
      [c000000fdeb6bbe0] [c0000000000d6468] kvm_vcpu_ioctl+0x528/0x900
      [c000000fdeb6bd40] [c0000000003bc04c] do_vfs_ioctl+0xcc/0x950
      [c000000fdeb6bde0] [c0000000003bc930] SyS_ioctl+0x60/0x100
      [c000000fdeb6be30] [c00000000000b96c] system_call+0x58/0x6c
      Instruction dump:
      7ca81a14 2fa50000 41de0010 7cc8182a 68c60002 78c6ffe2 0b060000 3cc2000a
      794a3664 390610d8 e9080000 7d485214 <e90a0020> 7d435378 790507e1 408202f0
      ---[ end trace fad4a342d0414aa2 ]---
      
      It turns out that what has happened is that the SLB entry for the
      vmmemap region hasn't been reloaded on exit from a guest, and it has
      the wrong page size.  Then, when the host next accesses the vmemmap
      region, it gets a page fault.
      
      Commit a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with
      KVM", 2017-07-24) modified the guest exit code so that it now only clears
      out the SLB for hash guest.  The code tests the radix flag and puts the
      result in a non-volatile CR field, CR2, and later branches based on CR2.
      
      Unfortunately, the kvmppc_save_tm function, which gets called between
      those two points, modifies all the user-visible registers in the case
      where the guest was in transactional or suspended state, except for a
      few which it restores (namely r1, r2, r9 and r13).  Thus the hash/radix indication in CR2 gets corrupted.
      
      This fixes the problem by re-doing the comparison just before the
      result is needed.  For good measure, this also adds comments next to
      the call sites of kvmppc_save_tm and kvmppc_restore_tm pointing out
      that non-volatile register state will be lost.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: a25bd72b ("powerpc/mm/radix: Workaround prefetch issue with KVM")
      Tested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      67f8a8c1
    • Paul Mackerras's avatar
      KVM: PPC: Book3S HV: Hold kvm->lock around call to kvmppc_update_lpcr · cf5f6f31
      Paul Mackerras authored
      
      Commit 468808bd ("KVM: PPC: Book3S HV: Set process table for HPT
      guests on POWER9", 2017-01-30) added a call to kvmppc_update_lpcr()
      which doesn't hold the kvm->lock mutex around the call, as required.
      This adds the lock/unlock pair, and for good measure, includes
      the kvmppc_setup_partition_table() call in the locked region, since
      it is altering global state of the VM.
      
      This error appears not to have any fatal consequences for the host;
      the consequences would be that the VCPUs could end up running with
      different LPCR values, or an update to the LPCR value by userspace
      using the one_reg interface could get overwritten, or the update
      done by kvmhv_configure_mmu() could get overwritten.
      
      Cc: stable@vger.kernel.org # v4.10+
      Fixes: 468808bd ("KVM: PPC: Book3S HV: Set process table for HPT guests on POWER9")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      cf5f6f31
    • Benjamin Herrenschmidt's avatar
      KVM: PPC: Book3S HV: Don't access XIVE PIPR register using byte accesses · d222af07
      Benjamin Herrenschmidt authored
      
      The XIVE interrupt controller on POWER9 machines doesn't support byte
      accesses to any register in the thread management area other than the
      CPPR (current processor priority register).  In particular, when
      reading the PIPR (pending interrupt priority register), we need to
      do a 32-bit or 64-bit load.
      
      Cc: stable@vger.kernel.org # v4.13
      Fixes: 2c4fb78f ("KVM: PPC: Book3S HV: Workaround POWER9 DD1.0 bug causing IPB bit loss")
      Signed-off-by: default avatarBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      d222af07
  5. Sep 07, 2017
    • Andy Lutomirski's avatar
      x86/mm: Document how CR4.PCIDE restore works · 1c9fe440
      Andy Lutomirski authored
      
      While debugging a problem, I thought that using
      cr4_set_bits_and_update_boot() to restore CR4.PCIDE would be
      helpful.  It turns out to be counterproductive.
      
      Add a comment documenting how this works.
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c9fe440
    • Andy Lutomirski's avatar
      x86/mm: Reinitialize TLB state on hotplug and resume · 72c0098d
      Andy Lutomirski authored
      
      When Linux brings a CPU down and back up, it switches to init_mm and then
      loads swapper_pg_dir into CR3.  With PCID enabled, this has the side effect
      of masking off the ASID bits in CR3.
      
      This can result in some confusion in the TLB handling code.  If we
      bring a CPU down and back up with any ASID other than 0, we end up
      with the wrong ASID active on the CPU after resume.  This could
      cause our internal state to become corrupt, although major
      corruption is unlikely because init_mm doesn't have any user pages.
      More obviously, if CONFIG_DEBUG_VM=y, we'll trip over an assertion
      in the next context switch.  The result of *that* is a failure to
      resume from suspend with probability 1 - 1/6^(cpus-1).
      
      Fix it by reinitializing cpu_tlbstate on resume and CPU bringup.
      
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: default avatarJiri Kosina <jikos@kernel.org>
      Fixes: 10af6235 ("x86/mm: Implement PCID based optimization: try to preserve old TLB entries using PCID")
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      72c0098d
    • Rik van Riel's avatar
      mm,fork: introduce MADV_WIPEONFORK · d2cd9ede
      Rik van Riel authored
      Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
      in the child process after fork.  This differs from MADV_DONTFORK in one
      important way.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      MADV_WIPEONFORK only works on private, anonymous VMAs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      [akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
      Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
      
      
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Reported-by: default avatarFlorian Weimer <fweimer@redhat.com>
      Reported-by: default avatarColm MacCártaigh <colm@allcosts.net>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d2cd9ede
    • Rik van Riel's avatar
      x86,mpx: make mpx depend on x86-64 to free up VMA flag · df3735c5
      Rik van Riel authored
      Patch series "mm,fork,security: introduce MADV_WIPEONFORK", v4.
      
      If a child process accesses memory that was MADV_WIPEONFORK, it will get
      zeroes.  The address ranges are still valid, they are just empty.
      
      If a child process accesses memory that was MADV_DONTFORK, it will get a
      segmentation fault, since those address ranges are no longer valid in
      the child after fork.
      
      Since MADV_DONTFORK also seems to be used to allow very large programs
      to fork in systems with strict memory overcommit restrictions, changing
      the semantics of MADV_DONTFORK might break existing programs.
      
      The use case is libraries that store or cache information, and want to
      know that they need to regenerate it in the child process after fork.
      
      Examples of this would be:
       - systemd/pulseaudio API checks (fail after fork) (replacing a getpid
         check, which is too slow without a PID cache)
       - PKCS#11 API reinitialization check (mandated by specification)
       - glibc's upcoming PRNG (reseed after fork)
       - OpenSSL PRNG (reseed after fork)
      
      The security benefits of a forking server having a re-inialized PRNG in
      every child process are pretty obvious.  However, due to libraries
      having all kinds of internal state, and programs getting compiled with
      many different versions of each library, it is unreasonable to expect
      calling programs to re-initialize everything manually after fork.
      
      A further complication is the proliferation of clone flags, programs
      bypassing glibc's functions to call clone directly, and programs calling
      unshare, causing the glibc pthread_atfork hook to not get called.
      
      It would be better to have the kernel take care of this automatically.
      
      The patchset also adds MADV_KEEPONFORK, to undo the effects of a prior
      MADV_WIPEONFORK.
      
      This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
      
          https://man.openbsd.org/minherit.2
      
      This patch (of 2):
      
      MPX only seems to be available on 64 bit CPUs, starting with Skylake and
      Goldmont.  Move VM_MPX into the 64 bit only portion of vma->vm_flags, in
      order to free up a VMA flag.
      
      Link: http://lkml.kernel.org/r/20170811212829.29186-2-riel@redhat.com
      
      
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Colm MacCártaigh <colm@allcosts.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df3735c5
    • Mike Kravetz's avatar
      mm: arch: consolidate mmap hugetlb size encodings · aafd4562
      Mike Kravetz authored
      A non-default huge page size can be encoded in the flags argument of the
      mmap system call.  The definitions for these encodings are in arch
      specific header files.  However, all architectures use the same values.
      
      Consolidate all the definitions in the primary user header file
      (uapi/linux/mman.h).  Include definitions for all known huge page sizes.
      Use the generic encoding definitions in hugetlb_encode.h as the basis
      for these definitions.
      
      Link: http://lkml.kernel.org/r/1501527386-10736-3-git-send-email-mike.kravetz@oracle.com
      
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aafd4562
    • Dou Liyang's avatar
      metag/numa: remove the unused parent_node() macro · f0cd3406
      Dou Liyang authored
      Commit a7be6e5a ("mm: drop useless local parameters of
      __register_one_node()") removes the last user of parent_node().
      
      The parent_node() macro in METAG architecture is unnecessary.
      
      Remove it for cleanup.
      
      Link: http://lkml.kernel.org/r/1501076076-1974-4-git-send-email-douly.fnst@cn.fujitsu.com
      
      
      Signed-off-by: default avatarDou Liyang <douly.fnst@cn.fujitsu.com>
      Reported-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Cc: James Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f0cd3406
  6. Sep 05, 2017
  7. Sep 04, 2017
Loading