- Feb 04, 2025
-
-
Paolo Bonzini authored
The only statement in a kvm_arch_post_init_vm implementation can be moved into the x86 kvm_arch_init_vm. Do so and remove all traces from architecture-independent code. Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Jan 31, 2025
-
-
Sean Christopherson authored
Exempt KVM-internal memslots from the KVM_MEM_MAX_NR_PAGES restriction, as the limit on the number of pages exists purely to play nice with dirty bitmap operations, which use 32-bit values to index the bitmaps, and dirty logging isn't supported for KVM-internal memslots. Link: https://lore.kernel.org/all/20240802205003.353672-6-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com> Reviewed-by:
Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by:
David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20250123144627.312456-2-imbrenda@linux.ibm.com Signed-off-by:
Claudio Imbrenda <imbrenda@linux.ibm.com> Message-ID: <20250123144627.312456-2-imbrenda@linux.ibm.com>
-
- Jan 15, 2025
-
-
Sean Christopherson authored
Disallow all flags for KVM-internal memslots as all existing flags require some amount of userspace interaction to have any meaning. In addition to guarding against KVM goofs, explicitly disallowing dirty logging of KVM- internal memslots will (hopefully) allow exempting KVM-internal memslots from the KVM_MEM_MAX_NR_PAGES limit, which appears to exist purely because the dirty bitmap operations use a 32-bit index. Cc: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by:
Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by:
Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by:
Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-6-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Now that there's no outer wrapper for __kvm_set_memory_region() and it's static, drop its double-underscore prefix. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by:
Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by:
Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by:
Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-5-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Add a dedicated API for setting internal memslots, and have it explicitly disallow setting userspace memslots. Setting a userspace memslots without a direct command from userspace would result in all manner of issues. No functional change intended. Cc: Tao Su <tao1.su@linux.intel.com> Cc: Claudio Imbrenda <imbrenda@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by:
Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by:
Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by:
Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-4-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Add proper lockdep assertions in __kvm_set_memory_region() and __x86_set_memory_region() instead of relying comments. Opportunistically delete __kvm_set_memory_region()'s entire function comment as the API doesn't allocate memory or select a gfn, and the "mostly for framebuffers" comment hasn't been true for a very long time. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by:
Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by:
Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by:
Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-3-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Open code kvm_set_memory_region() into its sole caller in preparation for adding a dedicated API for setting internal memslots. Oppurtunistically use the fancy new guard(mutex) to avoid a local 'r' variable. Cc: Tao Su <tao1.su@linux.intel.com> Reviewed-by:
Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by:
Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by:
Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20250111002022.1230573-2-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
- Dec 23, 2024
-
-
Isaku Yamahata authored
Add new members to strut kvm_gfn_range to indicate which mapping (private-vs-shared) to operate on: enum kvm_gfn_range_filter attr_filter. Update the core zapping operations to set them appropriately. TDX utilizes two GPA aliases for the same memslots, one for memory that is for private memory and one that is for shared. For private memory, KVM cannot always perform the same operations it does on memory for default VMs, such as zapping pages and having them be faulted back in, as this requires guest coordination. However, some operations such as guest driven conversion of memory between private and shared should zap private memory. Internally to the MMU, private and shared mappings are tracked on separate roots. Mapping and zapping operations will operate on the respective GFN alias for each root (private or shared). So zapping operations will by default zap both aliases. Add fields in struct kvm_gfn_range to allow callers to specify which aliases so they can only target the aliases appropriate for their specific operation. There was feedback that target aliases should be specified such that the default value (0) is to operate on both aliases. Several options were considered. Several variations of having separate bools defined such that the default behavior was to process both aliases. They either allowed nonsensical configurations, or were confusing for the caller. A simple enum was also explored and was close, but was hard to process in the caller. Instead, use an enum with the default value (0) reserved as a disallowed value. Catch ranges that didn't have the target aliases specified by looking for that specific value. Set target alias with enum appropriately for these MMU operations: - For KVM's mmu notifier callbacks, zap shared pages only because private pages won't have a userspace mapping - For setting memory attributes, kvm_arch_pre_set_memory_attributes() chooses the aliases based on the attribute. - For guest_memfd invalidations, zap private only. Link: https://lore.kernel.org/kvm/ZivIF9vjKcuGie3s@google.com/ Signed-off-by:
Isaku Yamahata <isaku.yamahata@intel.com> Co-developed-by:
Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by:
Rick Edgecombe <rick.p.edgecombe@intel.com> Message-ID: <20240718211230.1492011-3-rick.p.edgecombe@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
Yan Zhao authored
Remove the RCU-protected attribute from slot->gmem.file. No need to use RCU primitives rcu_assign_pointer()/synchronize_rcu() to update this pointer. - slot->gmem.file is updated in 3 places: kvm_gmem_bind(), kvm_gmem_unbind(), kvm_gmem_release(). All of them are protected by kvm->slots_lock. - slot->gmem.file is read in 2 paths: (1) kvm_gmem_populate kvm_gmem_get_file __kvm_gmem_get_pfn (2) kvm_gmem_get_pfn kvm_gmem_get_file __kvm_gmem_get_pfn Path (1) kvm_gmem_populate() requires holding kvm->slots_lock, so slot->gmem.file is protected by the kvm->slots_lock in this path. Path (2) kvm_gmem_get_pfn() does not require holding kvm->slots_lock. However, it's also not guarded by rcu_read_lock() and rcu_read_unlock(). So synchronize_rcu() in kvm_gmem_unbind()/kvm_gmem_release() actually will not wait for the readers in kvm_gmem_get_pfn() due to lack of RCU read-side critical section. The path (2) kvm_gmem_get_pfn() is safe without RCU protection because: a) kvm_gmem_bind() is called on a new memslot, before the memslot is visible to kvm_gmem_get_pfn(). b) kvm->srcu ensures that kvm_gmem_unbind() and freeing of a memslot occur after the memslot is no longer visible to kvm_gmem_get_pfn(). c) get_file_active() ensures that kvm_gmem_get_pfn() will not access the stale file if kvm_gmem_release() sets it to NULL. This is because if kvm_gmem_release() occurs before kvm_gmem_get_pfn(), get_file_active() will return NULL; if get_file_active() does not return NULL, kvm_gmem_release() should not occur until after kvm_gmem_get_pfn() releases the file reference. Signed-off-by:
Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241104084303.29909-1-yan.y.zhao@intel.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Dec 16, 2024
-
-
Sean Christopherson authored
Now that KVM takes vcpu->mutex inside kvm->lock when creating a vCPU, drop the hack to manually inform lockdep of the kvm->lock => vcpu->mutex ordering. This effectively reverts commit 42a90008 ("KVM: Ensure lockdep knows about kvm->lock vs. vcpu->mutex ordering rule"). Cc: Oliver Upton <oliver.upton@linux.dev> Acked-by:
Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241009150455.1057573-7-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
WARN once instead of triggering a BUG if xa_insert() fails because it encountered an existing entry. While KVM guarantees there should be no existing entry, there's no reason to BUG the kernel, as KVM needs to gracefully handle failure anyways. Reviewed-by:
Pankaj Gupta <pankaj.gupta@amd.com> Acked-by:
Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241009150455.1057573-6-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Now that KVM loads from vcpu_array if and only if the target index is valid with respect to online_vcpus, i.e. now that it is safe to erase a not-fully-onlined vCPU entry, revert to storing into vcpu_array before success is guaranteed. If xa_store() fails, which _should_ be impossible, then putting the vCPU's reference to 'struct kvm' results in a refcounting bug as the vCPU fd has been installed and owns the vCPU's reference. This was found by inspection, but forcing the xa_store() to fail confirms the problem: | Unable to handle kernel paging request at virtual address ffff800080ecd960 | Call trace: | _raw_spin_lock_irq+0x2c/0x70 | kvm_irqfd_release+0x24/0xa0 | kvm_vm_release+0x1c/0x38 | __fput+0x88/0x2ec | ____fput+0x10/0x1c | task_work_run+0xb0/0xd4 | do_exit+0x210/0x854 | do_group_exit+0x70/0x98 | get_signal+0x6b0/0x73c | do_signal+0xa4/0x11e8 | do_notify_resume+0x60/0x12c | el0_svc+0x64/0x68 | el0t_64_sync_handler+0x84/0xfc | el0t_64_sync+0x190/0x194 | Code: b9000909 d503201f 2a1f03e1 52800028 (88e17c08) Practically speaking, this is a non-issue as xa_store() can't fail, absent a nasty kernel bug. But the code is visually jarring and technically broken. This reverts commit afb2acb2. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michal Luczaj <mhal@rbox.co> Cc: Alexander Potapenko <glider@google.com> Cc: Marc Zyngier <maz@kernel.org> Reported-by:
Will Deacon <will@kernel.org> Acked-by:
Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241009150455.1057573-5-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
During vCPU creation, acquire vcpu->mutex prior to exposing the vCPU to userspace, and hold the mutex until online_vcpus is bumped, i.e. until the vCPU is fully online from KVM's perspective. To ensure asynchronous vCPU ioctls also wait for the vCPU to come online, explicitly check online_vcpus at the start of kvm_vcpu_ioctl(), and take the vCPU's mutex to wait if necessary (having to wait for any ioctl should be exceedingly rare, i.e. not worth optimizing). Reported-by:
Will Deacon <will@kernel.org> Reported-by:
Michal Luczaj <mhal@rbox.co> Link: https://lore.kernel.org/all/20240730155646.1687-1-will@kernel.org Acked-by:
Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241009150455.1057573-4-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
- Nov 14, 2024
-
-
Paolo Bonzini authored
kvm_vm_create_worker_thread() is meant to be used for kthreads that can consume significant amounts of CPU time on behalf of a VM or in response to how the VM behaves (for example how it accesses its memory). Therefore it wants to charge the CPU time consumed by that work to the VM's container. However, because of these threads, cgroups which have kvm instances inside never complete freezing. This can be trivially reproduced: root@test ~# mkdir /sys/fs/cgroup/test root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs root@test ~# qemu-system-x86_64 -nographic -enable-kvm and in another terminal: root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze root@test ~# cat /sys/fs/cgroup/test/cgroup.events populated 1 frozen 0 The cgroup freezing happens in the signal delivery path but kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never calls into the signal delivery path and thus never gets frozen. Because the cgroup freezer determines whether a given cgroup is frozen by comparing the number of frozen threads to the total number of threads in the cgroup, the cgroup never becomes frozen and users waiting for the state transition may hang indefinitely. Since the worker kthread is tied to a user process, it's better if it behaves similarly to user tasks as much as possible, including being able to send SIGSTOP and SIGCONT. In fact, vhost_task is all that kvm_vm_create_worker_thread() wanted to be and more: not only it inherits the userspace process's cgroups, it has other niceties like being parented properly in the process tree. Use it instead of the homegrown alternative. Incidentally, the new code is also better behaved when you flip recovery back and forth to disabled and back to enabled. If your recovery period is 1 minute, it will run the next recovery after 1 minute independent of how many times you flipped the parameter. (Commit message based on emails from Tejun). Reported-by:
Tejun Heo <tj@kernel.org> Reported-by:
Luca Boccassi <bluca@debian.org> Acked-by:
Tejun Heo <tj@kernel.org> Tested-by:
Luca Boccassi <bluca@debian.org> Cc: stable@vger.kernel.org Reviewed-by:
Sean Christopherson <seanjc@google.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com>
-
- Nov 03, 2024
-
-
Al Viro authored
in all of those failure exits prior to fdget() are plain returns and the only thing done after fdput() is (on failure exits) a kfree(), which can be done before fdput() just fine. NOTE: in acrn_irqfd_assign() 'fail:' failure exit is wrong for eventfd_ctx_fileget() failure (we only want fdput() there) and once we stop doing that, it doesn't need to check if eventfd is NULL or ERR_PTR(...) there. NOTE: in privcmd we move fdget() up before the allocation - more to the point, before the copy_from_user() attempt. Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
all failure exits prior to fdget() leave the scope, all matching fdput() are immediately followed by leaving the scope. [xfs_ioc_commit_range() chunk moved here as well] Reviewed-by:
Christian Brauner <brauner@kernel.org> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
Al Viro authored
fdget() is the first thing done in scope, all matching fdput() are immediately followed by leaving the scope. Reviewed-by:
Christian Brauner <brauner@kernel.org> Signed-off-by:
Al Viro <viro@zeniv.linux.org.uk>
-
- Oct 30, 2024
-
-
Sean Christopherson authored
Add a Kconfig to allow architectures to opt-out of a TLB flush when a young page is aged, as invalidating TLB entries is not functionally required on most KVM-supported architectures. Stale TLB entries can result in false negatives and theoretically lead to suboptimal reclaim, but in practice all observations have been that the performance gained by skipping TLB flushes outweighs any performance lost by reclaiming hot pages. E.g. the primary MMUs for x86 RISC-V, s390, and PPC Book3S elide the TLB flush for ptep_clear_flush_young(), and arm64's MMU skips the trailing DSB that's required for ordering (presumably because there are optimizations related to eliding other TLB flushes when doing make-before-break). Link: https://lore.kernel.org/r/20241011021051.1557902-18-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
To avoid jitter on KVM_RUN due to synchronize_rcu(), use a rwlock instead of RCU to protect vcpu->pid, a.k.a. the pid of the task last used to a vCPU. When userspace is doing M:N scheduling of tasks to vCPUs, e.g. to run SEV migration helper vCPUs during post-copy, the synchronize_rcu() needed to change the PID associated with the vCPU can stall for hundreds of milliseconds, which is problematic for latency sensitive post-copy operations. In the directed yield path, do not acquire the lock if it's contended, i.e. if the associated PID is changing, as that means the vCPU's task is already running. Reported-by:
Steve Rutherford <srutherford@google.com> Reviewed-by:
Steve Rutherford <srutherford@google.com> Acked-by:
Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240802200136.329973-3-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Do "return 0" instead of initializing and returning a local variable in kvm_vcpu_yield_to(), e.g. so that it's more obvious what the function returns if there is no task. No functional change intended. Acked-by:
Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240802200136.329973-2-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Sean Christopherson authored
Rework kvm_vcpu_on_spin() to use a single for-loop instead of making "two" passes over all vCPUs. Given N=kvm->last_boosted_vcpu, the logic is to iterate from vCPU[N+1]..vcpu[N-1], i.e. using two loops is just a kludgy way of handling the wrap from the last vCPU to vCPU0 when a boostable vCPU isn't found in vcpu[N+1]..vcpu[MAX]. Open code the xa_load() instead of using kvm_get_vcpu() to avoid reading online_vcpus in every loop, as well as the accompanying smp_rmb(), i.e. make it a custom kvm_for_each_vcpu(), for all intents and purposes. Oppurtunistically clean up the comment explaining the logic. Link: https://lore.kernel.org/r/20240802202121.341348-1-seanjc@google.com Signed-off-by:
Sean Christopherson <seanjc@google.com>
-
Christophe JAILLET authored
'struct kvm_device_ops' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increases overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig: Before: ====== text data bss dec hex filename 2605 169 16 2790 ae6 virt/kvm/vfio.o After: ===== text data bss dec hex filename 2685 89 16 2790 ae6 virt/kvm/vfio.o Signed-off-by:
Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/e7361a1bb7defbb0f7056b884e83f8d75ac9fe21.1727517084.git.christophe.jaillet@wanadoo.fr Signed-off-by:
Alex Williamson <alex.williamson@redhat.com>
-
- Oct 25, 2024
-
-
Sean Christopherson authored
Now that KVM no longer relies on an ugly heuristic to find its struct page references, i.e. now that KVM can't get false positives on VM_MIXEDMAP pfns, remove KVM's hack to elevate the refcount for pfns that happen to have a valid struct page. In addition to removing a long-standing wart in KVM, this allows KVM to map non-refcounted struct page memory into the guest, e.g. for exposing GPU TTM buffers to KVM guests. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-86-seanjc@google.com>
-
Sean Christopherson authored
Remove all kvm_{release,set}_pfn_*() APIs now that all users are gone. No functional change intended. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-85-seanjc@google.com>
-
Sean Christopherson authored
Now that the legacy gfn_to_pfn() APIs are gone, and all callers of hva_to_pfn() pass in a refcounted_page pointer, make it a required field to ensure all future usage in KVM plays nice. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-82-seanjc@google.com>
-
Sean Christopherson authored
Drop gfn_to_pfn() and all its variants now that all users are gone. No functional change intended. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-80-seanjc@google.com>
-
Sean Christopherson authored
Rework gfn_to_page() to support read-only accesses so that it can be used by arm64 to get MTE tags out of guest memory. Opportunistically rewrite the comment to be even more stern about using gfn_to_page(), as there are very few scenarios where requiring a struct page is actually the right thing to do (though there are such scenarios). Add a FIXME to call out that KVM probably should be pinning pages, not just getting pages. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-77-seanjc@google.com>
-
Sean Christopherson authored
Convert gfn_to_page() to the new kvm_follow_pfn() internal API, which will eventually allow removing gfn_to_pfn() and kvm_pfn_to_refcounted_page(). Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-76-seanjc@google.com>
-
Sean Christopherson authored
Provide the "struct page" associated with a guest_memfd pfn as an output from __kvm_gmem_get_pfn() so that KVM guest page fault handlers can directly put the page instead of having to rely on kvm_pfn_to_refcounted_page(). Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-47-seanjc@google.com>
-
Sean Christopherson authored
Refactor guest_memfd usage of __kvm_gmem_get_pfn() to pass the index into the guest_memfd file instead of the gfn, i.e. resolve the index based on the slot+gfn in the caller instead of in __kvm_gmem_get_pfn(). This will allow kvm_gmem_get_pfn() to retrieve and return the specific "struct page", which requires the index into the folio, without a redoing the index calculation multiple times (which isn't costly, just hard to follow). Opportunistically add a kvm_gmem_get_index() helper to make the copy+pasted code easier to understand. Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-46-seanjc@google.com>
-
Sean Christopherson authored
Add a new dedicated API, kvm_faultin_pfn(), for servicing guest page faults, i.e. for getting pages/pfns that will be mapped into the guest via an mmu_notifier-protected KVM MMU. Keep struct kvm_follow_pfn buried in internal code, as having __kvm_faultin_pfn() take "out" params is actually cleaner for several architectures, e.g. it allows the caller to have its own "page fault" structure without having to marshal data to/from kvm_follow_pfn. Long term, common KVM would ideally provide a kvm_page_fault structure, a la x86's struct of the same name. But all architectures need to be converted to a common API before that can happen. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-44-seanjc@google.com>
-
Sean Christopherson authored
Add an off-by-default module param to control whether or not KVM is allowed to map memory that isn't pinned, i.e. that KVM can't guarantee won't be freed while it is mapped into KVM and/or the guest. Don't remove the functionality entirely, as there are use cases where mapping unpinned memory is safe (as defined by the platform owner), e.g. when memory is hidden from the kernel and managed by userspace, in which case userspace is already fully trusted to not muck with guest memory mappings. But for more typical setups, mapping unpinned memory is wildly unsafe, and unnecessary. The APIs are used exclusively by x86's nested virtualization support, and there is no known (or sane) use case for mapping PFN-mapped memory a KVM guest _and_ letting the guest use it for virtualization structures. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-36-seanjc@google.com>
-
Sean Christopherson authored
When creating a memory map for read, don't request a writable pfn from the primary MMU. While creating read-only mappings can be theoretically slower, as they don't play nice with fast GUP due to the need to break CoW before mapping the underlying PFN, practically speaking, creating a mapping isn't a super hot path, and getting a writable mapping for reading is weird and confusing. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-35-seanjc@google.com>
-
Sean Christopherson authored
Now that all kvm_vcpu_{,un}map() users pass "true" for @dirty, have them pass "true" as a @writable param to kvm_vcpu_map(), and thus create a read-only mapping when possible. Note, creating read-only mappings can be theoretically slower, as they don't play nice with fast GUP due to the need to break CoW before mapping the underlying PFN. But practically speaking, creating a mapping isn't a super hot path, and getting a writable mapping for reading is weird and confusing. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-34-seanjc@google.com>
-
Sean Christopherson authored
Pin, as in FOLL_PIN, pages when mapping them for direct access by KVM. As per Documentation/core-api/pin_user_pages.rst, writing to a page that was gotten via FOLL_GET is explicitly disallowed. Correct (uses FOLL_PIN calls): pin_user_pages() write to the data within the pages unpin_user_pages() INCORRECT (uses FOLL_GET calls): get_user_pages() write to the data within the pages put_page() Unfortunately, FOLL_PIN is a "private" flag, and so kvm_follow_pfn must use a one-off bool instead of being able to piggyback the "flags" field. Link: https://lwn.net/Articles/930667 Link: https://lore.kernel.org/all/cover.1683044162.git.lstoakes@gmail.com Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-32-seanjc@google.com>
-
David Stevens authored
Migrate kvm_vcpu_map() to kvm_follow_pfn(), and have it track whether or not the map holds a refcounted struct page. Precisely tracking struct page references will eventually allow removing kvm_pfn_to_refcounted_page() and its various wrappers. Signed-off-by:
David Stevens <stevensd@chromium.org> [sean: use a pointer instead of a boolean] Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-31-seanjc@google.com>
-
Sean Christopherson authored
Track refcounted struct page memory using kvm_follow_pfn.refcounted_page instead of relying on kvm_release_pfn_clean() to correctly detect that the pfn is associated with a struct page. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-30-seanjc@google.com>
-
Sean Christopherson authored
Hoist the kvm_{set,release}_page_{clean,dirty}() APIs further up in kvm_main.c so that they can be used by the kvm_follow_pfn family of APIs. No functional change intended. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-29-seanjc@google.com>
-
Sean Christopherson authored
Add kvm_follow_pfn.refcounted_page as an output for the "to pfn" APIs to "return" the struct page that is associated with the returned pfn (if KVM acquired a reference to the page). This will eventually allow removing KVM's hacky kvm_pfn_to_refcounted_page() code, which is error prone and can't detect pfns that are valid, but aren't (currently) refcounted. Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-28-seanjc@google.com>
-
Sean Christopherson authored
Use a single pointer instead of a single-entry array for the struct page pointer in hva_to_pfn_fast(). Using an array makes the code unnecessarily annoying to read and update. No functional change intended. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Tested-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Sean Christopherson <seanjc@google.com> Tested-by:
Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by:
Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-27-seanjc@google.com>
-