Skip to content
Snippets Groups Projects
  1. Nov 14, 2024
    • Paolo Bonzini's avatar
      KVM: x86: switch hugepage recovery thread to vhost_task · d96c77bd
      Paolo Bonzini authored
      
      kvm_vm_create_worker_thread() is meant to be used for kthreads that
      can consume significant amounts of CPU time on behalf of a VM or in
      response to how the VM behaves (for example how it accesses its memory).
      Therefore it wants to charge the CPU time consumed by that work to
      the VM's container.
      
      However, because of these threads, cgroups which have kvm instances
      inside never complete freezing.  This can be trivially reproduced:
      
        root@test ~# mkdir /sys/fs/cgroup/test
        root@test ~# echo $$ > /sys/fs/cgroup/test/cgroup.procs
        root@test ~# qemu-system-x86_64 -nographic -enable-kvm
      
      and in another terminal:
      
        root@test ~# echo 1 > /sys/fs/cgroup/test/cgroup.freeze
        root@test ~# cat /sys/fs/cgroup/test/cgroup.events
        populated 1
        frozen 0
      
      The cgroup freezing happens in the signal delivery path but
      kvm_nx_huge_page_recovery_worker, while joining non-root cgroups, never
      calls into the signal delivery path and thus never gets frozen. Because
      the cgroup freezer determines whether a given cgroup is frozen by
      comparing the number of frozen threads to the total number of threads
      in the cgroup, the cgroup never becomes frozen and users waiting for
      the state transition may hang indefinitely.
      
      Since the worker kthread is tied to a user process, it's better if
      it behaves similarly to user tasks as much as possible, including
      being able to send SIGSTOP and SIGCONT.  In fact, vhost_task is all
      that kvm_vm_create_worker_thread() wanted to be and more: not only it
      inherits the userspace process's cgroups, it has other niceties like
      being parented properly in the process tree.  Use it instead of the
      homegrown alternative.
      
      Incidentally, the new code is also better behaved when you flip recovery
      back and forth to disabled and back to enabled.  If your recovery period
      is 1 minute, it will run the next recovery after 1 minute independent
      of how many times you flipped the parameter.
      
      (Commit message based on emails from Tejun).
      
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarLuca Boccassi <bluca@debian.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Tested-by: default avatarLuca Boccassi <bluca@debian.org>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d96c77bd
  2. Nov 03, 2024
  3. Oct 30, 2024
  4. Oct 25, 2024
Loading