1. 29 Mar, 2020 1 commit
    • Roman Gushchin's avatar
      mm: fork: fix kernel_stack memcg stats for various stack implementations · 8380ce47
      Roman Gushchin authored
      Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
      space for task stacks can be allocated using __vmalloc_node_range(),
      alloc_pages_node() and kmem_cache_alloc_node().
      In the first and the second cases page->mem_cgroup pointer is set, but
      in the third it's not: memcg membership of a slab page should be
      determined using the memcg_from_slab_page() function, which looks at
      page->slab_cache->memcg_params.memcg .  In this case, using
      mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
      page->mem_cgroup pointer is NULL even for pages charged to a non-root
      memory cgroup.
      It can lead to kernel_stack per-memcg counters permanently showing 0 on
      some architectures (depending on the configuration).
      In order to fix it, let's introduce a mod_memcg_obj_state() helper,
      which takes a pointer to a kernel object as a first argument, uses
      mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
      mod_memcg_state().  It allows to handle all possible configurations
      (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
      spilling any memcg/kmem specifics into fork.c .
      Note: This is a special version of the patch created for stable
      backports.  It contains code from the following two patches:
        - mm: memcg/slab: introduce mem_cgroup_from_obj()
        - mm: fork: fix kernel_stack memcg stats for various stack implementations
      [guro@fb.com: introduce mem_cgroup_from_obj()]
        Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
      Fixes: 4d96ba35
       ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  2. 28 Feb, 2020 1 commit
  3. 14 Jan, 2020 2 commits
    • Jason Gunthorpe's avatar
      mm/mmu_notifier: Rename struct mmu_notifier_mm to mmu_notifier_subscriptions · 984cfe4e
      Jason Gunthorpe authored
      The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
      and is difficult to abbreviate. The struct is actually holding the
      interval tree and hlist containing the notifiers subscribed to a mm.
      Use 'subscriptions' as the variable name for this struct instead of the
      really terrible and misleading 'mmn_mm'.
      Signed-off-by: default avatarJason Gunthorpe <jgg@mellanox.com>
    • Andrei Vagin's avatar
      ns: Introduce Time Namespace · 769071ac
      Andrei Vagin authored
      Time Namespace isolates clock values.
      The kernel provides access to several clocks CLOCK_REALTIME,
            System-wide clock that measures real (i.e., wall-clock) time.
            Clock that cannot be set and represents monotonic time since
            some unspecified starting point.
            Identical to CLOCK_MONOTONIC, except it also includes any time
            that the system is suspended.
      For many users, the time namespace means the ability to changes date and
      time in a container (CLOCK_REALTIME). Providing per namespace notions of
      CLOCK_REALTIME would be complex with a massive overhead, but has a dubious
      But in the context of checkpoint/restore functionality, monotonic and
      boottime clocks become interesting. Both clocks are monotonic with
      unspecified starting points. These clocks are widely used to measure time
      slices and set timers. After restoring or migrating processes, it has to be
      guaranteed that they never go backward. In an ideal case, the behavior of
      these clocks should be the same as for a case when a whole system is
      suspended. All this means that it is required to set CLOCK_MONOTONIC and
      CLOCK_BOOTTIME clocks, which can be achieved by adding per-namespace
      offsets for clocks.
      A time namespace is similar to a pid namespace in the way how it is
      created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
      but doesn't set it to the current process. Then all children of the process
      will be born in the new time namespace, or a process can use the setns()
      system call to join a namespace.
      This scheme allows setting clock offsets for a namespace, before any
      processes appear in it.
      All available clone flags have been used, so CLONE_NEWTIME uses the highest
      bit of CSIGNAL. It means that it can be used only with the unshare() and
      the clone3() system calls.
      [ tglx: Adjusted paragraph about clone3() to reality and massaged the
        	changelog a bit. ]
      Co-developed-by: default avatarDmitry Safonov <dima@arista.com>
      Signed-off-by: default avatarAndrei Vagin <avagin@gmail.com>
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://criu.org/Time_namespace
      Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
      Link: https://lore.kernel.org/r/20191112012724.250792-4-dima@arista.com
  4. 07 Jan, 2020 1 commit
  5. 01 Dec, 2019 1 commit
  6. 23 Nov, 2019 1 commit
  7. 20 Nov, 2019 4 commits
  8. 15 Nov, 2019 1 commit
    • Adrian Reber's avatar
      fork: extend clone3() to support setting a PID · 49cb2fc4
      Adrian Reber authored
      The main motivation to add set_tid to clone3() is CRIU.
      To restore a process with the same PID/TID CRIU currently uses
      /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
      ns_last_pid and then (quickly) does a clone(). This works most of the
      time, but it is racy. It is also slow as it requires multiple syscalls.
      Extending clone3() to support *set_tid makes it possible restore a
      process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
      race free (as long as the desired PID/TID is available).
      This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
      on clone3() with *set_tid as they are currently in place for ns_last_pid.
      The original version of this change was using a single value for
      set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
      decided to change set_tid to an array to enable setting the PID of a
      process in multiple PID namespaces at the same time. If a process is
      created in a PID namespace it is possible to influence the PID inside
      and outside of the PID namespace. Details also in the corresponding
      To create a process with the following PIDs:
            PID NS level         Requested PID
              0 (host)              31496
              1                        42
              2                         1
      For that example the two newly introduced parameters to struct
      clone_args (set_tid and set_tid_size) would need to be:
        set_tid[0] = 1;
        set_tid[1] = 42;
        set_tid[2] = 31496;
        set_tid_size = 3;
      If only the PIDs of the two innermost nested PID namespaces should be
      defined it would look like this:
        set_tid[0] = 1;
        set_tid[1] = 42;
        set_tid_size = 2;
      The PID of the newly created process would then be the next available
      free PID in the PID namespace level 0 (host) and 42 in the PID namespace
      at level 1 and the PID of the process in the innermost PID namespace
      would be 1.
      The set_tid array is used to specify the PID of a process starting
      from the innermost nested PID namespaces up to set_tid_size PID namespaces.
      set_tid_size cannot be larger then the current PID namespace level.
      Signed-off-by: default avatarAdrian Reber <areber@redhat.com>
      Reviewed-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
      Acked-by: default avatarAndrei Vagin <avagin@gmail.com>
      Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
  9. 13 Nov, 2019 1 commit
  10. 05 Nov, 2019 1 commit
    • Christian Brauner's avatar
      clone3: validate stack arguments · fa729c4d
      Christian Brauner authored
      Validate the stack arguments and setup the stack depening on whether or not
      it is growing down or up.
      Legacy clone() required userspace to know in which direction the stack is
      growing and pass down the stack pointer appropriately. To make things more
      confusing microblaze uses a variant of the clone() syscall selected by
      CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
      IA64 has a separate clone2() syscall which also takes an additional
      stack_size argument. Finally, parisc has a stack that is growing upwards.
      Userspace therefore has a lot nasty code like the following:
       #define __STACK_SIZE (8 * 1024 * 1024)
       pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
               pid_t ret;
               void *stack;
               stack = malloc(__STACK_SIZE);
               if (!stack)
                       return -ENOMEM;
       #ifdef __ia64__
               ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
       #elif defined(__parisc__) /* stack grows up */
               ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
               ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
               return ret;
      or even crazier variants such as [3].
      With clone3() we have the ability to validate the stack. We can check that
      when stack_size is passed, the stack pointer is valid and the other way
      around. We can also check that the memory area userspace gave us is fine to
      use via access_ok(). Furthermore, we probably should not require
      userspace to know in which direction the stack is growing. It is easy
      for us to do this in the kernel and I couldn't find the original
      reasoning behind exposing this detail to userspace.
      /* Intentional user visible API change */
      clone3() was released with 5.3. Currently, it is not documented and very
      unclear to userspace how the stack and stack_size argument have to be
      passed. After talking to glibc folks we concluded that trying to change
      clone3() to setup the stack instead of requiring userspace to do this is
      the right course of action.
      Note, that this is an explicit change in user visible behavior we introduce
      with this patch. If it breaks someone's use-case we will revert! (And then
      e.g. place the new behavior under an appropriate flag.)
      Breaking someone's use-case is very unlikely though. First, neither glibc
      nor musl currently expose a wrapper for clone3(). Second, there is no real
      motivation for anyone to use clone3() directly since it does not provide
      features that legacy clone doesn't. New features for clone3() will first
      happen in v5.5 which is why v5.4 is still a good time to try and make that
      change now and backport it to v5.3. Searches on [4] did not reveal any
      packages calling clone3().
      [1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
      [2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
      [3]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/raw-clone.h#L31
      [4]: https://codesearch.debian.net
      Fixes: 7f192e3c
       ("fork: add clone3")
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: <stable@vger.kernel.org> # 5.3
      Cc: GNU C Library <libc-alpha@sourceware.org>
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarAleksa Sarai <cyphar@cyphar.com>
      Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.com
  11. 21 Oct, 2019 1 commit
  12. 17 Oct, 2019 1 commit
    • Christian Brauner's avatar
      pidfd: check pid has attached task in fdinfo · 3d6d8da4
      Christian Brauner authored
      Currently, when a task is dead we still print the pid it used to use in
      the fdinfo files of its pidfds. This doesn't make much sense since the
      pid may have already been reused. So verify that the task is still alive
      by introducing the pid_has_task() helper which will be used by other
      callers in follow-up patches.
      If the task is not alive anymore, we will print -1. This allows us to
      differentiate between a task not being present in a given pid namespace
      - in which case we already print 0 - and a task having been reaped.
      Note that this uses PIDTYPE_PID for the check. Technically, we could've
      checked PIDTYPE_TGID since pidfds currently only refer to thread-group
      leaders but if they won't anymore in the future then this check becomes
      problematic without it being immediately obvious to non-experts imho. If
      a thread is created via clone(CLONE_THREAD) than struct pid has a single
      non-empty list pid->tasks[PIDTYPE_PID] and this pid can't be used as a
      PIDTYPE_TGID meaning pid->tasks[PIDTYPE_TGID] will return NULL even
      though the thread-group leader might still be very much alive. So
      checking PIDTYPE_PID is fine and is easier to maintain should we ever
      allow pidfds to refer to threads.
      Cc: Jann Horn <jannh@google.com>
      Cc: Christian Kellner <christian@kellner.me>
      Cc: linux-api@vger.kernel.org
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Link: https://lore.kernel.org/r/20191017101832.5985-1-christian.brauner@ubuntu.com
  13. 15 Oct, 2019 1 commit
  14. 07 Oct, 2019 1 commit
    • Michal Hocko's avatar
      kernel/sysctl.c: do not override max_threads provided by userspace · b0f53dbc
      Michal Hocko authored
      Partially revert 16db3d3f ("kernel/sysctl.c: threads-max observe
      limits") because the patch is causing a regression to any workload which
      needs to override the auto-tuning of the limit provided by kernel.
      set_max_threads is implementing a boot time guesstimate to provide a
      sensible limit of the concurrently running threads so that runaways will
      not deplete all the memory.  This is a good thing in general but there
      are workloads which might need to increase this limit for an application
      to run (reportedly WebSpher MQ is affected) and that is simply not
      possible after the mentioned change.  It is also very dubious to
      override an admin decision by an estimation that doesn't have any direct
      relation to correctness of the kernel operation.
      Fix this by dropping set_max_threads from sysctl_max_threads so any
      value is accepted as long as it fits into MAX_THREADS which is important
      to check because allowing more threads could break internal robust futex
      restriction.  While at it, do not use MIN_THREADS as the lower boundary
      because it is also only a heuristic for automatic estimation and admin
      might have a good reason to stop new threads to be created even when
      below this limit.
      This became more severe when we switched x86 from 4k to 8k kernel
      stacks.  Starting since 6538b8ea ("x86_64: expand kernel stack to
      16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
      In the particular case
        kernel.threads-max = 515561
        kernel.threads-max = 200000
      Neither of the two values is really insane on 32GB machine.
      I am not sure we want/need to tune the max_thread value further.  If
      anything the tuning should be removed altogether if proven not useful in
      general.  But we definitely need a way to override this auto-tuning.
      Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
      Fixes: 16db3d3f
       ("kernel/sysctl.c: threads-max observe limits")
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Reviewed-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  15. 03 Oct, 2019 1 commit
  16. 01 Oct, 2019 1 commit
  17. 26 Sep, 2019 1 commit
  18. 25 Sep, 2019 2 commits
    • Eric W. Biederman's avatar
      tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue · 0ff7b2cf
      Eric W. Biederman authored
      In the ordinary case today the RCU grace period for a task_struct is
      triggered when another process wait's for it's zombine and causes the
      kernel to call release_task().  As the waiting task has to receive a
      signal and then act upon it before this happens, typically this will
      occur after the original task as been removed from the runqueue.
      Unfortunaty in some cases such as self reaping tasks it can be shown
      that release_task() will be called starting the grace period for
      task_struct long before the task leaves the runqueue.
      Therefore use put_task_struct_rcu_user() in finish_task_switch() to
      guarantee that the there is a RCU lifetime after the task
      leaves the runqueue.
      Besides the change in the start of the RCU grace period for the
      task_struct this change may cause perf_event_delayed_put and
      trace_sched_process_free.  The function perf_event_delayed_put boils
      down to just a WARN_ON for cases that I assume never show happen.  So
      I don't see any problem with delaying it.
      The function trace_sched_process_free is a trace point and thus
      visible to user space.  Occassionally userspace has the strangest
      dependencies so this has a miniscule chance of causing a regression.
      This change only changes the timing of when the tracepoint is called.
      The change in timing arguably gives userspace a more accurate picture
      of what is going on.  So I don't expect there to be a regression.
      In the case where a task self reaps we are pretty much guaranteed that
      the RCU grace period is delayed.  So we should get quite a bit of
      coverage in of this worst case for the change in a normal threaded
      workload.  So I expect any issues to turn up quickly or not at all.
      I have lightly tested this change and everything appears to work
      Inspired-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Inspired-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    • Eric W. Biederman's avatar
      tasks: Add a count of task RCU users · 3fbd7ee2
      Eric W. Biederman authored
      Add a count of the number of RCU users (currently 1) of the task
      struct so that we can later add the scheduler case and get rid of the
      very subtle task_rcu_dereference(), and just use rcu_dereference().
      As suggested by Oleg have the count overlap rcu_head so that no
      additional space in task_struct is required.
      Inspired-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Inspired-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  19. 12 Sep, 2019 1 commit
    • Eugene Syromiatnikov's avatar
      fork: block invalid exit signals with clone3() · a0eb9abd
      Eugene Syromiatnikov authored
      Previously, higher 32 bits of exit_signal fields were lost when copied
      to the kernel args structure (that uses int as a type for the respective
      field). Moreover, as Oleg has noted, exit_signal is used unchecked, so
      it has to be checked for sanity before use; for the legacy syscalls,
      applying CSIGNAL mask guarantees that it is at least non-negative;
      however, there's no such thing is done in clone3() code path, and that
      can break at least thread_group_leader.
      This commit adds a check to copy_clone_args_from_user() to verify that
      the exit signal is limited by CSIGNAL as with legacy clone() and that
      the signal is valid. With this we don't get the legacy clone behavior
      were an invalid signal could be handed down and would only be detected
      and ignored in do_notify_parent(). Users of clone3() will now get a
      proper error when they pass an invalid exit signal. Note, that this is
      not user-visible behavior since no kernel with clone3() has been
      released yet.
      The following program will cause a splat on a non-fixed clone3() version
      and will fail correctly on a fixed version:
       #define _GNU_SOURCE
       #include <linux/sched.h>
       #include <linux/types.h>
       #include <sched.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <sys/syscall.h>
       #include <sys/wait.h>
       #include <unistd.h>
       int main(int argc, char *argv[])
              pid_t pid = -1;
              struct clone_args args = {0};
              args.exit_signal = -1;
              pid = syscall(__NR_clone3, &args, sizeof(struct clone_args));
              if (pid < 0)
              if (pid == 0)
      Fixes: 7f192e3c
       ("fork: add clone3")
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Suggested-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: default avatarEugene Syromiatnikov <esyr@redhat.com>
      Link: https://lore.kernel.org/r/4b38fa4ce420b119a4c6345f42fe3cec2de9b0b5.1568223594.git.esyr@redhat.com
      [christian.brauner@ubuntu.com: simplify check and rework commit message]
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
  20. 28 Aug, 2019 3 commits
  21. 20 Aug, 2019 1 commit
  22. 12 Aug, 2019 1 commit
  23. 01 Aug, 2019 1 commit
    • Christian Brauner's avatar
      pidfd: add P_PIDFD to waitid() · 3695eae5
      Christian Brauner authored
      This adds the P_PIDFD type to waitid().
      One of the last remaining bits for the pidfd api is to make it possible
      to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
      that want to use the pidfd api to exclusively manage processes can do so
      One of the things this will unblock in the future is the ability to make
      it possible to retrieve the exit status via waitid(P_PIDFD) for
      non-parent processes if handed a _suitable_ pidfd that has this feature
      set. This is similar to what you can do on FreeBSD with kqueue(). It
      might even end up being possible to wait on a process as a non-parent if
      an appropriate property is enabled on the pidfd.
      With P_PIDFD no scoping of the process identified by the pidfd is
      possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
      waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
      equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
      rely on pid-based wait*() syscalls for now.
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io
  24. 25 Jul, 2019 1 commit
    • Jann Horn's avatar
      sched/fair: Don't free p->numa_faults with concurrent readers · 16d51a59
      Jann Horn authored
      When going through execve(), zero out the NUMA fault statistics instead of
      freeing them.
      During execve, the task is reachable through procfs and the scheduler. A
      concurrent /proc/*/sched reader can read data from a freed ->numa_faults
      allocation (confirmed by KASAN) and write it back to userspace.
      I believe that it would also be possible for a use-after-free read to occur
      through a race between a NUMA fault and execve(): task_numa_fault() can
      lead to task_numa_compare(), which invokes task_weight() on the currently
      running task of a different CPU.
      Another way to fix this would be to make ->numa_faults RCU-managed or add
      extra locking, but it seems easier to wipe the NUMA fault statistics on
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Fixes: 82727018 ("sched/numa: Call task_numa_free() from do_execve()")
      Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
  25. 14 Jul, 2019 1 commit
  26. 01 Jul, 2019 1 commit
    • Christian Brauner's avatar
      fork: return proper negative error code · 28dd29c0
      Christian Brauner authored
      Make sure to return a proper negative error code from copy_process()
      when anon_inode_getfile() fails with CLONE_PIDFD.
      Otherwise _do_fork() will not detect an error and get_task_pid() will
      operator on a nonsensical pointer:
      R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
      R13: 00007ffc15fbb0ff R14: 00007ff07e47e9c0 R15: 0000000000000000
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 1 PID: 7990 Comm: syz-executor290 Not tainted 5.2.0-rc6+ #9
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:get_task_pid+0xe1/0x210 kernel/pid.c:372
      Code: 89 ff e8 62 27 5f 00 49 8b 07 44 89 f1 4c 8d bc c8 90 01 00 00 eb 0c
      e8 0d fe 25 00 49 81 c7 38 05 00 00 4c 89 f8 48 c1 e8 03 <80> 3c 18 00 74
      08 4c 89 ff e8 31 27 5f 00 4d 8b 37 e8 f9 47 12 00
      RSP: 0018:ffff88808a4a7d78 EFLAGS: 00010203
      RAX: 00000000000000a7 RBX: dffffc0000000000 RCX: ffff888088180600
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff88808a4a7d90 R08: ffffffff814fb3a8 R09: ffffed1015d66bf8
      R10: ffffed1015d66bf8 R11: 1ffff11015d66bf7 R12: 0000000000041ffc
      R13: 1ffff11011494fbc R14: 0000000000000000 R15: 000000000000053d
      FS:  00007ff07e47e700(0000) GS:ffff8880aeb00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004b5100 CR3: 0000000094df2000 CR4: 00000000001406e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        _do_fork+0x1b9/0x5f0 kernel/fork.c:2360
        __do_sys_clone kernel/fork.c:2454 [inline]
        __se_sys_clone kernel/fork.c:2448 [inline]
        __x64_sys_clone+0xc1/0xd0 kernel/fork.c:2448
        do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:301
      Link: https://lore.kernel.org/lkml/000000000000e0dc0d058c9e7142@google.com
      Reported-and-tested-by: syzbot+002e636502bc4b64eb5c@syzkaller.appspotmail.com
      Fixes: 6fd2fe49
       ("copy_process(): don't use ksys_close() on cleanups")
      Cc: Jann Horn <jannh@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
  27. 29 Jun, 2019 1 commit
  28. 28 Jun, 2019 1 commit
    • Joel Fernandes (Google)'s avatar
      pidfd: add polling support · b53b0b9d
      Joel Fernandes (Google) authored
      This patch adds polling support to pidfd.
      Android low memory killer (LMK) needs to know when a process dies once
      it is sent the kill signal. It does so by checking for the existence of
      /proc/pid which is both racy and slow. For example, if a PID is reused
      between when LMK sends a kill signal and checks for existence of the
      PID, since the wrong PID is now possibly checked for existence.
      Using the polling support, LMK will be able to get notified when a process
      exists in race-free and fast way, and allows the LMK to do other things
      (such as by polling on other fds) while awaiting the process being killed
      to die.
      For notification to polling processes, we follow the same existing
      mechanism in the kernel used when the parent of the task group is to be
      notified of a child's death (do_notify_parent). This is precisely when the
      tasks waiting on a poll of pidfd are also awakened in this patch.
      We have decided to include the waitqueue in struct pid for the following
      1. The wait queue has to survive for the lifetime of the poll. Including
         it in task_struct would not be option in this case because the task can
         be reaped and destroyed before the poll returns.
      2. By including the struct pid for the waitqueue means that during
         de_thread(), the new thread group leader automatically gets the new
         waitqueue/pid even though its task_struct is different.
      Appropriate test cases are added in the second patch to provide coverage of
      all the cases the patch is handling.
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Jonathan Kowalski <bl0pbl33p@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@android.com
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Co-developed-by: default avatarDaniel Colascione <dancol@google.com>
      Signed-off-by: default avatarDaniel Colascione <dancol@google.com>
      Signed-off-by: default avatarJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
  29. 27 Jun, 2019 1 commit
  30. 24 Jun, 2019 1 commit
    • Dmitry V. Levin's avatar
      fork: don't check parent_tidptr with CLONE_PIDFD · 9014143b
      Dmitry V. Levin authored
      Give userspace a cheap and reliable way to tell whether CLONE_PIDFD is
      supported by the kernel or not. The easiest way is to pass an invalid
      file descriptor value in parent_tidptr, perform the syscall and verify
      that parent_tidptr has been changed to a valid file descriptor value.
      CLONE_PIDFD uses parent_tidptr to return pidfds. CLONE_PARENT_SETTID
      will use parent_tidptr to return the tid of the parent. The two flags
      cannot be used together. Old kernels that only support
      CLONE_PARENT_SETTID will not verify the value pointed to by
      parent_tidptr. This behavior is unchanged even with the introduction of
      However, if CLONE_PIDFD is specified the kernel will currently check the
      value pointed to by parent_tidptr before placing the pidfd in the memory
      pointed to. EINVAL will be returned if the value in parent_tidptr is not
      If CLONE_PIDFD is supported and fd 0 is closed, then the returned pidfd
      can and likely will be 0 and parent_tidptr will be unchanged. This means
      userspace must either check CLONE_PIDFD support beforehand or check that
      fd 0 is not closed when invoking CLONE_PIDFD.
      The check for pidfd == 0 was introduced during the v5.2 merge window by
      commit b3e58382 ("clone: add CLONE_PIDFD") to ensure that
      CLONE_PIDFD could be potentially extended by passing in flags through
      the return argument.
      However, that extension would look horrible, and with the upcoming
      introduction of the clone3 syscall in v5.3 there is no need to extend
      legacy clone syscall this way. (Even if it would need to be extended,
      CLONE_DETACHED can be reused with CLONE_PIDFD.)
      So remove the pidfd == 0 check. Userspace that needs to be portable to
      kernels without CLONE_PIDFD support can then be advised to initialize
      pidfd to -1 and check the pidfd value returned by CLONE_PIDFD.
      Fixes: b3e58382
       ("clone: add CLONE_PIDFD")
      Signed-off-by: default avatarDmitry V. Levin <ldv@altlinux.org>
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
  31. 22 Jun, 2019 1 commit
  32. 20 Jun, 2019 1 commit
    • Christian Brauner's avatar
      arch: handle arches who do not yet define clone3 · d68dbb0c
      Christian Brauner authored
      This cleanly handles arches who do not yet define clone3.
      clone3() was initially placed under __ARCH_WANT_SYS_CLONE under the
      assumption that this would cleanly handle all architectures. It does
      Architectures such as nios2 or h8300 simply take the asm-generic syscall
      definitions and generate their syscall table from it. Since they don't
      define __ARCH_WANT_SYS_CLONE the build would fail complaining about
      sys_clone3 missing. The reason this doesn't happen for legacy clone is
      that nios2 and h8300 provide assembly stubs for sys_clone. This seems to
      be done for architectural reasons.
      The build failures for nios2 and h8300 were caught int -next luckily.
      The solution is to define __ARCH_WANT_SYS_CLONE3 that architectures can
      add. Additionally, we need a cond_syscall(clone3) for architectures such
      as nios2 or h8300 that generate their syscall table in the way I
      explained above.
      Fixes: 8f3220a8
       ("arch: wire-up clone3() syscall")
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Adrian Reber <adrian@lisas.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: x86@kernel.org
  33. 10 Jun, 2019 1 commit