Skip to content
  • Adrian Reber's avatar
    fork: extend clone3() to support setting a PID · 49cb2fc4
    Adrian Reber authored
    
    
    The main motivation to add set_tid to clone3() is CRIU.
    
    To restore a process with the same PID/TID CRIU currently uses
    /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
    ns_last_pid and then (quickly) does a clone(). This works most of the
    time, but it is racy. It is also slow as it requires multiple syscalls.
    
    Extending clone3() to support *set_tid makes it possible restore a
    process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
    race free (as long as the desired PID/TID is available).
    
    This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
    on clone3() with *set_tid as they are currently in place for ns_last_pid.
    
    The original version of this change was using a single value for
    set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
    decided to change set_tid to an array to enable setting the PID of a
    process in multiple PID namespaces at the same time. If a process is
    created in a PID namespace it is possible to influence the PID inside
    and outside of the PID namespace. Details also in the corresponding
    selftest.
    
    To create a process with the following PIDs:
    
          PID NS level         Requested PID
            0 (host)              31496
            1                        42
            2                         1
    
    For that example the two newly introduced parameters to struct
    clone_args (set_tid and set_tid_size) would need to be:
    
      set_tid[0] = 1;
      set_tid[1] = 42;
      set_tid[2] = 31496;
      set_tid_size = 3;
    
    If only the PIDs of the two innermost nested PID namespaces should be
    defined it would look like this:
    
      set_tid[0] = 1;
      set_tid[1] = 42;
      set_tid_size = 2;
    
    The PID of the newly created process would then be the next available
    free PID in the PID namespace level 0 (host) and 42 in the PID namespace
    at level 1 and the PID of the process in the innermost PID namespace
    would be 1.
    
    The set_tid array is used to specify the PID of a process starting
    from the innermost nested PID namespaces up to set_tid_size PID namespaces.
    
    set_tid_size cannot be larger then the current PID namespace level.
    
    Signed-off-by: default avatarAdrian Reber <areber@redhat.com>
    Reviewed-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
    Reviewed-by: default avatarDmitry Safonov <0x7f454c46@gmail.com>
    Acked-by: default avatarAndrei Vagin <avagin@gmail.com>
    Link: https://lore.kernel.org/r/20191115123621.142252-1-areber@redhat.com
    
    
    Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
    49cb2fc4