1. 09 Jul, 2019 4 commits
    • Jens Axboe's avatar
      io_uring: add support for recvmsg() · aa1fa28f
      Jens Axboe authored
      This is done through IORING_OP_RECVMSG. This opcode uses the same
      sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
      msghdr struct in the sqe->addr field as well.
      We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Jens Axboe's avatar
      io_uring: add support for sendmsg() · 0fa03c62
      Jens Axboe authored
      This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
      for the flags argument, and the msghdr struct is passed in the
      sqe->addr field.
      We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Al Viro's avatar
      coallocate socket_wq with socket itself · 333f7909
      Al Viro authored
      socket->wq is assign-once, set when we are initializing both
      struct socket it's in and struct socket_wq it points to.  As the
      matter of fact, the only reason for separate allocation was the
      ability to RCU-delay freeing of socket_wq.  RCU-delaying the
      freeing of socket itself gets rid of that need, so we can just
      fold struct socket_wq into the end of struct socket and simplify
      the life both for sock_alloc_inode() (one allocation instead of
      two) and for tun/tap oddballs, where we used to embed struct socket
      and struct socket_wq into the same structure (now - embedding just
      the struct socket).
      Note that reference to struct socket_wq in struct sock does remain
      a reference - that's unchanged.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Al Viro's avatar
      sockfs: switch to ->free_inode() · 6d7855c5
      Al Viro authored
      we do have an RCU-delayed part there already (freeing the wq),
      so it's not like the pipe situation; moreover, it might be
      worth considering coallocating wq with the rest of struct sock_alloc.
      ->sk_wq in struct sock would remain a pointer as it is, but
      the object it normally points to would be coallocated with
      struct socket...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  2. 03 Jul, 2019 1 commit
  3. 27 Jun, 2019 1 commit
    • Stanislav Fomichev's avatar
      bpf: implement getsockopt and setsockopt hooks · 0d01da6a
      Stanislav Fomichev authored
      Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
      BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
      BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
      passing them down to the kernel or bypass kernel completely.
      BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
      kernel returns.
      Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
      The buffer memory is pre-allocated (because I don't think there is
      a precedent for working with __user memory from bpf). This might be
      slow to do for each {s,g}etsockopt call, that's why I've added
      __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
      attached to a cgroup. Note, however, that there is a race between
      __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
      program layout might have changed; this should not be a problem
      because in general there is a race between multiple calls to
      {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
      The return code of the BPF program is handled as follows:
      * 0: EPERM
      * 1: success, continue with next BPF program in the cgroup chain
      * allow overwriting setsockopt arguments (Alexei Starovoitov):
        * use set_fs (same as kernel_setsockopt)
        * buffer is always kzalloc'd (no small on-stack buffer)
      * use s32 for optlen (Andrii Nakryiko)
      * return only 0 or 1 (Alexei Starovoitov)
      * always run all progs (Alexei Starovoitov)
      * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
        (decided to use optval=-1 instead, optval=0 might be a valid input)
      * call getsockopt hook after kernel handlers (Alexei Starovoitov)
      * rework cgroup chaining; stop as soon as bpf program returns
        0 or 2; see patch with the documentation for the details
      * drop Andrii's and Martin's Acked-by (not sure they are comfortable
        with the new state of things)
      * skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
      * don't export bpf_sk_fullsock helper (Martin Lau)
      * size != sizeof(__u64) for uapi pointers (Martin Lau)
      * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
      * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
      * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
      * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
      * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
      * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
      * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
      * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
      * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
      * added [0,2] return code check to verifier (Martin Lau)
      * dropped unused buf[64] from the stack (Martin Lau)
      * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
      * dropped bpf_target_off from ctx rewrites (Martin Lau)
      * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
  4. 05 Jun, 2019 1 commit
  5. 31 May, 2019 1 commit
  6. 30 May, 2019 1 commit
  7. 25 May, 2019 2 commits
    • David Howells's avatar
      vfs: Convert sockfs to use the new mount API · fba9be49
      David Howells authored
      Convert the sockfs filesystem to the new internal mount API as the old
      one will be obsoleted and removed.  This allows greater flexibility in
      communication of mount parameters between userspace, the VFS and the
      See Documentation/filesystems/mount_api.txt for more information.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Al Viro's avatar
      mount_pseudo(): drop 'name' argument, switch to d_make_root() · 1f58bb18
      Al Viro authored
      Once upon a time we used to set ->d_name of e.g. pipefs root
      so that d_path() on pipes would work.  These days it's
      completely pointless - dentries of pipes are not even connected
      to pipefs root.  However, mount_pseudo() had set the root
      dentry name (passed as the second argument) and callers
      kept inventing names to pass to it.  Including those that
      didn't *have* any non-root dentries to start with...
      All of that had been pointless for about 8 years now; it's
      time to get rid of that cargo-culting...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  8. 19 May, 2019 1 commit
    • Randy Dunlap's avatar
      net: fix kernel-doc warnings for socket.c · 85806af0
      Randy Dunlap authored
      Fix kernel-doc warnings by moving the kernel-doc notation to be
      immediately above the functions that it describes.
      Fixes these warnings for sock_sendmsg() and sock_recvmsg():
      ../net/socket.c:658: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:658: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'flags' description in 'INDIRECT_CALLABLE_DECLARE'
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  9. 05 May, 2019 1 commit
  10. 26 Apr, 2019 1 commit
  11. 19 Apr, 2019 2 commits
    • Arnd Bergmann's avatar
      net: socket: implement 64-bit timestamps · 0768e170
      Arnd Bergmann authored
      The 'timeval' and 'timespec' data structures used for socket timestamps
      are going to be redefined in user space based on 64-bit time_t in future
      versions of the C library to deal with the y2038 overflow problem,
      which breaks the ABI definition.
      Unlike many modern ioctl commands, SIOCGSTAMP and SIOCGSTAMPNS do not
      use the _IOR() macro to encode the size of the transferred data, so it
      remains ambiguous whether the application uses the old or new layout.
      The best workaround I could find is rather ugly: we redefine the command
      code based on the size of the respective data structure with a ternary
      operator. This lets it get evaluated as late as possible, hopefully after
      that structure is visible to the caller. We cannot use an #ifdef here,
      because inux/sockios.h might have been included before any libc header
      that could determine the size of time_t.
      The ioctl implementation now interprets the new command codes as always
      referring to the 64-bit structure on all architectures, while the old
      architecture specific command code still refers to the old architecture
      specific layout. The new command number is only used when they are
      actually different.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Arnd Bergmann's avatar
      net: rework SIOCGSTAMP ioctl handling · c7cbdbf2
      Arnd Bergmann authored
      The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
      socket protocol handlers, and all of those end up calling the same
      sock_get_timestamp()/sock_get_timestampns() helper functions, which
      results in a lot of duplicate code.
      With the introduction of 64-bit time_t on 32-bit architectures, this
      gets worse, as we then need four different ioctl commands in each
      socket protocol implementation.
      To simplify that, let's add a new .gettstamp() operation in
      struct proto_ops, and move ioctl implementation into the common
      sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
      We can reuse the sock_get_timestamp() implementation, but generalize
      it so it can deal with both native and compat mode, as well as
      timeval and timespec structures.
      Acked-by: default avatarStefan Schmidt <stefan@datenfreihafen.org>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 15 Mar, 2019 1 commit
  13. 25 Feb, 2019 1 commit
    • Eric Biggers's avatar
      net: socket: set sock->sk to NULL after calling proto_ops::release() · ff7b11aa
      Eric Biggers authored
      Commit 9060cb71 ("net: crypto set sk to NULL when af_alg_release.")
      fixed a use-after-free in sockfs_setattr() when an AF_ALG socket is
      closed concurrently with fchownat().  However, it ignored that many
      other proto_ops::release() methods don't set sock->sk to NULL and
      therefore allow the same use-after-free:
          - base_sock_release
          - bnep_sock_release
          - cmtp_sock_release
          - data_sock_release
          - dn_release
          - hci_sock_release
          - hidp_sock_release
          - iucv_sock_release
          - l2cap_sock_release
          - llcp_sock_release
          - llc_ui_release
          - rawsock_release
          - rfcomm_sock_release
          - sco_sock_release
          - svc_release
          - vcc_release
          - x25_release
      Rather than fixing all these and relying on every socket type to get
      this right forever, just make __sock_release() set sock->sk to NULL
      itself after calling proto_ops::release().
      Reproducer that produces the KASAN splat when any of these socket types
      are configured into the kernel:
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/socket.h>
          #include <unistd.h>
          pthread_t t;
          volatile int fd;
          void *close_thread(void *arg)
              for (;;) {
                  usleep(rand() % 100);
          int main()
              pthread_create(&t, NULL, close_thread, NULL);
              for (;;) {
                  fd = socket(rand() % 50, rand() % 11, 0);
                  fchownat(fd, "", 1000, 1000, 0x1000);
      Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  14. 03 Feb, 2019 4 commits
  15. 30 Jan, 2019 4 commits
  16. 18 Dec, 2018 1 commit
    • Arnd Bergmann's avatar
      y2038: socket: Add compat_sys_recvmmsg_time64 · e11d4284
      Arnd Bergmann authored
      recvmmsg() takes two arguments to pointers of structures that differ
      between 32-bit and 64-bit architectures: mmsghdr and timespec.
      For y2038 compatbility, we are changing the native system call from
      timespec to __kernel_timespec with a 64-bit time_t (in another patch),
      and use the existing compat system call on both 32-bit and 64-bit
      architectures for compatibility with traditional 32-bit user space.
      As we now have two variants of recvmmsg() for 32-bit tasks that are both
      different from the variant that we use on 64-bit tasks, this means we
      also require two compat system calls!
      The solution I picked is to flip things around: The existing
      compat_sys_recvmmsg() call gets moved from net/compat.c into net/socket.c
      and now handles the case for old user space on all architectures that
      have set CONFIG_COMPAT_32BIT_TIME.  A new compat_sys_recvmmsg_time64()
      call gets added in the old place for 64-bit architectures only, this
      one handles the case of a compat mmsghdr structure combined with
      In the indirect sys_socketcall(), we now need to call either
      do_sys_recvmmsg() or __compat_sys_recvmmsg(), depending on what kind of
      architecture we are on. For compat_sys_socketcall(), no such change is
      needed, we always call __compat_sys_recvmmsg().
      I decided to not add a new SYS_RECVMMSG_TIME64 socketcall: Any libc
      implementation for 64-bit time_t will need significant changes including
      an updated asm/unistd.h, and it seems better to consistently use the
      separate syscalls that configuration, leaving the socketcall only for
      backward compatibility with 32-bit time_t based libc.
      The naming is asymmetric for the moment, so both existing syscalls
      entry points keep their names, while the new ones are recvmmsg_time32
      and compat_recvmmsg_time64 respectively. I expect that we will rename
      the compat syscalls later as we start using generated syscall tables
      everywhere and add these entry points.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
  17. 18 Nov, 2018 1 commit
  18. 23 Oct, 2018 1 commit
    • David Howells's avatar
      iov_iter: Separate type from direction and use accessor functions · aa563d7b
      David Howells authored
      In the iov_iter struct, separate the iterator type from the iterator
      direction and use accessor functions to access them in most places.
      Convert a bunch of places to use switch-statements to access them rather
      then chains of bitwise-AND statements.  This makes it easier to add further
      iterator types.  Also, this can be more efficient as to implement a switch
      of small contiguous integers, the compiler can use ~50% fewer compare
      instructions than it has to use bitwise-and instructions.
      Further, cease passing the iterator type into the iterator setup function.
      The iterator function can set that itself.  Only the direction is required.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
  19. 18 Oct, 2018 1 commit
    • Wenwen Wang's avatar
      net: socket: fix a missing-check bug · b6168562
      Wenwen Wang authored
      In ethtool_ioctl(), the ioctl command 'ethcmd' is checked through a switch
      statement to see whether it is necessary to pre-process the ethtool
      structure, because, as mentioned in the comment, the structure
      ethtool_rxnfc is defined with padding. If yes, a user-space buffer 'rxnfc'
      is allocated through compat_alloc_user_space(). One thing to note here is
      that, if 'ethcmd' is ETHTOOL_GRXCLSRLALL, the size of the buffer 'rxnfc' is
      partially determined by 'rule_cnt', which is actually acquired from the
      user-space buffer 'compat_rxnfc', i.e., 'compat_rxnfc->rule_cnt', through
      get_user(). After 'rxnfc' is allocated, the data in the original user-space
      buffer 'compat_rxnfc' is then copied to 'rxnfc' through copy_in_user(),
      including the 'rule_cnt' field. However, after this copy, no check is
      re-enforced on 'rxnfc->rule_cnt'. So it is possible that a malicious user
      race to change the value in the 'compat_rxnfc->rule_cnt' between these two
      copies. Through this way, the attacker can bypass the previous check on
      'rule_cnt' and inject malicious data. This can cause undefined behavior of
      the kernel and introduce potential security risk.
      This patch avoids the above issue via copying the value acquired by
      get_user() to 'rxnfc->rule_cn', if 'ethcmd' is ETHTOOL_GRXCLSRLALL.
      Signed-off-by: default avatarWenwen Wang <wang6495@umn.edu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  20. 05 Oct, 2018 1 commit
  21. 13 Sep, 2018 1 commit
  22. 29 Aug, 2018 1 commit
    • Arnd Bergmann's avatar
      y2038: socket: Change recvmmsg to use __kernel_timespec · c2e6c856
      Arnd Bergmann authored
      This converts the recvmmsg() system call in all its variations to use
      'timespec64' internally for its timeout, and have a __kernel_timespec64
      argument in the native entry point. This lets us change the type to use
      64-bit time_t at a later point while using the 32-bit compat system call
      emulation for existing user space.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
  23. 14 Aug, 2018 1 commit
  24. 31 Jul, 2018 1 commit
  25. 30 Jul, 2018 2 commits
  26. 29 Jul, 2018 2 commits
  27. 12 Jul, 2018 1 commit