1. 29 May, 2020 1 commit
  2. 27 May, 2020 1 commit
  3. 19 May, 2020 2 commits
  4. 11 May, 2020 1 commit
    • Christoph Hellwig's avatar
      net: cleanly handle kernel vs user buffers for ->msg_control · 1f466e1f
      Christoph Hellwig authored
      The msg_control field in struct msghdr can either contain a user
      pointer when used with the recvmsg system call, or a kernel pointer
      when used with sendmsg.  To complicate things further kernel_recvmsg
      can stuff a kernel pointer in and then use set_fs to make the uaccess
      helpers accept it.
      Replace it with a union of a kernel pointer msg_control field, and
      a user pointer msg_control_user one, and allow kernel_recvmsg operate
      on a proper kernel pointer using a bitfield to override the normal
      choice of a user pointer for recvmsg.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  5. 20 Mar, 2020 1 commit
  6. 10 Mar, 2020 1 commit
  7. 08 Jan, 2020 1 commit
  8. 13 Dec, 2019 1 commit
  9. 10 Dec, 2019 1 commit
  10. 06 Dec, 2019 1 commit
  11. 03 Dec, 2019 2 commits
  12. 26 Nov, 2019 3 commits
  13. 25 Nov, 2019 1 commit
    • Linus Torvalds's avatar
      vfs: mark pipes and sockets as stream-like file descriptors · d8e464ec
      Linus Torvalds authored
      In commit 3975b097 ("convert stream-like files -> stream_open, even
      if they use noop_llseek") Kirill used a coccinelle script to change
      "nonseekable_open()" to "stream_open()", which changed the trivial cases
      of stream-like file descriptors to the new model with FMODE_STREAM.
      However, the two big cases - sockets and pipes - don't actually have
      that trivial pattern at all, and were thus never converted to
      FMODE_STREAM even though it makes lots of sense to do so.
      That's particularly true when looking forward to the next change:
      getting rid of FMODE_ATOMIC_POS entirely, and just using FMODE_STREAM to
      decide whether f_pos updates are needed or not.  And if they are, we'll
      always do them atomically.
      This came up because KCSAN (correctly) noted that the non-locked f_pos
      updates are data races: they are clearly benign for the case where we
      don't care, but it would be good to just not have that issue exist at
      Note that the reason we used FMODE_ATOMIC_POS originally is that only
      doing it for the minimal required case is "safer" in that it's possible
      that the f_pos locking can cause unnecessary serialization across the
      whole write() call.  And in the worst case, that kind of serialization
      can cause deadlock issues: think writers that need readers to empty the
      state using the same file descriptor.
      [ Note that the locking is per-file descriptor - because it protects
        "f_pos", which is obviously per-file descriptor - so it only affects
        cases where you literally use the same file descriptor to both read
        and write.
        So a regular pipe that has separate reading and writing file
        descriptors doesn't really have this situation even though it's the
        obvious case of "reader empties what a bit writer concurrently fills"
        But we want to make pipes as being stream-line anyway, because we
        don't want the unnecessary overhead of locking, and because a named
        pipe can be (ab-)used by reading and writing to the same file
        descriptor. ]
      There are likely a lot of other cases that might want FMODE_STREAM, and
      looking for ".llseek = no_llseek" users and other cases that don't have
      an lseek file operation at all and making them use "stream_open()" might
      be a good idea.  But pipes and sockets are likely to be the two main
      Cc: Kirill Smelkov <kirr@nexedi.com>
      Cc: Eic Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Paul McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  14. 15 Nov, 2019 2 commits
  15. 29 Oct, 2019 1 commit
  16. 23 Oct, 2019 2 commits
  17. 09 Jul, 2019 4 commits
    • Jens Axboe's avatar
      io_uring: add support for recvmsg() · aa1fa28f
      Jens Axboe authored
      This is done through IORING_OP_RECVMSG. This opcode uses the same
      sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the
      msghdr struct in the sqe->addr field as well.
      We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Jens Axboe's avatar
      io_uring: add support for sendmsg() · 0fa03c62
      Jens Axboe authored
      This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags
      for the flags argument, and the msghdr struct is passed in the
      sqe->addr field.
      We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't
      block, and punt to async execution if it would have.
      Acked-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    • Al Viro's avatar
      coallocate socket_wq with socket itself · 333f7909
      Al Viro authored
      socket->wq is assign-once, set when we are initializing both
      struct socket it's in and struct socket_wq it points to.  As the
      matter of fact, the only reason for separate allocation was the
      ability to RCU-delay freeing of socket_wq.  RCU-delaying the
      freeing of socket itself gets rid of that need, so we can just
      fold struct socket_wq into the end of struct socket and simplify
      the life both for sock_alloc_inode() (one allocation instead of
      two) and for tun/tap oddballs, where we used to embed struct socket
      and struct socket_wq into the same structure (now - embedding just
      the struct socket).
      Note that reference to struct socket_wq in struct sock does remain
      a reference - that's unchanged.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Al Viro's avatar
      sockfs: switch to ->free_inode() · 6d7855c5
      Al Viro authored
      we do have an RCU-delayed part there already (freeing the wq),
      so it's not like the pipe situation; moreover, it might be
      worth considering coallocating wq with the rest of struct sock_alloc.
      ->sk_wq in struct sock would remain a pointer as it is, but
      the object it normally points to would be coallocated with
      struct socket...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  18. 03 Jul, 2019 1 commit
  19. 27 Jun, 2019 1 commit
    • Stanislav Fomichev's avatar
      bpf: implement getsockopt and setsockopt hooks · 0d01da6a
      Stanislav Fomichev authored
      Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
      BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
      BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
      passing them down to the kernel or bypass kernel completely.
      BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
      kernel returns.
      Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
      The buffer memory is pre-allocated (because I don't think there is
      a precedent for working with __user memory from bpf). This might be
      slow to do for each {s,g}etsockopt call, that's why I've added
      __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
      attached to a cgroup. Note, however, that there is a race between
      __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
      program layout might have changed; this should not be a problem
      because in general there is a race between multiple calls to
      {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
      The return code of the BPF program is handled as follows:
      * 0: EPERM
      * 1: success, continue with next BPF program in the cgroup chain
      * allow overwriting setsockopt arguments (Alexei Starovoitov):
        * use set_fs (same as kernel_setsockopt)
        * buffer is always kzalloc'd (no small on-stack buffer)
      * use s32 for optlen (Andrii Nakryiko)
      * return only 0 or 1 (Alexei Starovoitov)
      * always run all progs (Alexei Starovoitov)
      * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
        (decided to use optval=-1 instead, optval=0 might be a valid input)
      * call getsockopt hook after kernel handlers (Alexei Starovoitov)
      * rework cgroup chaining; stop as soon as bpf program returns
        0 or 2; see patch with the documentation for the details
      * drop Andrii's and Martin's Acked-by (not sure they are comfortable
        with the new state of things)
      * skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
      * don't export bpf_sk_fullsock helper (Martin Lau)
      * size != sizeof(__u64) for uapi pointers (Martin Lau)
      * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
      * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
      * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
      * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
      * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
      * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
      * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
      * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
      * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
      * added [0,2] return code check to verifier (Martin Lau)
      * dropped unused buf[64] from the stack (Martin Lau)
      * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
      * dropped bpf_target_off from ctx rewrites (Martin Lau)
      * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
  20. 05 Jun, 2019 1 commit
  21. 31 May, 2019 1 commit
  22. 30 May, 2019 1 commit
  23. 25 May, 2019 2 commits
    • David Howells's avatar
      vfs: Convert sockfs to use the new mount API · fba9be49
      David Howells authored
      Convert the sockfs filesystem to the new internal mount API as the old
      one will be obsoleted and removed.  This allows greater flexibility in
      communication of mount parameters between userspace, the VFS and the
      See Documentation/filesystems/mount_api.txt for more information.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: netdev@vger.kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Al Viro's avatar
      mount_pseudo(): drop 'name' argument, switch to d_make_root() · 1f58bb18
      Al Viro authored
      Once upon a time we used to set ->d_name of e.g. pipefs root
      so that d_path() on pipes would work.  These days it's
      completely pointless - dentries of pipes are not even connected
      to pipefs root.  However, mount_pseudo() had set the root
      dentry name (passed as the second argument) and callers
      kept inventing names to pass to it.  Including those that
      didn't *have* any non-root dentries to start with...
      All of that had been pointless for about 8 years now; it's
      time to get rid of that cargo-culting...
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  24. 19 May, 2019 1 commit
    • Randy Dunlap's avatar
      net: fix kernel-doc warnings for socket.c · 85806af0
      Randy Dunlap authored
      Fix kernel-doc warnings by moving the kernel-doc notation to be
      immediately above the functions that it describes.
      Fixes these warnings for sock_sendmsg() and sock_recvmsg():
      ../net/socket.c:658: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:658: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'sock' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'msg' description in 'INDIRECT_CALLABLE_DECLARE'
      ../net/socket.c:889: warning: Excess function parameter 'flags' description in 'INDIRECT_CALLABLE_DECLARE'
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  25. 05 May, 2019 1 commit
  26. 26 Apr, 2019 1 commit
  27. 19 Apr, 2019 2 commits
    • Arnd Bergmann's avatar
      net: socket: implement 64-bit timestamps · 0768e170
      Arnd Bergmann authored
      The 'timeval' and 'timespec' data structures used for socket timestamps
      are going to be redefined in user space based on 64-bit time_t in future
      versions of the C library to deal with the y2038 overflow problem,
      which breaks the ABI definition.
      Unlike many modern ioctl commands, SIOCGSTAMP and SIOCGSTAMPNS do not
      use the _IOR() macro to encode the size of the transferred data, so it
      remains ambiguous whether the application uses the old or new layout.
      The best workaround I could find is rather ugly: we redefine the command
      code based on the size of the respective data structure with a ternary
      operator. This lets it get evaluated as late as possible, hopefully after
      that structure is visible to the caller. We cannot use an #ifdef here,
      because inux/sockios.h might have been included before any libc header
      that could determine the size of time_t.
      The ioctl implementation now interprets the new command codes as always
      referring to the 64-bit structure on all architectures, while the old
      architecture specific command code still refers to the old architecture
      specific layout. The new command number is only used when they are
      actually different.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Arnd Bergmann's avatar
      net: rework SIOCGSTAMP ioctl handling · c7cbdbf2
      Arnd Bergmann authored
      The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
      socket protocol handlers, and all of those end up calling the same
      sock_get_timestamp()/sock_get_timestampns() helper functions, which
      results in a lot of duplicate code.
      With the introduction of 64-bit time_t on 32-bit architectures, this
      gets worse, as we then need four different ioctl commands in each
      socket protocol implementation.
      To simplify that, let's add a new .gettstamp() operation in
      struct proto_ops, and move ioctl implementation into the common
      sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
      We can reuse the sock_get_timestamp() implementation, but generalize
      it so it can deal with both native and compat mode, as well as
      timeval and timespec structures.
      Acked-by: default avatarStefan Schmidt <stefan@datenfreihafen.org>
      Acked-by: default avatarNeil Horman <nhorman@tuxdriver.com>
      Acked-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  28. 15 Mar, 2019 1 commit
  29. 25 Feb, 2019 1 commit
    • Eric Biggers's avatar
      net: socket: set sock->sk to NULL after calling proto_ops::release() · ff7b11aa
      Eric Biggers authored
      Commit 9060cb71 ("net: crypto set sk to NULL when af_alg_release.")
      fixed a use-after-free in sockfs_setattr() when an AF_ALG socket is
      closed concurrently with fchownat().  However, it ignored that many
      other proto_ops::release() methods don't set sock->sk to NULL and
      therefore allow the same use-after-free:
          - base_sock_release
          - bnep_sock_release
          - cmtp_sock_release
          - data_sock_release
          - dn_release
          - hci_sock_release
          - hidp_sock_release
          - iucv_sock_release
          - l2cap_sock_release
          - llcp_sock_release
          - llc_ui_release
          - rawsock_release
          - rfcomm_sock_release
          - sco_sock_release
          - svc_release
          - vcc_release
          - x25_release
      Rather than fixing all these and relying on every socket type to get
      this right forever, just make __sock_release() set sock->sk to NULL
      itself after calling proto_ops::release().
      Reproducer that produces the KASAN splat when any of these socket types
      are configured into the kernel:
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/socket.h>
          #include <unistd.h>
          pthread_t t;
          volatile int fd;
          void *close_thread(void *arg)
              for (;;) {
                  usleep(rand() % 100);
          int main()
              pthread_create(&t, NULL, close_thread, NULL);
              for (;;) {
                  fd = socket(rand() % 50, rand() % 11, 0);
                  fchownat(fd, "", 1000, 1000, 0x1000);
      Fixes: 86741ec2 ("net: core: Add a UID field to struct sock.")
      Signed-off-by: default avatarEric Biggers <ebiggers@google.com>
      Acked-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>