1. 28 Apr, 2017 1 commit
  2. 10 Mar, 2017 1 commit
    • David Howells's avatar
      net: Work around lockdep limitation in sockets that use sockets · cdfbabfb
      David Howells authored
      Lockdep issues a circular dependency warning when AFS issues an operation
      through AF_RXRPC from a context in which the VFS/VM holds the mmap_sem.
      The theory lockdep comes up with is as follows:
       (1) If the pagefault handler decides it needs to read pages from AFS, it
           calls AFS with mmap_sem held and AFS begins an AF_RXRPC call, but
           creating a call requires the socket lock:
      	mmap_sem must be taken before sk_lock-AF_RXRPC
       (2) afs_open_socket() opens an AF_RXRPC socket and binds it.  rxrpc_bind()
           binds the underlying UDP socket whilst holding its socket lock.
           inet_bind() takes its own socket lock:
      	sk_lock-AF_RXRPC must be taken before sk_lock-AF_INET
       (3) Reading from a TCP socket into a userspace buffer might cause a fault
           and thus cause the kernel to take the mmap_sem, but the TCP socket is
           locked whilst doing this:
      	sk_lock-AF_INET must be taken before mmap_sem
      However, lockdep's theory is wrong in this...
  3. 09 Mar, 2017 1 commit
    • Paolo Abeni's avatar
      net/tunnel: set inner protocol in network gro hooks · 294acf1c
      Paolo Abeni authored
      The gso code of several tunnels type (gre and udp tunnels)
      takes for granted that the skb->inner_protocol is properly
      initialized and drops the packet elsewhere.
      On the forwarding path no one is initializing such field,
      so gro encapsulated packets are dropped on forward.
      Since commit 38720352 ("gre: Use inner_proto to obtain
      inner header protocol"), this can be reproduced when the
      encapsulated packets use gre as the tunneling protocol.
      The issue happens also with vxlan and geneve tunnels since
      commit 8bce6d7d ("udp: Generalize skb_udp_segment"), if the
      forwarding host's ingress nic has h/w offload for such tunnel
      and a vxlan/geneve device is configured on top of it, regardless
      of the configured peer address and vni.
      To address the issue, this change initialize the inner_protocol
      field for encapsulated packets in both ipv4 and ipv6 gro complete
      Fixes: 38720352 ("gre: Use inner_proto to obtain inner header protocol")
      Fixes: 8bce6d7d
       ("udp: Generalize skb_udp_segment")
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Acked-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  4. 15 Feb, 2017 1 commit
    • Steffen Klassert's avatar
      net: Add a skb_gro_flush_final helper. · 5f114163
      Steffen Klassert authored
      Add a skb_gro_flush_final helper to prepare for  consuming
      skbs in call_gro_receive. We will extend this helper to not
      touch the skb if the skb is consumed by a gro callback with
      a followup patch. We need this to handle the upcomming IPsec
      ESP callbacks as they reinject the skb to the napi_gro_receive
      asynchronous. The handler is used in all gro_receive functions
      that can call the ESP gro handlers.
      Signed-off-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
  5. 25 Jan, 2017 2 commits
    • Willy Tarreau's avatar
      net/tcp-fastopen: make connect()'s return case more consistent with non-TFO · 3979ad7e
      Willy Tarreau authored
      Without TFO, any subsequent connect() call after a successful one returns
      -1 EISCONN. The last API update ensured that __inet_stream_connect() can
      return -1 EINPROGRESS in response to sendmsg() when TFO is in use to
      indicate that the connection is now in progress. Unfortunately since this
      function is used both for connect() and sendmsg(), it has the undesired
      side effect of making connect() now return -1 EINPROGRESS as well after
      a successful call, while at the same time poll() returns POLLOUT. This
      can confuse some applications which happen to call connect() and to
      check for -1 EISCONN to ensure the connection is usable, and for which
      EINPROGRESS indicates a need to poll, causing a loop.
      This problem was encountered in haproxy where a call to connect() is
      precisely used in certain cases to confirm a connection's readiness.
      While arguably haproxy's behaviour should be improved here, it seems
      important to aim at a more robust behaviour when the goal of the new
      API is to make it easier to implement TFO in existing applications.
      This patch simply ensures that we preserve the same semantics as in
      the non-TFO case on the connect() syscall when using TFO, while still
      returning -1 EINPROGRESS on sendmsg(). For this we simply tell
      __inet_stream_connect() whether we're doing a regular connect() or in
      fact connecting for a sendmsg() call.
      Cc: Wei Wang <weiwan@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWilly Tarreau <w@1wt.eu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Wei Wang's avatar
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang authored
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          No MSG_FASTOPEN flag is needed.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  6. 24 Jan, 2017 1 commit
    • Krister Johansen's avatar
      Introduce a sysctl that modifies the value of PROT_SOCK. · 4548b683
      Krister Johansen authored
      Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl
      that denotes the first unprivileged inet port in the namespace.  To
      disable all privileged ports set this to zero.  It also checks for
      overlap with the local port range.  The privileged and local range may
      not overlap.
      The use case for this change is to allow containerized processes to bind
      to priviliged ports, but prevent them from ever being allowed to modify
      their container's network configuration.  The latter is accomplished by
      ensuring that the network namespace is not a child of the user
      namespace.  This modification was needed to allow the container manager
      to disable a namespace's priviliged port restrictions without exposing
      control of the network namespace to processes in the user namespace.
      Signed-off-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  7. 29 Dec, 2016 1 commit
  8. 24 Dec, 2016 1 commit
  9. 02 Dec, 2016 1 commit
  10. 30 Nov, 2016 1 commit
  11. 03 Nov, 2016 1 commit
    • WANG Cong's avatar
      inet: fix sleeping inside inet_wait_for_connect() · 14135f30
      WANG Cong authored
      Andrey reported this kernel warning:
        WARNING: CPU: 0 PID: 4608 at kernel/sched/core.c:7724
        __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
        do not call blocking ops when !TASK_RUNNING; state=1 set at
        [<ffffffff811f5a5c>] prepare_to_wait+0xbc/0x210
        Modules linked in:
        CPU: 0 PID: 4608 Comm: syz-executor Not tainted 4.9.0-rc2+ #320
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
         ffff88006625f7a0 ffffffff81b46914 ffff88006625f818 0000000000000000
         ffffffff84052960 0000000000000000 ffff88006625f7e8 ffffffff81111237
         ffff88006aceac00 ffffffff00001e2c ffffed000cc4beff ffffffff84052960
        Call Trace:
         [<     inline     >] __dump_stack lib/dump_stack.c:15
         [<ffffffff81b46914>] dump_stack+0xb3/0x10f lib/dump_stack.c:51
         [<ffffffff81111237>] __warn+0x1a7/0x1f0 kernel/panic.c:550
         [<ffffffff8111132c>] warn_slowpath_fmt+0xac/0xd0 kernel/panic.c:565
         [<ffffffff811922fc>] __might_sleep+0x14c/0x1a0 kernel/sched/core.c:7719
         [<     inline     >] slab_pre_alloc_hook mm/slab.h:393
         [<     inline     >] slab_alloc_node mm/slub.c:2634
         [<     inline     >] slab_alloc mm/slub.c:2716
         [<ffffffff81508da0>] __kmalloc_track_caller+0x150/0x2a0 mm/slub.c:4240
         [<ffffffff8146be14>] kmemdup+0x24/0x50 mm/util.c:113
         [<ffffffff8388b2cf>] dccp_feat_clone_sp_val.part.5+0x4f/0xe0 net/dccp/feat.c:374
         [<     inline     >] dccp_feat_clone_sp_val net/dccp/feat.c:1141
         [<     inline     >] dccp_feat_change_recv net/dccp/feat.c:1141
         [<ffffffff8388d491>] dccp_feat_parse_options+0xaa1/0x13d0 net/dccp/feat.c:1411
         [<ffffffff83894f01>] dccp_parse_options+0x721/0x1010 net/dccp/options.c:128
         [<ffffffff83891280>] dccp_rcv_state_process+0x200/0x15b0 net/dccp/input.c:644
         [<ffffffff838b8a94>] dccp_v4_do_rcv+0xf4/0x1a0 net/dccp/ipv4.c:681
         [<     inline     >] sk_backlog_rcv ./include/net/sock.h:872
         [<ffffffff82b7ceb6>] __release_sock+0x126/0x3a0 net/core/sock.c:2044
         [<ffffffff82b7d189>] release_sock+0x59/0x1c0 net/core/sock.c:2502
         [<     inline     >] inet_wait_for_connect net/ipv4/af_inet.c:547
         [<ffffffff8316b2a2>] __inet_stream_connect+0x5d2/0xbb0 net/ipv4/af_inet.c:617
         [<ffffffff8316b8d5>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:656
         [<ffffffff82b705e4>] SYSC_connect+0x244/0x2f0 net/socket.c:1533
         [<ffffffff82b72dd4>] SyS_connect+0x24/0x30 net/socket.c:1514
         [<ffffffff83fbf701>] entry_SYSCALL_64_fastpath+0x1f/0xc2
      Unlike commit 26cabd31
      ("sched, net: Clean up sk_wait_event() vs. might_sleep()"), the
      sleeping function is called before schedule_timeout(), this is indeed
      a bug. Fix this by moving the wait logic to the new API, it is similar
      to commit ff960a73
      ("netdev, sched/wait: Fix sleeping inside wait event").
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  12. 20 Oct, 2016 1 commit
    • Sabrina Dubroca's avatar
      net: add recursion limit to GRO · fcd91dd4
      Sabrina Dubroca authored
      Currently, GRO can do unlimited recursion through the gro_receive
      handlers.  This was fixed for tunneling protocols by limiting tunnel GRO
      to one level with encap_mark, but both VLAN and TEB still have this
      problem.  Thus, the kernel is vulnerable to a stack overflow, if we
      receive a packet composed entirely of VLAN headers.
      This patch adds a recursion counter to the GRO layer to prevent stack
      overflow.  When a gro_receive function hits the recursion limit, GRO is
      aborted for this skb and it is processed normally.  This recursion
      counter is put in the GRO CB, but could be turned into a percpu counter
      if we run out of space in the CB.
      Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report.
      Fixes: CVE-2016-7039
      Fixes: 9b174d88 ("net: Add Transparent Ethernet Bridging GRO support.")
      Fixes: 66e5133f
       ("vlan: Add GRO support for non hardware accelerated vlan")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: default avatarJiri Benc <jbenc@redhat.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  13. 20 Sep, 2016 1 commit
  14. 29 Aug, 2016 1 commit
  15. 24 Aug, 2016 1 commit
  16. 12 Jul, 2016 1 commit
    • Paul Gortmaker's avatar
      ipv4: af_inet: make it explicitly non-modular · d3fc0353
      Paul Gortmaker authored
      The Makefile controlling compilation of this file is obj-y,
      meaning that it currently is never being built as a module.
      Since MODULE_ALIAS is a no-op for non-modular code, we can simply
      remove the MODULE_ALIAS_NETPROTO variant used here.
      We replace module.h with kmod.h since the file does make use of
      request_module() in order to load other modules from here.
      We don't have to worry about init.h coming in via the removed
      module.h since the file explicitly includes init.h already.
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  17. 23 May, 2016 1 commit
    • Ezequiel Garcia's avatar
      ipv4: Fix non-initialized TTL when CONFIG_SYSCTL=n · 049bbf58
      Ezequiel Garcia authored
      Commit fa50d974 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
      moves the default TTL assignment, and as side-effect IPv4 TTL now
      has a default value only if sysctl support is enabled (CONFIG_SYSCTL=y).
      The sysctl_ip_default_ttl is fundamental for IP to work properly,
      as it provides the TTL to be used as default. The defautl TTL may be
      used in ip_selected_ttl, through the following flow:
      This commit fixes the issue by assigning net->ipv4.sysctl_ip_default_ttl
      in net_init_net, called during ipv4's initialization.
      Without this commit, a kernel built without sysctl support will send
      all IP packets with zero TTL (unless a TTL is explicitly set, e.g.
      with setsockopt).
      Given a similar issue might appear on the other knobs that were
      namespaceify, this commit also moves them.
      Fixes: fa50d974 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
      Signed-off-by: Ezequiel Garcia <e...
  18. 20 May, 2016 3 commits
  19. 14 Apr, 2016 3 commits
    • Alexander Duyck's avatar
      GSO: Support partial segmentation offload · 802ab55a
      Alexander Duyck authored
      This patch adds support for something I am referring to as GSO partial.
      The basic idea is that we can support a broader range of devices for
      segmentation if we use fixed outer headers and have the hardware only
      really deal with segmenting the inner header.  The idea behind the naming
      is due to the fact that everything before csum_start will be fixed headers,
      and everything after will be the region that is handled by hardware.
      With the current implementation it allows us to add support for the
      following GSO types with an inner TSO_MANGLEID or TSO6 offload:
      In the case of hardware that already supports tunneling we may be able to
      extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
      hardware can support updating inner IPv4 headers.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Alexander Duyck's avatar
      GRO: Add support for TCP with fixed IPv4 ID field, limit tunnel IP ID values · 1530545e
      Alexander Duyck authored
      This patch does two things.
      First it allows TCP to aggregate TCP frames with a fixed IPv4 ID field.  As
      a result we should now be able to aggregate flows that were converted from
      IPv6 to IPv4.  In addition this allows us more flexibility for future
      implementations of segmentation as we may be able to use a fixed IP ID when
      segmenting the flow.
      The second thing this does is that it places limitations on the outer IPv4
      ID header in the case of tunneled frames.  Specifically it forces the IP ID
      to be incrementing by 1 unless the DF bit is set in the outer IPv4 header.
      This way we can avoid creating overlapping series of IP IDs that could
      possibly be fragmented if the frame goes through GRO and is then
      resegmented via GSO.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
    • Alexander Duyck's avatar
      GSO: Add GSO type for fixed IPv4 ID · cbc53e08
      Alexander Duyck authored
      This patch adds support for TSO using IPv4 headers with a fixed IP ID
      field.  This is meant to allow us to do a lossless GRO in the case of TCP
      flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
      In addition I am adding a feature that for now I am referring to TSO with
      IP ID mangling.  Basically when this flag is enabled the device has the
      option to either output the flow with incrementing IP IDs or with a fixed
      IP ID regardless of what the original IP ID ordering was.  This is useful
      in cases where the DF bit is set and we do not care if the original IP ID
      value is maintained.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  20. 07 Apr, 2016 1 commit
  21. 05 Apr, 2016 1 commit
    • samanthakumar's avatar
      udp: enable MSG_PEEK at non-zero offset · 627d2d6b
      samanthakumar authored
      Enable peeking at UDP datagrams at the offset specified with socket
      option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up
      to the end of the given datagram.
      Implement the SO_PEEK_OFF semantics introduced in commit ef64a54f
      ("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset
      on peek, decrease it on regular reads.
      When peeking, always checksum the packet immediately, to avoid
      recomputation on subsequent peeks and final read.
      The socket lock is not held for the duration of udp_recvmsg, so
      peek and read operations can run concurrently. Only the last store
      to sk_peek_off is preserved.
      Signed-off-by: default avatarSam Kumar <samanthakumar@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  22. 22 Mar, 2016 1 commit
    • Deepa Dinamani's avatar
      net: ipv4: Fix truncated timestamp returned by inet_current_timestamp() · 3ba9d300
      Deepa Dinamani authored
      The millisecond timestamps returned by the function is
      converted to network byte order by making a call to htons().
      htons() only returns __be16 while __be32 is required here.
      This was identified by the sparse warning from the buildbot:
      net/ipv4/af_inet.c:1405:16: sparse: incorrect type in return
      			    expression (different base types)
      net/ipv4/af_inet.c:1405:16: expected restricted __be32
      net/ipv4/af_inet.c:1405:16: got restricted __be16 [usertype] <noident>
      Change the function to use htonl() to return the correct __be32 type
      instead so that the millisecond value doesn't get truncated.
      Signed-off-by: default avatarDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: James Morris <jmorris@namei.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Fixes: 822c8685
       ("net: ipv4: Convert IP network timestamps to be y2038 safe")
      Reported-by: Fengguang Wu <fengguang.wu@intel.com> [0-day test robot]
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  23. 20 Mar, 2016 2 commits
  24. 01 Mar, 2016 1 commit
  25. 17 Feb, 2016 1 commit
  26. 11 Feb, 2016 1 commit
  27. 14 Dec, 2015 1 commit
    • Hannes Frederic Sowa's avatar
      net: add validation for the socket syscall protocol argument · 79462ad0
      Hannes Frederic Sowa authored
      郭永刚 reported that one could simply crash the kernel as root by
      using a simple program:
      	int socket_fd;
      	struct sockaddr_in addr;
      	addr.sin_port = 0;
      	addr.sin_addr.s_addr = INADDR_ANY;
      	addr.sin_family = 10;
      	socket_fd = socket(10,3,0x40000000);
      	connect(socket_fd , &addr,16);
      AF_INET, AF_INET6 sockets actually only support 8-bit protocol
      identifiers. inet_sock's skc_protocol field thus is sized accordingly,
      thus larger protocol identifiers simply cut off the higher bits and
      store a zero in the protocol fields.
      This could lead to e.g. NULL function pointer because as a result of
      the cut off inet_num is zero and we call down to inet_autobind, which
      is NULL for raw sockets.
      kernel: Call Trace:
      kernel:  [<ffffffff816db90e>] ? inet_autobind+0x2e/0x70
      kernel:  [<ffffffff816db9a4>] inet_dgram_connect+0x54/0x80
      kernel:  [<ffffffff81645069>] SYSC_connect+0xd9/0x110
      kernel:  [<ffffffff810ac51b>] ? ptrace_notify+0x5b/0x80
      kernel:  [<ffffffff810...
  28. 30 Sep, 2015 1 commit
  29. 29 Sep, 2015 1 commit
    • Eric Dumazet's avatar
      tcp: prepare fastopen code for upcoming listener changes · 0536fcc0
      Eric Dumazet authored
      While auditing TCP stack for upcoming 'lockless' listener changes,
      I found I had to change fastopen_init_queue() to properly init the object
      before publishing it.
      Otherwise an other cpu could try to lock the spinlock before it gets
      properly initialized.
      Instead of adding appropriate barriers, just remove dynamic memory
      allocations :
      - Structure is 28 bytes on 64bit arches. Using additional 8 bytes
        for holding a pointer seems overkill.
      - Two listeners can share same cache line and performance would suffer.
      If we really want to save few bytes, we would instead dynamically allocate
      whole struct request_sock_queue in the future.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  30. 18 Sep, 2015 1 commit
  31. 01 Sep, 2015 1 commit
  32. 31 Aug, 2015 3 commits