1. 28 Apr, 2017 1 commit
  2. 21 Apr, 2017 1 commit
    • Ilan Tayari's avatar
      gso: Validate assumption of frag_list segementation · 43170c4e
      Ilan Tayari authored
      Commit 07b26c94 ("gso: Support partial splitting at the frag_list
      pointer") assumes that all SKBs in a frag_list (except maybe the last
      one) contain the same amount of GSO payload.
      
      This assumption is not always correct, resulting in the following
      warning message in the log:
          skb_segment: too many frags
      
      For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
      one frag, and some with 2 frags.
      After GRO, the frag_list SKBs end up having different amounts of payload.
      If this frag_list SKB is then forwarded, the aforementioned assumption
      is violated.
      
      Validate the assumption, and fall back to software GSO if it not true.
      
      Change-Id: Ia03983f4a47b6534dd987d7a2aad96d54d46d212
      Fixes: 07b26c94
      
       ("gso: Support partial splitting at the frag_list pointer")
      Signed-off-by: default avatarIlan Tayari <ilant@mellanox.com>
      Signed-off-by: default avatarIlya Lesokhin <ilyal@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43170c4e
  3. 17 Apr, 2017 1 commit
    • Willem de Bruijn's avatar
      net-timestamp: avoid use-after-free in ip_recv_error · 1862d620
      Willem de Bruijn authored
      Syzkaller reported a use-after-free in ip_recv_error at line
      
          info->ipi_ifindex = skb->dev->ifindex;
      
      This function is called on dequeue from the error queue, at which
      point the device pointer may no longer be valid.
      
      Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
      pointer is valid or NULL. Store it in temporary storage skb->cb.
      
      It is safe to reference skb->dev here, as called from device drivers
      or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
      in that case it is NULL and ifindex is set to 0 (invalid).
      
      Do not return a pktinfo cmsg if ifindex is 0. This maintains the
      current behavior of not returning a cmsg if skb->dev was NULL.
      
      On dequeue, the ipv4 path will cast from sock_exterr_skb to
      in_pktinfo. Both have ifindex as their first element, so no explicit
      conversion is needed. This is by design, introduced in commit
      0b922b7a ("net: original ingress device index in PKTINFO"). For
      ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.
      
      Fixes: 829ae9d6
      
       ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp")
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1862d620
  4. 22 Mar, 2017 2 commits
  5. 07 Mar, 2017 2 commits
  6. 03 Feb, 2017 1 commit
  7. 02 Feb, 2017 1 commit
  8. 27 Jan, 2017 1 commit
    • Eric Dumazet's avatar
      net: adjust skb->truesize in pskb_expand_head() · 158f323b
      Eric Dumazet authored
      
      
      Slava Shwartsman reported a warning in skb_try_coalesce(), when we
      detect skb->truesize is completely wrong.
      
      In his case, issue came from IPv6 reassembly coping with malicious
      datagrams, that forced various pskb_may_pull() to reallocate a bigger
      skb->head than the one allocated by NIC driver before entering GRO
      layer.
      
      Current code does not change skb->truesize, leaving this burden to
      callers if they care enough.
      
      Blindly changing skb->truesize in pskb_expand_head() is not
      easy, as some producers might track skb->truesize, for example
      in xmit path for back pressure feedback (sk->sk_wmem_alloc)
      
      We can detect the cases where it should be safe to change
      skb->truesize :
      
      1) skb is not attached to a socket.
      2) If it is attached to a socket, destructor is sock_edemux()
      
      My audit gave only two callers doing their own skb->truesize
      manipulation.
      
      I had to remove skb parameter in sock_edemux macro when
      CONFIG_INET is not set to avoid a compile error.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarSlava Shwartsman <slavash@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      158f323b
  9. 11 Jan, 2017 1 commit
  10. 09 Jan, 2017 1 commit
  11. 25 Dec, 2016 1 commit
    • Thomas Gleixner's avatar
      ktime: Get rid of the union · 2456e855
      Thomas Gleixner authored
      
      
      ktime is a union because the initial implementation stored the time in
      scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
      variant for 32bit machines. The Y2038 cleanup removed the timespec variant
      and switched everything to scalar nanoseconds. The union remained, but
      become completely pointless.
      
      Get rid of the union and just keep ktime_t as simple typedef of type s64.
      
      The conversion was done with coccinelle and some manual mopping up.
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2456e855
  12. 24 Dec, 2016 1 commit
  13. 10 Dec, 2016 1 commit
  14. 08 Dec, 2016 1 commit
    • Eric Dumazet's avatar
      udp: under rx pressure, try to condense skbs · c8c8b127
      Eric Dumazet authored
      
      
      Under UDP flood, many softirq producers try to add packets to
      UDP receive queue, and one user thread is burning one cpu trying
      to dequeue packets as fast as possible.
      
      Two parts of the per packet cost are :
      - copying payload from kernel space to user space,
      - freeing memory pieces associated with skb.
      
      If socket is under pressure, softirq handler(s) can try to pull in
      skb->head the payload of the packet if it fits.
      
      Meaning the softirq handler(s) can free/reuse the page fragment
      immediately, instead of letting udp_recvmsg() do this hundreds of usec
      later, possibly from another node.
      
      Additional gains :
      - We reduce skb->truesize and thus can store more packets per SO_RCVBUF
      - We avoid cache line misses at copyout() time and consume_skb() time,
      and avoid one put_page() with potential alien freeing on NUMA hosts.
      
      This comes at the cost of a copy, bounded to available tail room, which
      is usually small. (We might have to fix GRO_MAX_HEAD which looks bigger
      than necessary)
      
      This patch gave me about 5 % increase in throughput in my tests.
      
      skb_condense() helper could probably used in other contexts.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c8c8b127
  15. 02 Dec, 2016 1 commit
  16. 30 Nov, 2016 1 commit
    • Francis Yan's avatar
      tcp: SOF_TIMESTAMPING_OPT_STATS option for SO_TIMESTAMPING · 1c885808
      Francis Yan authored
      
      
      This patch exports the sender chronograph stats via the socket
      SO_TIMESTAMPING channel. Currently we can instrument how long a
      particular application unit of data was queued in TCP by tracking
      SOF_TIMESTAMPING_TX_SOFTWARE and SOF_TIMESTAMPING_TX_SCHED. Having
      these sender chronograph stats exported simultaneously along with
      these timestamps allow further breaking down the various sender
      limitation.  For example, a video server can tell if a particular
      chunk of video on a connection takes a long time to deliver because
      TCP was experiencing small receive window. It is not possible to
      tell before this patch without packet traces.
      
      To prepare these stats, the user needs to set
      SOF_TIMESTAMPING_OPT_STATS and SOF_TIMESTAMPING_OPT_TSONLY flags
      while requesting other SOF_TIMESTAMPING TX timestamps. When the
      timestamps are available in the error queue, the stats are returned
      in a separate control message of type SCM_TIMESTAMPING_OPT_STATS,
      in a list of TLVs (struct nlattr) of types: TCP_NLA_BUSY_TIME,
      TCP_NLA_RWND_LIMITED, TCP_NLA_SNDBUF_LIMITED. Unit is microsecond.
      Signed-off-by: default avatarFrancis Yan <francisyyan@gmail.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c885808
  17. 24 Nov, 2016 1 commit
    • Eric Dumazet's avatar
      tcp: enhance tcp_collapse_retrans() with skb_shift() · f8071cde
      Eric Dumazet authored
      In commit 2331ccc5
      
       ("tcp: enhance tcp collapsing"),
      we made a first step allowing copying right skb to left skb head.
      
      Since all skbs in socket write queue are headless (but possibly the very
      first one), this strategy often does not work.
      
      This patch extends tcp_collapse_retrans() to perform frag shifting,
      thanks to skb_shift() helper.
      
      This helper needs to not BUG on non headless skbs, as callers are ok
      with that.
      
      Tested:
      
      Following packetdrill test now passes :
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 8>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      +.100 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
         +0 write(4, ..., 200) = 200
         +0 > P. 1:201(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 201:401(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 401:601(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 601:801(200) ack 1
      +.001 write(4, ..., 200) = 200
         +0 > P. 801:1001(200) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1001:1101(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1101:1201(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1201:1301(100) ack 1
      +.001 write(4, ..., 100) = 100
         +0 > P. 1301:1401(100) ack 1
      
      +.099 < . 1:1(0) ack 201 win 257
      +.001 < . 1:1(0) ack 201 win 257 <nop,nop,sack 1001:1401>
         +0 > P. 201:1001(800) ack 1
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8071cde
  18. 20 Nov, 2016 1 commit
  19. 08 Nov, 2016 1 commit
    • Soheil Hassas Yeganeh's avatar
      sock: do not set sk_err in sock_dequeue_err_skb · f5f99309
      Soheil Hassas Yeganeh authored
      
      
      Do not set sk_err when dequeuing errors from the error queue.
      Doing so results in:
      a) Bugs: By overwriting existing sk_err values, it possibly
         hides legitimate errors. It is also incorrect when local
         errors are queued with ip_local_error. That happens in the
         context of a system call, which already returns the error
         code.
      b) Inconsistent behavior: When there are pending errors on
         the error queue, sk_err is sometimes 0 (e.g., for
         the first timestamp on the error queue) and sometimes
         set to an error code (after dequeuing the first
         timestamp).
      c) Suboptimality: Setting sk_err to ENOMSG on simple
         TX timestamps can abort parallel reads and writes.
      
      Removing this line doesn't break userspace. This is because
      userspace code cannot rely on sk_err for detecting whether
      there is something on the error queue. Except for ICMP messages
      received for UDP and RAW, sk_err is not set at enqueue time,
      and as a result sk_err can be 0 while there are plenty of
      errors on the error queue.
      
      For ICMP packets in UDP and RAW, sk_err is set when they are
      enqueued on the error queue, but that does not result in aborting
      reads and writes. For such cases, sk_err is only readable via
      getsockopt(SO_ERROR) which will reset the value of sk_err on
      its own. More importantly, prior to this patch,
      recvmsg(MSG_ERRQUEUE) has a race on setting sk_err (i.e.,
      sk_err is set by sock_dequeue_err_skb without atomic ops or
      locks) which can store 0 in sk_err even when we have ICMP
      messages pending. Removing this line from sock_dequeue_err_skb
      eliminates that race.
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5f99309
  20. 04 Oct, 2016 2 commits
    • Shmulik Ladkani's avatar
      net: skbuff: Limit skb_vlan_pop/push() to expect skb->data at mac header · b6a79208
      Shmulik Ladkani authored
      
      
      skb_vlan_pop/push were too generic, trying to support the cases where
      skb->data is at mac header, and cases where skb->data is arbitrarily
      elsewhere.
      
      Supporting an arbitrary skb->data was complex and bogus:
       - It failed to unwind skb->data to its original location post actual
         pop/push.
         (Also, semantic is not well defined for unwinding: If data was into
          the eth header, need to use same offset from start; But if data was
          at network header or beyond, need to adjust the original offset
          according to the push/pull)
       - It mangled the rcsum post actual push/pop, without taking into account
         that the eth bytes might already have been pulled out of the csum.
      
      Most callers (ovs, bpf) already had their skb->data at mac_header upon
      invoking skb_vlan_pop/push.
      Last caller that failed to do so (act_vlan) has been recently fixed.
      
      Therefore, to simplify things, no longer support arbitrary skb->data
      inputs for skb_vlan_pop/push().
      
      skb->data is expected to be exactly at mac_header; WARN otherwise.
      Signed-off-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Pravin Shelar <pshelar@ovn.org>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6a79208
    • Al Viro's avatar
      skb_splice_bits(): get rid of callback · 25869262
      Al Viro authored
      
      
      since pipe_lock is the outermost now, we don't need to drop/regain
      socket locks around the call of splice_to_pipe() from skb_splice_bits(),
      which kills the need to have a socket-specific callback; we can just
      call splice_to_pipe() and be done with that.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      25869262
  21. 22 Sep, 2016 3 commits
  22. 20 Sep, 2016 1 commit
  23. 09 Sep, 2016 1 commit
    • Yaogong Wang's avatar
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang authored
      
      
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: default avatarYaogong Wang <wygivan@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f5afeae
  24. 01 Jul, 2016 1 commit
  25. 04 Jun, 2016 1 commit
  26. 03 Jun, 2016 4 commits
  27. 10 May, 2016 1 commit
  28. 04 May, 2016 2 commits
  29. 25 Apr, 2016 1 commit
    • Sowmini Varadhan's avatar
      skbuff: Add pskb_extract() helper function · 6fa01ccd
      Sowmini Varadhan authored
      
      
      A pattern of skb usage seen in modules such as RDS-TCP is to
      extract `to_copy' bytes from the received TCP segment, starting
      at some offset `off' into a new skb `clone'. This is done in
      the ->data_ready callback, where the clone skb is queued up for rx on
      the PF_RDS socket, while the parent TCP segment is returned unchanged
      back to the TCP engine.
      
      The existing code uses the sequence
      	clone = skb_clone(..);
      	pskb_pull(clone, off, ..);
      	pskb_trim(clone, to_copy, ..);
      with the intention of discarding the first `off' bytes. However,
      skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
      up doing a redundant memcpy of bytes that will then get discarded
      in __pskb_pull_tail().
      
      To avoid this inefficiency, this commit adds pskb_extract() that
      creates the clone, and memcpy's only the relevant header/frag/frag_list
      to the start of `clone'. pskb_trim() is then invoked to trim clone
      down to the requested to_copy bytes.
      Signed-off-by: default avatarSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fa01ccd
  30. 16 Apr, 2016 1 commit
    • Daniel Borkmann's avatar
      vlan: pull on __vlan_insert_tag error path and fix csum correction · 9241e2df
      Daniel Borkmann authored
      When __vlan_insert_tag() fails from skb_vlan_push() path due to the
      skb_cow_head(), we need to undo the __skb_push() in the error path
      as well that was done earlier to move skb->data pointer to mac header.
      
      Moreover, I noticed that when in the non-error path the __skb_pull()
      is done and the original offset to mac header was non-zero, we fixup
      from a wrong skb->data offset in the checksum complete processing.
      
      So the skb_postpush_rcsum() really needs to be done before __skb_pull()
      where skb->data still points to the mac header start and thus operates
      under the same conditions as in __vlan_insert_tag().
      
      Fixes: 93515d53
      
       ("net: move vlan pop/push functions into common code")
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9241e2df
  31. 14 Apr, 2016 1 commit
    • Alexander Duyck's avatar
      GSO: Support partial segmentation offload · 802ab55a
      Alexander Duyck authored
      
      
      This patch adds support for something I am referring to as GSO partial.
      The basic idea is that we can support a broader range of devices for
      segmentation if we use fixed outer headers and have the hardware only
      really deal with segmenting the inner header.  The idea behind the naming
      is due to the fact that everything before csum_start will be fixed headers,
      and everything after will be the region that is handled by hardware.
      
      With the current implementation it allows us to add support for the
      following GSO types with an inner TSO_MANGLEID or TSO6 offload:
      NETIF_F_GSO_GRE
      NETIF_F_GSO_GRE_CSUM
      NETIF_F_GSO_IPIP
      NETIF_F_GSO_SIT
      NETIF_F_UDP_TUNNEL
      NETIF_F_UDP_TUNNEL_CSUM
      
      In the case of hardware that already supports tunneling we may be able to
      extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
      hardware can support updating inner IPv4 headers.
      Signed-off-by: default avatarAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      802ab55a