Skip to content
Snippets Groups Projects
  1. Sep 10, 2017
  2. Sep 09, 2017
    • Daniel Borkmann's avatar
      bpf: make error reporting in bpf_warn_invalid_xdp_action more clear · 9beb8bed
      Daniel Borkmann authored
      
      Differ between illegal XDP action code and just driver
      unsupported one to provide better feedback when we throw
      a one-time warning here. Reason is that with 814abfab
      ("xdp: add bpf_redirect helper function") not all drivers
      support the new XDP return code yet and thus they will
      fall into their 'default' case when checking for return
      codes after program return, which then triggers a
      bpf_warn_invalid_xdp_action() stating that the return
      code is illegal, but from XDP perspective it's not.
      
      I decided not to place something like a XDP_ACT_MAX define
      into uapi i) given we don't have this either for all other
      program types, ii) future action codes could have further
      encoding there, which would render such define unsuitable
      and we wouldn't be able to rip it out again, and iii) we
      rarely add new action codes.
      
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9beb8bed
    • John Fastabend's avatar
      net: rcu lock and preempt disable missing around generic xdp · bbbe211c
      John Fastabend authored
      
      do_xdp_generic must be called inside rcu critical section with preempt
      disabled to ensure BPF programs are valid and per-cpu variables used
      for redirect operations are consistent. This patch ensures this is true
      and fixes the splat below.
      
      The netif_receive_skb_internal() code path is now broken into two rcu
      critical sections. I decided it was better to limit the preempt_enable/disable
      block to just the xdp static key portion and the fallout is more
      rcu_read_lock/unlock calls. Seems like the best option to me.
      
      [  607.596901] =============================
      [  607.596906] WARNING: suspicious RCU usage
      [  607.596912] 4.13.0-rc4+ #570 Not tainted
      [  607.596917] -----------------------------
      [  607.596923] net/core/dev.c:3948 suspicious rcu_dereference_check() usage!
      [  607.596927]
      [  607.596927] other info that might help us debug this:
      [  607.596927]
      [  607.596933]
      [  607.596933] rcu_scheduler_active = 2, debug_locks = 1
      [  607.596938] 2 locks held by pool/14624:
      [  607.596943]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff95445ffd>] ip_finish_output2+0x14d/0x890
      [  607.596973]  #1:  (rcu_read_lock_bh){......}, at: [<ffffffff953c8e3a>] __dev_queue_xmit+0x14a/0xfd0
      [  607.597000]
      [  607.597000] stack backtrace:
      [  607.597006] CPU: 5 PID: 14624 Comm: pool Not tainted 4.13.0-rc4+ #570
      [  607.597011] Hardware name: Dell Inc. Precision Tower 5810/0HHV7N, BIOS A17 03/01/2017
      [  607.597016] Call Trace:
      [  607.597027]  dump_stack+0x67/0x92
      [  607.597040]  lockdep_rcu_suspicious+0xdd/0x110
      [  607.597054]  do_xdp_generic+0x313/0xa50
      [  607.597068]  ? time_hardirqs_on+0x5b/0x150
      [  607.597076]  ? mark_held_locks+0x6b/0xc0
      [  607.597088]  ? netdev_pick_tx+0x150/0x150
      [  607.597117]  netif_rx_internal+0x205/0x3f0
      [  607.597127]  ? do_xdp_generic+0xa50/0xa50
      [  607.597144]  ? lock_downgrade+0x2b0/0x2b0
      [  607.597158]  ? __lock_is_held+0x93/0x100
      [  607.597187]  netif_rx+0x119/0x190
      [  607.597202]  loopback_xmit+0xfd/0x1b0
      [  607.597214]  dev_hard_start_xmit+0x127/0x4e0
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Fixes: b5cdae32 ("net: Generic XDP")
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbbe211c
    • Daniel Borkmann's avatar
      bpf: don't select potentially stale ri->map from buggy xdp progs · 109980b8
      Daniel Borkmann authored
      
      We can potentially run into a couple of issues with the XDP
      bpf_redirect_map() helper. The ri->map in the per CPU storage
      can become stale in several ways, mostly due to misuse, where
      we can then trigger a use after free on the map:
      
      i) prog A is calling bpf_redirect_map(), returning XDP_REDIRECT
      and running on a driver not supporting XDP_REDIRECT yet. The
      ri->map on that CPU becomes stale when the XDP program is unloaded
      on the driver, and a prog B loaded on a different driver which
      supports XDP_REDIRECT return code. prog B would have to omit
      calling to bpf_redirect_map() and just return XDP_REDIRECT, which
      would then access the freed map in xdp_do_redirect() since not
      cleared for that CPU.
      
      ii) prog A is calling bpf_redirect_map(), returning a code other
      than XDP_REDIRECT. prog A is then detached, which triggers release
      of the map. prog B is attached which, similarly as in i), would
      just return XDP_REDIRECT without having called bpf_redirect_map()
      and thus be accessing the freed map in xdp_do_redirect() since
      not cleared for that CPU.
      
      iii) prog A is attached to generic XDP, calling the bpf_redirect_map()
      helper and returning XDP_REDIRECT. xdp_do_generic_redirect() is
      currently not handling ri->map (will be fixed by Jesper), so it's
      not being reset. Later loading a e.g. native prog B which would,
      say, call bpf_xdp_redirect() and then returns XDP_REDIRECT would
      find in xdp_do_redirect() that a map was set and uses that causing
      use after free on map access.
      
      Fix thus needs to avoid accessing stale ri->map pointers, naive
      way would be to call a BPF function from drivers that just resets
      it to NULL for all XDP return codes but XDP_REDIRECT and including
      XDP_REDIRECT for drivers not supporting it yet (and let ri->map
      being handled in xdp_do_generic_redirect()). There is a less
      intrusive way w/o letting drivers call a reset for each BPF run.
      
      The verifier knows we're calling into bpf_xdp_redirect_map()
      helper, so it can do a small insn rewrite transparent to the prog
      itself in the sense that it fills R4 with a pointer to the own
      bpf_prog. We have that pointer at verification time anyway and
      R4 is allowed to be used as per calling convention we scratch
      R0 to R5 anyway, so they become inaccessible and program cannot
      read them prior to a write. Then, the helper would store the prog
      pointer in the current CPUs struct redirect_info. Later in
      xdp_do_*_redirect() we check whether the redirect_info's prog
      pointer is the same as passed xdp_prog pointer, and if that's
      the case then all good, since the prog holds a ref on the map
      anyway, so it is always valid at that point in time and must
      have a reference count of at least 1. If in the unlikely case
      they are not equal, it means we got a stale pointer, so we clear
      and bail out right there. Also do reset map and the owning prog
      in bpf_xdp_redirect(), so that bpf_xdp_redirect_map() and
      bpf_xdp_redirect() won't get mixed up, only the last call should
      take precedence. A tc bpf_redirect() doesn't use map anywhere
      yet, so no need to clear it there since never accessed in that
      layer.
      
      Note that in case the prog is released, and thus the map as
      well we're still under RCU read critical section at that time
      and have preemption disabled as well. Once we commit with the
      __dev_map_insert_ctx() from xdp_do_redirect_map() and set the
      map to ri->map_to_flush, we still wait for a xdp_do_flush_map()
      to finish in devmap dismantle time once flush_needed bit is set,
      so that is fine.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Reported-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      109980b8
    • Haishuang Yan's avatar
      ip6_tunnel: fix setting hop_limit value for ipv6 tunnel · 18e1173d
      Haishuang Yan authored
      
      Similar to vxlan/geneve tunnel, if hop_limit is zero, it should fall
      back to ip6_dst_hoplimt().
      
      Signed-off-by: default avatarHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18e1173d
    • Haishuang Yan's avatar
      ip_tunnel: fix setting ttl and tos value in collect_md mode · 0f693f19
      Haishuang Yan authored
      
      ttl and tos variables are declared and assigned, but are not used in
      iptunnel_xmit() function.
      
      Fixes: cfc7381b ("ip_tunnel: add collect_md mode to IPIP tunnel")
      Cc: Alexei Starovoitov <ast@fb.com>
      Signed-off-by: default avatarHaishuang Yan <yanhaishuang@cmss.chinamobile.com>
      Acked-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0f693f19
  3. Sep 08, 2017
  4. Sep 07, 2017
  5. Sep 06, 2017
  6. Sep 05, 2017
    • Chuck Lever's avatar
      xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler · 9590d083
      Chuck Lever authored
      
      Adopt the use of xprt_pin_rqst to eliminate contention between
      Call-side users of rb_lock and the use of rb_lock in
      rpcrdma_reply_handler.
      
      This replaces the mechanism introduced in 431af645 ("xprtrdma:
      Fix client lock-up after application signal fires").
      
      Use recv_lock to quickly find the completing rqst, pin it, then
      drop the lock. At that point invalidation and pull-up of the Reply
      XDR can be done. Both are often expensive operations.
      
      Finally, take recv_lock again to signal completion to the RPC
      layer. It also protects adjustment of "cwnd".
      
      This greatly reduces the amount of time a lock is held by the
      reply handler. Comparing lock_stat results shows a marked decrease
      in contention on rb_lock and recv_lock.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      [trond.myklebust@primarydata.com: Remove call to rpcrdma_buffer_put() from
         the "out_norqst:" path in rpcrdma_reply_handler.]
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@primarydata.com>
      9590d083
    • Håkon Bugge's avatar
      rds: Fix non-atomic operation on shared flag variable · f530f39f
      Håkon Bugge authored
      
      The bits in m_flags in struct rds_message are used for a plurality of
      reasons, and from different contexts. To avoid any missing updates to
      m_flags, use the atomic set_bit() instead of the non-atomic equivalent.
      
      Signed-off-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
      Reviewed-by: default avatarKnut Omang <knut.omang@oracle.com>
      Reviewed-by: default avatarWei Lin Guay <wei.lin.guay@oracle.com>
      Acked-by: default avatarSantosh Shilimkar <santosh.shilimkar@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f530f39f
    • Jakub Kicinski's avatar
      net: sched: don't use GFP_KERNEL under spin lock · 2c8468dc
      Jakub Kicinski authored
      
      The new TC IDR code uses GFP_KERNEL under spin lock.  Which leads
      to:
      
      [  582.621091] BUG: sleeping function called from invalid context at ../mm/slab.h:416
      [  582.629721] in_atomic(): 1, irqs_disabled(): 0, pid: 3379, name: tc
      [  582.636939] 2 locks held by tc/3379:
      [  582.641049]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff910354ce>] rtnetlink_rcv_msg+0x92e/0x1400
      [  582.650958]  #1:  (&(&tn->idrinfo->lock)->rlock){+.-.+.}, at: [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0
      [  582.662217] Preemption disabled at:
      [  582.662222] [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0
      [  582.672592] CPU: 9 PID: 3379 Comm: tc Tainted: G        W       4.13.0-rc7-debug-00648-g43503a79b9f0 #287
      [  582.683432] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
      [  582.691937] Call Trace:
      ...
      [  582.742460]  kmem_cache_alloc+0x286/0x540
      [  582.747055]  radix_tree_node_alloc.constprop.6+0x4a/0x450
      [  582.753209]  idr_get_free_cmn+0x627/0xf80
      ...
      [  582.815525]  idr_alloc_cmn+0x1a8/0x270
      ...
      [  582.833804]  tcf_idr_create+0x31b/0x8e0
      ...
      
      Try to preallocate the memory with idr_prealloc(GFP_KERNEL)
      (as suggested by Eric Dumazet), and change the allocation
      flags under spin lock.
      
      Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR")
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@netronome.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2c8468dc
    • David Howells's avatar
      rxrpc: Make service connection lookup always check for retry · fdade4f6
      David Howells authored
      
      When an RxRPC service packet comes in, the target connection is looked up
      by an rb-tree search under RCU and a read-locked seqlock; the seqlock retry
      check is, however, currently skipped if we got a match, but probably
      shouldn't be in case the connection we found gets replaced whilst we're
      doing a search.
      
      Make the lookup procedure always go through need_seqretry(), even if the
      lookup was successful.  This makes sure we always pick up on a write-lock
      event.
      
      On the other hand, since we don't take a ref on the object, but rely on RCU
      to prevent its destruction after dropping the seqlock, I'm not sure this is
      necessary.
      
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fdade4f6
    • Chuck Lever's avatar
      svcrdma: Estimate Send Queue depth properly · 26fb2254
      Chuck Lever authored
      
      The rdma_rw API adjusts max_send_wr upwards during the
      rdma_create_qp() call. If the ULP actually wants to take advantage
      of these extra resources, it must increase the size of its send
      completion queue (created before rdma_create_qp is called) and
      increase its send queue accounting limit.
      
      Use the new rdma_rw_mr_factor API to figure out the correct value
      to use for the Send Queue and Send Completion Queue depths.
      
      And, ensure that the chosen Send Queue depth for a newly created
      transport does not overrun the QP WR limit of the underlying device.
      
      Lastly, there's no longer a need to carry the Send Queue depth in
      struct svcxprt_rdma, since the value is used only in the
      svc_rdma_accept() path.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      26fb2254
    • Chuck Lever's avatar
      svcrdma: Limit RQ depth · 5a25bfd2
      Chuck Lever authored
      
      Ensure that the chosen Receive Queue depth for a newly created
      transport does not overrun the QP WR limit of the underlying device.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      5a25bfd2
    • Chuck Lever's avatar
      svcrdma: Populate tail iovec when receiving · 193bcb7b
      Chuck Lever authored
      
      So that NFS WRITE payloads can eventually be placed directly into a
      file's page cache, enable the RPC-over-RDMA transport to present
      these payloads in the xdr_buf's page list, while placing trailing
      content (such as a GETATTR operation) in the xdr_buf's tail.
      
      After this change, the RPC-over-RDMA's "copy tail" hack, added by
      commit a97c331f ("svcrdma: Handle additional inline content"),
      is no longer needed and can be removed.
      
      Signed-off-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarJ. Bruce Fields <bfields@redhat.com>
      193bcb7b
Loading