1. 09 May, 2016 19 commits
  2. 07 May, 2016 1 commit
  3. 06 May, 2016 20 commits
    • Jiri Pirko's avatar
      mlxsw: spectrum: Fix ordering in mlxsw_sp_fini · 5113bfdb
      Jiri Pirko authored
      Fixes: 0f433fa0
      
       ("mlxsw: spectrum_buffers: Implement shared buffer configuration")
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5113bfdb
    • Marc Angel's avatar
      macvtap: add namespace support to the sysfs device class · 17af2bce
      Marc Angel authored
      When creating macvtaps that are expected to have the same ifindex
      in different network namespaces, only the first one will succeed.
      The others will fail with a sysfs_warn_dup warning due to them trying
      to create the following sysfs link (with 'NN' the ifindex of macvtapX):
      
      /sys/class/macvtap/tapNN -> /sys/devices/virtual/net/macvtapX/tapNN
      
      This is reproducible by running the following commands:
      
      ip netns add ns1
      ip netns add ns2
      ip link add veth0 type veth peer name veth1
      ip link set veth0 netns ns1
      ip link set veth1 netns ns2
      ip netns exec ns1 ip l add link veth0 macvtap0 type macvtap
      ip netns exec ns2 ip l add link veth1 macvtap1 type macvtap
      
      The last command will fail with "RTNETLINK answers: File exists" (along
      with the kernel warning) but retrying it will work because the ifindex
      was incremented.
      
      The 'net' device class is isolated between network namespaces so each
      one has its own hierarchy of net devices.
      This isn't the case for the 'macvtap' device class.
      The problem occurs half-way through the netdev registration, when
      `macvtap_device_event` is called-back to create the 'tapNN' macvtap
      class device under the 'macvtapX' net class device.
      
      This patch adds namespace support to the 'macvtap' device class so
      that /sys/class/macvtap is no longer shared between net namespaces.
      
      However, making the macvtap sysfs class namespace-aware has the side
      effect of changing /sys/devices/virtual/net/macvtapX/tapNN  into
      /sys/devices/virtual/net/macvtapX/macvtap/tapNN.
      
      This is due to Commit 24b1442d
      
       ("Driver-core: Always create class
      directories for classses that support namespaces") and the fact that
      class devices supporting namespaces are really not supposed to be placed
      directly under other class devices.
      
      To avoid breaking userland, a tapNN symlink pointing to macvtap/tapNN is
      created inside the macvtapX directory.
      Signed-off-by: default avatarMarc Angel <marc@arista.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17af2bce
    • Eric Dumazet's avatar
      ipv4: tcp: ip_send_unicast_reply() is not BH safe · 47dcc20a
      Eric Dumazet authored
      I forgot that ip_send_unicast_reply() is not BH safe (yet).
      
      Disabling preemption before calling it was not a good move.
      
      Fixes: c10d9310
      
       ("tcp: do not assume TCP code is non preemptible")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarAndres Lagar-Cavilla  <andreslc@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47dcc20a
    • David S. Miller's avatar
      Merge branch 'bpf-direct-pkt-access' · 4b307a8e
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf: introduce direct packet access
      
      This set of patches introduce 'direct packet access' from
      cls_bpf and act_bpf programs (which are root only).
      
      Current bpf programs use LD_ABS, LD_INS instructions which have
      to do 'if (off < skb_headlen)' for every packet access.
      It's ok for socket filters, but too slow for XDP, since single
      LD_ABS insn consumes 3% of cpu. Therefore we have to amortize the cost
      of length check over multiple packet accesses via direct access
      to skb->data, data_end pointers.
      
      The existing packet parser typically look like:
        if (load_half(skb, offsetof(struct ethhdr, h_proto)) != ETH_P_IP)
           return 0;
        if (load_byte(skb, ETH_HLEN + offsetof(struct iphdr, protocol)) != IPPROTO_UDP ||
            load_byte(skb, ETH_HLEN) != 0x45)
           return 0;
        ...
      with 'direct packet access' the bpf program becomes:
         void *data = (void *)(long)skb->data;
         void *data_end = (void *)(long)skb->data_end;
         struct eth_hdr *eth = data;
         struct iphdr *iph = data + sizeof(*eth);
      
         if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end)
            return 0;
         if (eth->h_proto != htons(ETH_P_IP))
            return 0;
         if (iph->protocol != IPPROTO_UDP || iph->ihl != 5)
            return 0;
         ...
      which is more natural to write and significantly faster.
      See patch 6 for performance tests:
      21Mpps(old) vs 24Mpps(new) with just 5 loads.
      For more complex parsers the performance gain is higher.
      
      The other approach implemented in [1] was adding two new instructions
      to interpreter and JITs and was too hard to use from llvm side.
      The approach presented here doesn't need any instruction changes,
      but the verifier has to work harder to check safety of the packet access.
      
      Patch 1 prepares the code and Patch 2 adds new checks for direct
      packet access and all of them are gated with 'env->allow_ptr_leaks'
      which is true for root only.
      Patch 3 improves search pruning for large programs.
      Patch 4 wires in verifier's changes with net/core/filter side.
      Patch 5 updates docs
      Patches 6 and 7 add tests.
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw
      
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b307a8e
    • Alexei Starovoitov's avatar
      samples/bpf: add verifier tests · 883e44e4
      Alexei Starovoitov authored
      
      
      add few tests for "pointer to packet" logic of the verifier
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      883e44e4
    • Alexei Starovoitov's avatar
      samples/bpf: add 'pointer to packet' tests · 65d472fb
      Alexei Starovoitov authored
      
      
      parse_simple.c - packet parser exapmle with single length check that
      filters out udp packets for port 9
      
      parse_varlen.c - variable length parser that understand multiple vlan headers,
      ipip, ipip6 and ip options to filter out udp or tcp packets on port 9.
      The packet is parsed layer by layer with multitple length checks.
      
      parse_ldabs.c - classic style of packet parsing using LD_ABS instruction.
      Same functionality as parse_simple.
      
      simple = 24.1Mpps per core
      varlen = 22.7Mpps
      ldabs  = 21.4Mpps
      
      Parser with LD_ABS instructions is slower than full direct access parser
      which does more packet accesses and checks.
      
      These examples demonstrate the choice bpf program authors can make between
      flexibility of the parser vs speed.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      65d472fb
    • Alexei Starovoitov's avatar
      bpf: add documentation for 'direct packet access' · f9c8d19d
      Alexei Starovoitov authored
      
      
      explain how verifier checks safety of packet access
      and update email addresses.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9c8d19d
    • Alexei Starovoitov's avatar
      bpf: wire in data and data_end for cls_act_bpf · db58ba45
      Alexei Starovoitov authored
      
      
      allow cls_bpf and act_bpf programs access skb->data and skb->data_end pointers.
      The bpf helpers that change skb->data need to update data_end pointer as well.
      The verifier checks that programs always reload data, data_end pointers
      after calls to such bpf helpers.
      We cannot add 'data_end' pointer to struct qdisc_skb_cb directly,
      since it's embedded as-is by infiniband ipoib, so wrapper struct is needed.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      db58ba45
    • Alexei Starovoitov's avatar
      bpf: improve verifier state equivalence · 735b4333
      Alexei Starovoitov authored
      
      
      since UNKNOWN_VALUE type is weaker than CONST_IMM we can un-teach
      verifier its recognition of constants in conditional branches
      without affecting safety.
      Ex:
      if (reg == 123) {
        .. here verifier was marking reg->type as CONST_IMM
           instead keep reg as UNKNOWN_VALUE
      }
      
      Two verifier states with UNKNOWN_VALUE are equivalent, whereas
      CONST_IMM_X != CONST_IMM_Y, since CONST_IMM is used for stack range
      verification and other cases.
      So help search pruning by marking registers as UNKNOWN_VALUE
      where possible instead of CONST_IMM.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      735b4333
    • Alexei Starovoitov's avatar
      bpf: direct packet access · 969bf05e
      Alexei Starovoitov authored
      Extended BPF carried over two instructions from classic to access
      packet data: LD_ABS and LD_IND. They're highly optimized in JITs,
      but due to their design they have to do length check for every access.
      When BPF is processing 20M packets per second single LD_ABS after JIT
      is consuming 3% cpu. Hence the need to optimize it further by amortizing
      the cost of 'off < skb_headlen' over multiple packet accesses.
      One option is to introduce two new eBPF instructions LD_ABS_DW and LD_IND_DW
      with similar usage as skb_header_pointer().
      The kernel part for interpreter and x64 JIT was implemented in [1], but such
      new insns behave like old ld_abs and abort the program with 'return 0' if
      access is beyond linear data. Such hidden control flow is hard to workaround
      plus changing JITs and rolling out new llvm is incovenient.
      
      Therefore allow cls_bpf/act_bpf program access skb->data directly:
      int bpf_prog(struct __sk_buff *skb)
      {
        struct iphdr *ip;
      
        if (skb->data + sizeof(struct iphdr) + ETH_HLEN > skb->data_end)
            /* packet too small */
            return 0;
      
        ip = skb->data + ETH_HLEN;
      
        /* access IP header fields with direct loads */
        if (ip->version != 4 || ip->saddr == 0x7f000001)
            return 1;
        [...]
      }
      
      This solution avoids introduction of new instructions. llvm stays
      the same and all JITs stay the same, but verifier has to work extra hard
      to prove safety of the above program.
      
      For XDP the direct store instructions can be allowed as well.
      
      The skb->data is NET_IP_ALIGNED, so for common cases the verifier can check
      the alignment. The complex packet parsers where packet pointer is adjusted
      incrementally cannot be tracked for alignment, so allow byte access in such cases
      and misaligned access on architectures that define efficient_unaligned_access
      
      [1] https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/?h=ld_abs_dw
      
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      969bf05e
    • Alexei Starovoitov's avatar
      bpf: cleanup verifier code · 1a0dc1ac
      Alexei Starovoitov authored
      
      
      cleanup verifier code and prepare it for addition of "pointer to packet" logic
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1a0dc1ac
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 95aef7ce
      David S. Miller authored
      
      
      Jeff Kirsher says:
      
      ====================
      40GbE Intel Wired LAN Driver Updates 2016-05-05
      
      This series contains updates to i40e and i40evf.
      
      The theme behind this series is code reduction, yeah!  Jesse provides
      most of the changes starting with a refactor of the interpretation of
      a tunnel which lets us start using the hardware's parsing.  Removed
      the packet split receive routine and ancillary code in preparation
      for the Rx-refactor.  The refactor of the receive routine,
      aligns the receive routine with the one in ixgbe which was highly
      optimized.  The hardware supports a 16 byte descriptor for receive,
      but the driver was never using it in production.  There was no performance
      benefit to the real driver of 16 byte descriptors, so drop a whole lot
      of complexity while getting rid of the code.  Fixed a bug where while
      changing the number of descriptors using ethtool, the driver did not
      test the limits of the system memory before permanently assuming it
      would be able to get receive buffer memory.
      
      Mitch fixes a memory leak of one page each time the driver is opened by
      allocating the correct number of receive buffers and do not fiddle with
      next_to_use in the VF driver.
      
      Arnd Bergmann fixed a indentation issue by adding the appropriate
      curly braces in i40e_vc_config_promiscuous_mode_msg().
      
      Julia Lawall fixed an issue found by Coccinelle, where i40e_client_ops
      structure can be const since it is never modified.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      95aef7ce
    • Tedd Ho-Jeong An's avatar
      Bluetooth: Add support for Intel Bluetooth device 8265 [8087:0a2b] · a0af53b5
      Tedd Ho-Jeong An authored
      
      
      This patch adds support for Intel Bluetooth device 8265 also known
      as Windstorm Peak (WsP).
      
      T:  Bus=01 Lev=01 Prnt=01 Port=01 Cnt=02 Dev#=  6 Spd=12   MxCh= 0
      D:  Ver= 2.00 Cls=e0(wlcon) Sub=01 Prot=01 MxPS=64 #Cfgs=  1
      P:  Vendor=8087 ProdID=0a2b Rev= 0.10
      C:* #Ifs= 2 Cfg#= 1 Atr=e0 MxPwr=100mA
      I:* If#= 0 Alt= 0 #EPs= 3 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=81(I) Atr=03(Int.) MxPS=  64 Ivl=1ms
      E:  Ad=02(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
      E:  Ad=82(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
      I:* If#= 1 Alt= 0 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=   0 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=   0 Ivl=1ms
      I:  If#= 1 Alt= 1 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=   9 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=   9 Ivl=1ms
      I:  If#= 1 Alt= 2 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  17 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  17 Ivl=1ms
      I:  If#= 1 Alt= 3 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  25 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  25 Ivl=1ms
      I:  If#= 1 Alt= 4 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  33 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  33 Ivl=1ms
      I:  If#= 1 Alt= 5 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
      E:  Ad=03(O) Atr=01(Isoc) MxPS=  49 Ivl=1ms
      E:  Ad=83(I) Atr=01(Isoc) MxPS=  49 Ivl=1ms
      Signed-off-by: default avatarTedd Ho-Jeong An <tedd.an@intel.com>
      Signed-off-by: default avatarMarcel Holtmann <marcel@holtmann.org>
      a0af53b5
    • David Ahern's avatar
      net: vrf: Create FIB tables on link create · b3b4663c
      David Ahern authored
      
      
      Tables have to exist for VRFs to function. Ensure they exist
      when VRF device is created.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b3b4663c
    • Jon Maxwell's avatar
      cnic: call cp->stop_hw() in cnic_start_hw() on allocation failure · f37bd0cc
      Jon Maxwell authored
      We recently had a system crash in the cnic module. Vmcore analysis confirmed
      that "ip link up" was executed which failed due to an allocation failure
      because of memory fragmentation. Futher analysis revealed that the cnic irq
      vector was still allocated after the "ip link up" that failed. When
      "ip link down" was executed it called free_msi_irqs() which crashed the system
      because the cnic irq was still inuse.
      
      PANIC: "kernel BUG at drivers/pci/msi.c:411!"
      
      The code execution was:
      
      cnic_netdev_event()
      if (event == NETDEV_UP) {
      .
      .
             ▹       if (!cnic_start_hw(dev))
      cnic_start_hw()
      calls cnic_cm_open() which failed with -ENOMEM
      cnic_start_hw() then took the err1 path:
      
      err1:
             cp->free_resc(dev); <---- frees resources but not irq vector
             pci_dev_put(dev->pcidev);
             return err;
      }
      
      
      
      This returns control back to cnic_netdev_event() but now the cnic irq vector
      is still allocated even although cnic_cm_open() failed. The next
      "ip link down" while trigger the crash.
      
      The cnic_start_hw() routine is not handling the allocation failure correctly.
      Fix this by checking whether CNIC_DRV_STATE_HANDLES_IRQ flag is set indicating
      that the hardware has been started in cnic_start_hw(). If it has then call
      cp->stop_hw() which frees the cnic irq vector and cnic resources. Otherwise
      just maintain the previous behaviour and free cnic resources.
      
      I reproduced this by injecting an ENOMEM error into cnic_cm_alloc_mem()s return
      code.
      
      # ip link set dev enpX down
      # ip link set dev enpX up <--- hit's allocation failure
      # ip link set dev enpX down <--- crashes here
      
      With this patch I confirmed there was no crash in the reproducer.
      Signed-off-by: default avatarJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f37bd0cc
    • Julia Lawall's avatar
      i40e: constify i40e_client_ops structure · 3949c4ac
      Julia Lawall authored
      
      
      The i40e_client_ops structure is never modified, so declare it as const.
      
      Done with the help of Coccinelle.
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@lip6.fr>
      Reviewed-by: default avatarLeon Romanovsky <leonro@mellanox.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      3949c4ac
    • Arnd Bergmann's avatar
      i40e: fix misleading indentation · ce927db4
      Arnd Bergmann authored
      
      
      Newly added code in i40e_vc_config_promiscuous_mode_msg() is indented
      in a way that gcc rightly complains about:
      
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c: In function 'i40e_vc_config_promiscuous_mode_msg':
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:1543:4: error: this 'if' clause does not guard... [-Werror=misleading-indentation]
          if (f->vlan >= 0 && f->vlan <= I40E_MAX_VLANID)
          ^~
      drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c:1550:5: note: ...this statement, but the latter is misleadingly indented as if it is guarded by the 'if'
           aq_err = pf->hw.aq.asq_last_status;
      
      From the context, it looks like the aq_err assignment was meant to be
      inside of the conditional expression, so I'm adding the appropriate
      curly braces now.
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Fixes: 5676a8b9
      
       ("i40e: Add VF promiscuous mode driver support")
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      ce927db4
    • Jesse Brandeburg's avatar
      i40e: Test memory before ethtool alloc succeeds · 147e81ec
      Jesse Brandeburg authored
      
      
      When testing on systems with very limited amounts of RAM, a bug was
      found where, while changing the number of descriptors using ethtool,
      the driver didn't test the limits of system memory before permanently
      assuming it would be able to get receive buffer memory.
      
      Work around this issue by pre-allocation of the receive buffer
      memory, in the "ghost" ring, which is then used during reinit
      using the new ring length.
      
      Change-Id: I92d7a5fb59a6c884b2efdd1ec652845f101c3359
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      147e81ec
    • Mitch Williams's avatar
      i40evf: Allocate Rx buffers properly · b163098e
      Mitch Williams authored
      
      
      Allocate the correct number of RX buffers, and don't fiddle with
      next_to_use. The common RX code handles all of this. This fixes a memory
      leak of one page each time the driver is opened.
      
      Change-Id: Id06eca353086e084921f047acad28c14745684ee
      Signed-off-by: default avatarMitch Williams <mitch.a.williams@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b163098e
    • Jesse Brandeburg's avatar
      i40e/i40evf: Remove unused hardware receive descriptor code · bec60fc4
      Jesse Brandeburg authored
      
      
      The hardware supports a 16 byte descriptor for receive, but the
      driver was never using it in production.  There was no performance
      benefit to the real driver of 16 byte descriptors, so drop a whole
      lot of complexity while getting rid of the code.
      
      Also since the previous patch made us use no-split mode all the
      time, drop any support in the driver for any other value in dtype
      and assume it is always zero (aka no-split).
      
      Hooray for code removal!
      
      Change-ID: I2257e902e4dad84a07b94db6d2e6f4ce69b27bc0
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarAndrew Bowers <andrewx.bowers@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      bec60fc4