Commits · 80532384af4ccb6d6cc965fa9232aaa2c456362c · Martyn Welch / linux

Sep 08, 2017

net: sched: fix memleak for chain zero · 80532384

Jiri Pirko authored 7 years ago

There's a memleak happening for chain 0. The thing is, chain 0 needs to
be always present, not created on demand. Therefore tcf_block_get upon
creation of block calls the tcf_chain_create function directly. The
chain is created with refcnt == 1, which is not correct in this case and
causes the memleak. So move the refcnt increment into tcf_chain_get
function even for the case when chain needs to be created.

Reported-by: Jakub Kicinski <kubakici@wp.pl>
Fixes: 5bc17018 ("net: sched: introduce multichain support for filters")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

80532384

Sep 07, 2017

tipc: remove unnecessary call to dev_net() · 8e0deed9

Kleber Sacilotto de Souza authored 7 years ago


The net device is already stored in the 'net' variable, so no need to call
dev_net() again.

Signed-off-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8e0deed9

netlink: access nlk groups safely in netlink bind and getname · f7736080

Xin Long authored 7 years ago


Now there is no lock protecting nlk ngroups/groups' accessing in
netlink bind and getname. It's safe from nlk groups' setting in
netlink_release, but not from netlink_realloc_groups called by
netlink_setsockopt.

netlink_lock_table is needed in both netlink bind and getname when
accessing nlk groups.

Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7736080

netlink: fix an use-after-free issue for nlk groups · be82485f

Xin Long authored 7 years ago


ChunYu found a netlink use-after-free issue by syzkaller:

[28448.842981] BUG: KASAN: use-after-free in __nla_put+0x37/0x40 at addr ffff8807185e2378
[28448.969918] Call Trace:
[...]
[28449.117207]  __nla_put+0x37/0x40
[28449.132027]  nla_put+0xf5/0x130
[28449.146261]  sk_diag_fill.isra.4.constprop.5+0x5a0/0x750 [netlink_diag]
[28449.176608]  __netlink_diag_dump+0x25a/0x700 [netlink_diag]
[28449.202215]  netlink_diag_dump+0x176/0x240 [netlink_diag]
[28449.226834]  netlink_dump+0x488/0xbb0
[28449.298014]  __netlink_dump_start+0x4e8/0x760
[28449.317924]  netlink_diag_handler_dump+0x261/0x340 [netlink_diag]
[28449.413414]  sock_diag_rcv_msg+0x207/0x390
[28449.432409]  netlink_rcv_skb+0x149/0x380
[28449.467647]  sock_diag_rcv+0x2d/0x40
[28449.484362]  netlink_unicast+0x562/0x7b0
[28449.564790]  netlink_sendmsg+0xaa8/0xe60
[28449.661510]  sock_sendmsg+0xcf/0x110
[28449.865631]  __sys_sendmsg+0xf3/0x240
[28450.000964]  SyS_sendmsg+0x32/0x50
[28450.016969]  do_syscall_64+0x25c/0x6c0
[28450.154439]  entry_SYSCALL64_slow_path+0x25/0x25

It was caused by no protection between nlk groups' free in netlink_release
and nlk groups' accessing in sk_diag_dump_groups. The similar issue also
exists in netlink_seq_show().

This patch is to defer nlk groups' free in deferred_put_nlk_sk.

Reported-by: ChunYu Wang <chunwang@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be82485f

sched: Use __qdisc_drop instead of kfree_skb in sch_prio and sch_qfq · 39ad1297

Gao Feng authored 7 years ago


The commit 520ac30f ("net_sched: drop packets after root qdisc lock
is released) made a big change of tc for performance. There are two points
left in sch_prio and sch_qfq which are not changed with that commit. Now
enhance them now with __qdisc_drop.

Signed-off-by: Gao Feng <gfree.wind@vip.163.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

39ad1297

Sep 06, 2017

mac80211: fix deadlock in driver-managed RX BA session start · bde59c47

Johannes Berg authored 7 years ago


When an RX BA session is started by the driver, and it has to tell
mac80211 about it, the corresponding bit in tid_rx_manage_offl gets
set and the BA session work is scheduled. Upon testing this bit, it
will call __ieee80211_start_rx_ba_session(), thus deadlocking as it
already holds the ampdu_mlme.mtx, which that acquires again.

Fix this by adding ___ieee80211_start_rx_ba_session(), a version of
the function that requires the mutex already held.

Cc: stable@vger.kernel.org
Fixes: 699cb58c ("mac80211: manage RX BA session offload without SKB queue")
Reported-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

bde59c47

mac80211: Complete ampdu work schedule during session tear down · 98e93e96

Ilan peer authored 7 years ago


Commit 7a7c0a64 ("mac80211: fix TX aggregation start/stop callback race")
added a cancellation of the ampdu work after the loop that stopped the
Tx and Rx BA sessions. However, in some cases, e.g., during HW reconfig,
the low level driver might call mac80211 APIs to complete the stopping
of the BA sessions, which would queue the ampdu work to handle the actual
completion. This work needs to be performed as otherwise mac80211 data
structures would not be properly synced.

Fix this by checking if BA session STOP_CB bit is set after the BA session
cancellation and properly clean the session.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
[Johannes: the work isn't flushed because that could do other things we
 don't want, and the locking situation isn't clear]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

98e93e96

cfg80211: honor NL80211_RRF_NO_HT40{MINUS,PLUS} · 4e0854a7

Emmanuel Grumbach authored 7 years ago


Honor the NL80211_RRF_NO_HT40{MINUS,PLUS} flags in
reg_process_ht_flags_channel. Not doing so leads can lead
to a firmware assert in iwlwifi for example.

Fixes: b0d7aa59 ("cfg80211: allow wiphy specific regdomain management")
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

4e0854a7

Sep 05, 2017

rds: Fix non-atomic operation on shared flag variable · f530f39f

Håkon Bugge authored 7 years ago


The bits in m_flags in struct rds_message are used for a plurality of
reasons, and from different contexts. To avoid any missing updates to
m_flags, use the atomic set_bit() instead of the non-atomic equivalent.

Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Knut Omang <knut.omang@oracle.com>
Reviewed-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f530f39f

net: sched: don't use GFP_KERNEL under spin lock · 2c8468dc

Jakub Kicinski authored 7 years ago


The new TC IDR code uses GFP_KERNEL under spin lock.  Which leads
to:

[  582.621091] BUG: sleeping function called from invalid context at ../mm/slab.h:416
[  582.629721] in_atomic(): 1, irqs_disabled(): 0, pid: 3379, name: tc
[  582.636939] 2 locks held by tc/3379:
[  582.641049]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff910354ce>] rtnetlink_rcv_msg+0x92e/0x1400
[  582.650958]  #1:  (&(&tn->idrinfo->lock)->rlock){+.-.+.}, at: [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0
[  582.662217] Preemption disabled at:
[  582.662222] [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0
[  582.672592] CPU: 9 PID: 3379 Comm: tc Tainted: G        W       4.13.0-rc7-debug-00648-g43503a79b9f0 #287
[  582.683432] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016
[  582.691937] Call Trace:
...
[  582.742460]  kmem_cache_alloc+0x286/0x540
[  582.747055]  radix_tree_node_alloc.constprop.6+0x4a/0x450
[  582.753209]  idr_get_free_cmn+0x627/0xf80
...
[  582.815525]  idr_alloc_cmn+0x1a8/0x270
...
[  582.833804]  tcf_idr_create+0x31b/0x8e0
...

Try to preallocate the memory with idr_prealloc(GFP_KERNEL)
(as suggested by Eric Dumazet), and change the allocation
flags under spin lock.

Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c8468dc

rxrpc: Make service connection lookup always check for retry · fdade4f6

David Howells authored 7 years ago


When an RxRPC service packet comes in, the target connection is looked up
by an rb-tree search under RCU and a read-locked seqlock; the seqlock retry
check is, however, currently skipped if we got a match, but probably
shouldn't be in case the connection we found gets replaced whilst we're
doing a search.

Make the lookup procedure always go through need_seqretry(), even if the
lookup was successful.  This makes sure we always pick up on a write-lock
event.

On the other hand, since we don't take a ref on the object, but rely on RCU
to prevent its destruction after dropping the seqlock, I'm not sure this is
necessary.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fdade4f6

net: dsa: tag_brcm: Set output queue from skb queue mapping · 0f15b098

Florian Fainelli authored 7 years ago


We originally used skb->priority but that was not quite correct as this
bitfield needs to contain the egress switch queue we intend to send this
SKB to.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0f15b098

net: dsa: Allow switch drivers to indicate number of TX queues · 55199df6

Florian Fainelli authored 7 years ago


Let switch drivers indicate how many TX queues they support. Some
switches, such as Broadcom Starfighter 2 are designed with 8 egress
queues. Future changes will allow us to leverage the queue mapping and
direct the transmission towards a particular queue.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55199df6

bridge: switchdev: Use an helper to clear forward mark · f1c2eddf

Ido Schimmel authored 7 years ago


Instead of using ifdef in the C file.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Tested-by: Yotam Gigi <yotamg@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1c2eddf

flow_dissector: Add limit for number of headers to dissect · 1eed4dfb

Tom Herbert authored 7 years ago


In flow dissector there are no limits to the number of nested
encapsulations or headers that might be dissected which makes for a
nice DOS attack. This patch sets a limit of the number of headers
that flow dissector will parse.

Headers includes network layer headers, transport layer headers, shim
headers for encapsulation, IPv6 extension headers, etc. The limit for
maximum number of headers to parse has be set to fifteen to account for
a reasonable number of encapsulations, extension headers, VLAN,
in a packet. Note that this limit does not supercede the STOP_AT_*
flags which may stop processing before the headers limit is reached.

Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

1eed4dfb

flow_dissector: Cleanup control flow · 3a1214e8

Tom Herbert authored 7 years ago


__skb_flow_dissect is riddled with gotos that make discerning the flow,
debugging, and extending the capability difficult. This patch
reorganizes things so that we only perform goto's after the two main
switch statements (no gotos within the cases now). It also eliminates
several goto labels so that there are only two labels that can be target
for goto.

Reported-by: Alexander Popov <alex.popov@linux.com>
Signed-off-by: Tom Herbert <tom@quantonium.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

3a1214e8

net/ncsi: fix ncsi_vlan_rx_{add,kill}_vid references · fd0c88b7

Arnd Bergmann authored 7 years ago

We get a new link error in allmodconfig kernels after ftgmac100
started using the ncsi helpers:

ERROR: "ncsi_vlan_rx_kill_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined!
ERROR: "ncsi_vlan_rx_add_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined!

Related to that, we get another error when CONFIG_NET_NCSI is disabled:

drivers/net/ethernet/faraday/ftgmac100.c:1626:25: error: 'ncsi_vlan_rx_add_vid' undeclared here (not in a function); did you mean 'ncsi_start_dev'?
drivers/net/ethernet/faraday/ftgmac100.c:1627:26: error: 'ncsi_vlan_rx_kill_vid' undeclared here (not in a function); did you mean 'ncsi_vlan_rx_add_vid'?

This fixes both problems at once, using a 'static inline' stub helper
for the disabled case, and exporting the functions when they are present.

Fixes: 51564585 ("ftgmac100: Support NCSI VLAN filtering when available")
Fixes: 21acf630 ("net/ncsi: Configure VLAN tag filter")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

fd0c88b7

nl80211: look for HT/VHT capabilities in beacon's tail · ba83bfb1

Igor Mitsyanko authored 7 years ago

There are no HT/VHT capabilities in cfg80211_ap_settings::beacon_ies,
these should be looked for in beacon's tail instead.

Fixes: 66cd794e ("nl80211: add HT/VHT capabilities to AP parameters")
Signed-off-by: Igor Mitsyanko <igor.mitsyanko.os@quantenna.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

ba83bfb1

mac80211: flush hw_roc_start work before cancelling the ROC · 6e46d8ce

Avraham Stern authored 7 years ago


When HW ROC is supported it is possible that after the HW notified
that the ROC has started, the ROC was cancelled and another ROC was
added while the hw_roc_start worker is waiting on the mutex (since
cancelling the ROC and adding another one also holds the same mutex).
As a result, the hw_roc_start worker will continue to run after the
new ROC is added but before it is actually started by the HW.
This may result in notifying userspace that the ROC has started before
it actually does, or in case of management tx ROC, in an attempt to
tx while not on the right channel.

In addition, when the driver will notify mac80211 that the second ROC
has started, mac80211 will warn that this ROC has already been
notified.

Fix this by flushing the hw_roc_start work before cancelling an ROC.

Cc: stable@vger.kernel.org
Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

6e46d8ce

mac80211: agg-tx: call drv_wake_tx_queue in proper context · 979e1f08

Johannes Berg authored 7 years ago


Since drv_wake_tx_queue() is normally called in the TX path, which
is already in an RCU critical section, we should call it the same
way in the aggregation code path, so if the driver expects to be
able to use RCU, it'll already be protected without having to enter
a nested critical section.

Additionally, disable soft-IRQs, since not doing so could cause
issues in a driver that relies on them already being disabled like
in the other path.

Fixes: ba8c3d6f ("mac80211: add an intermediate software queue implementation")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

979e1f08

mac80211: Fix null pointer dereference with iTXQ support · 89e9bfc4

Chunho Lee authored 7 years ago


This change adds null pointer check before dereferencing pointer dev on
netif_tx_start_all_queues() when an interface is added.
With iTXQ support, netif_tx_start_all_queues() is always called while
an interface is added. however, the netdev queues are not associated
and dev is null when the interface is either NL80211_IFTYPE_P2P_DEVICE
or NL80211_IFTYPE_NAN.

Signed-off-by: Chunho Lee <ch.lee@newracom.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

89e9bfc4

mac80211: add MESH IE in the correct order · b44eebea

Liad Kaufman authored 7 years ago


VHT MESH support was added, but the order of the IEs
wasn't enforced. Fix that.

Signed-off-by: Liad Kaufman <liad.kaufman@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

b44eebea

mac80211: shorten debug prints using ht_dbg() to avoid warning · d81b0fd0

Sharon Dvir authored 7 years ago


Invoking ht_dbg() with too long of a string will print a warning.
Shorten the messages while retaining the printed patameters.

Signed-off-by: Sharon Dvir <sharon.dvir@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

d81b0fd0

mac80211: fix VLAN handling with TXQs · 53168215

Johannes Berg authored 7 years ago


With TXQs, the AP_VLAN interfaces are resolved to their owner AP
interface when enqueuing the frame, which makes sense since the
frame really goes out on that as far as the driver is concerned.

However, this introduces a problem: frames to be encrypted with
a VLAN-specific GTK will now be encrypted with the AP GTK, since
the information about which virtual interface to use to select
the key is taken from the TXQ.

Fix this by preserving info->control.vif and using that in the
dequeue function. This now requires doing the driver-mapping
in the dequeue as well.

Since there's no way to filter the frames that are sitting on a
TXQ, drop all frames, which may affect other interfaces, when an
AP_VLAN is removed.

Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

53168215

mac80211: fix incorrect assignment of reassoc value · 7a7d3e4c

Simon Dinkin authored 7 years ago


this fix minor issue in the log message.
in ieee80211_rx_mgmt_assoc_resp function, when assigning the
reassoc value from the mgmt frame control:
ieee80211_is_reassoc_resp function need to be used, instead of
ieee80211_is_reassoc_req function.

Signed-off-by: Simon Dinkin <simon.dinkin@tandemg.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>

7a7d3e4c

Sep 04, 2017

netfilter: nf_tables: support for recursive chain deletion · 9dee1474

Pablo Neira Ayuso authored 7 years ago

This patch sorts out an asymmetry in deletions. Currently, table and set
deletion commands come with an implicit content flush on deletion.
However, chain deletion results in -EBUSY if there is content in this
chain, so no implicit flush happens. So you have to send a flush command
in first place to delete chains, this is inconsistent and it can be
annoying in terms of user experience.

This patch uses the new NLM_F_NONREC flag to request non-recursive chain
deletion, ie. if the chain to be removed contains rules, then this
returns EBUSY. This problem was discussed during the NFWS'17 in Faro,
Portugal. In iptables, you hit -EBUSY if you try to delete a chain that
contains rules, so you have to flush first before you can remove
anything. Since iptables-compat uses the nf_tables netlink interface, it
has to use the NLM_F_NONREC flag from userspace to retain the original
iptables semantics, ie. bail out on removing chains that contain rules.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

9dee1474

netfilter: nf_tables: use NLM_F_NONREC for deletion requests · a8278400

Pablo Neira Ayuso authored 7 years ago


Bail out if user requests non-recursive deletion for tables and sets.
This new flags tells nf_tables netlink interface to reject deletions if
tables and sets have content.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

a8278400

netfilter: nf_tables: add nf_tables_addchain() · 4035285f

Pablo Neira Ayuso authored 7 years ago


Wrap the chain addition path in a function to make it more maintainable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

4035285f

netfilter: nf_tables: add nf_tables_updchain() · 2c4a488a

Pablo Neira Ayuso authored 7 years ago


nf_tables_newchain() is too large, wrap the chain update path in a
function to make it more maintainable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

2c4a488a

net: Remove CONFIG_NETFILTER_DEBUG and _ASSERT() macros. · 9efdb14f

Varsha Rao authored 7 years ago


This patch removes CONFIG_NETFILTER_DEBUG and _ASSERT() macros as they
are no longer required. Replace _ASSERT() macros with WARN_ON().

Signed-off-by: Varsha Rao <rvarsha016@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

9efdb14f

net: Replace NF_CT_ASSERT() with WARN_ON(). · 44d6e2f2

Varsha Rao authored 7 years ago


This patch removes NF_CT_ASSERT() and instead uses WARN_ON().

Signed-off-by: Varsha Rao <rvarsha016@gmail.com>

44d6e2f2

netfilter: remove unused hooknum arg from packet functions · d1c1e39d
Florian Westphal authored 7 years ago
```
tested with allmodconfig build.

Signed-off-by: Florian Westphal <fw@strlen.de>
```
d1c1e39d

netfilter: nft_limit: add stateful object type · a6912055

Pablo M. Bermudo Garay authored 7 years ago


Register a new limit stateful object type into the stateful object
infrastructure.

Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

a6912055

netfilter: nft_limit: replace pkt_bytes with bytes · 6e323887

Pablo M. Bermudo Garay authored 7 years ago


Just a small refactor patch in order to improve the code readability.

Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

6e323887

netfilter: nf_tables: add select_ops for stateful objects · dfc46034

Pablo M. Bermudo Garay authored 7 years ago


This patch adds support for overloading stateful objects operations
through the select_ops() callback, just as it is implemented for
expressions.

This change is needed for upcoming additions to the stateful objects
infrastructure.

Signed-off-by: Pablo M. Bermudo Garay <pablombg@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

dfc46034

netfilter: xt_hashlimit: add rate match mode · bea74641

Vishwanath Pai authored 7 years ago


This patch adds a new feature to hashlimit that allows matching on the
current packet/byte rate without rate limiting. This can be enabled
with a new flag --hashlimit-rate-match. The match returns true if the
current rate of packets is above/below the user specified value.

The main difference between the existing algorithm and the new one is
that the existing algorithm rate-limits the flow whereas the new
algorithm does not. Instead it *classifies* the flow based on whether
it is above or below a certain rate. I will demonstrate this with an
example below. Let us assume this rule:

iptables -A INPUT -m hashlimit --hashlimit-above 10/s -j new_chain

If the packet rate is 15/s, the existing algorithm would ACCEPT 10
packets every second and send 5 packets to "new_chain".

But with the new algorithm, as long as the rate of 15/s is sustained,
all packets will continue to match and every packet is sent to new_chain.

This new functionality will let us classify different flows based on
their current rate, so that further decisions can be made on them based on
what the current rate is.

This is how the new algorithm works:
We divide time into intervals of 1 (sec/min/hour) as specified by
the user. We keep track of the number of packets/bytes processed in the
current interval. After each interval we reset the counter to 0.

When we receive a packet for match, we look at the packet rate
during the current interval and the previous interval to make a
decision:

if [ prev_rate < user and cur_rate < user ]
        return Below
else
        return Above

Where cur_rate is the number of packets/bytes seen in the current
interval, prev is the number of packets/bytes seen in the previous
interval and 'user' is the rate specified by the user.

We also provide flexibility to the user for choosing the time
interval using the option --hashilmit-interval. For example the user can
keep a low rate like x/hour but still keep the interval as small as 1
second.

To preserve backwards compatibility we have to add this feature in a new
revision, so I've created revision 3 for hashlimit. The two new options
we add are:

--hashlimit-rate-match
--hashlimit-rate-interval

I have updated the help text to add these new options. Also added a few
tests for the new options.

Suggested-by: Igor Lubashev <ilubashe@akamai.com>
Reviewed-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

bea74641

Sep 03, 2017

l2tp: pass tunnel pointer to ->session_create() · f026bc29

Guillaume Nault authored 7 years ago


Using l2tp_tunnel_find() in pppol2tp_session_create() and
l2tp_eth_create() is racy, because no reference is held on the
returned session. These functions are only used to implement the
->session_create callback which is run by l2tp_nl_cmd_session_create().
Therefore searching for the parent tunnel isn't necessary because
l2tp_nl_cmd_session_create() already has a pointer to it and holds a
reference.

This patch modifies ->session_create()'s prototype to directly pass the
the parent tunnel as parameter, thus avoiding searching for it in
pppol2tp_session_create() and l2tp_eth_create().

Since we have to touch the ->session_create() call in
l2tp_nl_cmd_session_create(), let's also remove the useless conditional:
we know that ->session_create isn't NULL at this point because it's
already been checked earlier in this same function.

Finally, one might be tempted to think that the removed
l2tp_tunnel_find() calls were harmless because they would return the
same tunnel as the one held by l2tp_nl_cmd_session_create() anyway.
But that tunnel might be removed and a new one created with same tunnel
Id before the l2tp_tunnel_find() call. In this case l2tp_tunnel_find()
would return the new tunnel which wouldn't be protected by the
reference held by l2tp_nl_cmd_session_create().

Fixes: 309795f4 ("l2tp: Add netlink control API for L2TP")
Fixes: d9e31d17 ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

f026bc29

l2tp: prevent creation of sessions on terminated tunnels · f3c66d4e

Guillaume Nault authored 7 years ago

l2tp_tunnel_destruct() sets tunnel->sock to NULL, then removes the
tunnel from the pernet list and finally closes all its sessions.
Therefore, it's possible to add a session to a tunnel that is still
reachable, but for which tunnel->sock has already been reset. This can
make l2tp_session_create() dereference a NULL pointer when calling
sock_hold(tunnel->sock).

This patch adds the .acpt_newsess field to struct l2tp_tunnel, which is
used by l2tp_tunnel_closeall() to prevent addition of new sessions to
tunnels. Resetting tunnel->sock is done after l2tp_tunnel_closeall()
returned, so that l2tp_session_add_to_tunnel() can safely take a
reference on it when .acpt_newsess is true.

The .acpt_newsess field is modified in l2tp_tunnel_closeall(), rather
than in l2tp_tunnel_destruct(), so that it benefits all tunnel removal
mechanisms. E.g. on UDP tunnels, a session could be added to a tunnel
after l2tp_udp_encap_destroy() proceeded. This would prevent the tunnel
from being removed because of the references held by this new session
on the tunnel and its socket. Even though the session could be removed
manually later on, this defeats the purpose of
commit 9980d001 ("l2tp: add udp encap socket destroy handler").

Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3c66d4e

Revert "net: fix percpu memory leaks" · 5a63643e

Jesper Dangaard Brouer authored 7 years ago


This reverts commit 1d6119ba.

After reverting commit 6d7b857d ("net: use lib/percpu_counter API
for fragmentation mem accounting") then here is no need for this
fix-up patch.  As percpu_counter is no longer used, it cannot
memory leak it any-longer.

Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119ba ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5a63643e

Revert "net: use lib/percpu_counter API for fragmentation mem accounting" · fb452a1a

Jesper Dangaard Brouer authored 7 years ago


This reverts commit 6d7b857d.

There is a bug in fragmentation codes use of the percpu_counter API,
that can cause issues on systems with many CPUs.

The frag_mem_limit() just reads the global counter (fbc->count),
without considering other CPUs can have upto batch size (130K) that
haven't been subtracted yet.  Due to the 3MBytes lower thresh limit,
this become dangerous at >=24 CPUs (3*1024*1024/130000=24).

The correct API usage would be to use __percpu_counter_compare() which
does the right thing, and takes into account the number of (online)
CPUs and batch size, to account for this and call __percpu_counter_sum()
when needed.

We choose to revert the use of the lib/percpu_counter API for frag
memory accounting for several reasons:

1) On systems with CPUs > 24, the heavier fully locked
   __percpu_counter_sum() is always invoked, which will be more
   expensive than the atomic_t that is reverted to.

Given systems with more than 24 CPUs are becoming common this doesn't
seem like a good option.  To mitigate this, the batch size could be
decreased and thresh be increased.

2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX
   CPU, before SKBs are pushed into sockets on remote CPUs.  Given
   NICs can only hash on L2 part of the IP-header, the NIC-RXq's will
   likely be limited.  Thus, a fair chance that atomic add+dec happen
   on the same CPU.

Revert note that commit 1d6119ba ("net: fix percpu memory leaks")
removed init_frag_mem_limit() and instead use inet_frags_init_net().
After this revert, inet_frags_uninit_net() becomes empty.

Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting")
Fixes: 1d6119ba ("net: fix percpu memory leaks")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

fb452a1a