- Sep 08, 2017
-
-
Jiri Pirko authored
There's a memleak happening for chain 0. The thing is, chain 0 needs to be always present, not created on demand. Therefore tcf_block_get upon creation of block calls the tcf_chain_create function directly. The chain is created with refcnt == 1, which is not correct in this case and causes the memleak. So move the refcnt increment into tcf_chain_get function even for the case when chain needs to be created. Reported-by:
Jakub Kicinski <kubakici@wp.pl> Fixes: 5bc17018 ("net: sched: introduce multichain support for filters") Signed-off-by:
Jiri Pirko <jiri@mellanox.com> Tested-by:
Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Sep 07, 2017
-
-
Kleber Sacilotto de Souza authored
The net device is already stored in the 'net' variable, so no need to call dev_net() again. Signed-off-by:
Kleber Sacilotto de Souza <kleber.souza@canonical.com> Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Xin Long authored
Now there is no lock protecting nlk ngroups/groups' accessing in netlink bind and getname. It's safe from nlk groups' setting in netlink_release, but not from netlink_realloc_groups called by netlink_setsockopt. netlink_lock_table is needed in both netlink bind and getname when accessing nlk groups. Acked-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Xin Long <lucien.xin@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Xin Long authored
ChunYu found a netlink use-after-free issue by syzkaller: [28448.842981] BUG: KASAN: use-after-free in __nla_put+0x37/0x40 at addr ffff8807185e2378 [28448.969918] Call Trace: [...] [28449.117207] __nla_put+0x37/0x40 [28449.132027] nla_put+0xf5/0x130 [28449.146261] sk_diag_fill.isra.4.constprop.5+0x5a0/0x750 [netlink_diag] [28449.176608] __netlink_diag_dump+0x25a/0x700 [netlink_diag] [28449.202215] netlink_diag_dump+0x176/0x240 [netlink_diag] [28449.226834] netlink_dump+0x488/0xbb0 [28449.298014] __netlink_dump_start+0x4e8/0x760 [28449.317924] netlink_diag_handler_dump+0x261/0x340 [netlink_diag] [28449.413414] sock_diag_rcv_msg+0x207/0x390 [28449.432409] netlink_rcv_skb+0x149/0x380 [28449.467647] sock_diag_rcv+0x2d/0x40 [28449.484362] netlink_unicast+0x562/0x7b0 [28449.564790] netlink_sendmsg+0xaa8/0xe60 [28449.661510] sock_sendmsg+0xcf/0x110 [28449.865631] __sys_sendmsg+0xf3/0x240 [28450.000964] SyS_sendmsg+0x32/0x50 [28450.016969] do_syscall_64+0x25c/0x6c0 [28450.154439] entry_SYSCALL64_slow_path+0x25/0x25 It was caused by no protection between nlk groups' free in netlink_release and nlk groups' accessing in sk_diag_dump_groups. The similar issue also exists in netlink_seq_show(). This patch is to defer nlk groups' free in deferred_put_nlk_sk. Reported-by:
ChunYu Wang <chunwang@redhat.com> Acked-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
Xin Long <lucien.xin@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Gao Feng authored
The commit 520ac30f ("net_sched: drop packets after root qdisc lock is released) made a big change of tc for performance. There are two points left in sch_prio and sch_qfq which are not changed with that commit. Now enhance them now with __qdisc_drop. Signed-off-by:
Gao Feng <gfree.wind@vip.163.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Sep 06, 2017
-
-
Johannes Berg authored
When an RX BA session is started by the driver, and it has to tell mac80211 about it, the corresponding bit in tid_rx_manage_offl gets set and the BA session work is scheduled. Upon testing this bit, it will call __ieee80211_start_rx_ba_session(), thus deadlocking as it already holds the ampdu_mlme.mtx, which that acquires again. Fix this by adding ___ieee80211_start_rx_ba_session(), a version of the function that requires the mutex already held. Cc: stable@vger.kernel.org Fixes: 699cb58c ("mac80211: manage RX BA session offload without SKB queue") Reported-by:
Matteo Croce <mcroce@redhat.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Ilan peer authored
Commit 7a7c0a64 ("mac80211: fix TX aggregation start/stop callback race") added a cancellation of the ampdu work after the loop that stopped the Tx and Rx BA sessions. However, in some cases, e.g., during HW reconfig, the low level driver might call mac80211 APIs to complete the stopping of the BA sessions, which would queue the ampdu work to handle the actual completion. This work needs to be performed as otherwise mac80211 data structures would not be properly synced. Fix this by checking if BA session STOP_CB bit is set after the BA session cancellation and properly clean the session. Signed-off-by:
Ilan Peer <ilan.peer@intel.com> [Johannes: the work isn't flushed because that could do other things we don't want, and the locking situation isn't clear] Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Emmanuel Grumbach authored
Honor the NL80211_RRF_NO_HT40{MINUS,PLUS} flags in reg_process_ht_flags_channel. Not doing so leads can lead to a firmware assert in iwlwifi for example. Fixes: b0d7aa59 ("cfg80211: allow wiphy specific regdomain management") Signed-off-by:
Emmanuel Grumbach <emmanuel.grumbach@intel.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
- Sep 05, 2017
-
-
Håkon Bugge authored
The bits in m_flags in struct rds_message are used for a plurality of reasons, and from different contexts. To avoid any missing updates to m_flags, use the atomic set_bit() instead of the non-atomic equivalent. Signed-off-by:
Håkon Bugge <haakon.bugge@oracle.com> Reviewed-by:
Knut Omang <knut.omang@oracle.com> Reviewed-by:
Wei Lin Guay <wei.lin.guay@oracle.com> Acked-by:
Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jakub Kicinski authored
The new TC IDR code uses GFP_KERNEL under spin lock. Which leads to: [ 582.621091] BUG: sleeping function called from invalid context at ../mm/slab.h:416 [ 582.629721] in_atomic(): 1, irqs_disabled(): 0, pid: 3379, name: tc [ 582.636939] 2 locks held by tc/3379: [ 582.641049] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff910354ce>] rtnetlink_rcv_msg+0x92e/0x1400 [ 582.650958] #1: (&(&tn->idrinfo->lock)->rlock){+.-.+.}, at: [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0 [ 582.662217] Preemption disabled at: [ 582.662222] [<ffffffff9110a5e0>] tcf_idr_create+0x2f0/0x8e0 [ 582.672592] CPU: 9 PID: 3379 Comm: tc Tainted: G W 4.13.0-rc7-debug-00648-g43503a79b9f0 #287 [ 582.683432] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016 [ 582.691937] Call Trace: ... [ 582.742460] kmem_cache_alloc+0x286/0x540 [ 582.747055] radix_tree_node_alloc.constprop.6+0x4a/0x450 [ 582.753209] idr_get_free_cmn+0x627/0xf80 ... [ 582.815525] idr_alloc_cmn+0x1a8/0x270 ... [ 582.833804] tcf_idr_create+0x31b/0x8e0 ... Try to preallocate the memory with idr_prealloc(GFP_KERNEL) (as suggested by Eric Dumazet), and change the allocation flags under spin lock. Fixes: 65a206c0 ("net/sched: Change act_api and act_xxx modules to use IDR") Signed-off-by:
Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by:
Simon Horman <simon.horman@netronome.com> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Howells authored
When an RxRPC service packet comes in, the target connection is looked up by an rb-tree search under RCU and a read-locked seqlock; the seqlock retry check is, however, currently skipped if we got a match, but probably shouldn't be in case the connection we found gets replaced whilst we're doing a search. Make the lookup procedure always go through need_seqretry(), even if the lookup was successful. This makes sure we always pick up on a write-lock event. On the other hand, since we don't take a ref on the object, but rely on RCU to prevent its destruction after dropping the seqlock, I'm not sure this is necessary. Signed-off-by:
David Howells <dhowells@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
We originally used skb->priority but that was not quite correct as this bitfield needs to contain the egress switch queue we intend to send this SKB to. Signed-off-by:
Florian Fainelli <f.fainelli@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
Let switch drivers indicate how many TX queues they support. Some switches, such as Broadcom Starfighter 2 are designed with 8 egress queues. Future changes will allow us to leverage the queue mapping and direct the transmission towards a particular queue. Signed-off-by:
Florian Fainelli <f.fainelli@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
Instead of using ifdef in the C file. Signed-off-by:
Ido Schimmel <idosch@mellanox.com> Suggested-by:
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Tested-by:
Yotam Gigi <yotamg@mellanox.com> Acked-by:
Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Tom Herbert authored
In flow dissector there are no limits to the number of nested encapsulations or headers that might be dissected which makes for a nice DOS attack. This patch sets a limit of the number of headers that flow dissector will parse. Headers includes network layer headers, transport layer headers, shim headers for encapsulation, IPv6 extension headers, etc. The limit for maximum number of headers to parse has be set to fifteen to account for a reasonable number of encapsulations, extension headers, VLAN, in a packet. Note that this limit does not supercede the STOP_AT_* flags which may stop processing before the headers limit is reached. Reported-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by:
Tom Herbert <tom@quantonium.net> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Tom Herbert authored
__skb_flow_dissect is riddled with gotos that make discerning the flow, debugging, and extending the capability difficult. This patch reorganizes things so that we only perform goto's after the two main switch statements (no gotos within the cases now). It also eliminates several goto labels so that there are only two labels that can be target for goto. Reported-by:
Alexander Popov <alex.popov@linux.com> Signed-off-by:
Tom Herbert <tom@quantonium.net> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Arnd Bergmann authored
We get a new link error in allmodconfig kernels after ftgmac100 started using the ncsi helpers: ERROR: "ncsi_vlan_rx_kill_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined! ERROR: "ncsi_vlan_rx_add_vid" [drivers/net/ethernet/faraday/ftgmac100.ko] undefined! Related to that, we get another error when CONFIG_NET_NCSI is disabled: drivers/net/ethernet/faraday/ftgmac100.c:1626:25: error: 'ncsi_vlan_rx_add_vid' undeclared here (not in a function); did you mean 'ncsi_start_dev'? drivers/net/ethernet/faraday/ftgmac100.c:1627:26: error: 'ncsi_vlan_rx_kill_vid' undeclared here (not in a function); did you mean 'ncsi_vlan_rx_add_vid'? This fixes both problems at once, using a 'static inline' stub helper for the disabled case, and exporting the functions when they are present. Fixes: 51564585 ("ftgmac100: Support NCSI VLAN filtering when available") Fixes: 21acf630 ("net/ncsi: Configure VLAN tag filter") Signed-off-by:
Arnd Bergmann <arnd@arndb.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Igor Mitsyanko authored
There are no HT/VHT capabilities in cfg80211_ap_settings::beacon_ies, these should be looked for in beacon's tail instead. Fixes: 66cd794e ("nl80211: add HT/VHT capabilities to AP parameters") Signed-off-by:
Igor Mitsyanko <igor.mitsyanko.os@quantenna.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Avraham Stern authored
When HW ROC is supported it is possible that after the HW notified that the ROC has started, the ROC was cancelled and another ROC was added while the hw_roc_start worker is waiting on the mutex (since cancelling the ROC and adding another one also holds the same mutex). As a result, the hw_roc_start worker will continue to run after the new ROC is added but before it is actually started by the HW. This may result in notifying userspace that the ROC has started before it actually does, or in case of management tx ROC, in an attempt to tx while not on the right channel. In addition, when the driver will notify mac80211 that the second ROC has started, mac80211 will warn that this ROC has already been notified. Fix this by flushing the hw_roc_start work before cancelling an ROC. Cc: stable@vger.kernel.org Signed-off-by:
Avraham Stern <avraham.stern@intel.com> Signed-off-by:
Luca Coelho <luciano.coelho@intel.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Johannes Berg authored
Since drv_wake_tx_queue() is normally called in the TX path, which is already in an RCU critical section, we should call it the same way in the aggregation code path, so if the driver expects to be able to use RCU, it'll already be protected without having to enter a nested critical section. Additionally, disable soft-IRQs, since not doing so could cause issues in a driver that relies on them already being disabled like in the other path. Fixes: ba8c3d6f ("mac80211: add an intermediate software queue implementation") Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Chunho Lee authored
This change adds null pointer check before dereferencing pointer dev on netif_tx_start_all_queues() when an interface is added. With iTXQ support, netif_tx_start_all_queues() is always called while an interface is added. however, the netdev queues are not associated and dev is null when the interface is either NL80211_IFTYPE_P2P_DEVICE or NL80211_IFTYPE_NAN. Signed-off-by:
Chunho Lee <ch.lee@newracom.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Liad Kaufman authored
VHT MESH support was added, but the order of the IEs wasn't enforced. Fix that. Signed-off-by:
Liad Kaufman <liad.kaufman@intel.com> Signed-off-by:
Luca Coelho <luciano.coelho@intel.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Sharon Dvir authored
Invoking ht_dbg() with too long of a string will print a warning. Shorten the messages while retaining the printed patameters. Signed-off-by:
Sharon Dvir <sharon.dvir@intel.com> Signed-off-by:
Luca Coelho <luciano.coelho@intel.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Johannes Berg authored
With TXQs, the AP_VLAN interfaces are resolved to their owner AP interface when enqueuing the frame, which makes sense since the frame really goes out on that as far as the driver is concerned. However, this introduces a problem: frames to be encrypted with a VLAN-specific GTK will now be encrypted with the AP GTK, since the information about which virtual interface to use to select the key is taken from the TXQ. Fix this by preserving info->control.vif and using that in the dequeue function. This now requires doing the driver-mapping in the dequeue as well. Since there's no way to filter the frames that are sitting on a TXQ, drop all frames, which may affect other interfaces, when an AP_VLAN is removed. Cc: stable@vger.kernel.org Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
Simon Dinkin authored
this fix minor issue in the log message. in ieee80211_rx_mgmt_assoc_resp function, when assigning the reassoc value from the mgmt frame control: ieee80211_is_reassoc_resp function need to be used, instead of ieee80211_is_reassoc_req function. Signed-off-by:
Simon Dinkin <simon.dinkin@tandemg.com> Signed-off-by:
Johannes Berg <johannes.berg@intel.com>
-
- Sep 04, 2017
-
-
Pablo Neira Ayuso authored
This patch sorts out an asymmetry in deletions. Currently, table and set deletion commands come with an implicit content flush on deletion. However, chain deletion results in -EBUSY if there is content in this chain, so no implicit flush happens. So you have to send a flush command in first place to delete chains, this is inconsistent and it can be annoying in terms of user experience. This patch uses the new NLM_F_NONREC flag to request non-recursive chain deletion, ie. if the chain to be removed contains rules, then this returns EBUSY. This problem was discussed during the NFWS'17 in Faro, Portugal. In iptables, you hit -EBUSY if you try to delete a chain that contains rules, so you have to flush first before you can remove anything. Since iptables-compat uses the nf_tables netlink interface, it has to use the NLM_F_NONREC flag from userspace to retain the original iptables semantics, ie. bail out on removing chains that contain rules. Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
Bail out if user requests non-recursive deletion for tables and sets. This new flags tells nf_tables netlink interface to reject deletions if tables and sets have content. Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
Wrap the chain addition path in a function to make it more maintainable. Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo Neira Ayuso authored
nf_tables_newchain() is too large, wrap the chain update path in a function to make it more maintainable. Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Varsha Rao authored
This patch removes CONFIG_NETFILTER_DEBUG and _ASSERT() macros as they are no longer required. Replace _ASSERT() macros with WARN_ON(). Signed-off-by:
Varsha Rao <rvarsha016@gmail.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Varsha Rao authored
This patch removes NF_CT_ASSERT() and instead uses WARN_ON(). Signed-off-by:
Varsha Rao <rvarsha016@gmail.com>
-
Florian Westphal authored
tested with allmodconfig build. Signed-off-by:
Florian Westphal <fw@strlen.de>
-
Pablo M. Bermudo Garay authored
Register a new limit stateful object type into the stateful object infrastructure. Signed-off-by:
Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo M. Bermudo Garay authored
Just a small refactor patch in order to improve the code readability. Signed-off-by:
Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Pablo M. Bermudo Garay authored
This patch adds support for overloading stateful objects operations through the select_ops() callback, just as it is implemented for expressions. This change is needed for upcoming additions to the stateful objects infrastructure. Signed-off-by:
Pablo M. Bermudo Garay <pablombg@gmail.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
Vishwanath Pai authored
This patch adds a new feature to hashlimit that allows matching on the current packet/byte rate without rate limiting. This can be enabled with a new flag --hashlimit-rate-match. The match returns true if the current rate of packets is above/below the user specified value. The main difference between the existing algorithm and the new one is that the existing algorithm rate-limits the flow whereas the new algorithm does not. Instead it *classifies* the flow based on whether it is above or below a certain rate. I will demonstrate this with an example below. Let us assume this rule: iptables -A INPUT -m hashlimit --hashlimit-above 10/s -j new_chain If the packet rate is 15/s, the existing algorithm would ACCEPT 10 packets every second and send 5 packets to "new_chain". But with the new algorithm, as long as the rate of 15/s is sustained, all packets will continue to match and every packet is sent to new_chain. This new functionality will let us classify different flows based on their current rate, so that further decisions can be made on them based on what the current rate is. This is how the new algorithm works: We divide time into intervals of 1 (sec/min/hour) as specified by the user. We keep track of the number of packets/bytes processed in the current interval. After each interval we reset the counter to 0. When we receive a packet for match, we look at the packet rate during the current interval and the previous interval to make a decision: if [ prev_rate < user and cur_rate < user ] return Below else return Above Where cur_rate is the number of packets/bytes seen in the current interval, prev is the number of packets/bytes seen in the previous interval and 'user' is the rate specified by the user. We also provide flexibility to the user for choosing the time interval using the option --hashilmit-interval. For example the user can keep a low rate like x/hour but still keep the interval as small as 1 second. To preserve backwards compatibility we have to add this feature in a new revision, so I've created revision 3 for hashlimit. The two new options we add are: --hashlimit-rate-match --hashlimit-rate-interval I have updated the help text to add these new options. Also added a few tests for the new options. Suggested-by:
Igor Lubashev <ilubashe@akamai.com> Reviewed-by:
Josh Hunt <johunt@akamai.com> Signed-off-by:
Vishwanath Pai <vpai@akamai.com> Signed-off-by:
Pablo Neira Ayuso <pablo@netfilter.org>
-
- Sep 03, 2017
-
-
Guillaume Nault authored
Using l2tp_tunnel_find() in pppol2tp_session_create() and l2tp_eth_create() is racy, because no reference is held on the returned session. These functions are only used to implement the ->session_create callback which is run by l2tp_nl_cmd_session_create(). Therefore searching for the parent tunnel isn't necessary because l2tp_nl_cmd_session_create() already has a pointer to it and holds a reference. This patch modifies ->session_create()'s prototype to directly pass the the parent tunnel as parameter, thus avoiding searching for it in pppol2tp_session_create() and l2tp_eth_create(). Since we have to touch the ->session_create() call in l2tp_nl_cmd_session_create(), let's also remove the useless conditional: we know that ->session_create isn't NULL at this point because it's already been checked earlier in this same function. Finally, one might be tempted to think that the removed l2tp_tunnel_find() calls were harmless because they would return the same tunnel as the one held by l2tp_nl_cmd_session_create() anyway. But that tunnel might be removed and a new one created with same tunnel Id before the l2tp_tunnel_find() call. In this case l2tp_tunnel_find() would return the new tunnel which wouldn't be protected by the reference held by l2tp_nl_cmd_session_create(). Fixes: 309795f4 ("l2tp: Add netlink control API for L2TP") Fixes: d9e31d17 ("l2tp: Add L2TP ethernet pseudowire support") Signed-off-by:
Guillaume Nault <g.nault@alphalink.fr> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
l2tp_tunnel_destruct() sets tunnel->sock to NULL, then removes the tunnel from the pernet list and finally closes all its sessions. Therefore, it's possible to add a session to a tunnel that is still reachable, but for which tunnel->sock has already been reset. This can make l2tp_session_create() dereference a NULL pointer when calling sock_hold(tunnel->sock). This patch adds the .acpt_newsess field to struct l2tp_tunnel, which is used by l2tp_tunnel_closeall() to prevent addition of new sessions to tunnels. Resetting tunnel->sock is done after l2tp_tunnel_closeall() returned, so that l2tp_session_add_to_tunnel() can safely take a reference on it when .acpt_newsess is true. The .acpt_newsess field is modified in l2tp_tunnel_closeall(), rather than in l2tp_tunnel_destruct(), so that it benefits all tunnel removal mechanisms. E.g. on UDP tunnels, a session could be added to a tunnel after l2tp_udp_encap_destroy() proceeded. This would prevent the tunnel from being removed because of the references held by this new session on the tunnel and its socket. Even though the session could be removed manually later on, this defeats the purpose of commit 9980d001 ("l2tp: add udp encap socket destroy handler"). Fixes: fd558d18 ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts") Signed-off-by:
Guillaume Nault <g.nault@alphalink.fr> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jesper Dangaard Brouer authored
This reverts commit 1d6119ba. After reverting commit 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting") then here is no need for this fix-up patch. As percpu_counter is no longer used, it cannot memory leak it any-longer. Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting") Fixes: 1d6119ba ("net: fix percpu memory leaks") Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jesper Dangaard Brouer authored
This reverts commit 6d7b857d. There is a bug in fragmentation codes use of the percpu_counter API, that can cause issues on systems with many CPUs. The frag_mem_limit() just reads the global counter (fbc->count), without considering other CPUs can have upto batch size (130K) that haven't been subtracted yet. Due to the 3MBytes lower thresh limit, this become dangerous at >=24 CPUs (3*1024*1024/130000=24). The correct API usage would be to use __percpu_counter_compare() which does the right thing, and takes into account the number of (online) CPUs and batch size, to account for this and call __percpu_counter_sum() when needed. We choose to revert the use of the lib/percpu_counter API for frag memory accounting for several reasons: 1) On systems with CPUs > 24, the heavier fully locked __percpu_counter_sum() is always invoked, which will be more expensive than the atomic_t that is reverted to. Given systems with more than 24 CPUs are becoming common this doesn't seem like a good option. To mitigate this, the batch size could be decreased and thresh be increased. 2) The add_frag_mem_limit+sub_frag_mem_limit pairs happen on the RX CPU, before SKBs are pushed into sockets on remote CPUs. Given NICs can only hash on L2 part of the IP-header, the NIC-RXq's will likely be limited. Thus, a fair chance that atomic add+dec happen on the same CPU. Revert note that commit 1d6119ba ("net: fix percpu memory leaks") removed init_frag_mem_limit() and instead use inet_frags_init_net(). After this revert, inet_frags_uninit_net() becomes empty. Fixes: 6d7b857d ("net: use lib/percpu_counter API for fragmentation mem accounting") Fixes: 1d6119ba ("net: fix percpu memory leaks") Signed-off-by:
Jesper Dangaard Brouer <brouer@redhat.com> Acked-by:
Florian Westphal <fw@strlen.de> Signed-off-by:
David S. Miller <davem@davemloft.net>
-