- Mar 25, 2016
-
-
Geliang Tang authored
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by:
Geliang Tang <geliangtang@163.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Don't open-code sizeof_footer() in read_partial_message() and ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now possible to use con_out_kvec_add() in prepare_write_message_footer(). Signed-off-by:
Ilya Dryomov <idryomov@gmail.com> Reviewed-by:
Alex Elder <elder@linaro.org>
-
Yan, Zheng authored
This helper duplicates last extent operation in OSD request, then adjusts the new extent operation's offset and length. The helper is for scatterd page writeback, which adds nonconsecutive dirty pages to single OSD request. Signed-off-by:
Yan, Zheng <zyan@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Turn r_ops into a flexible array member to enable large, consisting of up to 16 ops, OSD requests. The use case is scattered writeback in cephfs and, as far as the kernel client is concerned, 16 is just a made up number. r_ops had size 3 for copyup+hint+write, but copyup is really a special case - it can only happen once. ceph_osd_request_cache is therefore stuffed with num_ops=2 requests, anything bigger than that is allocated with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which means either num_ops=1 or num_ops=2 for use_mempool=true - all existing users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with that. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
ceph_osd_request_cache was introduced a long time ago. Also, osd_req is about to get a flexible array member, which ceph_osd_request_cache is going to be aware of. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Although msg_size is calculated correctly, the terms are grouped in a misleading way - snaps appears to not have room for a u32 length. Move calculation closer to its use and regroup terms. No functional change. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Yan, Zheng authored
This avoids defining large array of r_reply_op_{len,result} in in struct ceph_osd_request. Signed-off-by:
Yan, Zheng <zyan@redhat.com> Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Follow userspace nomenclature on this - the next commit adds outdata_len. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
This can happen if __close_session() in ceph_monc_stop() races with a connection reset. We need to ignore such faults, otherwise it's likely we would take !hunting, call __schedule_delayed() and end up with delayed_work() executing on invalid memory, among other things. The (two!) con->private tests are useless, as nothing ever clears con->private. Nuke them. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Doing __schedule_delayed() in the hunting branch is pointless, as the tick will have already been scheduled by then. What we need to do instead is *reschedule* it in the !hunting branch, after reopen_session() changes hunt_mult, which affects the delay. This helps with spacing out connection attempts and avoiding things like two back-to-back attempts followed by a longer period of waiting around. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
hunting is now set in __open_session() and cleared in finish_hunting(), instead of all around. The "session lost" message is printed not only on connection resets, but also on keepalive timeouts. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Unless we are in the process of setting up a client (i.e. connecting to the monitor cluster for the first time), apply a backoff: every time we want to reopen a session, increase our timeout by a multiple (currently 2); when we complete the connection, reduce that multipler by 50%. Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Split ping interval and ping timeout: ping interval is 10s; keepalive timeout is 30s. Make monc_ping_timeout a constant while at it - it's not actually exported as a mount option (and the rest of tick-related settings won't be either), so it's got no place in ceph_options. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Don't try to reconnect to the same monitor when we fail to establish a session within a timeout or it's lost. For that, pick_new_mon() needs to see the old value of cur_mon, so don't clear it in __close_session() - all calls to __close_session() but one are followed by __open_session() anyway. __open_session() is only called when a new session needs to be established, so the "already open?" branch, which is now in the way, is simply dropped. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
It is currently hard-coded in the mon_client that mdsmap and monmap subs are continuous, while osdmap sub is always "onetime". To better handle full clusters/pools in the osd_client, we need to be able to issue continuous osdmap subs. Revamp subs code to allow us to specify for each sub whether it should be continuous or not. Although not strictly required for the above, switch to SUBSCRIBE2 protocol while at it, eliminating the ambiguity between a request for "every map since X" and a request for "just the latest" when we don't have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now required - it's been supported since pre-argonaut (2010). Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling in before we validate the epoch and successfully install the new map can mess up mon_client sub state. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Coupling hunting state with subscribe state is not a good idea. Clear hunting when we complete the authentication handshake. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
Ilya Dryomov authored
Our debugfs dir name is a concatenation of cluster fsid and client unique ID ("global_id"). It used to be the case that we learned global_id first, nowadays we always learn fsid first - the monmap is sent before any auth replies are. ceph_debugfs_client_init() call in ceph_monc_handle_map() is therefore never executed and can be removed. Its counterpart in handle_auth_reply() doesn't really belong there either: having to do monc->client and unlocking early to work around lockdep is a testament to that. Move it into __ceph_open_session(), where it can be called unconditionally. Signed-off-by:
Ilya Dryomov <idryomov@gmail.com>
-
- Mar 24, 2016
-
-
Haishuang Yan authored
As ping_v6_sendmsg is used only in this file, making it static The body of "pingv6_prot" and "pingv6_protosw" were moved at the middle of the file, to avoid having to declare some static prototypes. Signed-off-by:
Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 23, 2016
-
-
Alexander Duyck authored
This patch corrects an oversight in which we were allowing the encap_level value to pass from the outer headers to the inner headers. As a result we were incorrectly identifying UDP or GRE tunnels as also making use of ipip or sit when the second header actually represented a tunnel encapsulated in either a UDP or GRE tunnel which already had the features masked. Fixes: 76443456 ("net: Move GSO csum into SKB_GSO_CB") Reported-by:
Tom Herbert <tom@herbertland.com> Signed-off-by:
Alexander Duyck <aduyck@mirantis.com> Acked-by:
Tom Herbert <tom@herbertland.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 22, 2016
-
-
Andy Lutomirski authored
The code wants to prevent compat code from receiving messages. Use in_compat_syscall for this. Signed-off-by:
Andy Lutomirski <luto@kernel.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Andy Lutomirski authored
SCTP unfortunately has a different ABI for SCTP_SOCKOPT_CONNECTX3 for 32-bit and 64-bit callers. Use in_compat_syscall to correctly distinguish them on all architectures. Signed-off-by:
Andy Lutomirski <luto@kernel.org> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: David Miller <davem@davemloft.net> Signed-off-by:
Andrew Morton <akpm@linux-foundation.org> Signed-off-by:
Linus Torvalds <torvalds@linux-foundation.org>
-
Claudio Imbrenda authored
When a thread is prepared for waiting by calling prepare_to_wait, sleeping is not allowed until either the wait has taken place or finish_wait has been called. The existing code in af_vsock imposed unnecessary no-sleep assumptions to a broad list of backend functions. This patch shrinks the influence of prepare_to_wait to the area where it is strictly needed, therefore relaxing the no-sleep restriction there. Signed-off-by:
Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Claudio Imbrenda authored
This reverts commit 59888180 ("vsock: Fix blocking ops call in prepare_to_wait") The commit reverted with this patch caused us to potentially miss wakeups. Since the condition is not checked between the prepare_to_wait and the schedule(), if a wakeup happens after the condition is checked but before the sleep happens, we will miss it. ( A description of the problem can be found here: http://www.makelinux.net/ldd3/chp-6-sect-2 ). By reverting the patch, the behaviour is still incorrect (since we shouldn't sleep between the prepare_to_wait and the schedule) but at least it will not miss wakeups. The next patch in the series actually fixes the behaviour. Signed-off-by:
Claudio Imbrenda <imbrenda@linux.vnet.ibm.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Lance Richardson authored
Field fl4.flowi4_flags is not initialized in fib_compute_spec_dst() before calling fib_lookup(), which means fib_table_lookup() is using non-deterministic data at this line: if (!(flp->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF)) { Fix by initializing the entire fl4 structure, which will prevent similar issues as fields are added in the future by ensuring that all fields are initialized to zero unless explicitly initialized to another value. Fixes: 58189ca7 ("net: Fix vti use case with oif in dst lookups") Suggested-by:
David Ahern <dsa@cumulusnetworks.com> Signed-off-by:
Lance Richardson <lrichard@redhat.com> Acked-by:
David Ahern <dsa@cumulusnetworks.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Paolo Abeni authored
Currently, ingress ipv4 broadcast datagrams are dropped since, in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on bcast packets. This patch addresses the issue, invoking ip_check_mc_rcu() only for mcast packets. Fixes: 6e540309 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()") Signed-off-by:
Paolo Abeni <pabeni@redhat.com> Acked-by:
Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
David Decotigny authored
By returning -ENOIOCTLCMD, sock_do_ioctl() falls back to calling dev_ioctl(), which provides support for NIC driver ioctls, which includes ethtool support. This is similar to the way ioctls are handled in udp.c or tcp.c. This removes the requirement that ethtool for example be tied to the support of a specific L3 protocol (ethtool uses an AF_INET socket today). Signed-off-by:
David Decotigny <decot@googlers.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Deepa Dinamani authored
The millisecond timestamps returned by the function is converted to network byte order by making a call to htons(). htons() only returns __be16 while __be32 is required here. This was identified by the sparse warning from the buildbot: net/ipv4/af_inet.c:1405:16: sparse: incorrect type in return expression (different base types) net/ipv4/af_inet.c:1405:16: expected restricted __be32 net/ipv4/af_inet.c:1405:16: got restricted __be16 [usertype] <noident> Change the function to use htonl() to return the correct __be32 type instead so that the millisecond value doesn't get truncated. Signed-off-by:
Deepa Dinamani <deepa.kernel@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: James Morris <jmorris@namei.org> Cc: Patrick McHardy <kaber@trash.net> Cc: Arnd Bergmann <arnd@arndb.de> Fixes: 822c8685 ("net: ipv4: Convert IP network timestamps to be y2038 safe") Reported-by: Fengguang Wu <fengguang.wu@intel.com> [0-day test robot] Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Dave Jones authored
commit 911362c7 ("net: add dst_cache support") added a new kconfig option that gets selected by other networking options. It seems the intent wasn't to offer this as a user-selectable option given the lack of help text, so this patch converts it to a silent option. Signed-off-by:
Dave Jones <davej@codemonkey.org.uk> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 21, 2016
-
-
Eli Cohen authored
Add two new NLAs to support configuration of Infiniband node or port GUIDs. New applications can choose to use this interface to configure GUIDs with iproute2 with commands such as: ip link set dev ib0 vf 0 node_guid 00:02:c9:03:00:21:6e:70 ip link set dev ib0 vf 0 port_guid 00:02:c9:03:00:21:6e:78 A new ndo, ndo_sef_vf_guid is introduced to notify the net device of the request to change the GUID. Signed-off-by:
Eli Cohen <eli@mellanox.com> Reviewed-by:
Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by:
Doug Ledford <dledford@redhat.com>
-
Eric Dumazet authored
It can be useful to lower max_gso_segs on NIC with very low number of TX descriptors like bcmgenet. However, this is defeated by bridge since it does not propagate the lower value of max_gso_segs and max_gso_size. Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Petri Gynther <pgynther@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
It can be useful to report dev->gso_max_segs and dev->gso_max_size so that "ip -d link" can display them to help debugging. For the moment, these attributes are read-only. Signed-off-by:
Eric Dumazet <edumazet@google.com> Cc: Petri Gynther <pgynther@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Luis de Bethencourt authored
When the function dev_get_phys_port_name was added it missed a description for it's len argument. Adding it. Fixes: db24a904 ("net: add support for phys_port_name") Signed-off-by:
Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 20, 2016
-
-
Luis de Bethencourt authored
Commit 22e0f8b9 ("net: sched: make bstats per cpu and estimator RCU safe") added the argument cpu_bstats to functions gen_new_estimator and gen_replace_estimator and now the descriptions of these are missing for the documentation. Adding them. Signed-off-by:
Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Luis de Bethencourt authored
Function gnet_stats_copy_basic is missing the description of the cpu argument in the documentation. Adding it. Signed-off-by:
Luis de Bethencourt <luisbg@osg.samsung.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jesse Gross authored
If a packet is either locally encapsulated or processed through GRO it is marked with the offloads that it requires. However, when it is decapsulated these tunnel offload indications are not removed. This means that if we receive an encapsulated TCP packet, aggregate it with GRO, decapsulate, and retransmit the resulting frame on a NIC that does not support encapsulation, we won't be able to take advantage of hardware offloads even though it is just a simple TCP packet at this point. This fixes the problem by stripping off encapsulation offload indications when packets are decapsulated. The performance impacts of this bug are significant. In a test where a Geneve encapsulated TCP stream is sent to a hypervisor, GRO'ed, decapsulated, and bridged to a VM performance is improved by 60% (5Gbps->8Gbps) as a result of avoiding unnecessary segmentation at the VM tap interface. Reported-by:
Ramu Ramamurthy <sramamur@linux.vnet.ibm.com> Fixes: 68c33163 ("v4 GRE: Add TCP segmentation offload for GRE") Signed-off-by:
Jesse Gross <jesse@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jesse Gross authored
When drivers express support for TSO of encapsulated packets, they only mean that they can do it for one layer of encapsulation. Supporting additional levels would mean updating, at a minimum, more IP length fields and they are unaware of this. No encapsulation device expresses support for handling offloaded encapsulated packets, so we won't generate these types of frames in the transmit path. However, GRO doesn't have a check for multiple levels of encapsulation and will attempt to build them. UDP tunnel GRO actually does prevent this situation but it only handles multiple UDP tunnels stacked on top of each other. This generalizes that solution to prevent any kind of tunnel stacking that would cause problems. Fixes: bf5a755f ("net-gre-gro: Add GRE support to the GRO stack") Signed-off-by:
Jesse Gross <jesse@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jesse Gross authored
ipip encapsulated packets can be merged together by GRO but the result does not have the proper GSO type set or even marked as being encapsulated at all. Later retransmission of these packets will likely fail if the device does not support ipip offloads. This is similar to the issue resolved in IPv6 sit in feec0cb3 ("ipv6: gro: support sit protocol"). Reported-by:
Patrick Boutilier <boutilpj@ednet.ns.ca> Fixes: 9667e9bb ("ipip: Add gro callbacks to ipip offload") Tested-by:
Patrick Boutilier <boutilpj@ednet.ns.ca> Acked-by:
Eric Dumazet <edumazet@google.com> Signed-off-by:
Jesse Gross <jesse@kernel.org> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Marcelo Ricardo Leitner authored
SCTP is a protocol that is aligned to a word (4 bytes). Thus using bare MTU can sometimes return values that are not aligned, like for loopback, which is 65536 but ipv4_mtu() limits that to 65535. This mis-alignment will cause the last non-aligned bytes to never be used and can cause issues with congestion control. So it's better to just consider a lower MTU and keep congestion control calcs saner as they are based on PMTU. Same applies to icmp frag needed messages, which is also fixed by this patch. One other effect of this is the inability to send MTU-sized packet without queueing or fragmentation and without hitting Nagle. As the check performed at sctp_packet_can_append_data(): if (chunk->skb->len + q->out_qlen >= transport->pathmtu - packet->overhead) /* Enough data queued to fill a packet */ return SCTP_XMIT_OK; with the above example of MTU, if there are no other messages queued, one cannot send a packet that just fits one packet (65532 bytes) and without causing DATA chunk fragmentation or a delay. v2: - Added WORD_TRUNC macro Signed-off-by:
Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Marcelo Ricardo Leitner authored
Currently, if a chunk is scheduled to be sent through a transport that is currently unconfirmed, it will be leaked as it is dequeued from outq and is not re-queued nor freed. As I'm not aware of any situation that may lead to this situation, I'm fixing this by freeing the chunk and also logging a trace so that we can fix the other bug if it ever happens. Signed-off-by:
Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Marcelo Ricardo Leitner authored
The SACK can be lost pretty much elsewhere, but if its allocation fail, we know we are not sending it, so it is better to revert a_rwnd to its previous value as this may give it a chance to issue a window update later. Signed-off-by:
Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-