summaryrefslogtreecommitdiff
path: root/net/ipv6
AgeCommit message (Collapse)Author
2014-08-22net: ipv6: fib: don't sleep inside atomic lockBenjamin Block
The function fib6_commit_metrics() allocates a piece of memory in mode GFP_KERNEL while holding an atomic lock from higher up in the stack, in the function __ip6_ins_rt(). This produces the following BUG: > BUG: sleeping function called from invalid context at mm/slub.c:1250 > in_atomic(): 1, irqs_disabled(): 0, pid: 2909, name: dhcpcd > 2 locks held by dhcpcd/2909: > #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff81978e67>] rtnl_lock+0x17/0x20 > #1: (&tb->tb6_lock){++--+.}, at: [<ffffffff81a6951a>] ip6_route_add+0x65a/0x800 > CPU: 1 PID: 2909 Comm: dhcpcd Not tainted 3.17.0-rc1 #1 > Hardware name: ASUS All Series/Q87T, BIOS 0216 10/16/2013 > 0000000000000008 ffff8800c8f13858 ffffffff81af135a 0000000000000000 > ffff880212202430 ffff8800c8f13878 ffffffff810f8d3a ffff880212202c98 > 0000000000000010 ffff8800c8f138c8 ffffffff8121ad0e 0000000000000001 > Call Trace: > [<ffffffff81af135a>] dump_stack+0x4e/0x68 > [<ffffffff810f8d3a>] __might_sleep+0x10a/0x120 > [<ffffffff8121ad0e>] kmem_cache_alloc_trace+0x4e/0x190 > [<ffffffff81a6bcd6>] ? fib6_commit_metrics+0x66/0x110 > [<ffffffff81a6bcd6>] fib6_commit_metrics+0x66/0x110 > [<ffffffff81a6cbf3>] fib6_add+0x883/0xa80 > [<ffffffff81a6951a>] ? ip6_route_add+0x65a/0x800 > [<ffffffff81a69535>] ip6_route_add+0x675/0x800 > [<ffffffff81a68f2a>] ? ip6_route_add+0x6a/0x800 > [<ffffffff81a6990c>] inet6_rtm_newroute+0x5c/0x80 > [<ffffffff8197cf01>] rtnetlink_rcv_msg+0x211/0x260 > [<ffffffff81978e67>] ? rtnl_lock+0x17/0x20 > [<ffffffff81119708>] ? lock_release_holdtime+0x28/0x180 > [<ffffffff81978e67>] ? rtnl_lock+0x17/0x20 > [<ffffffff8197ccf0>] ? __rtnl_unlock+0x20/0x20 > [<ffffffff819a989e>] netlink_rcv_skb+0x6e/0xd0 > [<ffffffff81978ee5>] rtnetlink_rcv+0x25/0x40 > [<ffffffff819a8e59>] netlink_unicast+0xd9/0x180 > [<ffffffff819a9600>] netlink_sendmsg+0x700/0x770 > [<ffffffff81103735>] ? local_clock+0x25/0x30 > [<ffffffff8194e83c>] sock_sendmsg+0x6c/0x90 > [<ffffffff811f98e3>] ? might_fault+0xa3/0xb0 > [<ffffffff8195ca6d>] ? verify_iovec+0x7d/0xf0 > [<ffffffff8194ec3e>] ___sys_sendmsg+0x37e/0x3b0 > [<ffffffff8111ef15>] ? trace_hardirqs_on_caller+0x185/0x220 > [<ffffffff81af979e>] ? mutex_unlock+0xe/0x10 > [<ffffffff819a55ec>] ? netlink_insert+0xbc/0xe0 > [<ffffffff819a65e5>] ? netlink_autobind.isra.30+0x125/0x150 > [<ffffffff819a6520>] ? netlink_autobind.isra.30+0x60/0x150 > [<ffffffff819a84f9>] ? netlink_bind+0x159/0x230 > [<ffffffff811f989a>] ? might_fault+0x5a/0xb0 > [<ffffffff8194f25e>] ? SYSC_bind+0x7e/0xd0 > [<ffffffff8194f8cd>] __sys_sendmsg+0x4d/0x80 > [<ffffffff8194f912>] SyS_sendmsg+0x12/0x20 > [<ffffffff81afc692>] system_call_fastpath+0x16/0x1b Fixing this by replacing the mode GFP_KERNEL with GFP_ATOMIC. Signed-off-by: Benjamin Block <bebl@mageta.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()Neal Cardwell
Make sure we use the correct address-family-specific function for handling MTU reductions from within tcp_release_cb(). Previously AF_INET6 sockets were incorrectly always using the IPv6 code path when sometimes they were handling IPv4 traffic and thus had an IPv4 dst. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Diagnosed-by: Willem de Bruijn <willemb@google.com> Fixes: 563d34d057862 ("tcp: dont drop MTU reduction indications") Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14sit: Fix ipip6_tunnel_lookup device matching criteriaShmulik Ladkani
As of 4fddbf5d78 ("sit: strictly restrict incoming traffic to tunnel link device"), when looking up a tunnel, tunnel's underlying interface (t->parms.link) is verified to match incoming traffic's ingress device. However the comparison was incorrectly based on skb->dev->iflink. Instead, dev->ifindex should be used, which correctly represents the interface from which the IP stack hands the ipip6 packets. This allows setting up sit tunnels bound to vlan interfaces (otherwise incoming ipip6 traffic on the vlan interface was dropped due to ipip6_tunnel_lookup match failure). Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06Merge branch 'akpm' (patchbomb from Andrew Morton)Linus Torvalds
Merge incoming from Andrew Morton: - Various misc things. - arch/sh updates. - Part of ocfs2. Review is slow. - Slab updates. - Most of -mm. - printk updates. - lib/ updates. - checkpatch updates. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (226 commits) checkpatch: update $declaration_macros, add uninitialized_var checkpatch: warn on missing spaces in broken up quoted checkpatch: fix false positives for --strict "space after cast" test checkpatch: fix false positive MISSING_BREAK warnings with --file checkpatch: add test for native c90 types in unusual order checkpatch: add signed generic types checkpatch: add short int to c variable types checkpatch: add for_each tests to indentation and brace tests checkpatch: fix brace style misuses of else and while checkpatch: add --fix option for a couple OPEN_BRACE misuses checkpatch: use the correct indentation for which() checkpatch: add fix_insert_line and fix_delete_line helpers checkpatch: add ability to insert and delete lines to patch/file checkpatch: add an index variable for fixed lines checkpatch: warn on break after goto or return with same tab indentation checkpatch: emit a warning on file add/move/delete checkpatch: add test for commit id formatting style in commit log checkpatch: emit fewer kmalloc_array/kcalloc conversion warnings checkpatch: improve "no space after cast" test checkpatch: allow multiple const * types ...
2014-08-06list: fix order of arguments for hlist_add_after(_rcu)Ken Helias
All other add functions for lists have the new item as first argument and the position where it is added as second argument. This was changed for no good reason in this function and makes using it unnecessary confusing. The name was changed to hlist_add_behind() to cause unconverted code to generate a compile error instead of using the wrong parameter order. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Ken Helias <kenhelias@firemail.de> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> [intel driver bits] Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-06tcp: md5: check md5 signature without socket lockDmitry Popov
Since a8afca032 (tcp: md5: protects md5sig_info with RCU) tcp_md5_do_lookup doesn't require socket lock, rcu_read_lock is enough. Therefore socket lock is no longer required for tcp_v{4,6}_inbound_md5_hash too, so we can move these calls (wrapped with rcu_read_{,un}lock) before bh_lock_sock: from tcp_v{4,6}_do_rcv to tcp_v{4,6}_rcv. Signed-off-by: Dmitry Popov <ixaphire@qrator.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/Makefile net/ipv6/sysctl_net_ipv6.c Two ipv6_table_template[] additions overlap, so the index of the ipv6_table[x] assignments needed to be adjusted. In the drivers/net/Makefile case, we've gotten rid of the garbage whereby we had to list every single USB networking driver in the top-level Makefile, there is just one "USB_NETWORKING" that guards everything. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-05net-timestamp: add key to disambiguate concurrent datagramsWillem de Bruijn
Datagrams timestamped on transmission can coexist in the kernel stack and be reordered in packet scheduling. When reading looped datagrams from the socket error queue it is not always possible to unique correlate looped data with original send() call (for application level retransmits). Even if possible, it may be expensive and complex, requiring packet inspection. Introduce a data-independent ID mechanism to associate timestamps with send calls. Pass an ID alongside the timestamp in field ee_data of sock_extended_err. The ID is a simple 32 bit unsigned int that is associated with the socket and incremented on each send() call for which software tx timestamp generation is enabled. The feature is enabled only if SOF_TIMESTAMPING_OPT_ID is set, to avoid changing ee_data for existing applications that expect it 0. The counter is reset each time the flag is reenabled. Reenabling does not change the ID of already submitted data. It is possible to receive out of order IDs if the timestamp stream is not quiesced first. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02ipv6: data of fwmark_reflect sysctl needs to be updated on netns constructionHannes Frederic Sowa
Fixes: e110861f86094cd ("net: add a sysctl to reflect the fwmark on replies") Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02inet: frags: use kmem_cache for inet_frag_queueNikolay Aleksandrov
Use kmem_cache to allocate/free inet_frag_queue objects since they're all the same size per inet_frags user and are alloced/freed in high volumes thus making it a perfect case for kmem_cache. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02inet: frags: use INET_FRAG_EVICTED to prevent icmp messagesNikolay Aleksandrov
Now that we have INET_FRAG_EVICTED we might as well use it to stop sending icmp messages in the "frag_expire" functions instead of stripping INET_FRAG_FIRST_IN from their flags when evicting. Also fix the comment style in ip6_expire_frag_queue(). Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Reviewed-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02inet: frags: rename last_in to flagsNikolay Aleksandrov
The last_in field has been used to store various flags different from first/last frag in so give it a more descriptive name: flags. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02inet: frags: use INC_STATS_BH in the ipv6 reassembly codeNikolay Aleksandrov
Softirqs are already disabled so no need to do it again, thus let's be consistent and use the IP6_INC_STATS_BH variant. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31net: use inet6_iif instead of IP6CB()->iifDuan Jiong
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-31net: fix the counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORSDuan Jiong
When dealing with ICMPv[46] Error Message, function icmp_socket_deliver() and icmpv6_notify() do some valid checks on packet's length, but then some protocols check packet's length redaudantly. So remove those duplicated statements, and increase counter ICMP_MIB_INERRORS/ICMP6_MIB_INERRORS in function icmp_socket_deliver() and icmpv6_notify() respectively. In addition, add missed counter in udp6/udplite6 when socket is NULL. Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2014-07-30 This is the last pull request for ipsec-next before I'll be off for two weeks starting on friday. David, can you please take urgent ipsec patches directly into net/net-next during this time? 1) Error handling simplifications for vti and vti6. From Mathias Krause. 2) Remove a duplicate semicolon after a return statement. From Christoph Paasch. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-29ipv6: fail early when creating netdev named all or defaultWANG Cong
We create a proc dir for each network device, this will cause conflicts when the devices have name "all" or "default". Rather than emitting an ugly kernel warning, we could just fail earlier by checking the device name. Reported-by: Stephane Chazelas <stephane.chazelas@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-28ip: make IP identifiers less predictableEric Dumazet
In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and Jedidiah describe ways exploiting linux IP identifier generation to infer whether two machines are exchanging packets. With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we changed IP id generation, but this does not really prevent this side-channel technique. This patch adds a random amount of perturbation so that IP identifiers for a given destination [1] are no longer monotonically increasing after an idle period. Note that prandom_u32_max(1) returns 0, so if generator is used at most once per jiffy, this patch inserts no hole in the ID suite and do not increase collision probability. This is jiffies based, so in the worst case (HZ=1000), the id can rollover after ~65 seconds of idle time, which should be fine. We also change the hash used in __ip_select_ident() to not only hash on daddr, but also saddr and protocol, so that ICMP probes can not be used to infer information for other protocols. For IPv6, adds saddr into the hash as well, but not nexthdr. If I ping the patched target, we can see ID are now hard to predict. 21:57:11.008086 IP (...) A > target: ICMP echo request, seq 1, length 64 21:57:11.010752 IP (... id 2081 ...) target > A: ICMP echo reply, seq 1, length 64 21:57:12.013133 IP (...) A > target: ICMP echo request, seq 2, length 64 21:57:12.015737 IP (... id 3039 ...) target > A: ICMP echo reply, seq 2, length 64 21:57:13.016580 IP (...) A > target: ICMP echo request, seq 3, length 64 21:57:13.019251 IP (... id 3437 ...) target > A: ICMP echo reply, seq 3, length 64 [1] TCP sessions uses a per flow ID generator not changed by this patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jeffrey Knockel <jeffk@cs.unm.edu> Reported-by: Jedidiah R. Crandall <crandall@cs.unm.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Hannes Frederic Sowa <hannes@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27inet: frag: set limits and make init_net's high_thresh limit globalNikolay Aleksandrov
This patch makes init_net's high_thresh limit to be the maximum for all namespaces, thus introducing a global memory limit threshold equal to the sum of the individual high_thresh limits which are capped. It also introduces some sane minimums for low_thresh as it shouldn't be able to drop below 0 (or > high_thresh in the unsigned case), and overall low_thresh should not ever be above high_thresh, so we make the following relations for a namespace: init_net: high_thresh - max(not capped), min(init_net low_thresh) low_thresh - max(init_net high_thresh), min (0) all other namespaces: high_thresh = max(init_net high_thresh), min(namespace's low_thresh) low_thresh = max(namespace's high_thresh), min(0) The major issue with having low_thresh > high_thresh is that we'll schedule eviction but never evict anything and thus rely only on the timers. Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27inet: frag: use seqlock for hash rebuildFlorian Westphal
rehash is rare operation, don't force readers to take the read-side rwlock. Instead, we only have to detect the (rare) case where the secret was altered while we are trying to insert a new inetfrag queue into the table. If it was changed, drop the bucket lock and recompute the hash to get the 'new' chain bucket that we have to insert into. Joint work with Nikolay Aleksandrov. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27inet: frag: remove periodic secret rebuild timerFlorian Westphal
merge functionality into the eviction workqueue. Instead of rebuilding every n seconds, take advantage of the upper hash chain length limit. If we hit it, mark table for rebuild and schedule workqueue. To prevent frequent rebuilds when we're completely overloaded, don't rebuild more than once every 5 seconds. ipfrag_secret_interval sysctl is now obsolete and has been marked as deprecated, it still can be changed so scripts won't be broken but it won't have any effect. A comment is left above each unused secret_timer variable to avoid confusion. Joint work with Nikolay Aleksandrov. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27inet: frag: remove lru listFlorian Westphal
no longer used. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27inet: frag: don't account number of fragment queuesFlorian Westphal
The 'nqueues' counter is protected by the lru list lock, once thats removed this needs to be converted to atomic counter. Given this isn't used for anything except for reporting it to userspace via /proc, just remove it. We still report the memory currently used by fragment reassembly queues. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27inet: frag: move eviction of queues to work queueFlorian Westphal
When the high_thresh limit is reached we try to toss the 'oldest' incomplete fragment queues until memory limits are below the low_thresh value. This happens in softirq/packet processing context. This has two drawbacks: 1) processors might evict a queue that was about to be completed by another cpu, because they will compete wrt. resource usage and resource reclaim. 2) LRU list maintenance is expensive. But when constantly overloaded, even the 'least recently used' element is recent, so removing 'lru' queue first is not 'fairer' than removing any other fragment queue. This moves eviction out of the fast path: When the low threshold is reached, a work queue is scheduled which then iterates over the table and removes the queues that exceed the memory limits of the namespace. It sets a new flag called INET_FRAG_EVICTED on the evicted queues so the proper counters will get incremented when the queue is forcefully expired. When the high threshold is reached, no more fragment queues are created until we're below the limit again. The LRU list is now unused and will be removed in a followup patch. Joint work with Nikolay Aleksandrov. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27inet: frag: move evictor calls into frag_find functionFlorian Westphal
First step to move eviction handling into a work queue. We lose two spots that accounted evicted fragments in MIB counters. Accounting will be restored since the upcoming work-queue evictor invokes the frag queue timer callbacks instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27inet: frag: remove hash size assumptions from callersFlorian Westphal
hide actual hash size from individual users: The _find function will now fold the given hash value into the required range. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27inet: frag: constify match, hashfn and constructor argumentsFlorian Westphal
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-24ipv6: remove obsolete comment in ip6_append_data()Li RongQing
After 11878b40e[net-timestamp: SOCK_RAW and PING timestamping], this comment becomes obsolete since the codes check not only UDP socket, but also RAW sock; and the codes are clear, not need the comments Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-24neigh: remove exceptional & on function nameHimangi Saraogi
In this file, function names are otherwise used as pointers without &. A simplified version of the Coccinelle semantic patch that makes this change is as follows: // <smpl> @r@ identifier f; @@ f(...) { ... } @@ identifier r.f; @@ - &f + f // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-23sock: remove skb argument from sk_rcvqueues_fullSorin Dumitru
It hasn't been used since commit 0fd7bac(net: relax rcvbuf limits). Signed-off-by: Sorin Dumitru <sorin@returnze.ro> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/infiniband/hw/cxgb4/device.c The cxgb4 conflict was simply overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains updates for your net-next tree, they are: 1) Use kvfree() helper function from x_tables, from Eric Dumazet. 2) Remove extra timer from the conntrack ecache extension, use a workqueue instead to redeliver lost events to userspace instead, from Florian Westphal. 3) Removal of the ulog targets for ebtables and iptables. The nflog infrastructure superseded this almost 9 years ago, time to get rid of this code. 4) Replace the list of loggers by an array now that we can only have two possible non-overlapping logger flavours, ie. kernel ring buffer and netlink logging. 5) Move Eric Dumazet's log buffer code to nf_log to reuse it from all of the supported per-family loggers. 6) Consolidate nf_log_packet() as an unified interface for packet logging. After this patch, if the struct nf_loginfo is available, it explicitly selects the logger that is used. 7) Move ip and ip6 logging code from xt_LOG to the corresponding per-family loggers. Thus, x_tables and nf_tables share the same code for packet logging. 8) Add generic ARP packet logger, which is used by nf_tables. The format aims to be consistent with the output of xt_LOG. 9) Add generic bridge packet logger. Again, this is used by nf_tables and it routes the packets to the real family loggers. As a result, we get consistent logging format for the bridge family. The ebt_log logging code has been intentionally left in place not to break backward compatibility since the logging output differs from xt_LOG. 10) Update nft_log to explicitly request the required family logger when needed. 11) Finish nft_log so it supports arp, ip, ip6, bridge and inet families. Allowing selection between netlink and kernel buffer ring logging. 12) Several fixes coming after the netfilter core logging changes spotted by robots. 13) Use IS_ENABLED() macros whenever possible in the netfilter tree, from Duan Jiong. 14) Removal of a couple of unnecessary branch before kfree, from Fabian Frederick. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.David Held
Many multicast sources can have the same port which can result in a very large list when hashing by port only. Hash by address and port instead if this is the case. This makes multicast more similar to unicast. On a 24-core machine receiving from 500 multicast sockets on the same port, before this patch 80% of system CPU was used up by spin locking and only ~25% of packets were successfully delivered. With this patch, all packets are delivered and kernel overhead is ~8% system CPU on spinlocks. Signed-off-by: David Held <drheld@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16udp: Simplify __udp*_lib_mcast_deliver.David Held
Switch to using sk_nulls_for_each which shortens the code and makes it easier to update. Signed-off-by: David Held <drheld@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16net-gre-gro: Fix a bug that breaks the forwarding pathJerry Chu
Fixed a bug that was introduced by my GRE-GRO patch (bf5a755f5e9186406bbf50f4087100af5bd68e40 net-gre-gro: Add GRE support to the GRO stack) that breaks the forwarding path because various GSO related fields were not set. The bug will cause on the egress path either the GSO code to fail, or a GRE-TSO capable (NETIF_F_GSO_GRE) NICs to choke. The following fix has been tested for both cases. Signed-off-by: H.K. Jerry Chu <hkchu@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15net-timestamp: SOCK_RAW and PING timestampingWillem de Bruijn
Add SO_TIMESTAMPING to sockets of type PF_INET[6]/SOCK_RAW: Add the necessary sock_tx_timestamp calls to the datapath for RAW sockets (ping sockets already had these calls). Fix the IP output path to pass the timestamp flags on the first fragment also for these sockets. The existing code relies on transhdrlen != 0 to indicate a first fragment. For these sockets, that assumption does not hold. This fixes http://bugzilla.kernel.org/show_bug.cgi?id=77221 Tested SOCK_RAW on IPv4 and IPv6, not PING. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15ipv6: remove unnecessary break after returnFabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15netfilter: remove unnecessary break after returnFabian Frederick
Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15net: set name_assign_type in alloc_netdev()Tom Gundersen
Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert all users to pass NET_NAME_UNKNOWN. Coccinelle patch: @@ expression sizeof_priv, name, setup, txqs, rxqs, count; @@ ( -alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs) +alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs) | -alloc_netdev_mq(sizeof_priv, name, setup, count) +alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count) | -alloc_netdev(sizeof_priv, name, setup) +alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup) ) v9: move comments here from the wrong commit Signed-off-by: Tom Gundersen <teg@jklm.no> Reviewed-by: David Herrmann <dh.herrmann@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-11ipv6: Use BUG_ONHimangi Saraogi
The semantic patch that makes this transformation is as follows: // <smpl> @@ expression e; @@ -if (e) BUG(); +BUG_ON(e); // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-11net: ipv6: Use BUG_ONHimangi Saraogi
The semantic patch that makes the transformation is as follows: // <smpl> @@ expression e; @@ -if (e) BUG(); +BUG_ON(e); // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-11ipv6: addrconf: implement address generation modesJiri Pirko
This patch introduces a possibility for userspace to set various (so far two) modes of generating addresses. This is useful for example for NetworkManager because it can set the mode to NONE and take care of link local addresses itself. That allow it to have the interface up, monitoring carrier but still don't have any addresses on it. One more use-case by Dan Williams: <quote> WWAN devices often have their LL address provided by the firmware of the device, which sometimes refuses to respond to incorrect LL addresses when doing DHCPv6 or IPv6 ND. The kernel cannot generate the correct LL address for two reasons: 1) WWAN pseudo-ethernet interfaces often construct a fake MAC address, or read a meaningless MAC address from the firmware. Thus the EUI64 and the IPv6LL address the kernel assigns will be wrong. The real LL address is often retrieved from the firmware with AT or proprietary commands. 2) WWAN PPP interfaces receive their LL address from IPV6CP, not from kernel assignments. Only after IPV6CP has completed do we know the LL address of the PPP interface and its peer. But the kernel has already assigned an incorrect LL address to the interface. So being able to suppress the kernel LL address generation and assign the one retrieved from the firmware is less complicated and more robust. </quote> Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-10ipv6: fix the check when handle RALi RongQing
d9333196572(ipv6: Allow accepting RA from local IP addresses.) made the wrong check, whether or not to accept RA with source-addr found on local machine, when accept_ra_from_local is 0. Fixes: d9333196572(ipv6: Allow accepting RA from local IP addresses.) Cc: Ben Greear <greearb@candelatech.com> Cc: Hannes Frederic Sowa <hannes@redhat.com> Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07ipv6: Implement automatic flow label generation on transmitTom Herbert
Automatically generate flow labels for IPv6 packets on transmit. The flow label is computed based on skb_get_hash. The flow label will only automatically be set when it is zero otherwise (i.e. flow label manager hasn't set one). This supports the transmit side functionality of RFC 6438. Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this functionality per socket. By default, auto flowlabels are disabled to avoid possible conflicts with flow label manager, however if this feature proves useful we may want to enable it by default. It should also be noted that FreeBSD has already implemented automatic flow labels (including the sysctl and socket option). In FreeBSD, automatic flow labels default to enabled. Performance impact: Running super_netperf with 200 flows for TCP_RR and UDP_RR for IPv6. Note that in UDP case, __skb_get_hash will be called for every packet with explains slight regression. In the TCP case the hash is saved in the socket so there is no regression. Automatic flow labels disabled: TCP_RR: 86.53% CPU utilization 127/195/322 90/95/99% latencies 1.40498e+06 tps UDP_RR: 90.70% CPU utilization 118/168/243 90/95/99% latencies 1.50309e+06 tps Automatic flow labels enabled: TCP_RR: 85.90% CPU utilization 128/199/337 90/95/99% latencies 1.40051e+06 UDP_RR 92.61% CPU utilization 115/164/236 90/95/99% latencies 1.4687e+06 Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07net: Save TX flow hash in sock and set in skbuf on xmitTom Herbert
For a connected socket we can precompute the flow hash for setting in skb->hash on output. This is a performance advantage over calculating the skb->hash for every packet on the connection. The computation is done using the common hash algorithm to be consistent with computations done for packets of the connection in other states where thers is no socket (e.g. time-wait, syn-recv, syn-cookies). This patch adds sk_txhash to the sock structure. inet_set_txhash and ip6_set_txhash functions are added which are called from points in TCP and UDP where socket moves to established state. skb_set_hash_from_sk is a function which sets skb->hash from the sock txhash value. This is called in UDP and TCP transmit path when transmitting within the context of a socket. Tested: ran super_netperf with 200 TCP_RR streams over a vxlan interface (in this case skb_get_hash called on every TX packet to create a UDP source port). Before fix: 95.02% CPU utilization 154/256/505 90/95/99% latencies 1.13042e+06 tps Time in functions: 0.28% skb_flow_dissect 0.21% __skb_get_hash After fix: 94.95% CPU utilization 156/254/485 90/95/99% latencies 1.15447e+06 Neither __skb_get_hash nor skb_flow_dissect appear in perf Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07tcp: switch snt_synack back to measuring transmit time of first SYNACKNeal Cardwell
Always store in snt_synack the time at which the server received the first client SYN and attempted to send the first SYNACK. Recent commit aa27fc501 ("tcp: tcp_v[46]_conn_request: fix snt_synack initialization") resolved an inconsistency between IPv4 and IPv6 in the initialization of snt_synack. This commit brings back the idea from 843f4a55e (tcp: use tcp_v4_send_synack on first SYN-ACK), which was going for the original behavior of snt_synack from the commit where it was added in 9ad7c049f0f79 ("tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side") in v3.1. In addition to being simpler (and probably a tiny bit faster), unconditionally storing the time of the first SYNACK attempt has been useful because it allows calculating a performance metric quantifying how long it took to establish a passive TCP connection. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Octavian Purdila <octavian.purdila@intel.com> Cc: Jerry Chu <hkchu@google.com> Acked-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-01inet: move ipv6only in sock_commonEric Dumazet
When an UDP application switches from AF_INET to AF_INET6 sockets, we have a small performance degradation for IPv4 communications because of extra cache line misses to access ipv6only information. This can also be noticed for TCP listeners, as ipv6_only_sock() is also used from __inet_lookup_listener()->compute_score() This is magnified when SO_REUSEPORT is used. Move ipv6only into struct sock_common so that it is available at no extra cost in lookups. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-01ipv6: Allow accepting RA from local IP addresses.Ben Greear
This can be used in virtual networking applications, and may have other uses as well. The option is disabled by default. A specific use case is setting up virtual routers, bridges, and hosts on a single OS without the use of network namespaces or virtual machines. With proper use of ip rules, routing tables, veth interface pairs and/or other virtual interfaces, and applications that can bind to interfaces and/or IP addresses, it is possibly to create one or more virtual routers with multiple hosts attached. The host interfaces can act as IPv6 systems, with radvd running on the ports in the virtual routers. With the option provided in this patch enabled, those hosts can now properly obtain IPv6 addresses from the radvd. Signed-off-by: Ben Greear <greearb@candelatech.com> Signed-off-by: David S. Miller <davem@davemloft.net>