summaryrefslogtreecommitdiff
path: root/net/netfilter/nf_conntrack_core.c
AgeCommit message (Collapse)Author
2018-04-13netfilter: conntrack: don't call iter for non-confirmed conntracksFlorian Westphal
[ Upstream commit b0feacaad13a0aa9657c37ed80991575981e2e3b ] nf_ct_iterate_cleanup_net currently calls iter() callback also for conntracks on the unconfirmed list, but this is unsafe. Acesses to nf_conn are fine, but some users access the extension area in the iter() callback, but that does only work reliably for confirmed conntracks (ct->ext can be reallocated at any time for unconfirmed conntrack). The seond issue is that there is a short window where a conntrack entry is neither on the list nor in the table: To confirm an entry, it is first removed from the unconfirmed list, then insert into the table. Fix this by iterating the unconfirmed list first and marking all entries as dying, then wait for rcu grace period. This makes sure all entries that were about to be confirmed either are in the main table, or will be dropped soon. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-18netfilter: nat: avoid use of nf_conn_nat extensionFlorian Westphal
commit 6e699867f84c0f358fed233fe6162173aca28e04 upstream. successful insert into the bysource hash sets IPS_SRC_NAT_DONE status bit so we can check that instead of presence of nat extension which requires extra deref. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27net/netfilter/nf_conntrack_core: Fix net_conntrack_lock()Manfred Spraul
commit 3ef0c7a730de0bae03d86c19570af764fa3c4445 upstream. As we want to remove spin_unlock_wait() and replace it with explicit spin_lock()/spin_unlock() calls, we can use this to simplify the locking. In addition: - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering. - The new code avoids the backwards loop. Only slightly tested, I did not manage to trigger calls to nf_conntrack_all_lock(). V2: With improved comments, to clearly show how the barriers pair. Fixes: b16c29191dc8 ("netfilter: nf_conntrack: use safer way to lock all buckets") Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: netfilter-devel@vger.kernel.org Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12netfilter: conntrack: refine gc worker heuristics, reduxFlorian Westphal
commit e5072053b09642b8ff417d47da05b84720aea3ee upstream. This further refines the changes made to conntrack gc_worker in commit e0df8cae6c16 ("netfilter: conntrack: refine gc worker heuristics"). The main idea of that change was to reduce the scan interval when evictions take place. However, on the reporters' setup, there are 1-2 million conntrack entries in total and roughly 8k new (and closing) connections per second. In this case we'll always evict at least one entry per gc cycle and scan interval is always at 1 jiffy because of this test: } else if (expired_count) { gc_work->next_gc_run /= 2U; next_run = msecs_to_jiffies(1); being true almost all the time. Given we scan ~10k entries per run its clearly wrong to reduce interval based on nonzero eviction count, it will only waste cpu cycles since a vast majorities of conntracks are not timed out. Thus only look at the ratio (scanned entries vs. evicted entries) to make a decision on whether to reduce or not. Because evictor is supposed to only kick in when system turns idle after a busy period, pick a high ratio -- this makes it 50%. We thus keep the idea of increasing scan rate when its likely that table contains many expired entries. In order to not let timed-out entries hang around for too long (important when using event logging, in which case we want to timely destroy events), we now scan the full table within at most GC_MAX_SCAN_JIFFIES (16 seconds) even in worst-case scenario where all timed-out entries sit in same slot. I tested this with a vm under synflood (with sysctl net.netfilter.nf_conntrack_tcp_timeout_syn_recv=3). While flood is ongoing, interval now stays at its max rate (GC_MAX_SCAN_JIFFIES / GC_MAX_BUCKETS_DIV -> 125ms). With feedback from Nicolas Dichtel. Reported-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com> Fixes: b87a2f9199ea82eaadc ("netfilter: conntrack: add gc worker to remove timed-out entries") Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Denys Fedoryshchenko <nuclearcat@nuclearcat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-03-12netfilter: conntrack: remove GC_MAX_EVICTS breakFlorian Westphal
commit 524b698db06b9b6da7192e749f637904e2f62d7b upstream. Instead of breaking loop and instant resched, don't bother checking this in first place (the loop calls cond_resched for every bucket anyway). Suggested-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-08netfilter: conntrack: refine gc worker heuristicsFlorian Westphal
Nicolas Dichtel says: After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to remove timed-out entries"), netlink conntrack deletion events may be sent with a huge delay. Nicolas further points at this line: goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS); and indeed, this isn't optimal at all. Rationale here was to ensure that we don't block other work items for too long, even if nf_conntrack_htable_size is huge. But in order to have some guarantee about maximum time period where a scan of the full conntrack table completes we should always use a fixed slice size, so that once every N scans the full table has been examined at least once. We also need to balance this vs. the case where the system is either idle (i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens from packet path). So, after some discussion with Nicolas: 1. want hard guarantee that we scan entire table at least once every X s -> need to scan fraction of table (get rid of upper bound) 2. don't want to eat cycles on idle or very busy system -> increase interval if we did not evict any entries 3. don't want to block other worker items for too long -> make fraction really small, and prefer small scan interval instead 4. Want reasonable short time where we detect timed-out entry when system went idle after a burst of traffic, while not doing scans all the time. -> Store next gc scan in worker, increasing delays when no eviction happened and shrinking delay when we see timed out entries. The old gc interval is turned into a max number, scans can now happen every jiffy if stale entries are present. Longest possible time period until an entry is evicted is now 2 minutes in worst case (entry expires right after it was deemed 'not expired'). Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-10-20netfilter: conntrack: restart gc immediately if GC_MAX_EVICTS is reachedNicolas Dichtel
When the maximum evictions number is reached, do not wait 5 seconds before the next run. CC: Florian Westphal <fw@strlen.de> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-25Merge branch 'master' of ↵Pablo Neira Ayuso
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Conflicts: net/netfilter/core.c net/netfilter/nf_tables_netdev.c Resolve two conflicts before pull request for David's net-next tree: 1) Between c73c24849011 ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment") from the net tree and commit ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()"). 2) Between e8bffe0cf964 ("net: Add _nf_(un)register_hooks symbols") and Aaron Conole's patches to replace list_head with single linked list. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-24netfilter: Remove explicit rcu_read_lock in nf_hook_slowAaron Conole
All of the callers of nf_hook_slow already hold the rcu_read_lock, so this cleanup removes the recursive call. This is just a cleanup, as the locking code gracefully handles this situation. Signed-off-by: Aaron Conole <aconole@bytheb.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2016-09-13netfilter: synproxy: Check oom when adding synproxy and seqadj ct extensionsGao Feng
When memory is exhausted, nfct_seqadj_ext_add may fail to add the synproxy and seqadj extensions. The function nf_ct_seqadj_init doesn't check if get valid seqadj pointer by the nfct_seqadj. Now drop the packet directly when fail to add seqadj extension to avoid dereference NULL pointer in nf_ct_seqadj_init from init_conntrack(). Signed-off-by: Gao Feng <fgao@ikuai8.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-09-12netfilter: conntrack: remove packet hotpath statsFlorian Westphal
These counters sit in hot path and do show up in perf, this is especially true for 'found' and 'searched' which get incremented for every packet processed. Information like searched=212030105 new=623431 found=333613 delete=623327 does not seem too helpful nowadays: - on busy systems found and searched will overflow every few hours (these are 32bit integers), other more busy ones every few days. - for debugging there are better methods, such as iptables' trace target, the conntrack log sysctls. Nowadays we also have perf tool. This removes packet path stat counters except those that are expected to be 0 (or close to 0) on a normal system, e.g. 'insert_failed' (race happened) or 'invalid' (proto tracker rejects). The insert stat is retained for the ctnetlink case. The found stat is retained for the tuple-is-taken check when NAT has to determine if it needs to pick a different source address. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30netfilter: remove __nf_ct_kill_acct helperFlorian Westphal
After timer removal this just calls nf_ct_delete so remove the __ prefix version and make nf_ct_kill a shorthand for nf_ct_delete. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30netfilter: conntrack: resched gc again if eviction rate is highFlorian Westphal
If we evicted a large fraction of the scanned conntrack entries re-schedule the next gc cycle for immediate execution. This triggers during tests where load is high, then drops to zero and many connections will be in TW/CLOSE state with < 30 second timeouts. Without this change it will take several minutes until conntrack count comes back to normal. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30netfilter: conntrack: add gc worker to remove timed-out entriesFlorian Westphal
Conntrack gc worker to evict stale entries. GC happens once every 5 seconds, but we only scan at most 1/64th of the table (and not more than 8k) buckets to avoid hogging cpu. This means that a complete scan of the table will take several minutes of wall-clock time. Considering that the gc run will never have to evict any entries during normal operation because those will happen from packet path this should be fine. We only need gc to make sure userspace (conntrack event listeners) eventually learn of the timeout, and for resource reclaim in case the system becomes idle. We do not disable BH and cond_resched for every bucket so this should not introduce noticeable latencies either. A followup patch will add a small change to speed up GC for the extreme case where most entries are timed out on an otherwise idle system. v2: Use cond_resched_rcu_qs & add comment wrt. missing restart on nulls value change in gc worker, suggested by Eric Dumazet. v3: don't call cancel_delayed_work_sync twice (again, Eric). Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30netfilter: conntrack: get rid of conntrack timerFlorian Westphal
With stats enabled this eats 80 bytes on x86_64 per nf_conn entry, as Eric Dumazet pointed out during netfilter workshop 2016. Eric also says: "Another reason was the fact that Thomas was about to change max timer range [..]" (500462a9de657f8, 'timers: Switch to a non-cascading wheel'). Remove the timer and use a 32bit jiffies value containing timestamp until entry is valid. During conntrack lookup, even before doing tuple comparision, check the timeout value and evict the entry in case it is too old. The dying bit is used as a synchronization point to avoid races where multiple cpus try to evict the same entry. Because lookup is always lockless, we need to bump the refcnt once when we evict, else we could try to evict already-dead entry that is being recycled. This is the standard/expected way when conntrack entries are destroyed. Followup patches will introduce garbage colliction via work queue and further places where we can reap obsoleted entries (e.g. during netlink dumps), this is needed to avoid expired conntracks from hanging around for too long when lookup rate is low after a busy period. Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-30netfilter: restart search if moved to other chainFlorian Westphal
In case nf_conntrack_tuple_taken did not find a conflicting entry check that all entries in this hash slot were tested and restart in case an entry was moved to another chain. Reported-by: Eric Dumazet <edumazet@google.com> Fixes: ea781f197d6a ("netfilter: nf_conntrack: use SLAB_DESTROY_BY_RCU and get rid of call_rcu()") Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-18netfilter: nf_conntrack: restore nf_conntrack_htable_size as exported symbolPablo Neira Ayuso
This is required to iterate over the hash table in cttimeout, ctnetlink and nf_conntrack_ipv4. >> ERROR: "nf_conntrack_htable_size" [net/netfilter/nfnetlink_cttimeout.ko] undefined! ERROR: "nf_conntrack_htable_size" [net/netfilter/nf_conntrack_netlink.ko] undefined! ERROR: "nf_conntrack_htable_size" [net/ipv4/netfilter/nf_conntrack_ipv4.ko] undefined! Fixes: adf0516845bcd0 ("netfilter: remove ip_conntrack* sysctl compat code") Reported-by: kbuild test robot <fengguang.wu@intel.com> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-18netfilter: conntrack: simplify the code by using nf_conntrack_get_htLiping Zhang
Since commit 64b87639c9cb ("netfilter: conntrack: fix race between nf_conntrack proc read and hash resize") introduce the nf_conntrack_get_ht, so there's no need to check nf_conntrack_generation again and again to get the hash table and hash size. And convert nf_conntrack_get_ht to inline function here. Suggested-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-08-13netfilter: remove ip_conntrack* sysctl compat codePablo Neira Ayuso
This backward compatibility has been around for more than ten years, since Yasuyuki Kozakai introduced IPv6 in conntrack. These days, we have alternate /proc/net/nf_conntrack* entries, the ctnetlink interface and the conntrack utility got adopted by many people in the user community according to what I observed on the netfilter user mailing list. So let's get rid of this. Note that nf_conntrack_htable_size and unsigned int nf_conntrack_max do not need to be exported as symbol anymore. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) Unified UDP encapsulation offload methods for drivers, from Alexander Duyck. 2) Make DSA binding more sane, from Andrew Lunn. 3) Support QCA9888 chips in ath10k, from Anilkumar Kolli. 4) Several workqueue usage cleanups, from Bhaktipriya Shridhar. 5) Add XDP (eXpress Data Path), essentially running BPF programs on RX packets as soon as the device sees them, with the option to mirror the packet on TX via the same interface. From Brenden Blanco and others. 6) Allow qdisc/class stats dumps to run lockless, from Eric Dumazet. 7) Add VLAN support to b53 and bcm_sf2, from Florian Fainelli. 8) Simplify netlink conntrack entry layout, from Florian Westphal. 9) Add ipv4 forwarding support to mlxsw spectrum driver, from Ido Schimmel, Yotam Gigi, and Jiri Pirko. 10) Add SKB array infrastructure and convert tun and macvtap over to it. From Michael S Tsirkin and Jason Wang. 11) Support qdisc packet injection in pktgen, from John Fastabend. 12) Add neighbour monitoring framework to TIPC, from Jon Paul Maloy. 13) Add NV congestion control support to TCP, from Lawrence Brakmo. 14) Add GSO support to SCTP, from Marcelo Ricardo Leitner. 15) Allow GRO and RPS to function on macsec devices, from Paolo Abeni. 16) Support MPLS over IPV4, from Simon Horman. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits) xgene: Fix build warning with ACPI disabled. be2net: perform temperature query in adapter regardless of its interface state l2tp: Correctly return -EBADF from pppol2tp_getname. net/mlx5_core/health: Remove deprecated create_singlethread_workqueue net: ipmr/ip6mr: update lastuse on entry change macsec: ensure rx_sa is set when validation is disabled tipc: dump monitor attributes tipc: add a function to get the bearer name tipc: get monitor threshold for the cluster tipc: make cluster size threshold for monitoring configurable tipc: introduce constants for tipc address validation net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update() MAINTAINERS: xgene: Add driver and documentation path Documentation: dtb: xgene: Add MDIO node dtb: xgene: Add MDIO node drivers: net: xgene: ethtool: Use phy_ethtool_gset and sset drivers: net: xgene: Use exported functions drivers: net: xgene: Enable MDIO driver drivers: net: xgene: Add backward compatibility drivers: net: phy: xgene: Add MDIO driver ...
2016-07-25Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The locking tree was busier in this cycle than the usual pattern - a couple of major projects happened to coincide. The main changes are: - implement the atomic_fetch_{add,sub,and,or,xor}() API natively across all SMP architectures (Peter Zijlstra) - add atomic_fetch_{inc/dec}() as well, using the generic primitives (Davidlohr Bueso) - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso, Waiman Long) - optimize smp_cond_load_acquire() on arm64 and implement LSE based atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}() on arm64 (Will Deacon) - introduce smp_acquire__after_ctrl_dep() and fix various barrier mis-uses and bugs (Peter Zijlstra) - after discovering ancient spin_unlock_wait() barrier bugs in its implementation and usage, strengthen its semantics and update/fix usage sites (Peter Zijlstra) - optimize mutex_trylock() fastpath (Peter Zijlstra) - ... misc fixes and cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits) locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire() locking/static_keys: Fix non static symbol Sparse warning locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec() locking/atomic, arch/tile: Fix tilepro build locking/atomic, arch/m68k: Remove comment locking/atomic, arch/arc: Fix build locking/Documentation: Clarify limited control-dependency scope locking/atomic, arch/rwsem: Employ atomic_long_fetch_add() locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire() locking/atomic, arch/mips: Convert to _relaxed atomics locking/atomic, arch/alpha: Convert to _relaxed atomics locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions locking/atomic: Remove linux/atomic.h:atomic_fetch_or() locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}() locking/atomic: Fix atomic64_relaxed() bits locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}() locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}() ...
2016-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next, they are: 1) Count pre-established connections as active in "least connection" schedulers such that pre-established connections to avoid overloading backend servers on peak demands, from Michal Kubecek via Simon Horman. 2) Address a race condition when resizing the conntrack table by caching the bucket size when fulling iterating over the hashtable in these three possible scenarios: 1) dump via /proc/net/nf_conntrack, 2) unlinking userspace helper and 3) unlinking custom conntrack timeout. From Liping Zhang. 3) Revisit early_drop() path to perform lockless traversal on conntrack eviction under stress, use del_timer() as synchronization point to avoid two CPUs evicting the same entry, from Florian Westphal. 4) Move NAT hlist_head to nf_conn object, this simplifies the existing NAT extension and it doesn't increase size since recent patches to align nf_conn, from Florian. 5) Use rhashtable for the by-source NAT hashtable, also from Florian. 6) Don't allow --physdev-is-out from OUTPUT chain, just like --physdev-out is not either, from Hangbin Liu. 7) Automagically set on nf_conntrack counters if the user tries to match ct bytes/packets from nftables, from Liping Zhang. 8) Remove possible_net_t fields in nf_tables set objects since we just simply pass the net pointer to the backend set type implementations. 9) Fix possible off-by-one in h323, from Toby DiPasquale. 10) early_drop() may be called from ctnetlink patch, so we must hold rcu read size lock from them too, this amends Florian's patch #3 coming in this batch, from Liping Zhang. 11) Use binary search to validate jump offset in x_tables, this addresses the O(n!) validation that was introduced recently resolve security issues with unpriviledge namespaces, from Florian. 12) Fix reference leak to connlabel in error path of nft_ct, from Zhang. 13) Three updates for nft_log: Fix log prefix leak in error path. Bail out on loglevel larger than debug in nft_log and set on the new NF_LOG_F_COPY_LEN flag when snaplen is specified. Again from Zhang. 14) Allow to filter rule dumps in nf_tables based on table and chain names. 15) Simplify connlabel to always use 128 bits to store labels and get rid of unused function in xt_connlabel, from Florian. 16) Replace set_expect_timeout() by mod_timer() from the h323 conntrack helper, by Gao Feng. 17) Put back x_tables module reference in nft_compat on error, from Liping Zhang. 18) Add a reference count to the x_tables extensions cache in nft_compat, so we can remove them when unused and avoid a crash if the extensions are rmmod, again from Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Just several instances of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-12netfilter: conntrack: skip clash resolution if nat is in placePablo Neira Ayuso
The clash resolution is not easy to apply if the NAT table is registered. Even if no NAT rules are installed, the nul-binding ensures that a unique tuple is used, thus, the packet that loses race gets a different source port number, as described by: http://marc.info/?l=netfilter-devel&m=146818011604484&w=2 Clash resolution with NAT is also problematic if addresses/port range ports are used since the conntrack that wins race may describe a different mangling that we may have earlier applied to the packet via nf_nat_setup_info(). Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Marc Dionne <marc.c.dionne@gmail.com>
2016-07-12netfilter: conntrack: protect early_drop by rcu read lockLiping Zhang
User can add ct entry via nfnetlink(IPCTNL_MSG_CT_NEW), and if the total number reach the nf_conntrack_max, we will try to drop some ct entries. But in this case(the main function call path is ctnetlink_create_conntrack -> nf_conntrack_alloc -> early_drop), rcu_read_lock is not held, so race with hash resize will happen. Fixes: 242922a02717 ("netfilter: conntrack: simplify early_drop") Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: conntrack: simplify early_dropFlorian Westphal
We don't need to acquire the bucket lock during early drop, we can use lockless traveral just like ____nf_conntrack_find. The timer deletion serves as synchronization point, if another cpu attempts to evict same entry, only one will succeed with timer deletion. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-11netfilter: conntrack: fix race between nf_conntrack proc read and hash resizeLiping Zhang
When we do "cat /proc/net/nf_conntrack", and meanwhile resize the conntrack hash table via /sys/module/nf_conntrack/parameters/hashsize, race will happen, because reader can observe a newly allocated hash but the old size (or vice versa). So oops will happen like follows: BUG: unable to handle kernel NULL pointer dereference at 0000000000000017 IP: [<ffffffffa0418e21>] seq_print_acct+0x11/0x50 [nf_conntrack] Call Trace: [<ffffffffa0412f4e>] ? ct_seq_show+0x14e/0x340 [nf_conntrack] [<ffffffff81261a1c>] seq_read+0x2cc/0x390 [<ffffffff812a8d62>] proc_reg_read+0x42/0x70 [<ffffffff8123bee7>] __vfs_read+0x37/0x130 [<ffffffff81347980>] ? security_file_permission+0xa0/0xc0 [<ffffffff8123cf75>] vfs_read+0x95/0x140 [<ffffffff8123e475>] SyS_read+0x55/0xc0 [<ffffffff817c2572>] entry_SYSCALL_64_fastpath+0x1a/0xa4 It is very easy to reproduce this kernel crash. 1. open one shell and input the following cmds: while : ; do echo $RANDOM > /sys/module/nf_conntrack/parameters/hashsize done 2. open more shells and input the following cmds: while : ; do cat /proc/net/nf_conntrack done 3. just wait a monent, oops will happen soon. The solution in this patch is based on Florian's Commit 5e3c61f98175 ("netfilter: conntrack: fix lookup race during hash resize"). And add a wrapper function nf_conntrack_get_ht to get hash and hsize suggested by Florian Westphal. Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-07-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, they are: 1) Don't use userspace datatypes in bridge netfilter code, from Tobin Harding. 2) Iterate only once over the expectation table when removing the helper module, instead of once per-netns, from Florian Westphal. 3) Extra sanitization in xt_hook_ops_alloc() to return error in case we ever pass zero hooks, xt_hook_ops_alloc(): 4) Handle NFPROTO_INET from the logging core infrastructure, from Liping Zhang. 5) Autoload loggers when TRACE target is used from rules, this doesn't change the behaviour in case the user already selected nfnetlink_log as preferred way to print tracing logs, also from Liping Zhang. 6) Conntrack slabs with SLAB_HWCACHE_ALIGN to allow rearranging fields by cache lines, increases the size of entries in 11% per entry. From Florian Westphal. 7) Skip zone comparison if CONFIG_NF_CONNTRACK_ZONES=n, from Florian. 8) Remove useless defensive check in nf_logger_find_get() from Shivani Bhardwaj. 9) Remove zone extension as place it in the conntrack object, this is always include in the hashing and we expect more intensive use of zones since containers are in place. Also from Florian Westphal. 10) Owner match now works from any namespace, from Eric Bierdeman. 11) Make sure we only reply with TCP reset to TCP traffic from nf_reject_ipv4, patch from Liping Zhang. 12) Introduce --nflog-size to indicate amount of network packet bytes that are copied to userspace via log message, from Vishwanath Pai. This obsoletes --nflog-range that has never worked, it was designed to achieve this but it has never worked. 13) Introduce generic macros for nf_tables object generation masks. 14) Use generation mask in table, chain and set objects in nf_tables. This allows fixes interferences with ongoing preparation phase of the commit protocol and object listings going on at the same time. This update is introduced in three patches, one per object. 15) Check if the object is active in the next generation for element deactivation in the rbtree implementation, given that deactivation happens from the commit phase path we have to observe the future status of the object. 16) Support for deletion of just added elements in the hash set type. 17) Allow to resize hashtable from /proc entry, not only from the obscure /sys entry that maps to the module parameter, from Florian Westphal. 18) Get rid of NFT_BASECHAIN_DISABLED, this code is not exercised anymore since we tear down the ruleset whenever the netdevice goes away. 19) Support for matching inverted set lookups, from Arturo Borrero. 20) Simplify the iptables_mangle_hook() by removing a superfluous extra branch. 21) Introduce ether_addr_equal_masked() and use it from the netfilter codebase, from Joe Perches. 22) Remove references to "Use netfilter MARK value as routing key" from the Netfilter Kconfig description given that this toggle doesn't exists already for 10 years, from Moritz Sichert. 23) Introduce generic NF_INVF() and use it from the xtables codebase, from Joe Perches. 24) Setting logger to NONE via /proc was not working unless explicit nul-termination was included in the string. This fixes seems to leave the former behaviour there, so we don't break backward. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-01netfilter: conntrack: avoid integer overflow when resizingFlorian Westphal
Can overflow so we might allocate very small table when bucket count is high on a 32bit platform. Note: resize is only possible from init_netns. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-24netfilter: conntrack: allow increasing bucket size via sysctl tooFlorian Westphal
No need to restrict this to module parameter. We export a copy of the real hash size -- when user alters the value we allocate the new table, copy entries etc before we update the real size to the requested one. This is also needed because the real size is used by concurrent readers and cannot be changed without synchronizing the conntrack generation seqcnt. We only allow changing this value from the initial net namespace. Tested using http-client-benchmark vs. httpterm with concurrent while true;do echo $RANDOM > /proc/sys/net/netfilter/nf_conntrack_buckets done Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23netfilter: move zone info into struct nf_connFlorian Westphal
Curently we store zone information as a conntrack extension. This has one drawback: for every lookup we need to fetch the zone data from the extension area. This change place the zone data directly into the main conntrack object structure and then removes the zone conntrack extension. The zone data is just 4 bytes, it fits into a padding hole before the tuplehash info, so we do not even increase the nf_conn structure size. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-23netfilter: conntrack: align nf_conn on cacheline boundaryFlorian Westphal
increases struct size by 32 bytes (288 -> 320), but it is the right thing, else any attempt to (re-)arrange nf_conn members by cacheline won't work. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-15netfilter: conntrack: destroy kmemcache on module removalFlorian Westphal
I forgot to move the kmem_cache_destroy into the exit path. Fixes: 0c5366b3a8c7 ("netfilter: conntrack: use single slab cache) Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-06-14locking/spinlock, netfilter: Fix nf_conntrack_lock() barriersPeter Zijlstra
Even with spin_unlock_wait() fixed, nf_conntrack_lock{,_all}() is borken as it misses a bunch of memory barriers to order the whole global vs local locks scheme. Even x86 (and other TSO archs) are affected. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> [ Updated the comments. ] Cc: Andrew Morton <akpm@linux-foundation.org> Cc: David S. Miller <davem@davemloft.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
The nf_conntrack_core.c fix in 'net' is not relevant in 'net-next' because we no longer have a per-netns conntrack hash. The ip_gre.c conflict as well as the iwlwifi ones were cases of overlapping changes. Conflicts: drivers/net/wireless/intel/iwlwifi/mvm/tx.c net/ipv4/ip_gre.c net/netfilter/nf_conntrack_core.c Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-14nf_conntrack: avoid kernel pointer value leak in slab nameLinus Torvalds
The slab name ends up being visible in the directory structure under /sys, and even if you don't have access rights to the file you can see the filenames. Just use a 64-bit counter instead of the pointer to the 'net' structure to generate a unique name. This code will go away in 4.7 when the conntrack code moves to a single kmemcache, but this is the backportable simple solution to avoiding leaking kernel pointers to user space. Fixes: 5b3501faa874 ("netfilter: nf_conntrack: per netns nf_conntrack_cachep") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-10netfilter: conntrack: remove uninitialized shadow variableArnd Bergmann
A recent commit introduced an unconditional use of an uninitialized variable, as reported in this gcc warning: net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_confirm': net/netfilter/nf_conntrack_core.c:632:33: error: 'ctinfo' may be used uninitialized in this function [-Werror=maybe-uninitialized] bytes = atomic64_read(&counter[CTINFO2DIR(ctinfo)].bytes); ^ net/netfilter/nf_conntrack_core.c:628:26: note: 'ctinfo' was declared here enum ip_conntrack_info ctinfo; The problem is that a local variable shadows the function parameter. This removes the local variable, which looks like what Pablo originally intended. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 71d8c47fc653 ("netfilter: conntrack: introduce clash resolution on insertion race") Acked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-09netfilter: conntrack: use single slab cacheFlorian Westphal
An earlier patch changed lookup side to also net_eq() namespaces after obtaining a reference on the conntrack, so a single kmemcache can be used. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: introduce clash resolution on insertion racePablo Neira Ayuso
This patch introduces nf_ct_resolve_clash() to resolve race condition on conntrack insertions. This is particularly a problem for connection-less protocols such as UDP, with no initial handshake. Two or more packets may race to insert the entry resulting in packet drops. Another problematic scenario are packets enqueued to userspace via NFQUEUE after the raw table, that make it easier to trigger this race. To resolve this, the idea is to reset the conntrack entry to the one that won race. Packet and bytes counters are also merged. The 'insert_failed' stats still accounts for this situation, after this patch, the drop counter is bumped whenever we drop packets, so we can watch for unresolved clashes. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: introduce nf_ct_acct_update()Pablo Neira Ayuso
Introduce a helper function to update conntrack counters. __nf_ct_kill_acct() was unnecessarily subtracting skb_network_offset() that is expected to be zero from the ipv4/ipv6 hooks. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: __nf_ct_l4proto_find() always returns valid pointerPablo Neira Ayuso
Remove unnecessary check for non-nul pointer in destroy_conntrack() given that __nf_ct_l4proto_find() returns the generic protocol tracker if the protocol is not supported. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: consider ct netns in early_drop logicFlorian Westphal
When iterating, skip conntrack entries living in a different netns. We could ignore netns and kill some other non-assured one, but it has two problems: - a netns can kill non-assured conntracks in other namespace - we would start to 'over-subscribe' the affected/overlimit netns. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: use a single hashtable for all namespacesFlorian Westphal
We already include netns address in the hash and compare the netns pointers during lookup, so even if namespaces have overlapping addresses entries will be spread across the table. Assuming 64k bucket size, this change saves 0.5 mbyte per namespace on a 64bit system. NAT bysrc and expectation hash is still per namespace, those will changed too soon. Future patch will also make conntrack object slab cache global again. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: make netns address part of hashFlorian Westphal
Once we place all conntracks into a global hash table we want them to be spread across entire hash table, even if namespaces have overlapping ip addresses. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: check netns when comparing conntrack objectsFlorian Westphal
Once we place all conntracks in the same hash table we must also compare the netns pointer to skip conntracks that belong to a different namespace. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: use nf_ct_key_equal() in more placesFlorian Westphal
This prepares for upcoming change that places all conntracks into a single, global table. For this to work we will need to also compare net pointer during lookup. To avoid open-coding such check use the nf_ct_key_equal helper and then later extend it to also consider net_eq. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: don't attempt to iterate over empty tableFlorian Westphal
Once we place all conntracks into same table iteration becomes more costly because the table contains conntracks that we are not interested in (belonging to other netns). So don't bother scanning if the current namespace has no entries. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: fix lookup race during hash resizeFlorian Westphal
When resizing the conntrack hash table at runtime via echo 42 > /sys/module/nf_conntrack/parameters/hashsize, we are racing with the conntrack lookup path -- reads can happen in parallel and nothing prevents readers from observing a the newly allocated hash but the old size (or vice versa). So access to hash[bucket] can trigger OOB read access in case the table got expanded and we saw the new size but the old hash pointer (or it got shrunk and we got new hash ptr but the size of the old and larger table): kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN CPU: 0 PID: 3 Comm: ksoftirqd/0 Not tainted 4.6.0-rc2+ #107 [..] Call Trace: [<ffffffff822c3d6a>] ? nf_conntrack_tuple_taken+0x12a/0xe90 [<ffffffff822c3ac1>] ? nf_ct_invert_tuplepr+0x221/0x3a0 [<ffffffff8230e703>] get_unique_tuple+0xfb3/0x2760 Use generation counter to obtain the address/length of the same table. Also add a synchronize_net before freeing the old hash. AFAICS, without it we might access ct_hash[bucket] after ct_hash has been freed, provided that lockless reader got delayed by another event: CPU1 CPU2 seq_begin seq_retry <delay> resize occurs free oldhash for_each(oldhash[size]) Note that resize is only supported in init_netns, it took over 2 minutes of constant resizing+flooding to produce the warning, so this isn't a big problem in practice. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2016-05-05netfilter: conntrack: keep BH enabled during lookupFlorian Westphal
No need to disable BH here anymore: stats are switched to _ATOMIC variant (== this_cpu_inc()), which nowadays generates same code as the non _ATOMIC NF_STAT, at least on x86. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>