summaryrefslogtreecommitdiff
path: root/net/netlink
AgeCommit message (Collapse)Author
2015-12-09netlink: Add missing goto statement to netlink_insertHerbert Xu
The backport of 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink: Fix autobind race condition that leads to zero port ID") missed a goto statement, which causes netlink to break subtly. This was discovered by Stefan Priebe <s.priebe@profihost.ag>. Fixes: 4e2776241766 ("netlink: Fix autobind race condition that...") Reported-by: Stefan Priebe <s.priebe@profihost.ag> Reported-by: Philipp Hahn <pmhahn@pmhahn.de> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-27netlink: Trim skb to alloc size to avoid MSG_TRUNCArad, Ronen
[ Upstream commit db65a3aaf29ecce2e34271d52e8d2336b97bd9fe ] netlink_dump() allocates skb based on the calculated min_dump_alloc or a per socket max_recvmsg_len. min_alloc_size is maximum space required for any single netdev attributes as calculated by rtnl_calcit(). max_recvmsg_len tracks the user provided buffer to netlink_recvmsg. It is capped at 16KiB. The intention is to avoid small allocations and to minimize the number of calls required to obtain dump information for all net devices. netlink_dump packs as many small messages as could fit within an skb that was sized for the largest single netdev information. The actual space available within an skb is larger than what is requested. It could be much larger and up to near 2x with align to next power of 2 approach. Allowing netlink_dump to use all the space available within the allocated skb increases the buffer size a user has to provide to avoid truncaion (i.e. MSG_TRUNG flag set). It was observed that with many VLANs configured on at least one netdev, a larger buffer of near 64KiB was necessary to avoid "Message truncated" error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when min_alloc_size was only little over 32KiB. This patch trims skb to allocated size in order to allow the user to avoid truncation with more reasonable buffer size. Signed-off-by: Ronen Arad <ronen.arad@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-03netlink: Replace rhash_portid with boundHerbert Xu
[ Upstream commit da314c9923fed553a007785a901fd395b7eb6c19 ] On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote: > > store_release and load_acquire are different from the usual memory > barriers and can't be paired this way. You have to pair store_release > and load_acquire. Besides, it isn't a particularly good idea to OK I've decided to drop the acquire/release helpers as they don't help us at all and simply pessimises the code by using full memory barriers (on some architectures) where only a write or read barrier is needed. > depend on memory barriers embedded in other data structures like the > above. Here, especially, rhashtable_insert() would have write barrier > *before* the entry is hashed not necessarily *after*, which means that > in the above case, a socket which appears to have set bound to a > reader might not visible when the reader tries to look up the socket > on the hashtable. But you are right we do need an explicit write barrier here to ensure that the hashing is visible. > There's no reason to be overly smart here. This isn't a crazy hot > path, write barriers tend to be very cheap, store_release more so. > Please just do smp_store_release() and note what it's paired with. It's not about being overly smart. It's about actually understanding what's going on with the code. I've seen too many instances of people simply sprinkling synchronisation primitives around without any knowledge of what is happening underneath, which is just a recipe for creating hard-to-debug races. > > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr, > > } > > } > > > > - if (!nlk->portid) { > > + if (!nlk->bound) { > > I don't think you can skip load_acquire here just because this is the > second deref of the variable. That doesn't change anything. Race > condition could still happen between the first and second tests and > skipping the second would lead to the same kind of bug. The reason this one is OK is because we do not use nlk->portid or try to get nlk from the hash table before we return to user-space. However, there is a real bug here that none of these acquire/release helpers discovered. The two bound tests here used to be a single one. Now that they are separate it is entirely possible for another thread to come in the middle and bind the socket. So we need to repeat the portid check in order to maintain consistency. > > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr, > > !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND)) > > return -EPERM; > > > > - if (!nlk->portid) > > + if (!nlk->bound) > > Don't we need load_acquire here too? Is this path holding a lock > which makes that unnecessary? Ditto. ---8<--- The commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ("netlink: Fix autobind race condition that leads to zero port ID") created some new races that can occur due to inconcsistencies between the two port IDs. Tejun is right that a barrier is unavoidable. Therefore I am reverting to the original patch that used a boolean to indicate that a user netlink socket has been bound. Barriers have been added where necessary to ensure that a valid portid and the hashed socket is visible. I have also changed netlink_insert to only return EBUSY if the socket is bound to a portid different to the requested one. This combined with only reading nlk->bound once in netlink_bind fixes a race where two threads that bind the socket at the same time with different port IDs may both succeed. Fixes: 1f770c0a09da ("netlink: Fix autobind race condition that leads to zero port ID") Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Nacked-by: Tejun Heo <tj@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-03netlink: Fix autobind race condition that leads to zero port IDHerbert Xu
[ Upstream commit 1f770c0a09da855a2b51af6d19de97fb955eca85 ] The commit c0bb07df7d981e4091432754e30c9c720e2c0c78 ("netlink: Reset portid after netlink_insert failure") introduced a race condition where if two threads try to autobind the same socket one of them may end up with a zero port ID. This led to kernel deadlocks that were observed by multiple people. This patch reverts that commit and instead fixes it by introducing a separte rhash_portid variable so that the real portid is only set after the socket has been successfully hashed. Fixes: c0bb07df7d98 ("netlink: Reset portid after netlink_insert failure") Reported-by: Tejun Heo <tj@kernel.org> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-10-03netlink, mmap: transform mmap skb into full skb on tapsDaniel Borkmann
[ Upstream commit 1853c949646005b5959c483becde86608f548f24 ] Ken-ichirou reported that running netlink in mmap mode for receive in combination with nlmon will throw a NULL pointer dereference in __kfree_skb() on nlmon_xmit(), in my case I can also trigger an "unable to handle kernel paging request". The problem is the skb_clone() in __netlink_deliver_tap_skb() for skbs that are mmaped. I.e. the cloned skb doesn't have a destructor, whereas the mmap netlink skb has it pointed to netlink_skb_destructor(), set in the handler netlink_ring_setup_skb(). There, skb->head is being set to NULL, so that in such cases, __kfree_skb() doesn't perform a skb_release_data() via skb_release_all(), where skb->head is possibly being freed through kfree(head) into slab allocator, although netlink mmap skb->head points to the mmap buffer. Similarly, the same has to be done also for large netlink skbs where the data area is vmalloced. Therefore, as discussed, make a copy for these rather rare cases for now. This fixes the issue on my and Ken-ichirou's test-cases. Reference: http://thread.gmane.org/gmane.linux.network/371129 Fixes: bcbde0d449ed ("net: netlink: virtual tap device management") Reported-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Ken-ichirou MATSUZAWA <chamaken@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29netlink: make sure -EBUSY won't escape from netlink_insertDaniel Borkmann
[ Upstream commit 4e7c1330689e27556de407d3fdadc65ffff5eb12 ] Linus reports the following deadlock on rtnl_mutex; triggered only once so far (extract): [12236.694209] NetworkManager D 0000000000013b80 0 1047 1 0x00000000 [12236.694218] ffff88003f902640 0000000000000000 ffffffff815d15a9 0000000000000018 [12236.694224] ffff880119538000 ffff88003f902640 ffffffff81a8ff84 00000000ffffffff [12236.694230] ffffffff81a8ff88 ffff880119c47f00 ffffffff815d133a ffffffff81a8ff80 [12236.694235] Call Trace: [12236.694250] [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10 [12236.694257] [<ffffffff815d133a>] ? schedule+0x2a/0x70 [12236.694263] [<ffffffff815d15a9>] ? schedule_preempt_disabled+0x9/0x10 [12236.694271] [<ffffffff815d2c3f>] ? __mutex_lock_slowpath+0x7f/0xf0 [12236.694280] [<ffffffff815d2cc6>] ? mutex_lock+0x16/0x30 [12236.694291] [<ffffffff814f1f90>] ? rtnetlink_rcv+0x10/0x30 [12236.694299] [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180 [12236.694309] [<ffffffff814f5ad3>] ? rtnl_getlink+0x113/0x190 [12236.694319] [<ffffffff814f202a>] ? rtnetlink_rcv_msg+0x7a/0x210 [12236.694331] [<ffffffff8124565c>] ? sock_has_perm+0x5c/0x70 [12236.694339] [<ffffffff814f1fb0>] ? rtnetlink_rcv+0x30/0x30 [12236.694346] [<ffffffff8150d62c>] ? netlink_rcv_skb+0x9c/0xc0 [12236.694354] [<ffffffff814f1f9f>] ? rtnetlink_rcv+0x1f/0x30 [12236.694360] [<ffffffff8150ce3b>] ? netlink_unicast+0xfb/0x180 [12236.694367] [<ffffffff8150d344>] ? netlink_sendmsg+0x484/0x5d0 [12236.694376] [<ffffffff810a236f>] ? __wake_up+0x2f/0x50 [12236.694387] [<ffffffff814cad23>] ? sock_sendmsg+0x33/0x40 [12236.694396] [<ffffffff814cb05e>] ? ___sys_sendmsg+0x22e/0x240 [12236.694405] [<ffffffff814cab75>] ? ___sys_recvmsg+0x135/0x1a0 [12236.694415] [<ffffffff811a9d12>] ? eventfd_write+0x82/0x210 [12236.694423] [<ffffffff811a0f9e>] ? fsnotify+0x32e/0x4c0 [12236.694429] [<ffffffff8108cb70>] ? wake_up_q+0x60/0x60 [12236.694434] [<ffffffff814cba09>] ? __sys_sendmsg+0x39/0x70 [12236.694440] [<ffffffff815d4797>] ? entry_SYSCALL_64_fastpath+0x12/0x6a It seems so far plausible that the recursive call into rtnetlink_rcv() looks suspicious. One way, where this could trigger is that the senders NETLINK_CB(skb).portid was wrongly 0 (which is rtnetlink socket), so the rtnl_getlink() request's answer would be sent to the kernel instead to the actual user process, thus grabbing rtnl_mutex() twice. One theory would be that netlink_autobind() triggered via netlink_sendmsg() internally overwrites the -EBUSY error to 0, but where it is wrongly originating from __netlink_insert() instead. That would reset the socket's portid to 0, which is then filled into NETLINK_CB(skb).portid later on. As commit d470e3b483dc ("[NETLINK]: Fix two socket hashing bugs.") also puts it, -EBUSY should not be propagated from netlink_insert(). It looks like it's very unlikely to reproduce. We need to trigger the rhashtable_insert_rehash() handler under a situation where rehashing currently occurs (one /rare/ way would be to hit ht->elasticity limits while not filled enough to expand the hashtable, but that would rather require a specifically crafted bind() sequence with knowledge about destination slots, seems unlikely). It probably makes sense to guard __netlink_insert() in any case and remap that error. It was suggested that EOVERFLOW might be better than an already overloaded ENOMEM. Reference: http://thread.gmane.org/gmane.linux.network/372676 Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-09-29netlink: don't hold mutex in rcu callback when releasing mmapd ringFlorian Westphal
[ Upstream commit 0470eb99b4721586ccac954faac3fa4472da0845 ] Kirill A. Shutemov says: This simple test-case trigers few locking asserts in kernel: int main(int argc, char **argv) { unsigned int block_size = 16 * 4096; struct nl_mmap_req req = { .nm_block_size = block_size, .nm_block_nr = 64, .nm_frame_size = 16384, .nm_frame_nr = 64 * block_size / 16384, }; unsigned int ring_size; int fd; fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC); if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0) exit(1); if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0) exit(1); ring_size = req.nm_block_nr * req.nm_block_size; mmap(NULL, 2 * ring_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); return 0; } +++ exited with 0 +++ BUG: sleeping function called from invalid context at /home/kas/git/public/linux-mm/kernel/locking/mutex.c:616 in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: init 3 locks held by init/1: #0: (reboot_mutex){+.+...}, at: [<ffffffff81080959>] SyS_reboot+0xa9/0x220 #1: ((reboot_notifier_list).rwsem){.+.+..}, at: [<ffffffff8107f379>] __blocking_notifier_call_chain+0x39/0x70 #2: (rcu_callback){......}, at: [<ffffffff810d32e0>] rcu_do_batch.isra.49+0x160/0x10c0 Preemption disabled at:[<ffffffff8145365f>] __delay+0xf/0x20 CPU: 1 PID: 1 Comm: init Not tainted 4.1.0-00009-gbddf4c4818e0 #253 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014 ffff88017b3d8000 ffff88027bc03c38 ffffffff81929ceb 0000000000000102 0000000000000000 ffff88027bc03c68 ffffffff81085a9d 0000000000000002 ffffffff81ca2a20 0000000000000268 0000000000000000 ffff88027bc03c98 Call Trace: <IRQ> [<ffffffff81929ceb>] dump_stack+0x4f/0x7b [<ffffffff81085a9d>] ___might_sleep+0x16d/0x270 [<ffffffff81085bed>] __might_sleep+0x4d/0x90 [<ffffffff8192e96f>] mutex_lock_nested+0x2f/0x430 [<ffffffff81932fed>] ? _raw_spin_unlock_irqrestore+0x5d/0x80 [<ffffffff81464143>] ? __this_cpu_preempt_check+0x13/0x20 [<ffffffff8182fc3d>] netlink_set_ring+0x1ed/0x350 [<ffffffff8182e000>] ? netlink_undo_bind+0x70/0x70 [<ffffffff8182fe20>] netlink_sock_destruct+0x80/0x150 [<ffffffff817e484d>] __sk_free+0x1d/0x160 [<ffffffff817e49a9>] sk_free+0x19/0x20 [..] Cong Wang says: We can't hold mutex lock in a rcu callback, [..] Thomas Graf says: The socket should be dead at this point. It might be simpler to add a netlink_release_ring() function which doesn't require locking at all. Reported-by: "Kirill A. Shutemov" <kirill@shutemov.name> Diagnosed-by: Cong Wang <cwang@twopensource.com> Suggested-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-16netlink: Reset portid after netlink_insert failureHerbert Xu
The commit c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate nl_sk_hash_lock") breaks the autobind retry mechanism because it doesn't reset portid after a failed netlink_insert. This means that should autobind fail the first time around, then the socket will be stuck in limbo as it can never be bound again since it already has a non-zero portid. Fixes: c5adde9468b0 ("netlink: eliminate nl_sk_hash_lock") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-14netlink: move nl_table in read_mostly sectionEric Dumazet
netlink sockets creation and deletion heavily modify nl_table_users and nl_table_lock. If nl_table is sharing one cache line with one of them, netlink performance is really bad on SMP. ffffffff81ff5f00 B nl_table ffffffff81ff5f0c b nl_table_users Putting nl_table in read_mostly section increased performance of my open/delete netlink sockets test by about 80 % This came up while diagnosing a getaddrinfo() problem. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-05-03netlink: Remove max_size settingHerbert Xu
We currently limit the hash table size to 64K which is very bad as even 10 years ago it was relatively easy to generate millions of sockets. Since the hash table is naturally limited by memory allocation failure, we don't really need an explicit limit so this patch removes it. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@noironetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-04-25net: fix crash in build_skb()Eric Dumazet
When I added pfmemalloc support in build_skb(), I forgot netlink was using build_skb() with a vmalloc() area. In this patch I introduce __build_skb() for netlink use, and build_skb() is a wrapper handling both skb->head_frag and skb->pfmemalloc This means netlink no longer has to hack skb->head_frag [ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26! [ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN [ 1567.700067] Dumping ftrace buffer: [ 1567.700067] (ftrace buffer empty) [ 1567.700067] Modules linked in: [ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167 [ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000 [ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3)) [ 1567.700067] RSP: 0018:ffff8802467779d8 EFLAGS: 00010202 [ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c [ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049 [ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000 [ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000 [ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000 [ 1567.700067] FS: 00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000 [ 1567.700067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0 [ 1567.700067] Stack: [ 1567.700067] ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000 [ 1567.700067] ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08 [ 1567.700067] ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821 [ 1567.700067] Call Trace: [ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316) [ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329) [ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311) [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273) [ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273) [ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623) [ 1567.774369] sock_write_iter (net/socket.c:823) [ 1567.774369] ? sock_sendmsg (net/socket.c:806) [ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491) [ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249) [ 1567.774369] ? default_llseek (fs/read_write.c:487) [ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701) [ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4)) [ 1567.774369] vfs_write (fs/read_write.c:539) [ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577) [ 1567.774369] ? SyS_read (fs/read_write.c:577) [ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636) [ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42) [ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261) Fixes: 79930f5892e ("net: do not deplete pfmemalloc reserve") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-25rhashtable: provide len to obj_hashfnPatrick McHardy
nftables sets will be converted to use so called setextensions, moving the key to a non-fixed position. To hash it, the obj_hashfn must be used, however it so far doesn't receive the length parameter. Pass the key length to obj_hashfn() and convert existing users. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-24rhashtable: Disable automatic shrinking by defaultThomas Graf
Introduce a new bool automatic_shrinking to require the user to explicitly opt-in to automatic shrinking of tables. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23netlink: Use default rhashtable hashfnHerbert Xu
This patch removes the explicit jhash value for the hashfn parameter of rhashtable. As the key length is a multiple of 4, this means that we will actually end up using jhash2. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-21netlink: Remove netlink_compare_arg.trailerHerbert Xu
Instead of computing the offset from trailer, this patch computes netlink_compare_arg_len from the offset of portid and then adds 4 to it. This allows trailer to be removed. Reported-by: David Miller <davem@davemloft.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20netlink: Move namespace into hash keyHerbert Xu
Currently the name space is a de facto key because it has to match before we find an object in the hash table. However, it isn't in the hash value so all objects from different name spaces with the same port ID hash to the same bucket. This is bad as the number of name spaces is unbounded. This patch fixes this by using the namespace when doing the hash. Because the namespace field doesn't lie next to the portid field in the netlink socket, this patch switches over to the rhashtable interface without a fixed key. This patch also uses the new inlined rhashtable interface where possible. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18netlink: Use rhashtable max_size instead of max_shiftHerbert Xu
This patch converts netlink to use rhashtable max_size instead of the obsolete max_shift. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/rocker/rocker.c The rocker commit was two overlapping changes, one to rename the ->vport member to ->pport, and another making the bitmask expression use '1ULL' instead of plain '1'. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-02net: Remove iocb argument from sendmsg and recvmsgYing Xue
After TIPC doesn't depend on iocb argument in its internal implementations of sendmsg() and recvmsg() hooks defined in proto structure, no any user is using iocb argument in them at all now. Then we can drop the redundant iocb argument completely from kinds of implementations of both sendmsg() and recvmsg() in the entire networking stack. Cc: Christoph Hellwig <hch@lst.de> Suggested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-27rhashtable: remove indirection for grow/shrink decision functionsDaniel Borkmann
Currently, all real users of rhashtable default their grow and shrink decision functions to rht_grow_above_75() and rht_shrink_below_30(), so that there's currently no need to have this explicitly selectable. It can/should be generic and private inside rhashtable until a real use case pops up. Since we can make this private, we'll save us this additional indirection layer and can improve insertion/deletion time as well. Reference: http://patchwork.ozlabs.org/patch/443040/ Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/vxlan.c drivers/vhost/net.c include/linux/if_vlan.h net/core/dev.c The net/core/dev.c conflict was the overlap of one commit marking an existing function static whilst another was adding a new function. In the include/linux/if_vlan.h case, the type used for a local variable was changed in 'net', whereas the function got rewritten to fix a stacked vlan bug in 'net-next'. In drivers/vhost/net.c, Al Viro's iov_iter conversions in 'net-next' overlapped with an endainness fix for VHOST 1.0 in 'net'. In drivers/net/vxlan.c, vxlan_find_vni() added a 'flags' parameter in 'net-next' whereas in 'net' there was a bug fix to pass in the correct network namespace pointer in calls to this function. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04Merge branch 'for-davem' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs More iov_iter work from Al Viro. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04netlink: Use rhashtable walk iteratorHerbert Xu
This patch gets rid of the manual rhashtable walk in netlink which touches rhashtable internals that should not be exposed. It does so by using the rhashtable iterator primitives. In fact the existing code was very buggy. Some sockets weren't shown at all while others were shown more than once. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-02-04netlink: make the check for "send from tx_ring" deterministicAl Viro
As it is, zero msg_iovlen means that the first iovec in the kernel array of iovecs is left uninitialized, so checking if its ->iov_base is NULL is random. Since the real users of that thing are doing sendto(fd, NULL, 0, ...), they are getting msg_iovlen = 1 and msg_iov[0] = {NULL, 0}, which is what this test is trying to catch. As suggested by davem, let's just check that msg_iovlen was 1 and msg_iov[0].iov_base was NULL - _that_ is well-defined and it catches what we want to catch. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-30netlink: fix wrong subscription bitmask to group mapping inPablo Neira
The subscription bitmask passed via struct sockaddr_nl is converted to the group number when calling the netlink_bind() and netlink_unbind() callbacks. The conversion is however incorrect since bitmask (1 << 0) needs to be mapped to group number 1. Note that you cannot specify the group number 0 (usually known as _NONE) from setsockopt() using NETLINK_ADD_MEMBERSHIP since this is rejected through -EINVAL. This problem became noticeable since 97840cb ("netfilter: nfnetlink: fix insufficient validation in nfnetlink_bind") when binding to bitmask (1 << 0) in ctnetlink. Reported-by: Andre Tomt <andre@tomt.net> Reported-by: Ivan Delalande <colona@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-28net: remove sock_iocbChristoph Hellwig
The sock_iocb structure is allocate on stack for each read/write-like operation on sockets, and contains various fields of which only the embedded msghdr and sometimes a pointer to the scm_cookie is ever used. Get rid of the sock_iocb and put a msghdr directly on the stack and pass the scm_cookie explicitly to netlink_mmap_sendmsg. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: arch/arm/boot/dts/imx6sx-sdb.dts net/sched/cls_bpf.c Two simple sets of overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-27netlink: Kill redundant net argument in netlink_insertHerbert Xu
The socket already carries the net namespace with it so there is no need to be passing another net around. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-18netlink: make nlmsg_end() and genlmsg_end() voidJohannes Berg
Contrary to common expectations for an "int" return, these functions return only a positive value -- if used correctly they cannot even return 0 because the message header will necessarily be in the skb. This makes the very common pattern of if (genlmsg_end(...) < 0) { ... } be a whole bunch of dead code. Many places also simply do return nlmsg_end(...); and the caller is expected to deal with it. This also commonly (at least for me) causes errors, because it is very common to write if (my_function(...)) /* error condition */ and if my_function() does "return nlmsg_end()" this is of course wrong. Additionally, there's not a single place in the kernel that actually needs the message length returned, and if anyone needs it later then it'll be very easy to just use skb->len there. Remove this, and make the functions void. This removes a bunch of dead code as described above. The patch adds lines because I did - return nlmsg_end(...); + nlmsg_end(...); + return 0; I could have preserved all the function's return values by returning skb->len, but instead I've audited all the places calling the affected functions and found that none cared. A few places actually compared the return value with <= 0 in dump functionality, but that could just be changed to < 0 with no change in behaviour, so I opted for the more efficient version. One instance of the error I've made numerous times now is also present in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't check for <0 or <=0 and thus broke out of the loop every single time. I've preserved this since it will (I think) have caused the messages to userspace to be formatted differently with just a single message for every SKB returned to userspace. It's possible that this isn't needed for the tools that actually use this, but I don't even know what they are so couldn't test that changing this behaviour would be acceptable. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-16genetlink: synchronize socket closing and family removalJohannes Berg
In addition to the problem Jeff Layton reported, I looked at the code and reproduced the same warning by subscribing and removing the genl family with a socket still open. This is a fairly tricky race which originates in the fact that generic netlink allows the family to go away while sockets are still open - unlike regular netlink which has a module refcount for every open socket so in general this cannot be triggered. Trying to resolve this issue by the obvious locking isn't possible as it will result in deadlocks between unregistration and group unbind notification (which incidentally lockdep doesn't find due to the home grown locking in the netlink table.) To really resolve this, introduce a "closing socket" reference counter (for generic netlink only, as it's the only affected family) in the core netlink code and use that in generic netlink to wait for all the sockets that are being closed at the same time as a generic netlink family is removed. This fixes the race that when a socket is closed, it will should call the unbind, but if the family is removed at the same time the unbind will not find it, leading to the warning. The real problem though is that in this case the unbind could actually find a new family that is registered to have a multicast group with the same ID, and call its mcast_unbind() leading to confusing. Also remove the warning since it would still trigger, but is now no longer a problem. This also moves the code in af_netlink.c to before unreferencing the module to avoid having the same problem in the normal non-genl case. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-16genetlink: disallow subscribing to unknown mcast groupsJohannes Berg
Jeff Layton reported that he could trigger the multicast unbind warning in generic netlink using trinity. I originally thought it was a race condition between unregistering the generic netlink family and closing the socket, but there's a far simpler explanation: genetlink currently allows subscribing to groups that don't (yet) exist, and the warning is triggered when unsubscribing again while the group still doesn't exist. Originally, I had a warning in the subscribe case and accepted it out of userspace API concerns, but the warning was of course wrong and removed later. However, I now think that allowing userspace to subscribe to groups that don't exist is wrong and could possibly become a security problem: Consider a (new) genetlink family implementing a permission check in the mcast_bind() function similar to the like the audit code does today; it would be possible to bypass the permission check by guessing the ID and subscribing to the group it exists. This is only possible in case a family like that would be dynamically loaded, but it doesn't seem like a huge stretch, for example wireless may be loaded when you plug in a USB device. To avoid this reject such subscription attempts. If this ends up causing userspace issues we may need to add a workaround in af_netlink to deny such requests but not return an error. Reported-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-16netlink: Fix netlink_insert EADDRINUSE errorHerbert Xu
The patch c5adde9468b0714a051eac7f9666f23eb10b61f7 ("netlink: eliminate nl_sk_hash_lock") introduced a bug where the EADDRINUSE error has been replaced by ENOMEM. This patch rectifies that problem. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-13netlink: eliminate nl_sk_hash_lockYing Xue
As rhashtable_lookup_compare_insert() can guarantee the process of search and insertion is atomic, it's safe to eliminate the nl_sk_hash_lock. After this, object insertion or removal will be protected with per bucket lock on write side while object lookup is guarded with rcu read lock on read side. Signed-off-by: Ying Xue <ying.xue@windriver.com> Cc: Thomas Graf <tgraf@suug.ch> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03netlink: Lockless lookup with RCU grace period in socket releaseThomas Graf
Defers the release of the socket reference using call_rcu() to allow using an RCU read-side protected call to rhashtable_lookup() This restores behaviour and performance gains as previously introduced by e341694 ("netlink: Convert netlink_lookup() to use RCU protected hash table") without the side effect of severely delayed socket destruction. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Per bucket locks & deferred expansion/shrinkingThomas Graf
Introduces an array of spinlocks to protect bucket mutations. The number of spinlocks per CPU is configurable and selected based on the hash of the bucket. This allows for parallel insertions and removals of entries which do not share a lock. The patch also defers expansion and shrinking to a worker queue which allows insertion and removal from atomic context. Insertions and deletions may occur in parallel to it and are only held up briefly while the particular bucket is linked or unzipped. Mutations of the bucket table pointer is protected by a new mutex, read access is RCU protected. In the event of an expansion or shrinking, the new bucket table allocated is exposed as a so called future table as soon as the resize process starts. Lookups, deletions, and insertions will briefly use both tables. The future table becomes the main table after an RCU grace period and initial linking of the old to the new table was performed. Optimization of the chains to make use of the new number of buckets follows only the new table is in use. The side effect of this is that during that RCU grace period, a bucket traversal using any rht_for_each() variant on the main table will not see any insertions performed during the RCU grace period which would at that point land in the future table. The lookup will see them as it searches both tables if needed. Having multiple insertions and removals occur in parallel requires nelems to become an atomic counter. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Convert bucket iterators to take table and indexThomas Graf
This patch is in preparation to introduce per bucket spinlocks. It extends all iterator macros to take the bucket table and bucket index. It also introduces a new rht_dereference_bucket() to handle protected accesses to buckets. It introduces a barrier() to the RCU iterators to the prevent the compiler from caching the first element. The lockdep verifier is introduced as stub which always succeeds and properly implement in the next patch when the locks are introduced. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-01-03rhashtable: Do hashing inside of rhashtable_lookup_compare()Thomas Graf
Hash the key inside of rhashtable_lookup_compare() like rhashtable_lookup() does. This allows to simplify the hashing functions and keep them private. Signed-off-by: Thomas Graf <tgraf@suug.ch> Cc: netfilter-devel@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-29genetlink: A genl_bind() to an out-of-range multicast group should not WARN().David S. Miller
Users can request to bind to arbitrary multicast groups, so warning when the requested group number is out of range is not appropriate. And with the warning removed, and the 'err' variable properly given an initial value, we can remove 'found' altogether. Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-27netlink/genetlink: pass network namespace to bind/unbindJohannes Berg
Netlink families can exist in multiple namespaces, and for the most part multicast subscriptions are per network namespace. Thus it only makes sense to have bind/unbind notifications per network namespace. To achieve this, pass the network namespace of a given client socket to the bind/unbind functions. Also do this in generic netlink, and there also make sure that any bind for multicast groups that only exist in init_net is rejected. This isn't really a problem if it is accepted since a client in a different namespace will never receive any notifications from such a group, but it can confuse the family if not rejected (it's also possible to silently (without telling the family) accept it, but it would also have to be ignored on unbind so families that take any kind of action on bind/unbind won't do unnecessary work for invalid clients like that. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-27genetlink: pass multicast bind/unbind to familiesJohannes Berg
In order to make the newly fixed multicast bind/unbind functionality in generic netlink, pass them down to the appropriate family. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-27netlink: call unbind when releasing socketJohannes Berg
Currently, netlink_unbind() is only called when the socket explicitly unbinds, which limits its usefulness (luckily there are no users of it yet anyway.) Call netlink_unbind() also when a socket is released, so it becomes possible to track listeners with this callback and without also implementing a netlink notifier (and checking netlink_has_listeners() in there.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-27netlink: update listeners directly when removing socketJohannes Berg
The code is now confusing to read - first in one function down (netlink_remove) any group subscriptions are implicitly removed by calling __sk_del_bind_node(), but the subscriber database is only updated far later by calling netlink_update_listeners(). Move the latter call to just after removal from the list so it is easier to follow the code. This also enables moving the locking inside the kernel-socket conditional, which improves the normal socket destruction path. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-27netlink: rename netlink_unbind() to netlink_undo_bind()Johannes Berg
The new name is more expressive - this isn't a generic unbind function but rather only a little undo helper for use only in netlink_bind(). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-18netlink: Don't reorder loads/stores before marking mmap netlink frame as ↵Thomas Graf
available Each mmap Netlink frame contains a status field which indicates whether the frame is unused, reserved, contains data or needs to be skipped. Both loads and stores may not be reordeded and must complete before the status field is changed and another CPU might pick up the frame for use. Use an smp_mb() to cover needs of both types of callers to netlink_set_status(), callers which have been reading data frame from the frame, and callers which have been filling or releasing and thus writing to the frame. - Example code path requiring a smp_rmb(): memcpy(skb->data, (void *)hdr + NL_MMAP_HDRLEN, hdr->nm_len); netlink_set_status(hdr, NL_MMAP_STATUS_UNUSED); - Example code path requiring a smp_wmb(): hdr->nm_uid = from_kuid(sk_user_ns(sk), NETLINK_CB(skb).creds.uid); hdr->nm_gid = from_kgid(sk_user_ns(sk), NETLINK_CB(skb).creds.gid); netlink_frame_flush_dcache(hdr); netlink_set_status(hdr, NL_MMAP_STATUS_VALID); Fixes: f9c228 ("netlink: implement memory mapped recvmsg()") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-18netlink: Always copy on mmap TX.David Miller
Checking the file f_count and the nlk->mapped count is not completely sufficient to prevent the mmap'd area contents from changing from under us during netlink mmap sendmsg() operations. Be careful to sample the header's length field only once, because this could change from under us as well. Fixes: 5fd96123ee19 ("netlink: implement memory mapped sendmsg()") Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Thomas Graf <tgraf@suug.ch>
2014-12-10netlink: use jhash as hashfn for rhashtableDaniel Borkmann
For netlink, we shouldn't be using arch_fast_hash() as a hashing discipline, but rather jhash() instead. Since netlink sockets can be opened by any user, a local attacker would be able to easily create collisions with the DPDK-derived arch_fast_hash(), which trades off performance for security by using crc32 CPU instructions on x86_64. While it might have a legimite use case in other places, it should be avoided in netlink context, though. As rhashtable's API is very flexible, we could later on still decide on other hashing disciplines, if legitimate. Reference: http://thread.gmane.org/gmane.linux.kernel/1844123 Fixes: e341694e3eb5 ("netlink: Convert netlink_lookup() to use RCU protected hash table") Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09put iov_iter into msghdrAl Viro
Note that the code _using_ ->msg_iter at that point will be very unhappy with anything other than unshifted iovec-backed iov_iter. We still need to convert users to proper primitives. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-24new helper: memcpy_from_msg()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-11-19netlink: Deletion of an unnecessary check before the function call ↵Markus Elfring
"__module_get" The __module_get() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-14Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/chelsio/cxgb4vf/sge.c drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c sge.c was overlapping two changes, one to use the new __dev_alloc_page() in net-next, and one to use s->fl_pg_order in net. ixgbe_phy.c was a set of overlapping whitespace changes. Signed-off-by: David S. Miller <davem@davemloft.net>