summaryrefslogtreecommitdiff
path: root/net/ipv4/raw.c
AgeCommit message (Collapse)Author
2023-06-05ipv{4,6}/raw: fix output xfrm lookup wrt protocolNicolas Dichtel
commit 3632679d9e4f879f49949bb5b050e0de553e4739 upstream. With a raw socket bound to IPPROTO_RAW (ie with hdrincl enabled), the protocol field of the flow structure, build by raw_sendmsg() / rawv6_sendmsg()), is set to IPPROTO_RAW. This breaks the ipsec policy lookup when some policies are defined with a protocol in the selector. For ipv6, the sin6_port field from 'struct sockaddr_in6' could be used to specify the protocol. Just accept all values for IPPROTO_RAW socket. For ipv4, the sin_port field of 'struct sockaddr_in' could not be used without breaking backward compatibility (the value of this field was never checked). Let's add a new kind of control message, so that the userland could specify which protocol is used. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") CC: stable@vger.kernel.org Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://lore.kernel.org/r/20230522120820.1319391-1-nicolas.dichtel@6wind.com Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-02-01ipv4: raw: lock the socket in raw_bind()Eric Dumazet
[ Upstream commit 153a0d187e767c68733b8e9f46218eb1f41ab902 ] For some reason, raw_bind() forgot to lock the socket. BUG: KCSAN: data-race in __ip4_datagram_connect / raw_bind write to 0xffff8881170d4308 of 4 bytes by task 5466 on cpu 0: raw_bind+0x1b0/0x250 net/ipv4/raw.c:739 inet_bind+0x56/0xa0 net/ipv4/af_inet.c:443 __sys_bind+0x14b/0x1b0 net/socket.c:1697 __do_sys_bind net/socket.c:1708 [inline] __se_sys_bind net/socket.c:1706 [inline] __x64_sys_bind+0x3d/0x50 net/socket.c:1706 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff8881170d4308 of 4 bytes by task 5468 on cpu 1: __ip4_datagram_connect+0xb7/0x7b0 net/ipv4/datagram.c:39 ip4_datagram_connect+0x2a/0x40 net/ipv4/datagram.c:89 inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576 __sys_connect_file net/socket.c:1900 [inline] __sys_connect+0x197/0x1b0 net/socket.c:1917 __do_sys_connect net/socket.c:1927 [inline] __se_sys_connect net/socket.c:1924 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1924 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0x0003007f Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 5468 Comm: syz-executor.5 Not tainted 5.17.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-10-01netfilter: drop bridge nf reset from nf_resetFlorian Westphal
commit 174e23810cd31 ("sk_buff: drop all skb extensions on free and skb scrubbing") made napi recycle always drop skb extensions. The additional skb_ext_del() that is performed via nf_reset on napi skb recycle is not needed anymore. Most nf_reset() calls in the stack are there so queued skb won't block 'rmmod nf_conntrack' indefinitely. This removes the skb_ext_del from nf_reset, and renames it to a more fitting nf_reset_ct(). In a few selected places, add a call to skb_ext_reset to make sure that no active extensions remain. I am submitting this for "net", because we're still early in the release cycle. The patch applies to net-next too, but I think the rename causes needless divergence between those trees. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13ip: support SO_MARK cmsgWillem de Bruijn
Enable setting skb->mark for UDP and RAW sockets using cmsg. This is analogous to existing support for TOS, TTL, txtime, etc. Packet sockets already support this as of commit c7d39e32632e ("packet: support per-packet fwmark for af_packet sendmsg"). Similar to other fields, implement by 1. initialize the sockcm_cookie.mark from socket option sk_mark 2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl 3. initialize inet_cork.mark from sockcm_cookie.mark 4. initialize each (usually just one) skb->mark from inet_cork.mark Step 1 is handled in one location for most protocols by ipcm_init_sk as of commit 351782067b6b ("ipv4: ipcm_cookie initializers"). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-25ipv4: Use return value of inet_iif() for __raw_v4_lookup in the while loopStephen Suryaputra
In commit 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic"), the dif argument to __raw_v4_lookup() is coming from the returned value of inet_iif() but the change was done only for the first lookup. Subsequent lookups in the while loop still use skb->dev->ifIndex. Fixes: 19e4e768064a8 ("ipv4: Fix raw socket lookup for local traffic") Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-19net: Treat sock->sk_drops as an unsigned int when printingPatrick Talbert
Currently, procfs socket stats format sk_drops as a signed int (%d). For large values this will cause a negative number to be printed. We know the drop count can never be a negative so change the format specifier to %u. Signed-off-by: Patrick Talbert <ptalbert@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-08ipv4: Fix raw socket lookup for local trafficDavid Ahern
inet_iif should be used for the raw socket lookup. inet_iif considers rt_iif which handles the case of local traffic. As it stands, ping to a local address with the '-I <dev>' option fails ever since ping was changed to use SO_BINDTODEVICE instead of cmsg + IP_PKTINFO. IPv6 works fine. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Lots of conflicts, by happily all cases of overlapping changes, parallel adds, things of that nature. Thanks to Stephen Rothwell, Saeed Mahameed, and others for their guidance in these resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-17net: add missing SOF_TIMESTAMPING_OPT_ID supportWillem de Bruijn
SOF_TIMESTAMPING_OPT_ID is supported on TCP, UDP and RAW sockets. But it was missing on RAW with IPPROTO_IP, PF_PACKET and CAN. Add skb_setup_tx_timestamp that configures both tx_flags and tskey for these paths that do not need corking or use bytestream keys. Fixes: 09c2d251b707 ("net-timestamp: add key to disambiguate concurrent datagrams") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-27net/ipv4: Fix missing raw_init when CONFIG_PROC_FS is disabledDavid Ahern
Randy reported when CONFIG_PROC_FS is not enabled: ld: net/ipv4/af_inet.o: in function `inet_init': af_inet.c:(.init.text+0x42d): undefined reference to `raw_init' Fix by moving the endif up to the end of the proc entries Fixes: 6897445fb194c ("net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFs") Reported-by: Randy Dunlap <rdunlap@infradead.org> Cc: Mike Manning <mmanning@vyatta.att-mail.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07net: fix raw socket lookup device bind matching with VRFsDuncan Eastoe
When there exist a pair of raw sockets one unbound and one bound to a VRF but equal in all other respects, when a packet is received in the VRF context, __raw_v4_lookup() matches on both sockets. This results in the packet being delivered over both sockets, instead of only the raw socket bound to the VRF. The bound device checks in __raw_v4_lookup() are replaced with a call to raw_sk_bound_dev_eq() which correctly handles whether the packet should be delivered over the unbound socket in such cases. In __raw_v6_lookup() the match on the device binding of the socket is similarly updated to use raw_sk_bound_dev_eq() which matches the handling in __raw_v4_lookup(). Importantly raw_sk_bound_dev_eq() takes the raw_l3mdev_accept sysctl into account. Signed-off-by: Duncan Eastoe <deastoe@vyatta.att-mail.com> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Tested-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-07net: provide a sysctl raw_l3mdev_accept for raw socket lookup with VRFsMike Manning
Add a sysctl raw_l3mdev_accept to control raw socket lookup in a manner similar to use of tcp_l3mdev_accept for stream and of udp_l3mdev_accept for datagram sockets. Have this default to enabled for reasons of backwards compatibility. This is so as to specify the output device with cmsg and IP_PKTINFO, but using a socket not bound to the corresponding VRF. This allows e.g. older ping implementations to be run with specifying the device but without executing it in the VRF. If the option is disabled, packets received in a VRF context are only handled by a raw socket bound to the VRF, and correspondingly packets in the default VRF are only handled by a socket not bound to any VRF. Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Tested-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-02ipv4: Allow sending multicast packets on specific i/f using VRF socketRobert Shearman
It is useful to be able to use the same socket for listening in a specific VRF, as for sending multicast packets out of a specific interface. However, the bound device on the socket currently takes precedence and results in the packets not being sent. Relax the condition on overriding the output interface to use for sending packets out of UDP, raw and ping sockets to allow multicast packets to be sent using the specified multicast interface. Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07ip: remove tx_flags from ipcm_cookie and use same logic for v4 and v6Willem de Bruijn
skb_shinfo(skb)->tx_flags is derived from sk->sk_tsflags, possibly after modification by __sock_cmsg_send, by calling sock_tx_timestamp. The IPv4 and IPv6 paths do this conversion differently. In IPv4, the individual protocols that support tx timestamps call this function and store the result in ipc.tx_flags. In IPv6, sock_tx_timestamp is called in __ip6_append_data. There is no need to store both tx_flags and ts_flags in the cookie as one is derived from the other. Convert when setting up the cork and remove the redundant field. This is similar to IPv6, only have the conversion happen only once per datagram, in ip(6)_setup_cork. Also change __ip6_append_data to match __ip_append_data. Only update tskey if timestamping is enabled with OPT_ID. The SOCK_.. test is redundant: only valid protocols can have non-zero cork->tx_flags. After this change the IPv4 and IPv6 logic is the same. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-07ipv4: ipcm_cookie initializersWillem de Bruijn
Initialize the cookie in one location to reduce code duplication and avoid bugs from inconsistent initialization, such as that fixed in commit 9887cba19978 ("ip: limit use of gso_size to udp"). Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04net: ipv4: Hook into time based transmissionJesus Sanchez-Palencia
Add a transmit_time field to struct inet_cork, then copy the timestamp from the CMSG cookie at ip_setup_cork() so we can safely copy it into the skb later during __ip_make_skb(). For the raw fast path, just perform the copy at raw_send_hdrinc(). Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16proc: introduce proc_create_net{,_data}Christoph Hellwig
Variants of proc_create{,_data} that directly take a struct seq_operations and deal with network namespaces in ->open and ->release. All callers of proc_create + seq_open_net converted over, and seq_{open,release}_net are removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-05-16ipv{4,6}/raw: simplify ѕeq_file codeChristoph Hellwig
Pass the hashtable to the proc private data instead of copying it into the per-file private data. Signed-off-by: Christoph Hellwig <hch@lst.de>
2018-03-27net: Drop pernet_operations::asyncKirill Tkhai
Synchronous pernet_operations are not allowed anymore. All are asynchronous. So, drop the structure member. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-26net: Use octal not symbolic permissionsJoe Perches
Prefer the direct use of octal for permissions. Done with checkpatch -f --types=SYMBOLIC_PERMS --fix-inplace and some typing. Miscellanea: o Whitespace neatening around these conversions. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-03-22net: Revert "ipv4: fix a deadlock in ip_ra_control"Kirill Tkhai
This reverts commit 1215e51edad1. Since raw_close() is used on every RAW socket destruction, the changes made by 1215e51edad1 scale sadly. This clearly seen on endless unshare(CLONE_NEWNET) test, and cleanup_net() kwork spends a lot of time waiting for rtnl_lock() introduced by this commit. Previous patch moved IP_ROUTER_ALERT out of rtnl_lock(), so we revert this patch. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-13net: Convert pernet_subsys, registered from inet_init()Kirill Tkhai
arp_net_ops just addr/removes /proc entry. devinet_ops allocates and frees duplicate of init_net tables and (un)registers sysctl entries. fib_net_ops allocates and frees pernet tables, creates/destroys netlink socket and (un)initializes /proc entries. Foreign pernet_operations do not touch them. ip_rt_proc_ops only modifies pernet /proc entries. xfrm_net_ops creates/destroys /proc entries, allocates/frees pernet statistics, hashes and tables, and (un)initializes sysctl files. These are not touched by foreigh pernet_operations xfrm4_net_ops allocates/frees private pernet memory, and configures sysctls. sysctl_route_ops creates/destroys sysctls. rt_genid_ops only initializes fields of just allocated net. ipv4_inetpeer_ops allocated/frees net private memory. igmp_net_ops just creates/destroys /proc files and socket, noone else interested in. tcp_sk_ops seems to be safe, because tcp_sk_init() does not depend on any other pernet_operations modifications. Iteration over hash table in inet_twsk_purge() is made under RCU lock, and it's safe to iterate the table this way. Removing from the table happen from inet_twsk_deschedule_put(), but this function is safe without any extern locks, as it's synchronized inside itself. There are many examples, it's used in different context. So, it's safe to leave tcp_sk_exit_batch() unlocked. tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe. udplite4_net_ops only creates/destroys pernet /proc file. icmp_sk_ops creates percpu sockets, not touched by foreign pernet_operations. ipmr_net_ops creates/destroys pernet fib tables, (un)registers fib rules and /proc files. This seem to be safe to execute in parallel with foreign pernet_operations. af_inet_ops just sets up default parameters of newly created net. ipv4_mib_ops creates and destroys pernet percpu statistics. raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops and ip_proc_ops only create/destroy pernet /proc files. ip4_frags_ops creates and destroys sysctl file. So, it's safe to make the pernet_operations async. Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Acked-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-02-03Merge tag 'usercopy-v4.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardened usercopy whitelisting from Kees Cook: "Currently, hardened usercopy performs dynamic bounds checking on slab cache objects. This is good, but still leaves a lot of kernel memory available to be copied to/from userspace in the face of bugs. To further restrict what memory is available for copying, this creates a way to whitelist specific areas of a given slab cache object for copying to/from userspace, allowing much finer granularity of access control. Slab caches that are never exposed to userspace can declare no whitelist for their objects, thereby keeping them unavailable to userspace via dynamic copy operations. (Note, an implicit form of whitelisting is the use of constant sizes in usercopy operations and get_user()/put_user(); these bypass all hardened usercopy checks since these sizes cannot change at runtime.) This new check is WARN-by-default, so any mistakes can be found over the next several releases without breaking anyone's system. The series has roughly the following sections: - remove %p and improve reporting with offset - prepare infrastructure and whitelist kmalloc - update VFS subsystem with whitelists - update SCSI subsystem with whitelists - update network subsystem with whitelists - update process memory with whitelists - update per-architecture thread_struct with whitelists - update KVM with whitelists and fix ioctl bug - mark all other allocations as not whitelisted - update lkdtm for more sensible test overage" * tag 'usercopy-v4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (38 commits) lkdtm: Update usercopy tests for whitelisting usercopy: Restrict non-usercopy caches to size 0 kvm: x86: fix KVM_XEN_HVM_CONFIG ioctl kvm: whitelist struct kvm_vcpu_arch arm: Implement thread_struct whitelist for hardened usercopy arm64: Implement thread_struct whitelist for hardened usercopy x86: Implement thread_struct whitelist for hardened usercopy fork: Provide usercopy whitelisting for task_struct fork: Define usercopy region in thread_stack slab caches fork: Define usercopy region in mm_struct slab caches net: Restrict unwhitelisted proto caches to size 0 sctp: Copy struct sctp_sock.autoclose to userspace using put_user() sctp: Define usercopy region in SCTP proto slab cache caif: Define usercopy region in caif proto slab cache ip: Define usercopy region in IP proto slab cache net: Define usercopy region in struct proto slab cache scsi: Define usercopy region in scsi_sense_cache slab cache cifs: Define usercopy region in cifs_request slab cache vxfs: Define usercopy region in vxfs_inode slab cache ufs: Define usercopy region in ufs_inode_cache slab cache ...
2018-01-25net/ipv4: Allow send to local broadcast from a socket bound to a VRFDavid Ahern
Message sends to the local broadcast address (255.255.255.255) require uc_index or sk_bound_dev_if to be set to an egress device. However, responses or only received if the socket is bound to the device. This is overly constraining for processes running in an L3 domain. This patch allows a socket bound to the VRF device to send to the local broadcast address by using IP_UNICAST_IF to set the egress interface with packet receipt handled by the VRF binding. Similar to IP_MULTICAST_IF, relax the constraint on setting IP_UNICAST_IF if a socket is bound to an L3 master device. In this case allow uc_index to be set to an enslaved if sk_bound_dev_if is an L3 master device and is the master device for the ifindex. In udp and raw sendmsg, allow uc_index to override the oif if uc_index master device is oif (ie., the oif is an L3 master and the index is an L3 slave). Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-16net: delete /proc THIS_MODULE referencesAlexey Dobriyan
/proc has been ignoring struct file_operations::owner field for 10 years. Specifically, it started with commit 786d7e1612f0b0adb6046f19b906609e4fe8b1ba ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where inode->i_fop is initialized with proxy struct file_operations for regular files: - if (de->proc_fops) - inode->i_fop = de->proc_fops; + if (de->proc_fops) { + if (S_ISREG(inode->i_mode)) + inode->i_fop = &proc_reg_file_ops; + else + inode->i_fop = de->proc_fops; + } VFS stopped pinning module at this point. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-15ip: Define usercopy region in IP proto slab cacheDavid Windsor
The ICMP filters for IPv4 and IPv6 raw sockets need to be copied to/from userspace. In support of usercopy hardening, this patch defines a region in the struct proto slab cache in which userspace copy operations are allowed. example usage trace: net/ipv4/raw.c: raw_seticmpfilter(...): ... copy_from_user(&raw_sk(sk)->filter, ..., optlen) raw_geticmpfilter(...): ... copy_to_user(..., &raw_sk(sk)->filter, len) net/ipv6/raw.c: rawv6_seticmpfilter(...): ... copy_from_user(&raw6_sk(sk)->filter, ..., optlen) rawv6_geticmpfilter(...): ... copy_to_user(..., &raw6_sk(sk)->filter, len) This region is known as the slab cache's usercopy region. Slab caches can now check that each dynamically sized copy operation involving cache-managed memory falls entirely within the slab's usercopy region. This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY whitelisting code in the last public patch of grsecurity/PaX based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Signed-off-by: David Windsor <dave@nullcore.net> [kees: split from network patch, provide usage trace] Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org>
2018-01-09net: ipv4: emulate READ_ONCE() on ->hdrincl bit-field in raw_sendmsg()Nicolai Stange
Commit 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") fixed the issue of possibly inconsistent ->hdrincl handling due to concurrent updates by reading this bit-field member into a local variable and using the thus stabilized value in subsequent tests. However, aforementioned commit also adds the (correct) comment that /* hdrincl should be READ_ONCE(inet->hdrincl) * but READ_ONCE() doesn't work with bit fields */ because as it stands, the compiler is free to shortcut or even eliminate the local variable at its will. Note that I have not seen anything like this happening in reality and thus, the concern is a theoretical one. However, in order to be on the safe side, emulate a READ_ONCE() on the bit-field by doing it on the local 'hdrincl' variable itself: int hdrincl = inet->hdrincl; hdrincl = READ_ONCE(hdrincl); This breaks the chain in the sense that the compiler is not allowed to replace subsequent reads from hdrincl with reloads from inet->hdrincl. Fixes: 8f659a03a0ba ("net: ipv4: fix for a race condition in raw_sendmsg") Signed-off-by: Nicolai Stange <nstange@suse.de> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-12-11net: ipv4: fix for a race condition in raw_sendmsgMohamed Ghannam
inet->hdrincl is racy, and could lead to uninitialized stack pointer usage, so its value should be read only once. Fixes: c008ba5bdc9f ("ipv4: Avoid reading user iov twice after raw_probe_proto_opt") Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07net: ipv4: add second dif to multicast source filterDavid Ahern
Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-08-07net: ipv4: add second dif to raw socket lookupsDavid Ahern
Add a second device index, sdif, to raw socket lookups. sdif is the index for ingress devices enslaved to an l3mdev. It allows the lookups to consider the enslaved device as well as the L3 domain when searching for a socket. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-07-01net: convert sock.sk_refcnt from atomic_t to refcount_tReshetova, Elena
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. This patch uses refcount_inc_not_zero() instead of atomic_inc_not_zero_hint() due to absense of a _hint() version of refcount API. If the hint() version must be used, we might need to revisit API. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04ipv4, ipv6: ensure raw socket message is big enough to hold an IP headerAlexander Potapenko
raw_send_hdrinc() and rawv6_send_hdrinc() expect that the buffer copied from the userspace contains the IPv4/IPv6 header, so if too few bytes are copied, parts of the header may remain uninitialized. This bug has been detected with KMSAN. For the record, the KMSAN report: ================================================================== BUG: KMSAN: use of unitialized memory in nf_ct_frag6_gather+0xf5a/0x44a0 inter: 0 CPU: 0 PID: 1036 Comm: probe Not tainted 4.11.0-rc5+ #2455 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 dump_stack+0x143/0x1b0 lib/dump_stack.c:52 kmsan_report+0x16b/0x1e0 mm/kmsan/kmsan.c:1078 __kmsan_warning_32+0x5c/0xa0 mm/kmsan/kmsan_instr.c:510 nf_ct_frag6_gather+0xf5a/0x44a0 net/ipv6/netfilter/nf_conntrack_reasm.c:577 ipv6_defrag+0x1d9/0x280 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68 nf_hook_entry_hookfn ./include/linux/netfilter.h:102 nf_hook_slow+0x13f/0x3c0 net/netfilter/core.c:310 nf_hook ./include/linux/netfilter.h:212 NF_HOOK ./include/linux/netfilter.h:255 rawv6_send_hdrinc net/ipv6/raw.c:673 rawv6_sendmsg+0x2fcb/0x41a0 net/ipv6/raw.c:919 inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762 sock_sendmsg_nosec net/socket.c:633 sock_sendmsg net/socket.c:643 SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696 SyS_sendto+0xbc/0xe0 net/socket.c:1664 do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285 entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:246 RIP: 0033:0x436e03 RSP: 002b:00007ffce48baf38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00000000004002b0 RCX: 0000000000436e03 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007ffce48baf90 R08: 00007ffce48baf50 R09: 000000000000001c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000401790 R14: 0000000000401820 R15: 0000000000000000 origin: 00000000d9400053 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:362 kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:257 kmsan_poison_shadow+0x6d/0xc0 mm/kmsan/kmsan.c:270 slab_alloc_node mm/slub.c:2735 __kmalloc_node_track_caller+0x1f4/0x390 mm/slub.c:4341 __kmalloc_reserve net/core/skbuff.c:138 __alloc_skb+0x2cd/0x740 net/core/skbuff.c:231 alloc_skb ./include/linux/skbuff.h:933 alloc_skb_with_frags+0x209/0xbc0 net/core/skbuff.c:4678 sock_alloc_send_pskb+0x9ff/0xe00 net/core/sock.c:1903 sock_alloc_send_skb+0xe4/0x100 net/core/sock.c:1920 rawv6_send_hdrinc net/ipv6/raw.c:638 rawv6_sendmsg+0x2918/0x41a0 net/ipv6/raw.c:919 inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762 sock_sendmsg_nosec net/socket.c:633 sock_sendmsg net/socket.c:643 SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696 SyS_sendto+0xbc/0xe0 net/socket.c:1664 do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285 return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:246 ================================================================== , triggered by the following syscalls: socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3 sendto(3, NULL, 0, 0, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "ff00::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EPERM A similar report is triggered in net/ipv4/raw.c if we use a PF_INET socket instead of a PF_INET6 one. Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-04-17ipv4: fix a deadlock in ip_ra_controlWANG Cong
Similar to commit 87e9f0315952 ("ipv4: fix a potential deadlock in mcast getsockopt() path"), there is a deadlock scenario for IP_ROUTER_ALERT too: CPU0 CPU1 ---- ---- lock(rtnl_mutex); lock(sk_lock-AF_INET); lock(rtnl_mutex); lock(sk_lock-AF_INET); Fix this by always locking RTNL first on all setsockopt() paths. Note, after this patch ip_ra_lock is no longer needed either. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-02-07net: use dst_confirm_neigh for UDP, RAW, ICMP, L2TPJulian Anastasov
When same struct dst_entry can be used for many different neighbours we can not use it for pending confirmations. The datagram protocols can use MSG_CONFIRM to confirm the neighbour. When used with MSG_PROBE we do not reach the code where neighbour is confirmed, so we have to do the same slow lookup by using the dst_confirm_neigh() helper. When MSG_PROBE is not used, ip_append_data/ip6_append_data will set the skb flag dst_pending_confirm. Reported-by: YueHaibing <yuehaibing@huawei.com> Fixes: 5110effee8fd ("net: Do delayed neigh confirmation.") Fixes: f2bb4bedf35d ("ipv4: Cache output routes in fib_info nexthops.") Signed-off-by: Julian Anastasov <ja@ssi.bg> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-12-24Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-11-07net: Update raw socket bind to consider l3 domainDavid Ahern
Binding a raw socket to a local address fails if the socket is bound to an L3 domain: $ vrf-test -s -l 10.100.1.2 -R -I red error binding socket: 99: Cannot assign requested address Update raw_bind to look consider if sk_bound_dev_if is bound to an L3 domain and use inet_addr_type_table to lookup the address. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-04net: inet: Support UID-based routing in IP protocols.Lorenzo Colitti
- Use the UID in routing lookups made by protocol connect() and sendmsg() functions. - Make sure that routing lookups triggered by incoming packets (e.g., Path MTU discovery) take the UID of the socket into account. - For packets not associated with a userspace socket, (e.g., ping replies) use UID 0 inside the user namespace corresponding to the network namespace the socket belongs to. This allows all namespaces to apply routing and iptables rules to kernel-originated traffic in that namespaces by matching UID 0. This is better than using the UID of the kernel socket that is sending the traffic, because the UID of kernel sockets created at namespace creation time (e.g., the per-processor ICMP and TCP sockets) is the UID of the user that created the socket, which might not be mapped in the namespace. Tested: compiles allnoconfig, allyesconfig, allmodconfig Tested: https://android-review.googlesource.com/253302 Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-02net: ip, diag -- Adjust raw_abort to use unlocked __udp_disconnectCyrill Gorcunov
While being preparing patches for killing raw sockets via diag netlink interface I noticed that my runs are stuck: | [root@pcs7 ~]# cat /proc/`pidof ss`/stack | [<ffffffff816d1a76>] __lock_sock+0x80/0xc4 | [<ffffffff816d206a>] lock_sock_nested+0x47/0x95 | [<ffffffff8179ded6>] udp_disconnect+0x19/0x33 | [<ffffffff8179b517>] raw_abort+0x33/0x42 | [<ffffffff81702322>] sock_diag_destroy+0x4d/0x52 which has not been the case before. I narrowed it down to the commit | commit 286c72deabaa240b7eebbd99496ed3324d69f3c0 | Author: Eric Dumazet <edumazet@google.com> | Date: Thu Oct 20 09:39:40 2016 -0700 | | udp: must lock the socket in udp_disconnect() where we start locking the socket for different reason. So the raw_abort escaped the renaming and we have to fix this typo using __udp_disconnect instead. Fixes: 286c72deabaa ("udp: must lock the socket in udp_disconnect()") CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Mostly simple overlapping changes. For example, David Ahern's adjacency list revamp in 'net-next' conflicted with an adjacency list traversal bug fix in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-23net: ip, diag -- Add diag interface for raw socketsCyrill Gorcunov
In criu we are actively using diag interface to collect sockets present in the system when dumping applications. And while for unix, tcp, udp[lite], packet, netlink it works as expected, the raw sockets do not have. Thus add it. v2: - add missing sock_put calls in raw_diag_dump_one (by eric.dumazet@) - implement @destroy for diag requests (by dsa@) v3: - add export of raw_abort for IPv6 (by dsa@) - pass net-admin flag into inet_sk_diag_fill due to changes in net-next branch (by dsa@) v4: - use @pad in struct inet_diag_req_v2 for raw socket protocol specification: raw module carries sockets which may have custom protocol passed from socket() syscall and sole @sdiag_protocol is not enough to match underlied ones - start reporting protocol specifed in socket() call when sockets are raw ones for the same reason: user space tools like ss may parse this attribute and use it for socket matching v5 (by eric.dumazet@): - use sock_hold in raw_sock_get instead of atomic_inc, we're holding (raw_v4_hashinfo|raw_v6_hashinfo)->lock when looking up so counter won't be zero here. v6: - use sdiag_raw_protocol() helper which will access @pad structure used for raw sockets protocol specification: we can't simply rename this member without breaking uapi v7: - sine sdiag_raw_protocol() helper is not suitable for uapi lets rather make an alias structure with proper names. __check_inet_diag_req_raw helper will catch if any of structure unintentionally changed. CC: David S. Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David Ahern <dsa@cumulusnetworks.com> CC: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> CC: James Morris <jmorris@namei.org> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> CC: Patrick McHardy <kaber@trash.net> CC: Andrey Vagin <avagin@openvz.org> CC: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20udp: must lock the socket in udp_disconnect()Eric Dumazet
Baozeng Ding reported KASAN traces showing uses after free in udp_lib_get_port() and other related UDP functions. A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash. I could write a reproducer with two threads doing : static int sock_fd; static void *thr1(void *arg) { for (;;) { connect(sock_fd, (const struct sockaddr *)arg, sizeof(struct sockaddr_in)); } } static void *thr2(void *arg) { struct sockaddr_in unspec; for (;;) { memset(&unspec, 0, sizeof(unspec)); connect(sock_fd, (const struct sockaddr *)&unspec, sizeof(unspec)); } } Problem is that udp_disconnect() could run without holding socket lock, and this was causing list corruptions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: ipv4: Remove l3mdev_get_saddrDavid Ahern
No longer needed Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04sock: enable timestamping using control messagesSoheil Hassas Yeganeh
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt. This is very costly when users want to sample writes to gather tx timestamps. Add support for enabling SO_TIMESTAMPING via control messages by using tsflags added in `struct sockcm_cookie` (added in the previous patches in this series) to set the tx_flags of the last skb created in a sendmsg. With this patch, the timestamp recording bits in tx_flags of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg. Please note that this is only effective for overriding the recording timestamps flags. Users should enable timestamp reporting (e.g., SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using socket options and then should ask for SOF_TIMESTAMPING_TX_* using control messages per sendmsg to sample timestamps for each write. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04ipv4: process socket-level control messages in IPv4Soheil Hassas Yeganeh
Process socket-level control messages by invoking __sock_cmsg_send in ip_cmsg_send for control messages on the SOL_SOCKET layer. This makes sure whenever ip_cmsg_send is called in udp, icmp, and raw, we also process socket-level control messages. Note that this commit interprets new control messages that were ignored before. As such, this commit does not change the behavior of IPv4 control messages. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/phy/bcm7xxx.c drivers/net/phy/marvell.c drivers/net/vxlan.c All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-13ipv4: fix memory leaks in ip_cmsg_send() callersEric Dumazet
Dmitry reported memory leaks of IP options allocated in ip_cmsg_send() when/if this function returns an error. Callers are responsible for the freeing. Many thanks to Dmitry for the report and diagnostic. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11sock: struct proto hash function may errorCraig Gallek
In order to support fast reuseport lookups in TCP, the hash function defined in struct proto must be capable of returning an error code. This patch changes the function signature of all related hash functions to return an integer and handles or propagates this return value at all call sites. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04net: Propagate lookup failure in l3mdev_get_saddr to callerDavid Ahern
Commands run in a vrf context are not failing as expected on a route lookup: root@kenny:~# ip ro ls table vrf-red unreachable default root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 ping: Warning: source address might be selected on device other than vrf-red. PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data. --- 10.100.1.254 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms Since the vrf table does not have a route for 10.100.1.254 the ping should have failed. The saddr lookup causes a full VRF table lookup. Propogating a lookup failure to the user allows the command to fail as expected: root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 connect: No route to host Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-16raw: increment correct SNMP counters for ICMP messagesBen Cartwright-Cox
Sending ICMP packets with raw sockets ends up in the SNMP counters logging the type as the first byte of the IPv4 header rather than the ICMP header. This is fixed by adding the IP Header Length to the casting into a icmphdr struct. Signed-off-by: Ben Cartwright-Cox <ben@benjojo.co.uk> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>