summaryrefslogtreecommitdiff
path: root/drivers/vhost
AgeCommit message (Collapse)Author
2019-03-19vhost/vsock: fix vhost vsock cid hashing inconsistentZha Bin
commit 7fbe078c37aba3088359c9256c1a1d0c3e39ee81 upstream. The vsock core only supports 32bit CID, but the Virtio-vsock spec define CID (dst_cid and src_cid) as u64 and the upper 32bits is reserved as zero. This inconsistency causes one bug in vhost vsock driver. The scenarios is: 0. A hash table (vhost_vsock_hash) is used to map an CID to a vsock object. And hash_min() is used to compute the hash key. hash_min() is defined as: (sizeof(val) <= 4 ? hash_32(val, bits) : hash_long(val, bits)). That means the hash algorithm has dependency on the size of macro argument 'val'. 0. In function vhost_vsock_set_cid(), a 64bit CID is passed to hash_min() to compute the hash key when inserting a vsock object into the hash table. 0. In function vhost_vsock_get(), a 32bit CID is passed to hash_min() to compute the hash key when looking up a vsock for an CID. Because the different size of the CID, hash_min() returns different hash key, thus fails to look up the vsock object for an CID. To fix this bug, we keep CID as u64 in the IOCTLs and virtio message headers, but explicitly convert u64 to u32 when deal with the hash table and vsock core. Fixes: 834e772c8db0 ("vhost/vsock: fix use-after-free in network stack callers") Link: https://github.com/stefanha/virtio/blob/vsock/trunk/content.tex Signed-off-by: Zha Bin <zhabin@linux.alibaba.com> Reviewed-by: Liu Jiang <gerry@linux.alibaba.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Shengjing Zhu <i@zhsj.me> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-02-27vhost/scsi: Use copy_to_iter() to send control queue responseBijan Mottahedeh
[ Upstream commit 8e5dadfe76cf2862ebf3e4f22adef29982df7766 ] Uses copy_to_iter() instead of __copy_to_user() in order to ensure we support arbitrary layouts and an input buffer split across iov entries. Fixes: 0d02dbd68c47b ("vhost/scsi: Respond to control queue operations") Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-27vhost: return EINVAL if iovecs size does not match the message sizePavel Tikhomirov
[ Upstream commit 74ad7419489ddade8044e3c9ab064ad656520306 ] We've failed to copy and process vhost_iotlb_msg so let userspace at least know about it. For instance before these patch the code below runs without any error: int main() { struct vhost_msg msg; struct iovec iov; int fd; fd = open("/dev/vhost-net", O_RDWR); if (fd == -1) { perror("open"); return 1; } iov.iov_base = &msg; iov.iov_len = sizeof(msg)-4; if (writev(fd, &iov,1) == -1) { perror("writev"); return 1; } return 0; } Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-23vhost: correctly check the return value of translate_desc() in log_used()Jason Wang
[ Upstream commit 816db7663565cd23f74ed3d5c9240522e3fb0dda ] When fail, translate_desc() returns negative value, otherwise the number of iovs. So we should fail when the return value is negative instead of a blindly check against zero. Detected by CoverityScan, CID# 1442593: Control flow issues (DEADCODE) Fixes: cc5e71075947 ("vhost: log dirty page correctly") Acked-by: Michael S. Tsirkin <mst@redhat.com> Reported-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-02-06vhost: fix OOB in get_rx_bufs()Jason Wang
[ Upstream commit b46a0bf78ad7b150ef5910da83859f7f5a514ffd ] After batched used ring updating was introduced in commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"). We tend to batch heads in vq->heads for more than one packet. But the quota passed to get_rx_bufs() was not correctly limited, which can result a OOB write in vq->heads. headcount = get_rx_bufs(vq, vq->heads + nvq->done_idx, vhost_len, &in, vq_log, &log, likely(mergeable) ? UIO_MAXIOV : 1); UIO_MAXIOV was still used which is wrong since we could have batched used in vq->heads, this will cause OOB if the next buffer needs more than 960 (1024 (UIO_MAXIOV) - 64 (VHOST_NET_BATCH)) heads after we've batched 64 (VHOST_NET_BATCH) heads: Acked-by: Stefan Hajnoczi <stefanha@redhat.com> ============================================================================= BUG kmalloc-8k (Tainted: G B ): Redzone overwritten ----------------------------------------------------------------------------- INFO: 0x00000000fd93b7a2-0x00000000f0713384. First byte 0xa9 instead of 0xcc INFO: Allocated in alloc_pd+0x22/0x60 age=3933677 cpu=2 pid=2674 kmem_cache_alloc_trace+0xbb/0x140 alloc_pd+0x22/0x60 gen8_ppgtt_create+0x11d/0x5f0 i915_ppgtt_create+0x16/0x80 i915_gem_create_context+0x248/0x390 i915_gem_context_create_ioctl+0x4b/0xe0 drm_ioctl_kernel+0xa5/0xf0 drm_ioctl+0x2ed/0x3a0 do_vfs_ioctl+0x9f/0x620 ksys_ioctl+0x6b/0x80 __x64_sys_ioctl+0x11/0x20 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Slab 0x00000000d13e87af objects=3 used=3 fp=0x (null) flags=0x200000000010201 INFO: Object 0x0000000003278802 @offset=17064 fp=0x00000000e2e6652b Fixing this by allocating UIO_MAXIOV + VHOST_NET_BATCH iovs for vhost-net. This is done through set the limitation through vhost_dev_init(), then set_owner can allocate the number of iov in a per device manner. This fixes CVE-2018-16880. Fixes: e2b3b35eb989 ("vhost_net: batch used ring update in rx") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-31vhost: log dirty page correctlyJason Wang
[ Upstream commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4 ] Vhost dirty page logging API is designed to sync through GPA. But we try to log GIOVA when device IOTLB is enabled. This is wrong and may lead to missing data after migration. To solve this issue, when logging with device IOTLB enabled, we will: 1) reuse the device IOTLB translation result of GIOVA->HVA mapping to get HVA, for writable descriptor, get HVA through iovec. For used ring update, translate its GIOVA to HVA 2) traverse the GPA->HVA mapping to get the possible GPA and log through GPA. Pay attention this reverse mapping is not guaranteed to be unique, so we should log each possible GPA in this case. This fix the failure of scp to guest during migration. In -next, we will probably support passing GIOVA->GPA instead of GIOVA->HVA. Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Reported-by: Jintack Lim <jintack@cs.columbia.edu> Cc: Jintack Lim <jintack@cs.columbia.edu> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-12Revert "net: vhost: lock the vqs one by one"Jason Wang
This reverts commit 78139c94dc8c96a478e67dab3bee84dc6eccb5fd. We don't protect device IOTLB with vq mutex, which will lead e.g use after free for device IOTLB entries. And since we've switched to use mutex_trylock() in previous patch, it's safe to revert it without having deadlock. Fixes: commit 78139c94dc8c ("net: vhost: lock the vqs one by one") Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-12vhost_net: switch to use mutex_trylock() in vhost_net_busy_poll()Jason Wang
We used to hold the mutex of paired virtqueue in vhost_net_busy_poll(). But this will results an inconsistent lock order which may cause deadlock if we try to bring back the protection of device IOTLB with vq mutex that requires to hold mutex of all virtqueues at the same time. Fix this simply by switching to use mutex_trylock(), when fail just skip the busy polling. This can happen when device IOTLB is under updating which should be rare. Fixes: commit 78139c94dc8c ("net: vhost: lock the vqs one by one") Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-12vhost: make sure used idx is seen before log in vhost_add_used_n()Jason Wang
We miss a write barrier that guarantees used idx is updated and seen before log. This will let userspace sync and copy used ring before used idx is update. Fix this by adding a barrier before log_write(). Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support") Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "A decent batch of fixes here. I'd say about half are for problems that have existed for a while, and half are for new regressions added in the 4.20 merge window. 1) Fix 10G SFP phy module detection in mvpp2, from Baruch Siach. 2) Revert bogus emac driver change, from Benjamin Herrenschmidt. 3) Handle BPF exported data structure with pointers when building 32-bit userland, from Daniel Borkmann. 4) Memory leak fix in act_police, from Davide Caratti. 5) Check RX checksum offload in RX descriptors properly in aquantia driver, from Dmitry Bogdanov. 6) SKB unlink fix in various spots, from Edward Cree. 7) ndo_dflt_fdb_dump() only works with ethernet, enforce this, from Eric Dumazet. 8) Fix FID leak in mlxsw driver, from Ido Schimmel. 9) IOTLB locking fix in vhost, from Jean-Philippe Brucker. 10) Fix SKB truesize accounting in ipv4/ipv6/netfilter frag memory limits otherwise namespace exit can hang. From Jiri Wiesner. 11) Address block parsing length fixes in x25 from Martin Schiller. 12) IRQ and ring accounting fixes in bnxt_en, from Michael Chan. 13) For tun interfaces, only iface delete works with rtnl ops, enforce this by disallowing add. From Nicolas Dichtel. 14) Use after free in liquidio, from Pan Bian. 15) Fix SKB use after passing to netif_receive_skb(), from Prashant Bhole. 16) Static key accounting and other fixes in XPS from Sabrina Dubroca. 17) Partially initialized flow key passed to ip6_route_output(), from Shmulik Ladkani. 18) Fix RTNL deadlock during reset in ibmvnic driver, from Thomas Falcon. 19) Several small TCP fixes (off-by-one on window probe abort, NULL deref in tail loss probe, SNMP mis-estimations) from Yuchung Cheng" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits) net/sched: cls_flower: Reject duplicated rules also under skip_sw bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips. bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips. bnxt_en: Keep track of reserved IRQs. bnxt_en: Fix CNP CoS queue regression. net/mlx4_core: Correctly set PFC param if global pause is turned off. Revert "net/ibm/emac: wrong bit is used for STA control" neighbour: Avoid writing before skb->head in neigh_hh_output() ipv6: Check available headroom in ip6_xmit() even without options tcp: lack of available data can also cause TSO defer ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl mlxsw: spectrum_router: Relax GRE decap matching check mlxsw: spectrum_switchdev: Avoid leaking FID's reference count mlxsw: spectrum_nve: Remove easily triggerable warnings ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes sctp: frag_point sanity check tcp: fix NULL ref in tail loss probe tcp: Do not underestimate rwnd_limited net: use skb_list_del_init() to remove from RX sublists ...
2018-12-06vhost/vsock: fix use-after-free in network stack callersStefan Hajnoczi
If the network stack calls .send_pkt()/.cancel_pkt() during .release(), a struct vhost_vsock use-after-free is possible. This occurs because .release() does not wait for other CPUs to stop using struct vhost_vsock. Switch to an RCU-enabled hashtable (indexed by guest CID) so that .release() can wait for other CPUs by calling synchronize_rcu(). This also eliminates vhost_vsock_lock acquisition in the data path so it could have a positive effect on performance. This is CVE-2018-14625 "kernel: use-after-free Read in vhost_transport_send_pkt". Cc: stable@vger.kernel.org Reported-and-tested-by: syzbot+bd391451452fb0b93039@syzkaller.appspotmail.com Reported-by: syzbot+e3e074963495f92a89ed@syzkaller.appspotmail.com Reported-by: syzbot+d5a0a170c5069658b141@syzkaller.appspotmail.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com>
2018-12-06vhost/vsock: fix reset orphans race with close timeoutStefan Hajnoczi
If a local process has closed a connected socket and hasn't received a RST packet yet, then the socket remains in the table until a timeout expires. When a vhost_vsock instance is released with the timeout still pending, the socket is never freed because vhost_vsock has already set the SOCK_DONE flag. Check if the close timer is pending and let it close the socket. This prevents the race which can leak sockets. Reported-by: Maximilian Riemensberger <riemensberger@cadami.net> Cc: Graham Whaley <graham.whaley@gmail.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-12-03vhost: fix IOTLB lockingJean-Philippe Brucker
Commit 78139c94dc8c ("net: vhost: lock the vqs one by one") moved the vq lock to improve scalability, but introduced a possible deadlock in vhost-iotlb. vhost_iotlb_notify_vq() now takes vq->mutex while holding the device's IOTLB spinlock. And on the vhost_iotlb_miss() path, the spinlock is taken while holding vq->mutex. Since calling vhost_poll_queue() doesn't require any lock, avoid the deadlock by not taking vq->mutex. Fixes: 78139c94dc8c ("net: vhost: lock the vqs one by one") Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-01Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio/vhost updates from Michael Tsirkin: "Fixes and tweaks: - virtio balloon page hinting support - vhost scsi control queue - misc fixes" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: MAINTAINERS: remove reference to bogus vsock file vhost/scsi: Use common handling code in request queue handler vhost/scsi: Extract common handling code from control queue handler vhost/scsi: Respond to control queue operations vhost/scsi: truncate T10 PI iov_iter to prot_bytes virtio-balloon: VIRTIO_BALLOON_F_PAGE_POISON mm/page_poison: expose page_poisoning_enabled to kernel modules virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT kvm_config: add CONFIG_VIRTIO_MENU
2018-10-31vhost: Fix Spectre V1 vulnerabilityJason Wang
The idx in vhost_vring_ioctl() was controlled by userspace, hence a potential exploitation of the Spectre variant 1 vulnerability. Fixing this by sanitizing idx before using it to index d->vqs. Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-10-24vhost/scsi: Use common handling code in request queue handlerBijan Mottahedeh
Change the request queue handler to use common handling routines same as the control queue handler. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-10-24vhost/scsi: Extract common handling code from control queue handlerBijan Mottahedeh
Prepare to change the request queue handler to use common handling routines. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-10-24vhost/scsi: Respond to control queue operationsBijan Mottahedeh
The vhost-scsi driver currently does not handle any control queue operations. In particular, vhost_scsi_ctl_handle_kick, merely prints out a debug message but does nothing else. This can cause guest VMs to hang. As part of SCSI recovery from an error, e.g., an I/O timeout, the SCSI midlayer attempts to abort the failed operation. The SCSI virtio driver translates the abort to a SCSI TMF request that gets put on the control queue (virtscsi_abort -> virtscsi_tmf). The SCSI virtio driver then waits indefinitely for this request to be completed, but it never will because vhost-scsi never responds to that request. To avoid a hang, always respond to control queue operations; explicitly reject TMF requests, and return a no-op response to event requests. Signed-off-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-10-24vhost/scsi: truncate T10 PI iov_iter to prot_bytesGreg Edwards
Commands with protection information included were not truncating the protection iov_iter to the number of protection bytes in the command. This resulted in vhost_scsi mis-calculating the size of the protection SGL in vhost_scsi_calc_sgls(), and including both the protection and data SG entries in the protection SGL. Fixes: 09b13fa8c1a1 ("vhost/scsi: Add ANY_LAYOUT support in vhost_scsi_handle_vq") Signed-off-by: Greg Edwards <gedwards@ddn.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Fixes: 09b13fa8c1a1093e9458549ac8bb203a7c65c62a Cc: stable@vger.kernel.org Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-07net: vhost: remove bad code lineTonghao Zhang
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26net: vhost: add rx busy polling in tx pathTonghao Zhang
This patch improves the guest receive performance. On the handle_tx side, we poll the sock receive queue at the same time. handle_rx do that in the same way. We set the poll-us=100us and use the netperf to test throughput and mean latency. When running the tests, the vhost-net kthread of that VM, is alway 100% CPU. The commands are shown as below. Rx performance is greatly improved by this patch. There is not notable performance change on tx with this series though. This patch is useful for bi-directional traffic. netperf -H IP -t TCP_STREAM -l 20 -- -O "THROUGHPUT, THROUGHPUT_UNITS, MEAN_LATENCY" Topology: [Host] ->linux bridge -> tap vhost-net ->[Guest] TCP_STREAM: * Without the patch: 19842.95 Mbps, 6.50 us mean latency * With the patch: 37598.20 Mbps, 3.43 us mean latency Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26net: vhost: factor out busy polling logic to vhost_net_busy_poll()Tonghao Zhang
Factor out generic busy polling logic and will be used for in tx path in the next patch. And with the patch, qemu can set differently the busyloop_timeout for rx queue. To avoid duplicate codes, introduce the helper functions: * sock_has_rx_data(changed from sk_has_rx_data) * vhost_net_busy_poll_try_queue Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26net: vhost: replace magic number of lock annotationTonghao Zhang
Use the VHOST_NET_VQ_XXX as a subclass for mutex_lock_nested. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-26net: vhost: lock the vqs one by oneTonghao Zhang
This patch changes the way that lock all vqs at the same, to lock them one by one. It will be used for next patch to avoid the deadlock. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21vhost_net: add a missing error returnDan Carpenter
We accidentally left out this error return so it leads to some use after free bugs later on. Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13vhost_net: batch submitting XDP buffers to underlayer socketsJason Wang
This patch implements XDP batching for vhost_net. The idea is first to try to do userspace copy and build XDP buff directly in vhost. Instead of submitting the packet immediately, vhost_net will batch them in an array and submit every 64 (VHOST_NET_BATCH) packets to the under layer sockets through msg_control of sendmsg(). When XDP is enabled on the TUN/TAP, TUN/TAP can process XDP inside a loop without caring GUP thus it can do batch map flushing. When XDP is not enabled or not supported, the underlayer socket need to build skb and pass it to network core. The batched packet submission allows us to do batching like netif_receive_skb_list() in the future. This saves lots of indirect calls for better cache utilization. For the case that we can't so batching e.g when sndbuf is limited or packet size is too large, we will go for usual one packet per sendmsg() way. Doing testpmd on various setups gives us: Test /+pps% XDP_DROP on TAP /+44.8% XDP_REDIRECT on TAP /+29% macvtap (skb) /+26% Netperf tests shows obvious improvements for small packet transmission: size/session/+thu%/+normalize% 64/ 1/ +2%/ 0% 64/ 2/ +3%/ +1% 64/ 4/ +7%/ +5% 64/ 8/ +8%/ +6% 256/ 1/ +3%/ 0% 256/ 2/ +10%/ +7% 256/ 4/ +26%/ +22% 256/ 8/ +27%/ +23% 512/ 1/ +3%/ +2% 512/ 2/ +19%/ +14% 512/ 4/ +43%/ +40% 512/ 8/ +45%/ +41% 1024/ 1/ +4%/ 0% 1024/ 2/ +27%/ +21% 1024/ 4/ +38%/ +73% 1024/ 8/ +15%/ +24% 2048/ 1/ +10%/ +7% 2048/ 2/ +16%/ +12% 2048/ 4/ 0%/ +2% 2048/ 8/ 0%/ +2% 4096/ 1/ +36%/ +60% 4096/ 2/ -11%/ -26% 4096/ 4/ 0%/ +14% 4096/ 8/ 0%/ +4% 16384/ 1/ -1%/ +5% 16384/ 2/ 0%/ +2% 16384/ 4/ 0%/ -3% 16384/ 8/ 0%/ +4% 65535/ 1/ 0%/ +10% 65535/ 2/ 0%/ +8% 65535/ 4/ 0%/ +1% 65535/ 8/ 0%/ +3% Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13tun: switch to new type of msg_controlJason Wang
This patch introduces to a new tun/tap specific msg_control: #define TUN_MSG_UBUF 1 #define TUN_MSG_PTR 2 struct tun_msg_ctl { int type; void *ptr; }; This allows us to pass different kinds of msg_control through sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will be used by the existed vhost_net zerocopy code. The second is XDP buff, which allows vhost_net to pass XDP buff to TUN. This could be used to implement accepting an array of XDP buffs from vhost_net in the following patches. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) ICE, E1000, IGB, IXGBE, and I40E bug fixes from the Intel folks. 2) Better fix for AB-BA deadlock in packet scheduler code, from Cong Wang. 3) bpf sockmap fixes (zero sized key handling, etc.) from Daniel Borkmann. 4) Send zero IPID in TCP resets and SYN-RECV state ACKs, to prevent attackers using it as a side-channel. From Eric Dumazet. 5) Memory leak in mediatek bluetooth driver, from Gustavo A. R. Silva. 6) Hook up rt->dst.input of ipv6 anycast routes properly, from Hangbin Liu. 7) hns and hns3 bug fixes from Huazhong Tan. 8) Fix RIF leak in mlxsw driver, from Ido Schimmel. 9) iova range check fix in vhost, from Jason Wang. 10) Fix hang in do_tcp_sendpages() with tls, from John Fastabend. 11) More r8152 chips need to disable RX aggregation, from Kai-Heng Feng. 12) Memory exposure in TCA_U32_SEL handling, from Kees Cook. 13) TCP BBR congestion control fixes from Kevin Yang. 14) hv_netvsc, ignore non-PCI devices, from Stephen Hemminger. 15) qed driver fixes from Tomer Tayar. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits) net: sched: Fix memory exposure from short TCA_U32_SEL qed: fix spelling mistake "comparsion" -> "comparison" vhost: correctly check the iova range when waking virtqueue qlge: Fix netdev features configuration. net: macb: do not disable MDIO bus at open/close time Revert "net: stmmac: fix build failure due to missing COMMON_CLK dependency" net: macb: Fix regression breaking non-MDIO fixed-link PHYs mlxsw: spectrum_switchdev: Do not leak RIFs when removing bridge i40e: fix condition of WARN_ONCE for stat strings i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled ixgbe: fix driver behaviour after issuing VFLR ixgbe: Prevent unsupported configurations with XDP ixgbe: Replace GFP_ATOMIC with GFP_KERNEL igb: Replace mdelay() with msleep() in igb_integrated_phy_loopback() igb: Replace GFP_ATOMIC with GFP_KERNEL in igb_sw_init() igb: Use an advanced ctx descriptor for launchtime e1000: ensure to free old tx/rx rings in set_ringparam() e1000: check on netif_running() before calling e1000_up() ixgb: use dma_zalloc_coherent instead of allocator/memset ice: Trivial formatting fixes ...
2018-08-25vhost: correctly check the iova range when waking virtqueueJason Wang
We don't wakeup the virtqueue if the first byte of pending iova range is the last byte of the range we just got updated. This will lead a virtqueue to wait for IOTLB updating forever. Fixing by correct the check and wake up the virtqueue in this case. Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Reported-by: Peter Xu <peterx@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com> Tested-by: Peter Xu <peterx@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-24Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "virtio, vhost: fixes, tweaks No new features but a bunch of tweaks such as switching balloon from oom notifier to shrinker" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: vhost/scsi: increase VHOST_SCSI_PREALLOC_PROT_SGLS to 2048 vhost: allow vhost-scsi driver to be built-in virtio: pci-legacy: Validate queue pfn virtio: mmio-v1: Validate queue PFN virtio_balloon: replace oom notifier with shrinker virtio-balloon: kzalloc the vb struct virtio-balloon: remove BUG() in init_vqs
2018-08-22vhost/scsi: increase VHOST_SCSI_PREALLOC_PROT_SGLS to 2048Greg Edwards
The current value of VHOST_SCSI_PREALLOC_PROT_SGLS is too small to accommodate larger I/Os, e.g. 16-32 MiB, when the VIRTIO_SCSI_F_T10_PI feature bit is negotiated and the backing store supports T10 PI. vhost-scsi rejects the command with errors like: [ 59.581317] vhost_scsi_calc_sgls: requested sgl_count: 1820 exceeds pre-allocated max_sgls: 512 Signed-off-by: Greg Edwards <gedwards@ddn.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-08-22vhost: allow vhost-scsi driver to be built-inGreg Edwards
It's useful to allow vhost-scsi to be built-in when testing vhost in L1 + L2 VMs and booting L1 VM with QEMU '-kernel' option. Signed-off-by: Greg Edwards <gedwards@ddn.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2018-08-15Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsiLinus Torvalds
Pull SCSI updates from James Bottomley: "This is mostly updates to the usual drivers: mpt3sas, lpfc, qla2xxx, hisi_sas, smartpqi, megaraid_sas, arcmsr. In addition, with the continuing absence of Nic we have target updates for tcmu and target core (all with reviews and acks). The biggest observable change is going to be that we're (again) trying to switch to mulitqueue as the default (a user can still override the setting on the kernel command line). Other major core stuff is the removal of the remaining Microchannel drivers, an update of the internal timers and some reworks of completion and result handling" * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (203 commits) scsi: core: use blk_mq_run_hw_queues in scsi_kick_queue scsi: ufs: remove unnecessary query(DM) UPIU trace scsi: qla2xxx: Fix issue reported by static checker for qla2x00_els_dcmd2_sp_done() scsi: aacraid: Spelling fix in comment scsi: mpt3sas: Fix calltrace observed while running IO & reset scsi: aic94xx: fix an error code in aic94xx_init() scsi: st: remove redundant pointer STbuffer scsi: qla2xxx: Update driver version to 10.00.00.08-k scsi: qla2xxx: Migrate NVME N2N handling into state machine scsi: qla2xxx: Save frame payload size from ICB scsi: qla2xxx: Fix stalled relogin scsi: qla2xxx: Fix race between switch cmd completion and timeout scsi: qla2xxx: Fix Management Server NPort handle reservation logic scsi: qla2xxx: Flush mailbox commands on chip reset scsi: qla2xxx: Fix unintended Logout scsi: qla2xxx: Fix session state stuck in Get Port DB scsi: qla2xxx: Fix redundant fc_rport registration scsi: qla2xxx: Silent erroneous message scsi: qla2xxx: Prevent sysfs access when chip is down scsi: qla2xxx: Add longer window for chip reset ...
2018-08-09Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Overlapping changes in RXRPC, changing to ktime_get_seconds() whilst adding some tracepoints. Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08vhost: reset metadata cache when initializing new IOTLBJason Wang
We need to reset metadata cache during new IOTLB initialization, otherwise the stale pointers to previous IOTLB may be still accessed which will lead a use after free. Reported-by: syzbot+c51e6736a1bf614b3272@syzkaller.appspotmail.com Fixes: f88949138058 ("vhost: introduce O(1) vq metadata cache") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-06vhost: switch to use new message formatJason Wang
We use to have message like: struct vhost_msg { int type; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; Unfortunately, there will be a hole of 32bit in 64bit machine because of the alignment. This leads a different formats between 32bit API and 64bit API. What's more it will break 32bit program running on 64bit machine. So fixing this by introducing a new message type with an explicit 32bit reserved field after type like: struct vhost_msg_v2 { __u32 type; __u32 reserved; union { struct vhost_iotlb_msg iotlb; __u8 padding[64]; }; }; We will have a consistent ABI after switching to use this. To enable this capability, introduce a new ioctl (VHOST_SET_BAKCEND_FEATURE) for userspace to enable this feature (VHOST_BACKEND_F_IOTLB_V2). Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02scsi: target: loop, usb, vhost, xen: use target_remove_sessionMike Christie
This converts drivers that were only calling transport_deregister_session to use target_remove_session. The calling of transport_deregister_session_configfs via target_remove_session for these types of drivers is ok, because they were not exporting info from fields like sess_acl_list, sess->se_tpg and sess->fabric_sess_ptr from configfs accessible functions, so they will see no difference. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Felipe Balbi <balbi@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andrzej Pietrasiewicz <andrzej.p@samsung.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Juergen Gross <jgross@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-08-02scsi: target: rename target_alloc_sessionMike Christie
Rename target_alloc_session to target_setup_session to avoid confusion with the other transport session allocation function that only allocates the session and because the target_alloc_session does so much more. It allocates the session, sets up the nacl and registers the session. The next patch will then add a remove function to match the setup in this one, so it should make sense for all drivers, except iscsi, to just call those 2 functions to setup and remove a session. iscsi will continue to be the odd driver. Signed-off-by: Mike Christie <mchristi@redhat.com> Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Chris Boot <bootc@bootc.net> Cc: Bryant G. Ly <bryantly@linux.vnet.ibm.com> Cc: Michael Cyr <mikecyr@linux.vnet.ibm.com> Cc: <qla2xxx-upstream@qlogic.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Felipe Balbi <balbi@kernel.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Andrzej Pietrasiewicz <andrzej.p@samsung.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Juergen Gross <jgross@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
2018-07-22vhost_net: batch update used ring for datacopy TXJason Wang
Like commit e2b3b35eb989 ("vhost_net: batch used ring update in rx"), this patches implements batch used ring update for datacopy TX (zerocopy has already done some kind of batching). Testpmd transmission from guest to host (XDP_DROP on tap) shows 25.8% improvement (from ~3.1Mpps to ~3.9Mpps) on Broadwell i7-5600U CPU @ 2.60GHz machine. Netperf TCP tests does not show obvious differences. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCHJason Wang
A more generic name which could be used for TX as well. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22vhost_net: rename vhost_rx_signal_used() to vhost_net_signal_used()Jason Wang
Rename for reusing this for TX. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22vhost_net: split out datacopy logicJason Wang
Instead of mixing zerocopy and datacopy logics, this patch tries to split datacopy logic out. This results for a more compact code and ad-hoc optimization could be done on top more easily. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22vhost_net: introduce tx_can_batch()Jason Wang
Introduce tx_can_batch() to determine whether TX could be batched. This will help to reduce the code duplication in the future. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22vhost_net: introduce get_tx_bufs()Jason Wang
Factor out logic of getting tx buffer and iov iter initialization. This will be used for reducing codes duplication in the future. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22vhost_net: introduce vhost_exceeds_weight()Jason Wang
Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22vhost_net: introduce helper to initialize tx iov iterJason Wang
Introduce init_iov_iter() in order to be reused by future patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-22vhost_net: drop unnecessary parameterJason Wang
Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04vhost_net: Avoid rx vring kicks during busyloopToshiaki Makita
We may run out of avail rx ring descriptor under heavy load but busypoll did not detect it so busypoll may have exited prematurely. Avoid this by checking rx ring full during busypoll. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04vhost_net: Avoid rx queue wake-ups during busypollToshiaki Makita
We may run handle_rx() while rx work is queued. For example a packet can push the rx work during the window before handle_rx calls vhost_net_disable_vq(). In that case busypoll immediately exits due to vhost_has_work() condition and enables vq again. This can lead to another unnecessary rx wake-ups, so poll rx work instead of enabling the vq. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-07-04vhost_net: Avoid tx vring kicks during busyloopToshiaki Makita
Under heavy load vhost busypoll may run without suppressing notification. For example tx zerocopy callback can push tx work while handle_tx() is running, then busyloop exits due to vhost_has_work() condition and enables notification but immediately reenters handle_tx() because the pushed work was tx. In this case handle_tx() tries to disable notification again, but when using event_idx it by design cannot. Then busyloop will run without suppressing notification. Another example is the case where handle_tx() tries to enable notification but avail idx is advanced so disables it again. This case also leads to the same situation with event_idx. The problem is that once we enter this situation busyloop does not work under heavy load for considerable amount of time, because notification is likely to happen during busyloop and handle_tx() immediately enables notification after notification happens. Specifically busyloop detects notification by vhost_has_work() and then handle_tx() calls vhost_enable_notify(). Because the detected work was the tx work, it enters handle_tx(), and enters busyloop without suppression again. This is likely to be repeated, so with event_idx we are almost not able to suppress notification in this case. To fix this, poll the work instead of enabling notification when busypoll is interrupted by something. IMHO vhost_has_work() is kind of interruption rather than a signal to completely cancel the busypoll, so let's run busypoll after the necessary work is done. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>