summaryrefslogtreecommitdiff
path: root/net/sunrpc
AgeCommit message (Collapse)Author
2019-01-23sunrpc: handle ENOMEM in rpcb_getport_asyncJ. Bruce Fields
commit 81c88b18de1f11f70c97f28ced8d642c00bb3955 upstream. If we ignore the error we'll hit a null dereference a little later. Reported-by: syzbot+4b98281f2401ab849f4b@syzkaller.appspotmail.com Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16sunrpc: use-after-free in svc_process_common()Vasily Averin
commit d4b09acf924b84bae77cad090a9d108e70b43643 upstream. if node have NFSv41+ mounts inside several net namespaces it can lead to use-after-free in svc_process_common() svc_process_common() /* Setup reply header */ rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr(rqstp); <<< HERE svc_process_common() can use incorrect rqstp->rq_xprt, its caller function bc_svc_process() takes it from serv->sv_bc_xprt. The problem is that serv is global structure but sv_bc_xprt is assigned per-netnamespace. According to Trond, the whole "let's set up rqstp->rq_xprt for the back channel" is nothing but a giant hack in order to work around the fact that svc_process_common() uses it to find the xpt_ops, and perform a couple of (meaningless for the back channel) tests of xpt_flags. All we really need in svc_process_common() is to be able to run rqstp->rq_xprt->xpt_ops->xpo_prep_reply_hdr() Bruce J Fields points that this xpo_prep_reply_hdr() call is an awfully roundabout way just to do "svc_putnl(resv, 0);" in the tcp case. This patch does not initialiuze rqstp->rq_xprt in bc_svc_process(), now it calls svc_process_common() with rqstp->rq_xprt = NULL. To adjust reply header svc_process_common() just check rqstp->rq_prot and calls svc_tcp_prep_reply_hdr() for tcp case. To handle rqstp->rq_xprt = NULL case in functions called from svc_process_common() patch intruduces net namespace pointer svc_rqst->rq_bc_net and adjust SVC_NET() definition. Some other function was also adopted to properly handle described case. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Cc: stable@vger.kernel.org Fixes: 23c20ecd4475 ("NFS: callback up - users counting cleanup") Signed-off-by: J. Bruce Fields <bfields@redhat.com> v2: - added lost extern svc_tcp_prep_reply_hdr() - dropped trace_svc_process() changes Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13sunrpc: use SVC_NET() in svcauth_gss_* functionsVasily Averin
commit b8be5674fa9a6f3677865ea93f7803c4212f3e10 upstream. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13sunrpc: fix cache_head leak due to queued requestVasily Averin
commit 4ecd55ea074217473f94cfee21bb72864d39f8d7 upstream. After commit d202cce8963d, an expired cache_head can be removed from the cache_detail's hash. However, the expired cache_head may be waiting for a reply from a previously submitted request. Such a cache_head has an increased refcounter and therefore it won't be freed after cache_put(freeme). Because the cache_head was removed from the hash it cannot be found during cache_clean() and can be leaked forever, together with stalled cache_request and other taken resources. In our case we noticed it because an entry in the export cache was holding a reference on a filesystem. Fixes d202cce8963d ("sunrpc: never return expired entries in sunrpc_cache_lookup") Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Cc: stable@kernel.org # 2.6.35 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-13SUNRPC: Fix a race with XPRT_CONNECTINGTrond Myklebust
[ Upstream commit cf76785d30712d90185455e752337acdb53d2a5d ] Ensure that we clear XPRT_CONNECTING before releasing the XPRT_LOCK so that we don't have races between the (asynchronous) socket setup code and tasks in xprt_connect(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Tested-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-01-09sock: Make sock->sk_stamp thread-safeDeepa Dinamani
[ Upstream commit 3a0ed3e9619738067214871e9cb826fa23b2ddb9 ] Al Viro mentioned (Message-ID <20170626041334.GZ10672@ZenIV.linux.org.uk>) that there is probably a race condition lurking in accesses of sk_stamp on 32-bit machines. sock->sk_stamp is of type ktime_t which is always an s64. On a 32 bit architecture, we might run into situations of unsafe access as the access to the field becomes non atomic. Use seqlocks for synchronization. This allows us to avoid using spinlocks for readers as readers do not need mutual exclusion. Another approach to solve this is to require sk_lock for all modifications of the timestamps. The current approach allows for timestamps to have their own lock: sk_stamp_lock. This allows for the patch to not compete with already existing critical sections, and side effects are limited to the paths in the patch. The addition of the new field maintains the data locality optimizations from commit 9115e8cd2a0c ("net: reorganize struct sock for better data locality") Note that all the instances of the sk_stamp accesses are either through the ioctl or the syscall recvmsg. Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-21SUNRPC: Fix a potential race in xprt_connect()Trond Myklebust
[ Upstream commit 0a9a4304f3614e25d9de9b63502ca633c01c0d70 ] If an asynchronous connection attempt completes while another task is in xprt_connect(), then the call to rpc_sleep_on() could end up racing with the call to xprt_wake_pending_tasks(). So add a second test of the connection state after we've put the task to sleep and set the XPRT_CONNECTING flag, when we know that there can be no asynchronous connection attempts still in progress. Fixes: 0b9e79431377d ("SUNRPC: Move the test for XPRT_CONNECTING into...") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-12-13SUNRPC: Fix leak of krb5p encode pagesChuck Lever
commit 8dae5398ab1ac107b1517e8195ed043d5f422bd0 upstream. call_encode can be invoked more than once per RPC call. Ensure that each call to gss_wrap_req_priv does not overwrite pointers to previously allocated memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-01SUNRPC: Fix a bogus get/put in generic_key_to_expire()Trond Myklebust
[ Upstream commit e3d5e573a54dabdc0f9f3cb039d799323372b251 ] Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-27SUNRPC: drop pointless static qualifier in xdr_get_next_encode_buffer()YueHaibing
[ Upstream commit 025911a5f4e36955498ed50806ad1b02f0f76288 ] There is no need to have the '__be32 *p' variable static since new value always be assigned before use it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-21sunrpc: correct the computation for page_ptr when truncatingFrank Sorenson
commit 5d7a5bcb67c70cbc904057ef52d3fcfeb24420bb upstream. When truncating the encode buffer, the page_ptr is getting advanced, causing the next page to be skipped while encoding. The page is still included in the response, so the response contains a page of bogus data. We need to adjust the page_ptr backwards to ensure we encode the next page into the correct place. We saw this triggered when concurrent directory modifications caused nfsd4_encode_direct_fattr() to return nfserr_noent, and the resulting call to xdr_truncate_encode() corrupted the READDIR reply. Signed-off-by: Frank Sorenson <sorenson@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-11-13nfsd: Fix an Oops in free_session()Trond Myklebust
commit bb6ad5572c0022e17e846b382d7413cdcf8055be upstream. In call_xpt_users(), we delete the entry from the list, but we do not reinitialise it. This triggers the list poisoning when we later call unregister_xpt_user() in nfsd4_del_conns(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-15sunrpc: Don't use stack buffer with scatterlistLaura Abbott
[ Upstream commit 44090cc876926277329e1608bafc01b9f6da627f ] Fedora got a bug report from NFS: kernel BUG at include/linux/scatterlist.h:143! ... RIP: 0010:sg_init_one+0x7d/0x90 .. make_checksum+0x4e7/0x760 [rpcsec_gss_krb5] gss_get_mic_kerberos+0x26e/0x310 [rpcsec_gss_krb5] gss_marshal+0x126/0x1a0 [auth_rpcgss] ? __local_bh_enable_ip+0x80/0xe0 ? call_transmit_status+0x1d0/0x1d0 [sunrpc] call_transmit+0x137/0x230 [sunrpc] __rpc_execute+0x9b/0x490 [sunrpc] rpc_run_task+0x119/0x150 [sunrpc] nfs4_run_exchange_id+0x1bd/0x250 [nfsv4] _nfs4_proc_exchange_id+0x2d/0x490 [nfsv4] nfs41_discover_server_trunking+0x1c/0xa0 [nfsv4] nfs4_discover_server_trunking+0x80/0x270 [nfsv4] nfs4_init_client+0x16e/0x240 [nfsv4] ? nfs_get_client+0x4c9/0x5d0 [nfs] ? _raw_spin_unlock+0x24/0x30 ? nfs_get_client+0x4c9/0x5d0 [nfs] nfs4_set_client+0xb2/0x100 [nfsv4] nfs4_create_server+0xff/0x290 [nfsv4] nfs4_remote_mount+0x28/0x50 [nfsv4] mount_fs+0x3b/0x16a vfs_kern_mount.part.35+0x54/0x160 nfs_do_root_mount+0x7f/0xc0 [nfsv4] nfs4_try_mount+0x43/0x70 [nfsv4] ? get_nfs_version+0x21/0x80 [nfs] nfs_fs_mount+0x789/0xbf0 [nfs] ? pcpu_alloc+0x6ca/0x7e0 ? nfs_clone_super+0x70/0x70 [nfs] ? nfs_parse_mount_options+0xb40/0xb40 [nfs] mount_fs+0x3b/0x16a vfs_kern_mount.part.35+0x54/0x160 do_mount+0x1fd/0xd50 ksys_mount+0xba/0xd0 __x64_sys_mount+0x21/0x30 do_syscall_64+0x60/0x1f0 entry_SYSCALL_64_after_hwframe+0x49/0xbe This is BUG_ON(!virt_addr_valid(buf)) triggered by using a stack allocated buffer with a scatterlist. Convert the buffer for rc4salt to be dynamically allocated instead. Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1615258 Signed-off-by: Laura Abbott <labbott@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-09NFSv4 client live hangs after live data migration recoveryBill Baker
commit 0f90be132cbf1537d87a6a8b9e80867adac892f6 upstream. After a live data migration event at the NFS server, the client may send I/O requests to the wrong server, causing a live hang due to repeated recovery events. On the wire, this will appear as an I/O request failing with NFS4ERR_BADSESSION, followed by successful CREATE_SESSION, repeatedly. NFS4ERR_BADSSESSION is returned because the session ID being used was issued by the other server and is not valid at the old server. The failure is caused by async worker threads having cached the transport (xprt) in the rpc_task structure. After the migration recovery completes, the task is redispatched and the task resends the request to the wrong server based on the old value still present in tk_xprt. The solution is to recompute the tk_xprt field of the rpc_task structure so that the request goes to the correct server. Signed-off-by: Bill Baker <bill.baker@oracle.com> Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Helen Chao <helen.chao@oracle.com> Fixes: fb43d17210ba ("SUNRPC: Use the multipath iterator to assign a ...") Cc: stable@vger.kernel.org # v4.9+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-22xprtrdma: Fix corner cases when handling device removalChuck Lever
commit 25524288631fc5b7d33259fca1e0dc38146be5d6 upstream. Michal Kalderon has found some corner cases around device unload with active NFS mounts that I didn't have the imagination to test when xprtrdma device removal was added last year. - The ULP device removal handler is responsible for deallocating the PD. That wasn't clear to me initially, and my own testing suggested it was not necessary, but that is incorrect. - The transport destruction path can no longer assume that there is a valid ID. - When destroying a transport, ensure that ib_free_cq() is not invoked on a CQ that was already released. Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com> Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA from ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-07-03xprtrdma: Return -ENOBUFS when no pages are availableChuck Lever
commit a8f688ec437dc2045cc8f0c89fe877d5803850da upstream. The use of -EAGAIN in rpcrdma_convert_iovs() is a latent bug: the transport never calls xprt_write_space() when more pages become available. -ENOBUFS will trigger the correct "delay briefly and call again" logic. Fixes: 7a89f9c626e3 ("xprtrdma: Honor ->send_request API contract") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # 4.8+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26SUNRPC: Don't call __UDPX_INC_STATS() from a preemptible contextTrond Myklebust
[ Upstream commit 0afa6b4412988019db14c6bfb8c6cbdf120ca9ad ] Calling __UDPX_INC_STATS() from a preemptible context leads to a warning of the form: BUG: using __this_cpu_add() in preemptible [00000000] code: kworker/u5:0/31 caller is xs_udp_data_receive_workfn+0x194/0x270 CPU: 1 PID: 31 Comm: kworker/u5:0 Not tainted 4.15.0-rc8-00076-g90ea9f1 #2 Workqueue: xprtiod xs_udp_data_receive_workfn Call Trace: dump_stack+0x85/0xc1 check_preemption_disabled+0xce/0xe0 xs_udp_data_receive_workfn+0x194/0x270 process_one_work+0x318/0x620 worker_thread+0x20a/0x390 ? process_one_work+0x620/0x620 kthread+0x120/0x130 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x24/0x30 Since we're taking a spinlock in those functions anyway, let's fix the issue by moving the call so that it occurs under the spinlock. Reported-by: kernel test robot <fengguang.wu@intel.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26svcrdma: Fix Read chunk round-upChuck Lever
[ Upstream commit 175e03101d36c3034f3c80038d4c28838351a7f2 ] A single NFSv4 WRITE compound can often have three operations: PUTFH, WRITE, then GETATTR. When the WRITE payload is sent in a Read chunk, the client places the GETATTR in the inline part of the RPC/RDMA message, just after the WRITE operation (sans payload). The position value in the Read chunk enables the receiver to insert the Read chunk at the correct place in the received XDR stream; that is between the WRITE and GETATTR. According to RFC 8166, an NFS/RDMA client does not have to add XDR round-up to the Read chunk that carries the WRITE payload. The receiver adds XDR round-up padding if it is absent and the receiver's XDR decoder requires it to be present. Commit 193bcb7b3719 ("svcrdma: Populate tail iovec when receiving") attempted to add support for receiving such a compound so that just the WRITE payload appears in rq_arg's page list, and the trailing GETATTR is placed in rq_arg's tail iovec. (TCP just strings the whole compound into the head iovec and page list, without regard to the alignment of the WRITE payload). The server transport logic also had to accommodate the optional XDR round-up of the Read chunk, which it did simply by lengthening the tail iovec when round-up was needed. This approach is adequate for the NFSv2 and NFSv3 WRITE decoders. Unfortunately it is not sufficient for nfsd4_decode_write. When the Read chunk length is a couple of bytes less than PAGE_SIZE, the computation at the end of nfsd4_decode_write allows argp->pagelen to go negative, which breaks the logic in read_buf that looks for the tail iovec. The result is that a WRITE operation whose payload length is just less than a multiple of a page succeeds, but the subsequent GETATTR in the same compound fails with NFS4ERR_OP_ILLEGAL because the XDR decoder can't find it. Clients ignore the error, but they must update their attribute cache via a separate round trip. As nfsd4_decode_write appears to expect the payload itself to always have appropriate XDR round-up, have svc_rdma_build_normal_read_chunk add the Read chunk XDR round-up to the page_len rather than lengthening the tail iovec. Reported-by: Olga Kornievskaia <kolga@netapp.com> Fixes: 193bcb7b3719 ("svcrdma: Populate tail iovec when receiving") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-26xprtrdma: Fix backchannel allocation of extra rpcrdma_repsChuck Lever
[ Upstream commit d698c4a02ee02053bbebe051322ff427a2dad56a ] The backchannel code uses rpcrdma_recv_buffer_put to add new reps to the free rep list. This also decrements rb_recv_count, which spoofs the receive overrun logic in rpcrdma_buffer_get_rep. Commit 9b06688bc3b9 ("xprtrdma: Fix additional uses of spin_lock_irqsave(rb_lock)") replaced the original open-coded list_add with a call to rpcrdma_recv_buffer_put(), but then a year later, commit 05c974669ece ("xprtrdma: Fix receive buffer accounting") added rep accounting to rpcrdma_recv_buffer_put. It was an oversight to let the backchannel continue to use this function. The fix this, let's combine the "add to free list" logic with rpcrdma_create_rep. Also, do not allocate RPCRDMA_MAX_BC_REQUESTS rpcrdma_reps in rpcrdma_buffer_create and then allocate additional rpcrdma_reps in rpcrdma_bc_setup_reps. Allocating the extra reps during backchannel set-up is sufficient. Fixes: 05c974669ece ("xprtrdma: Fix receive buffer accounting") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-24rpc_pipefs: fix double-dput()Al Viro
commit 4a3877c4cedd95543f8726b0a98743ed8db0c0fb upstream. if we ever hit rpc_gssd_dummy_depopulate() dentry passed to it has refcount equal to 1. __rpc_rmpipe() drops it and dput() done after that hits an already freed dentry. Cc: stable@kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-19sunrpc: remove incorrect HMAC request initializationEric Biggers
commit f3aefb6a7066e24bfea7fcf1b07907576de69d63 upstream. make_checksum_hmac_md5() is allocating an HMAC transform and doing crypto API calls in the following order: crypto_ahash_init() crypto_ahash_setkey() crypto_ahash_digest() This is wrong because it makes no sense to init() the request before a key has been set, given that the initial state depends on the key. And digest() is short for init() + update() + final(), so in this case there's no need to explicitly call init() at all. Before commit 9fa68f620041 ("crypto: hash - prevent using keyed hashes without setting key") the extra init() had no real effect, at least for the software HMAC implementation. (There are also hardware drivers that implement HMAC-MD5, and it's not immediately obvious how gracefully they handle init() before setkey().) But now the crypto API detects this incorrect initialization and returns -ENOKEY. This is breaking NFS mounts in some cases. Fix it by removing the incorrect call to crypto_ahash_init(). Reported-by: Michael Young <m.a.young@durham.ac.uk> Fixes: 9fa68f620041 ("crypto: hash - prevent using keyed hashes without setting key") Fixes: fffdaef2eb4a ("gss_krb5: Add support for rc4-hmac encryption") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22xprtrdma: Fix BUG after a device removalChuck Lever
commit e89e8d8fcdc6751e86ccad794b052fe67e6ad619 upstream. Michal Kalderon reports a BUG that occurs just after device removal: [ 169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049 [ 169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma] The RPC/RDMA client transport attempts to allocate some resources on demand. Registered buffers are one such resource. These are allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call and Reply messages. A hardware resource is associated with each of these buffers, as they can be used for a Send or Receive Work Request. If a device is removed from under an NFS/RDMA mount, the transport layer is responsible for releasing all hardware resources before the device can be finally unplugged. A BUG results when the NFS mount hasn't yet seen much activity: the transport tries to release resources that haven't yet been allocated. rpcrdma_free_regbuf() already checks for this case, so just move that check to cover the DEVICE_REMOVAL case as well. Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com> Fixes: bebd031866ca ("xprtrdma: Support unplugging an HCA ...") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com> Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22xprtrdma: Fix calculation of ri_max_send_sgesChuck Lever
commit 1179e2c27efe21167ec9d882b14becefba2ee990 upstream. Commit 16f906d66cd7 ("xprtrdma: Reduce required number of send SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes a problem where xprtrdma would not work if the device's max_sge capability was small (low single digits). At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of each RPC. ri_max_send_sges is set to this value: ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES; Then when marshaling each RPC, rpcrdma_args_inline uses that value to determine whether the device has enough Send SGEs to convey an NFS WRITE payload inline, or whether instead a Read chunk is required. More recently, commit ae72950abf99 ("xprtrdma: Add data structure to manage RDMA Send arguments") used the ri_max_send_sges value to calculate the size of an array, but that commit erroneously assumed ri_max_send_sges contains a value similar to the device's max_sge, and not one that was reduced by the minimum SGE count. This assumption results in the calculated size of the sendctx's Send SGE array to be too small. When the array is used to marshal an RPC, the code can write Send SGEs into the following sendctx element in that array, corrupting it. When the device's max_sge is large, this issue is entirely harmless; but it results in an oops in the provider's post_send method, if dev.attrs.max_sge is small. So let's straighten this out: ri_max_send_sges will now contain a value with the same meaning as dev.attrs.max_sge, which makes the code easier to understand, and enables rpcrdma_sendctx_create to calculate the size of the SGE array correctly. Reported-by: Michal Kalderon <Michal.Kalderon@cavium.com> Fixes: 16f906d66cd7 ("xprtrdma: Reduce required number of send SGEs") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Michal Kalderon <Michal.Kalderon@cavium.com> Cc: stable@vger.kernel.org # v4.10+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03SUNRPC: Allow connect to return EHOSTUNREACHTrond Myklebust
[ Upstream commit 4ba161a793d5f43757c35feff258d9f20a082940 ] Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20xprtrdma: Don't defer fencing an async RPC's chunksChuck Lever
[ Upstream commit 8f66b1a529047a972cb9602a919c53a95f3d7a2b ] In current kernels, waiting in xprt_release appears to be safe to do. I had erroneously believed that for ASYNC RPCs, waiting of any kind in xprt_release->xprt_rdma_free would result in deadlock. I've done injection testing and consulted with Trond to confirm that waiting in the RPC release path is safe. For the very few times where RPC resources haven't yet been released earlier by the reply handler, it is safe to wait synchronously in xprt_rdma_free for invalidation rather than defering it to MR recovery. Note: When the QP is error state, posting a LocalInvalidate should flush and mark the MR as bad. There is no way the remote HCA can access that MR via a QP in error state, so it is effectively already inaccessible and thus safe for the Upper Layer to access. The next time the MR is used it should be recognized and cleaned up properly by frwr_op_map. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20SUNRPC: Fix a race in the receive code pathTrond Myklebust
commit 90d91b0cd371193d9dbfa9beacab8ab9a4cb75e0 upstream. We must ensure that the call to rpc_sleep_on() in xprt_transmit() cannot race with the call to xprt_complete_rqst(). Reported-by: Chuck Lever <chuck.lever@oracle.com> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=317 Fixes: ce7c252a8c74 ("SUNRPC: Add a separate spinlock to protect..") Reviewed-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-20kernel: make groups_sort calling a responsibility group_info allocatorsThiago Rafael Becker
commit bdcf0a423ea1c40bbb40e7ee483b50fc8aa3d758 upstream. In testing, we found that nfsd threads may call set_groups in parallel for the same entry cached in auth.unix.gid, racing in the call of groups_sort, corrupting the groups for that entry and leading to permission denials for the client. This patch: - Make groups_sort globally visible. - Move the call to groups_sort to the modifiers of group_info - Remove the call to groups_sort from set_groups Link: http://lkml.kernel.org/r/20171211151420.18655-1-thiago.becker@gmail.com Signed-off-by: Thiago Rafael Becker <thiago.becker@gmail.com> Reviewed-by: Matthew Wilcox <mawilcox@microsoft.com> Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-14sunrpc: Fix rpc_task_begin trace pointChuck Lever
[ Upstream commit b2bfe5915d5fe7577221031a39ac722a0a2a1199 ] The rpc_task_begin trace point always display a task ID of zero. Move the trace point call site so that it picks up the new task ID. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-30svcrdma: Preserve CB send buffer across retransmitsChuck Lever
commit 0bad47cada5defba13e98827d22d06f13258dfb3 upstream. During each NFSv4 callback Call, an RDMA Send completion frees the page that contains the RPC Call message. If the upper layer determines that a retransmit is necessary, this is too soon. One possible symptom: after a GARBAGE_ARGS response an NFSv4.1 callback request, the following BUG fires on the NFS server: kernel: BUG: Bad page state in process kworker/0:2H pfn:7d3ce2 kernel: page:ffffea001f4f3880 count:-2 mapcount:0 mapping: (null) index:0x0 kernel: flags: 0x2fffff80000000() kernel: raw: 002fffff80000000 0000000000000000 0000000000000000 fffffffeffffffff kernel: raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000 kernel: page dumped because: nonzero _refcount kernel: Modules linked in: cts rpcsec_gss_krb5 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue rpcrdm a ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pc lmul crc32_pclmul ghash_clmulni_intel pcbc iTCO_wdt iTCO_vendor_support aesni_intel crypto_simd glue_helper cryptd pcspkr lpc_ich i2c_i801 mei_me mf d_core mei raid0 sg wmi ioatdma ipmi_si ipmi_devintf ipmi_msghandler shpchp acpi_power_meter acpi_pad nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c mlx4_en mlx4_ib mlx5_ib ib_core sd_mod sr_mod cdrom ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci crc32c_intel libahci drm mlx5_core igb libata mlx4_core dca i2c_algo_bit i2c_core nvme kernel: ptp nvme_core pps_core dm_mirror dm_region_hash dm_log dm_mod dax kernel: CPU: 0 PID: 11495 Comm: kworker/0:2H Not tainted 4.14.0-rc3-00001-g577ce48 #811 kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015 kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core] kernel: Call Trace: kernel: dump_stack+0x62/0x80 kernel: bad_page+0xfe/0x11a kernel: free_pages_check_bad+0x76/0x78 kernel: free_pcppages_bulk+0x364/0x441 kernel: ? ttwu_do_activate.isra.61+0x71/0x78 kernel: free_hot_cold_page+0x1c5/0x202 kernel: __put_page+0x2c/0x36 kernel: svc_rdma_put_context+0xd9/0xe4 [rpcrdma] kernel: svc_rdma_wc_send+0x50/0x98 [rpcrdma] This issue exists all the way back to v4.5, but refactoring and code re-organization prevents this simple patch from applying to kernels older than v4.12. The fix is the same, however, if someone needs to backport it. Reported-by: Ben Coddington <bcodding@redhat.com> BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=314 Fixes: 5d252f90a800 ('svcrdma: Add class for RDMA backwards ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-02Merge tag 'spdx_identifiers-4.14-rc8' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull initial SPDX identifiers from Greg KH: "License cleanup: add SPDX license identifiers to some files Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>" * tag 'spdx_identifiers-4.14-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: License cleanup: add SPDX license identifier to uapi header files with a license License cleanup: add SPDX license identifier to uapi header files with no license License cleanup: add SPDX GPL-2.0 license identifier to files with no license
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-19SUNRPC: Destroy transport from the system workqueueTrond Myklebust
The transport may need to flush transport connect and receive tasks that are running on rpciod. In order to do so safely, we need to ensure that the caller of cancel_work_sync() etc is not itself running on rpciod. Do so by running the destroy task from the system workqueue. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-16SUNRPC: fix a list corruption issue in xprt_release()Trond Myklebust
We remove the request from the receive list before we call xprt_wait_on_pinned_rqst(), and so we need to use list_del_init(). Otherwise, we will see list corruption when xprt_complete_rqst() is called. Reported-by: Emre Celebi <emre@primarydata.com> Fixes: ce7c252a8c741 ("SUNRPC: Add a separate spinlock to protect...") Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-10-01sunrpc: remove redundant initialization of sockColin Ian King
sock is being initialized and then being almost immediately updated hence the initialized value is not being used and is redundant. Remove the initialization. Cleans up clang warning: warning: Value stored to 'sock' during its initialization is never read Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-25IB: Correct MR length field to be 64-bitParav Pandit
The ib_mr->length represents the length of the MR in bytes as per the IBTA spec 1.3 section 11.2.10.3 (REGISTER PHYSICAL MEMORY REGION). Currently ib_mr->length field is defined as only 32-bits field. This might result into truncation and failed WRs of consumers who registers more than 4GB bytes memory regions and whose WRs accessing such MRs. This patch makes the length 64-bit to avoid such truncation. Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Faisal Latif <faisal.latif@intel.com> Fixes: 4c67e2bfc8b7 ("IB/core: Introduce new fast registration API") Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
2017-09-11Merge tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Hightlights include: Stable bugfixes: - Fix mirror allocation in the writeback code to avoid a use after free - Fix the O_DSYNC writes to use the correct byte range - Fix 2 use after free issues in the I/O code Features: - Writeback fixes to split up the inode->i_lock in order to reduce contention - RPC client receive fixes to reduce the amount of time the xprt->transport_lock is held when receiving data from a socket into am XDR buffer. - Ditto fixes to reduce contention between call side users of the rdma rb_lock, and its use in rpcrdma_reply_handler. - Re-arrange rdma stats to reduce false cacheline sharing. - Various rdma cleanups and optimisations. - Refactor the NFSv4.1 exchange id code and clean up the code. - Const-ify all instances of struct rpc_xprt_ops Bugfixes: - Fix the NFSv2 'sec=' mount option. - NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys' - Fix the NFSv3 GRANT callback when the port changes on the server. - Fix livelock issues with COMMIT - NFSv4: Use correct inode in _nfs4_opendata_to_nfs4_state() when doing and NFSv4.1 open by filehandle" * tag 'nfs-for-4.14-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (69 commits) NFS: Count the bytes of skipped subrequests in nfs_lock_and_join_requests() NFS: Don't hold the group lock when calling nfs_release_request() NFS: Remove pnfs_generic_transfer_commit_list() NFS: nfs_lock_and_join_requests and nfs_scan_commit_list can deadlock NFS: Fix 2 use after free issues in the I/O code NFS: Sync the correct byte range during synchronous writes lockd: Delete an error message for a failed memory allocation in reclaimer() NFS: remove jiffies field from access cache NFS: flush data when locking a file to ensure cache coherence for mmap. SUNRPC: remove some dead code. NFS: don't expect errors from mempool_alloc(). xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler xprtrdma: Re-arrange struct rx_stats NFS: Fix NFSv2 security settings NFSv4.1: don't use machine credentials for CLOSE when using 'sec=sys' SUNRPC: ECONNREFUSED should cause a rebind. NFS: Remove unused parameter gfp_flags from nfs_pageio_init() NFSv4: Fix up mirror allocation SUNRPC: Add a separate spinlock to protect the RPC request receive list SUNRPC: Cleanup xs_tcp_read_common() ...
2017-09-06SUNRPC: remove some dead code.NeilBrown
RPC_TASK_NO_RETRANS_TIMEOUT is set when cl_noretranstimeo is set, which happens when RPC_CLNT_CREATE_NO_RETRANS_TIMEOUT is set, which happens when NFS_CS_NO_RETRANS_TIMEOUT is set. This flag means "don't resend on a timeout, only resend if the connection gets broken for some reason". cl_discrtry is set when RPC_CLNT_CREATE_DISCRTRY is set, which happens when NFS_CS_DISCRTRY is set. This flag means "always disconnect before resending". NFS_CS_NO_RETRANS_TIMEOUT and NFS_CS_DISCRTRY are both only set in nfs4_init_client(), and it always sets both. So we will never have a situation where only one of the flags is set. So this code, which tests if timeout retransmits are allowed, and disconnection is required, will never run. So it makes sense to remove this code as it cannot be tested and could confuse people reading the code (like me). (alternately we could leave it there with a comment saying it is never actually used). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-05xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handlerChuck Lever
Adopt the use of xprt_pin_rqst to eliminate contention between Call-side users of rb_lock and the use of rb_lock in rpcrdma_reply_handler. This replaces the mechanism introduced in 431af645cf66 ("xprtrdma: Fix client lock-up after application signal fires"). Use recv_lock to quickly find the completing rqst, pin it, then drop the lock. At that point invalidation and pull-up of the Reply XDR can be done. Both are often expensive operations. Finally, take recv_lock again to signal completion to the RPC layer. It also protects adjustment of "cwnd". This greatly reduces the amount of time a lock is held by the reply handler. Comparing lock_stat results shows a marked decrease in contention on rb_lock and recv_lock. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> [trond.myklebust@primarydata.com: Remove call to rpcrdma_buffer_put() from the "out_norqst:" path in rpcrdma_reply_handler.] Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-09-05Merge tag 'nfs-rdma-for-4.14-1' of ↵Trond Myklebust
git://git.linux-nfs.org/projects/anna/linux-nfs into linux-next NFS-over-RDMA client updates for Linux 4.14 Bugfixes and cleanups: - Constify rpc_xprt_ops - Harden RPC call encoding and decoding - Clean up rpc call decoding to use xdr_streams - Remove unused variables from various structures - Refactor code to remove imul instructions - Rearrange rx_stats structure for better cacheline sharing
2017-09-05svcrdma: Estimate Send Queue depth properlyChuck Lever
The rdma_rw API adjusts max_send_wr upwards during the rdma_create_qp() call. If the ULP actually wants to take advantage of these extra resources, it must increase the size of its send completion queue (created before rdma_create_qp is called) and increase its send queue accounting limit. Use the new rdma_rw_mr_factor API to figure out the correct value to use for the Send Queue and Send Completion Queue depths. And, ensure that the chosen Send Queue depth for a newly created transport does not overrun the QP WR limit of the underlying device. Lastly, there's no longer a need to carry the Send Queue depth in struct svcxprt_rdma, since the value is used only in the svc_rdma_accept() path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-09-05svcrdma: Limit RQ depthChuck Lever
Ensure that the chosen Receive Queue depth for a newly created transport does not overrun the QP WR limit of the underlying device. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-09-05svcrdma: Populate tail iovec when receivingChuck Lever
So that NFS WRITE payloads can eventually be placed directly into a file's page cache, enable the RPC-over-RDMA transport to present these payloads in the xdr_buf's page list, while placing trailing content (such as a GETATTR operation) in the xdr_buf's tail. After this change, the RPC-over-RDMA's "copy tail" hack, added by commit a97c331f9aa9 ("svcrdma: Handle additional inline content"), is no longer needed and can be removed. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-09-05merge nfsd 4.13 bugfixes into nfsd for-4.14 branchJ. Bruce Fields
2017-08-24svcrdma: Clean up svc_rdma_build_read_chunk()Chuck Lever
Dan Carpenter <dan.carpenter@oracle.com> observed that the while() loop in svc_rdma_build_read_chunk() does not document the assumption that the loop interior is always executed at least once. Defensive: the function now returns -EINVAL if this assumption fails. Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24sunrpc: Const-ify struct sv_serv_opsChuck Lever
Close an attack vector by moving the arrays of per-server methods to read-only memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24sunrpc: Const-ify instances of struct svc_xprt_opsChuck Lever
Close an attack vector by moving the arrays of server-side transport methods to read-only memory. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-24net: sunrpc: svcsock: fix NULL-pointer exceptionVadim Lomovtsev
While running nfs/connectathon tests kernel NULL-pointer exception has been observed due to races in svcsock.c. Race is appear when kernel accepts connection by kernel_accept (which creates new socket) and start queuing ingress packets to new socket. This happens in ksoftirq context which could run concurrently on a different core while new socket setup is not done yet. The fix is to re-order socket user data init sequence and add write/read barrier calls to be sure that we got proper values for callback pointers before actually calling them. Test results: nfs/connectathon reports '0' failed tests for about 200+ iterations. Crash log: ---<-snip->--- [ 6708.638984] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [ 6708.647093] pgd = ffff0000094e0000 [ 6708.650497] [00000000] *pgd=0000010ffff90003, *pud=0000010ffff90003, *pmd=0000010ffff80003, *pte=0000000000000000 [ 6708.660761] Internal error: Oops: 86000005 [#1] SMP [ 6708.665630] Modules linked in: nfsv3 nfnetlink_queue nfnetlink_log nfnetlink rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache overlay xt_CONNSECMARK xt_SECMARK xt_conntrack iptable_security ip_tables ah4 xfrm4_mode_transport sctp tun binfmt_misc ext4 jbd2 mbcache loop tcp_diag udp_diag inet_diag rpcrdma ib_isert iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack vfat fat ghash_ce sha2_ce sha1_ce cavium_rng_vf i2c_thunderx sg thunderx_edac i2c_smbus edac_core cavium_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c nicvf nicpf ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops [ 6708.736446] ttm drm i2c_core thunder_bgx thunder_xcv mdio_thunder mdio_cavium dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_3c300909c5b3f46dcacd49aab3334af_87021] [ 6708.752275] CPU: 84 PID: 0 Comm: swapper/84 Tainted: G W OE 4.11.0-4.el7.aarch64 #1 [ 6708.760787] Hardware name: www.cavium.com CRB-2S/CRB-2S, BIOS 0.3 Mar 13 2017 [ 6708.767910] task: ffff810006842e80 task.stack: ffff81000689c000 [ 6708.773822] PC is at 0x0 [ 6708.776739] LR is at svc_data_ready+0x38/0x88 [sunrpc] [ 6708.781866] pc : [<0000000000000000>] lr : [<ffff0000029d7378>] pstate: 60000145 [ 6708.789248] sp : ffff810ffbad3900 [ 6708.792551] x29: ffff810ffbad3900 x28: ffff000008c73d58 [ 6708.797853] x27: 0000000000000000 x26: ffff81000bbe1e00 [ 6708.803156] x25: 0000000000000020 x24: ffff800f7410bf28 [ 6708.808458] x23: ffff000008c63000 x22: ffff000008c63000 [ 6708.813760] x21: ffff800f7410bf28 x20: ffff81000bbe1e00 [ 6708.819063] x19: ffff810012412400 x18: 00000000d82a9df2 [ 6708.824365] x17: 0000000000000000 x16: 0000000000000000 [ 6708.829667] x15: 0000000000000000 x14: 0000000000000001 [ 6708.834969] x13: 0000000000000000 x12: 722e736f622e676e [ 6708.840271] x11: 00000000f814dd99 x10: 0000000000000000 [ 6708.845573] x9 : 7374687225000000 x8 : 0000000000000000 [ 6708.850875] x7 : 0000000000000000 x6 : 0000000000000000 [ 6708.856177] x5 : 0000000000000028 x4 : 0000000000000000 [ 6708.861479] x3 : 0000000000000000 x2 : 00000000e5000000 [ 6708.866781] x1 : 0000000000000000 x0 : ffff81000bbe1e00 [ 6708.872084] [ 6708.873565] Process swapper/84 (pid: 0, stack limit = 0xffff81000689c000) [ 6708.880341] Stack: (0xffff810ffbad3900 to 0xffff8100068a0000) [ 6708.886075] Call trace: [ 6708.888513] Exception stack(0xffff810ffbad3710 to 0xffff810ffbad3840) [ 6708.894942] 3700: ffff810012412400 0001000000000000 [ 6708.902759] 3720: ffff810ffbad3900 0000000000000000 0000000060000145 ffff800f79300000 [ 6708.910577] 3740: ffff000009274d00 00000000000003ea 0000000000000015 ffff000008c63000 [ 6708.918395] 3760: ffff810ffbad3830 ffff800f79300000 000000000000004d 0000000000000000 [ 6708.926212] 3780: ffff810ffbad3890 ffff0000080f88dc ffff800f79300000 000000000000004d [ 6708.934030] 37a0: ffff800f7930093c ffff000008c63000 0000000000000000 0000000000000140 [ 6708.941848] 37c0: ffff000008c2c000 0000000000040b00 ffff81000bbe1e00 0000000000000000 [ 6708.949665] 37e0: 00000000e5000000 0000000000000000 0000000000000000 0000000000000028 [ 6708.957483] 3800: 0000000000000000 0000000000000000 0000000000000000 7374687225000000 [ 6708.965300] 3820: 0000000000000000 00000000f814dd99 722e736f622e676e 0000000000000000 [ 6708.973117] [< (null)>] (null) [ 6708.977824] [<ffff0000086f9fa4>] tcp_data_queue+0x754/0xc5c [ 6708.983386] [<ffff0000086fa64c>] tcp_rcv_established+0x1a0/0x67c [ 6708.989384] [<ffff000008704120>] tcp_v4_do_rcv+0x15c/0x22c [ 6708.994858] [<ffff000008707418>] tcp_v4_rcv+0xaf0/0xb58 [ 6709.000077] [<ffff0000086df784>] ip_local_deliver_finish+0x10c/0x254 [ 6709.006419] [<ffff0000086dfea4>] ip_local_deliver+0xf0/0xfc [ 6709.011980] [<ffff0000086dfad4>] ip_rcv_finish+0x208/0x3a4 [ 6709.017454] [<ffff0000086e018c>] ip_rcv+0x2dc/0x3c8 [ 6709.022328] [<ffff000008692fc8>] __netif_receive_skb_core+0x2f8/0xa0c [ 6709.028758] [<ffff000008696068>] __netif_receive_skb+0x38/0x84 [ 6709.034580] [<ffff00000869611c>] netif_receive_skb_internal+0x68/0xdc [ 6709.041010] [<ffff000008696bc0>] napi_gro_receive+0xcc/0x1a8 [ 6709.046690] [<ffff0000014b0fc4>] nicvf_cq_intr_handler+0x59c/0x730 [nicvf] [ 6709.053559] [<ffff0000014b1380>] nicvf_poll+0x38/0xb8 [nicvf] [ 6709.059295] [<ffff000008697a6c>] net_rx_action+0x2f8/0x464 [ 6709.064771] [<ffff000008081824>] __do_softirq+0x11c/0x308 [ 6709.070164] [<ffff0000080d14e4>] irq_exit+0x12c/0x174 [ 6709.075206] [<ffff00000813101c>] __handle_domain_irq+0x78/0xc4 [ 6709.081027] [<ffff000008081608>] gic_handle_irq+0x94/0x190 [ 6709.086501] Exception stack(0xffff81000689fdf0 to 0xffff81000689ff20) [ 6709.092929] fde0: 0000810ff2ec0000 ffff000008c10000 [ 6709.100747] fe00: ffff000008c70ef4 0000000000000001 0000000000000000 ffff810ffbad9b18 [ 6709.108565] fe20: ffff810ffbad9c70 ffff8100169d3800 ffff810006843ab0 ffff81000689fe80 [ 6709.116382] fe40: 0000000000000bd0 0000ffffdf979cd0 183f5913da192500 0000ffff8a254ce4 [ 6709.124200] fe60: 0000ffff8a254b78 0000aaab10339808 0000000000000000 0000ffff8a0c2a50 [ 6709.132018] fe80: 0000ffffdf979b10 ffff000008d6d450 ffff000008c10000 ffff000008d6d000 [ 6709.139836] fea0: 0000000000000054 ffff000008cd3dbc 0000000000000000 0000000000000000 [ 6709.147653] fec0: 0000000000000000 0000000000000000 0000000000000000 ffff81000689ff20 [ 6709.155471] fee0: ffff000008085240 ffff81000689ff20 ffff000008085244 0000000060000145 [ 6709.163289] ff00: ffff81000689ff10 ffff00000813f1e4 ffffffffffffffff ffff00000813f238 [ 6709.171107] [<ffff000008082eb4>] el1_irq+0xb4/0x140 [ 6709.175976] [<ffff000008085244>] arch_cpu_idle+0x44/0x11c [ 6709.181368] [<ffff0000087bf3b8>] default_idle_call+0x20/0x30 [ 6709.187020] [<ffff000008116d50>] do_idle+0x158/0x1e4 [ 6709.191973] [<ffff000008116ff4>] cpu_startup_entry+0x2c/0x30 [ 6709.197624] [<ffff00000808e7cc>] secondary_start_kernel+0x13c/0x160 [ 6709.203878] [<0000000001bc71c4>] 0x1bc71c4 [ 6709.207967] Code: bad PC value [ 6709.211061] SMP: stopping secondary CPUs [ 6709.218830] Starting crashdump kernel... [ 6709.222749] Bye! ---<-snip>--- Signed-off-by: Vadim Lomovtsev <vlomovts@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2017-08-22xprtrdma: Re-arrange struct rx_statsChuck Lever
To reduce false cacheline sharing, separate counters that are likely to be accessed in the Call path from those accessed in the Reply path. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2017-08-20Merge branch 'bugfixes'Trond Myklebust
2017-08-20SUNRPC: ECONNREFUSED should cause a rebind.NeilBrown
If you - mount and NFSv3 filesystem - do some file locking which requires the server to make a GRANT call back - unmount - mount again and do the same locking then the second attempt at locking suffers a 30 second delay. Unmounting and remounting causes lockd to stop and restart, which causes it to bind to a new port. The server still thinks the old port is valid and gets ECONNREFUSED when trying to contact it. ECONNREFUSED should be seen as a hard error that is not worth retrying. Rebinding is the only reasonable response. This patch forces a rebind if that makes sense. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>