summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-09-22Merge tag 'v4.1.44' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.44 * tag 'v4.1.44': (180 commits) Linux 4.1.44 mtd: fsl-quadspi: fix macro collision problems with READ/WRITE pinctrl: samsung: Remove bogus irq_[un]mask from resource management pinctrl: sunxi: add a missing function of A10/A20 pinctrl driver pnfs/blocklayout: require 64-bit sector_t iio: adc: vf610_adc: Fix VALT selection value for REFSEL bits usb:xhci:Add quirk for Certain failing HP keyboard on reset after resume usb: quirks: Add no-lpm quirk for Moshi USB to Ethernet Adapter USB: Check for dropped connection before switching to full speed uas: Add US_FL_IGNORE_RESIDUE for Initio Corporation INIC-3069 iio: light: tsl2563: use correct event code staging:iio:resolver:ad2s1210 fix negative IIO_ANGL_VEL read USB: hcd: Mark secondary HCD as dead if the primary one died USB: serial: pl2303: add new ATEN device id USB: serial: cp210x: add support for Qivicon USB ZigBee dongle USB: serial: option: add D-Link DWM-222 device ID nfs/flexfiles: fix leak of nfs4_ff_ds_version arrays fuse: initialize the flock flag in fuse_file on allocation iscsi-target: Fix iscsi_np reset hung task during parallel delete iscsi-target: fix memory leak in iscsit_setup_text_cmd() ... Signed-off-by: Otavio Salvador <otavio@ossystems.com.br>
2017-09-10pnfs/blocklayout: require 64-bit sector_tChristoph Hellwig
[ Upstream commit 8a9d6e964d318533ba3d2901ce153ba317c99a89 ] The blocklayout code does not compile cleanly for a 32-bit sector_t, and also has no reliable checks for devices sizes, which makes it unsafe to use with a kernel that doesn't support large block devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Arnd Bergmann <arnd@arndb.de> Fixes: 5c83746a0cf2 ("pnfs/blocklayout: in-kernel GETDEVICEINFO XDR parsing") Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10nfs/flexfiles: fix leak of nfs4_ff_ds_version arraysWeston Andros Adamson
[ Upstream commit 1feb26162bee7b2f110facfec71b5c7bdbc7d14d ] The client was freeing the nfs4_ff_layout_ds, but not the contained nfs4_ff_ds_version array. Signed-off-by: Weston Andros Adamson <dros@primarydata.com> Cc: stable@vger.kernel.org # v4.0+ Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10fuse: initialize the flock flag in fuse_file on allocationMateusz Jurczyk
[ Upstream commit 68227c03cba84a24faf8a7277d2b1a03c8959c2c ] Before the patch, the flock flag could remain uninitialized for the lifespan of the fuse_file allocation. Unless set to true in fuse_file_flock(), it would remain in an indeterminate state until read in an if statement in fuse_release_common(). This could consequently lead to taking an unexpected branch in the code. The bug was discovered by a runtime instrumentation designed to detect use of uninitialized memory in the kernel. Signed-off-by: Mateusz Jurczyk <mjurczyk@google.com> Fixes: 37fb3a30b462 ("fuse: fix flock") Cc: <stable@vger.kernel.org> # v3.1+ Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10f2fs: sanity check checkpoint segno and blkoffJin Qian
[ Upstream commit 15d3042a937c13f5d9244241c7a9c8416ff6e82a ] Make sure segno and blkoff read from raw image are valid. Cc: stable@vger.kernel.org Signed-off-by: Jin Qian <jinqian@google.com> [Jaegeuk Kim: adjust minor coding style] Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10ext4: fix overflow caused by missing cast in ext4_resize_fs()Jerry Lee
[ Upstream commit aec51758ce10a9c847a62a48a168f8c804c6e053 ] On a 32-bit platform, the value of n_blcoks_count may be wrong during the file system is resized to size larger than 2^32 blocks. This may caused the superblock being corrupted with zero blocks count. Fixes: 1c6bd7173d66 Signed-off-by: Jerry Lee <jerrylee@qnap.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org # 3.7+ Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10ext4: fix SEEK_HOLE/SEEK_DATA for blocksize < pagesizeJan Kara
[ Upstream commit fcf5ea10992fbac3c7473a1db33d56a139333cd1 ] ext4_find_unwritten_pgoff() does not properly handle a situation when starting index is in the middle of a page and blocksize < pagesize. The following command shows the bug on filesystem with 1k blocksize: xfs_io -f -c "falloc 0 4k" \ -c "pwrite 1k 1k" \ -c "pwrite 3k 1k" \ -c "seek -a -r 0" foo In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048, SEEK_DATA) will return the correct result. Fix the problem by neglecting buffers in a page before starting offset. Reported-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> CC: stable@vger.kernel.org # 3.8+ Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10Make file credentials available to the seqfile interfacesLinus Torvalds
[ Upstream commit 34dbbcdbf63360661ff7bda6c5f52f99ac515f92 ] A lot of seqfile users seem to be using things like %pK that uses the credentials of the current process, but that is actually completely wrong for filesystem interfaces. The unix semantics for permission checking files is to check permissions at _open_ time, not at read or write time, and that is not just a small detail: passing off stdin/stdout/stderr to a suid application and making the actual IO happen in privileged context is a classic exploit technique. So if we want to be able to look at permissions at read time, we need to use the file open credentials, not the current ones. Normal file accesses can just use "f_cred" (or any of the helper functions that do that, like file_ns_capable()), but the seqfile interfaces do not have any such options. It turns out that seq_file _does_ save away the user_ns information of the file, though. Since user_ns is just part of the full credential information, replace that special case with saving off the cred pointer instead, and suddenly seq_file has all the permission information it needs. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10udf: Fix deadlock between writeback and udf_setsize()Jan Kara
[ Upstream commit f2e95355891153f66d4156bf3a142c6489cd78c6 ] udf_setsize() called truncate_setsize() with i_data_sem held. Thus truncate_pagecache() called from truncate_setsize() could lock a page under i_data_sem which can deadlock as page lock ranks below i_data_sem - e. g. writeback can hold page lock and try to acquire i_data_sem to map a block. Fix the problem by moving truncate_setsize() calls from under i_data_sem. It is safe for us to change i_size without holding i_data_sem as all the places that depend on i_size being stable already hold inode_lock. CC: stable@vger.kernel.org Fixes: 7e49b6f2480cb9a9e7322a91592e56a5c85361f5 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10NFS: only invalidate dentrys that are clearly invalid.NeilBrown
[ Upstream commit cc89684c9a265828ce061037f1f79f4a68ccd3f7 ] Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate") in v3.18, a return of '0' from ->d_revalidate() will cause the dentry to be invalidated even if it has filesystems mounted on or it or on a descendant. The mounted filesystem is unmounted. This means we need to be careful not to return 0 unless the directory referred to truly is invalid. So -ESTALE or -ENOENT should invalidate the directory. Other errors such a -EPERM or -ERESTARTSYS should be returned from ->d_revalidate() so they are propagated to the caller. A particular problem can be demonstrated by: 1/ mount an NFS filesystem using NFSv3 on /mnt 2/ mount any other filesystem on /mnt/foo 3/ ls /mnt/foo 4/ turn off network, or otherwise make the server unable to respond 5/ ls /mnt/foo & 6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack 7/ kill -9 $! # this results in -ERESTARTSYS being returned 8/ observe that /mnt/foo has been unmounted. This patch changes nfs_lookup_revalidate() to only treat -ESTALE from nfs_lookup_verify_inode() and -ESTALE or -ENOENT from ->lookup() as indicating an invalid inode. Other errors are returned. Also nfs_check_inode_attributes() is changed to return -ESTALE rather than -EIO. This is consistent with the error returned in similar circumstances from nfs_update_inode(). As this bug allows any user to unmount a filesystem mounted on an NFS filesystem, this fix is suitable for stable kernels. Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate") Cc: stable@vger.kernel.org (v3.18+) Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-09-10f2fs: Don't clear SGID when inheriting ACLsJaegeuk Kim
[ Upstream commit c925dc162f770578ff4a65ec9b08270382dba9e6 ] This patch copies commit b7f8a09f80: "btrfs: Don't clear SGID when inheriting ACLs" written by Jan. Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-08-06Merge tag 'v4.1.43' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.43 * tag 'v4.1.43': (182 commits) Linux 4.1.43 HID: core: prevent out-of-bound readings ipvs: SNAT packet replies only for NATed connections Revert "dmaengine: ep93xx: Don't drain the transfers in terminate_all()" staging: comedi: ni_mio_common: fix E series ni_ai_insn_read() data kvm: vmx: Do not disable intercepts for BNDCFGS tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results PM / QoS: return -EINVAL for bogus strings sched/topology: Optimize build_group_mask() sched/topology: Fix overlapping sched_group_mask crypto: caam - fix signals handling crypto: atmel - only treat EBUSY as transient if backlog crypto: talitos - Extend max key length for SHA384/512-HMAC and AEAD Add "shutdown" to "struct class". mnt: Make propagate_umount less slow for overlapping mount propagation trees mnt: In propgate_umount handle visiting mounts in any order mnt: In umount propagation reparent in a separate pass vt: fix unchecked __put_user() in tioclinux ioctls exec: Limit arg stack to at most 75% of _STK_LIM s390: reduce ELF_ET_DYN_BASE ...
2017-07-31mnt: Make propagate_umount less slow for overlapping mount propagation treesEric W. Biederman
[ Upstream commit 296990deb389c7da21c78030376ba244dc1badf5 ] Andrei Vagin pointed out that time to executue propagate_umount can go non-linear (and take a ludicrious amount of time) when the mount propogation trees of the mounts to be unmunted by a lazy unmount overlap. Make the walk of the mount propagation trees nearly linear by remembering which mounts have already been visited, allowing subsequent walks to detect when walking a mount propgation tree or a subtree of a mount propgation tree would be duplicate work and to skip them entirely. Walk the list of mounts whose propgatation trees need to be traversed from the mount highest in the mount tree to mounts lower in the mount tree so that odds are higher that the code will walk the largest trees first, allowing later tree walks to be skipped entirely. Add cleanup_umount_visitation to remover the code's memory of which mounts have been visited. Add the functions last_slave and skip_propagation_subtree to allow skipping appropriate parts of the mount propagation tree without needing to change the logic of the rest of the code. A script to generate overlapping mount propagation trees: $ cat runs.h set -e mount -t tmpfs zdtm /mnt mkdir -p /mnt/1 /mnt/2 mount -t tmpfs zdtm /mnt/1 mount --make-shared /mnt/1 mkdir /mnt/1/1 iteration=10 if [ -n "$1" ] ; then iteration=$1 fi for i in $(seq $iteration); do mount --bind /mnt/1/1 /mnt/1/1 done mount --rbind /mnt/1 /mnt/2 TIMEFORMAT='%Rs' nr=$(( ( 2 ** ( $iteration + 1 ) ) + 1 )) echo -n "umount -l /mnt/1 -> $nr " time umount -l /mnt/1 nr=$(cat /proc/self/mountinfo | grep zdtm | wc -l ) time umount -l /mnt/2 $ for i in $(seq 9 19); do echo $i; unshare -Urm bash ./run.sh $i; done Here are the performance numbers with and without the patch: mhash | 8192 | 8192 | 1048576 | 1048576 mounts | before | after | before | after ------------------------------------------------ 1025 | 0.040s | 0.016s | 0.038s | 0.019s 2049 | 0.094s | 0.017s | 0.080s | 0.018s 4097 | 0.243s | 0.019s | 0.206s | 0.023s 8193 | 1.202s | 0.028s | 1.562s | 0.032s 16385 | 9.635s | 0.036s | 9.952s | 0.041s 32769 | 60.928s | 0.063s | 44.321s | 0.064s 65537 | | 0.097s | | 0.097s 131073 | | 0.233s | | 0.176s 262145 | | 0.653s | | 0.344s 524289 | | 2.305s | | 0.735s 1048577 | | 7.107s | | 2.603s Andrei Vagin reports fixing the performance problem is part of the work to fix CVE-2016-6213. Cc: stable@vger.kernel.org Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount") Reported-by: Andrei Vagin <avagin@openvz.org> Reviewed-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31mnt: In propgate_umount handle visiting mounts in any orderEric W. Biederman
[ Upstream commit 99b19d16471e9c3faa85cad38abc9cbbe04c6d55 ] While investigating some poor umount performance I realized that in the case of overlapping mount trees where some of the mounts are locked the code has been failing to unmount all of the mounts it should have been unmounting. This failure to unmount all of the necessary mounts can be reproduced with: $ cat locked_mounts_test.sh mount -t tmpfs test-base /mnt mount --make-shared /mnt mkdir -p /mnt/b mount -t tmpfs test1 /mnt/b mount --make-shared /mnt/b mkdir -p /mnt/b/10 mount -t tmpfs test2 /mnt/b/10 mount --make-shared /mnt/b/10 mkdir -p /mnt/b/10/20 mount --rbind /mnt/b /mnt/b/10/20 unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi' sleep 1 umount -l /mnt/b wait %% $ unshare -Urm ./locked_mounts_test.sh This failure is corrected by removing the prepass that marks mounts that may be umounted. A first pass is added that umounts mounts if possible and if not sets mount mark if they could be unmounted if they weren't locked and adds them to a list to umount possibilities. This first pass reconsiders the mounts parent if it is on the list of umount possibilities, ensuring that information of umoutability will pass from child to mount parent. A second pass then walks through all mounts that are umounted and processes their children unmounting them or marking them for reparenting. A last pass cleans up the state on the mounts that could not be umounted and if applicable reparents them to their first parent that remained mounted. While a bit longer than the old code this code is much more robust as it allows information to flow up from the leaves and down from the trunk making the order in which mounts are encountered in the umount propgation tree irrelevant. Cc: stable@vger.kernel.org Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts") Reviewed-by: Andrei Vagin <avagin@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31mnt: In umount propagation reparent in a separate passEric W. Biederman
[ Upstream commit 570487d3faf2a1d8a220e6ee10f472163123d7da ] It was observed that in some pathlogical cases that the current code does not unmount everything it should. After investigation it was determined that the issue is that mnt_change_mntpoint can can change which mounts are available to be unmounted during mount propagation which is wrong. The trivial reproducer is: $ cat ./pathological.sh mount -t tmpfs test-base /mnt cd /mnt mkdir 1 2 1/1 mount --bind 1 1 mount --make-shared 1 mount --bind 1 2 mount --bind 1/1 1/1 mount --bind 1/1 1/1 echo grep test-base /proc/self/mountinfo umount 1/1 echo grep test-base /proc/self/mountinfo $ unshare -Urm ./pathological.sh The expected output looks like: 46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000 47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000 47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 The output without the fix looks like: 46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000 47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000 47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000 That last mount in the output was in the propgation tree to be unmounted but was missed because the mnt_change_mountpoint changed it's parent before the walk through the mount propagation tree observed it. Cc: stable@vger.kernel.org Fixes: 1064f874abc0 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.") Acked-by: Andrei Vagin <avagin@virtuozzo.com> Reviewed-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31exec: Limit arg stack to at most 75% of _STK_LIMKees Cook
[ Upstream commit da029c11e6b12f321f36dac8771e833b65cec962 ] To avoid pathological stack usage or the need to special-case setuid execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB). Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31binfmt_elf: use ELF_ET_DYN_BASE only for PIEKees Cook
[ Upstream commit eab09532d40090698b05a07c1c87f39fdbc5fab5 ] The ELF_ET_DYN_BASE position was originally intended to keep loaders away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2 /bin/cat" might cause the subsequent load of /bin/cat into where the loader had been loaded.) With the advent of PIE (ET_DYN binaries with an INTERP Program Header), ELF_ET_DYN_BASE continued to be used since the kernel was only looking at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the top 1/3rd of the TASK_SIZE, a substantial portion of the address space is unused. For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are loaded above the mmap region. This means they can be made to collide (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with pathological stack regions. Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap region in all cases, and will now additionally avoid programs falling back to the mmap region by enforcing MAP_FIXED for program loads (i.e. if it would have collided with the stack, now it will fail to load instead of falling back to the mmap region). To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP) are loaded into the mmap region, leaving space available for either an ET_EXEC binary with a fixed location or PIE being loaded into mmap by the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE, which means architectures can now safely lower their values without risk of loaders colliding with their subsequently loaded programs. For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use the entire 32-bit address space for 32-bit pointers. Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and suggestions on how to implement this solution. Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Daniel Micay <danielmicay@gmail.com> Cc: Qualys Security Advisory <qsa@qualys.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Pratyush Anand <panand@redhat.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will.deacon@arm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31fs/dcache.c: fix spin lockup issue on nlru->lockSahitya Tummala
[ Upstream commit b17c070fb624cf10162cf92ea5e1ec25cd8ac176 ] __list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer duration if there are more number of items in the lru list. As per the current code, it can hold the spin lock for upto maximum UINT_MAX entries at a time. So if there are more number of items in the lru list, then "BUG: spinlock lockup suspected" is observed in the below path: spin_bug+0x90 do_raw_spin_lock+0xfc _raw_spin_lock+0x28 list_lru_add+0x28 dput+0x1c8 path_put+0x20 terminate_walk+0x3c path_lookupat+0x100 filename_lookup+0x6c user_path_at_empty+0x54 SyS_faccessat+0xd0 el0_svc_naked+0x24 This nlru->lock is acquired by another CPU in this path - d_lru_shrink_move+0x34 dentry_lru_isolate_shrink+0x48 __list_lru_walk_one.isra.10+0x94 list_lru_walk_node+0x40 shrink_dcache_sb+0x60 do_remount_sb+0xbc do_emergency_remount+0xb0 process_one_work+0x228 worker_thread+0x2e0 kthread+0xf4 ret_from_fork+0x10 Fix this lockup by reducing the number of entries to be shrinked from the lru list to 1024 at once. Also, add cond_resched() before processing the lru list again. Link: http://marc.info/?t=149722864900001&r=1&w=2 Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Suggested-by: Jan Kara <jack@suse.cz> Suggested-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31fs: completely ignore unknown open flagsChristoph Hellwig
[ Upstream commit 629e014bb8349fcf7c1e4df19a842652ece1c945 ] Currently we just stash anything we got into file->f_flags, and the report it in fcntl(F_GETFD). This patch just clears out all unknown flags so that we don't pass them to the fs or report them. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31fs: add a VALID_OPEN_FLAGSChristoph Hellwig
[ Upstream commit 80f18379a7c350c011d30332658aa15fe49a8fa5 ] Add a central define for all valid open flags, and use it in the uniqueness check. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31coredump: Ensure proper size of sparse core filesDave Kleikamp
[ Upstream commit 4d22c75d4c7b5c5f4bd31054f09103ee490878fd ] If the last section of a core file ends with an unmapped or zero page, the size of the file does not correspond with the last dump_skip() call. gdb complains that the file is truncated and can be confusing to users. After all of the vma sections are written, make sure that the file size is no smaller than the current file position. This problem can be demonstrated with gdb's bigcore testcase on the sparc architecture. Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31Btrfs: fix truncate down when no_holes feature is enabledLiu Bo
[ Upstream commit 91298eec05cd8d4e828cf7ee5d4a6334f70cf69a ] For such a file mapping, [0-4k][hole][8k-12k] In NO_HOLES mode, we don't have the [hole] extent any more. Commit c1aa45759e90 ("Btrfs: fix shrinking truncate when the no_holes feature is enabled") fixed disk isize not being updated in NO_HOLES mode when data is not flushed. However, even if data has been flushed, we can still have trouble in updating disk isize since we updated disk isize to 'start' of the last evicted extent. Reviewed-by: Chris Mason <clm@fb.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31NFSv4: fix a reference leak caused WARNING messagesKinglong Mee
[ Upstream commit 366a1569bff3fe14abfdf9285e31e05e091745f5 ] Because nfs4_opendata_access() has close the state when access is denied, so the state isn't leak. Rather than revert the commit a974deee47, I'd like clean the strange state close. [ 1615.094218] ------------[ cut here ]------------ [ 1615.094607] WARNING: CPU: 0 PID: 23702 at lib/list_debug.c:31 __list_add_valid+0x8e/0xa0 [ 1615.094913] list_add double add: new=ffff9d7901d9f608, prev=ffff9d7901d9f608, next=ffff9d7901ee8dd0. [ 1615.095458] Modules linked in: nfsv4(E) nfs(E) nfsd(E) tun bridge stp llc fuse ip_set nfnetlink vmw_vsock_vmci_transport vsock f2fs snd_seq_midi snd_seq_midi_event fscrypto coretemp ppdev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_rapl_perf vmw_balloon snd_ens1371 joydev gameport snd_ac97_codec ac97_bus snd_seq snd_pcm snd_rawmidi snd_timer snd_seq_device snd soundcore nfit parport_pc parport acpi_cpufreq tpm_tis tpm_tis_core tpm i2c_piix4 vmw_vmci shpchp auth_rpcgss nfs_acl lockd(E) grace sunrpc(E) xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel mptspi e1000 serio_raw scsi_transport_spi mptscsih mptbase ata_generic pata_acpi fjes [last unloaded: nfs] [ 1615.097663] CPU: 0 PID: 23702 Comm: fstest Tainted: G W E 4.11.0-rc1+ #517 [ 1615.098015] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 1615.098807] Call Trace: [ 1615.099183] dump_stack+0x63/0x86 [ 1615.099578] __warn+0xcb/0xf0 [ 1615.099967] warn_slowpath_fmt+0x5f/0x80 [ 1615.100370] __list_add_valid+0x8e/0xa0 [ 1615.100760] nfs4_put_state_owner+0x75/0xc0 [nfsv4] [ 1615.101136] __nfs4_close+0x109/0x140 [nfsv4] [ 1615.101524] nfs4_close_state+0x15/0x20 [nfsv4] [ 1615.101949] nfs4_close_context+0x21/0x30 [nfsv4] [ 1615.102691] __put_nfs_open_context+0xb8/0x110 [nfs] [ 1615.103155] put_nfs_open_context+0x10/0x20 [nfs] [ 1615.103586] nfs4_file_open+0x13b/0x260 [nfsv4] [ 1615.103978] do_dentry_open+0x20a/0x2f0 [ 1615.104369] ? nfs4_copy_file_range+0x30/0x30 [nfsv4] [ 1615.104739] vfs_open+0x4c/0x70 [ 1615.105106] ? may_open+0x5a/0x100 [ 1615.105469] path_openat+0x623/0x1420 [ 1615.105823] do_filp_open+0x91/0x100 [ 1615.106174] ? __alloc_fd+0x3f/0x170 [ 1615.106568] do_sys_open+0x130/0x220 [ 1615.106920] ? __put_cred+0x3d/0x50 [ 1615.107256] SyS_open+0x1e/0x20 [ 1615.107588] entry_SYSCALL_64_fastpath+0x1a/0xa9 [ 1615.107922] RIP: 0033:0x7fab599069b0 [ 1615.108247] RSP: 002b:00007ffcf0600d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 [ 1615.108575] RAX: ffffffffffffffda RBX: 00007fab59bcfae0 RCX: 00007fab599069b0 [ 1615.108896] RDX: 0000000000000200 RSI: 0000000000000200 RDI: 00007ffcf060255e [ 1615.109211] RBP: 0000000000040010 R08: 0000000000000000 R09: 0000000000000016 [ 1615.109515] R10: 00000000000006a1 R11: 0000000000000246 R12: 0000000000041000 [ 1615.109806] R13: 0000000000040010 R14: 0000000000001000 R15: 0000000000002710 [ 1615.110152] ---[ end trace 96ed63b1306bf2f3 ]--- Fixes: a974deee47 ("NFSv4: Fix memory and state leak in...") Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31CIFS: Improve readdir verbosityPavel Shilovsky
[ Upstream commit dcd87838c06f05ab7650b249ebf0d5b57ae63e1e ] Downgrade the loglevel for SMB2 to prevent filling the log with messages if e.g. readdir was interrupted. Also make SMB2 and SMB1 codepaths do the same logging during readdir. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31autofs: sanity check status reported with AUTOFS_DEV_IOCTL_FAILNeilBrown
[ Upstream commit 9fa4eb8e490a28de40964b1b0e583d8db4c7e57c ] If a positive status is passed with the AUTOFS_DEV_IOCTL_FAIL ioctl, autofs4_d_automount() will return ERR_PTR(status) with that status to follow_automount(), which will then dereference an invalid pointer. So treat a positive status the same as zero, and map to ENOENT. See comment in systemd src/core/automount.c::automount_send_ready(). Link: http://lkml.kernel.org/r/871sqwczx5.fsf@notabene.neil.brown.name Signed-off-by: NeilBrown <neilb@suse.com> Cc: Ian Kent <raven@themaw.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31fs/exec.c: account for argv/envp pointersKees Cook
[ Upstream commit 98da7d08850fb8bdeb395d6368ed15753304aa0c ] When limiting the argv/envp strings during exec to 1/4 of the stack limit, the storage of the pointers to the strings was not included. This means that an exec with huge numbers of tiny strings could eat 1/4 of the stack limit in strings and then additional space would be later used by the pointers to the strings. For example, on 32-bit with a 8MB stack rlimit, an exec with 1677721 single-byte strings would consume less than 2MB of stack, the max (8MB / 4) amount allowed, but the pointers to the strings would consume the remaining additional stack space (1677721 * 4 == 6710884). The result (1677721 + 6710884 == 8388605) would exhaust stack space entirely. Controlling this stack exhaustion could result in pathological behavior in setuid binaries (CVE-2017-1000365). [akpm@linux-foundation.org: additional commenting from Kees] Fixes: b6a2fea39318 ("mm: variable length argument support") Link: http://lkml.kernel.org/r/20170622001720.GA32173@beast Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Qualys Security Advisory <qsa@qualys.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-31Call echo service immediately after socket reconnectSachin Prabhu
[ Upstream commit b8c600120fc87d53642476f48c8055b38d6e14c7 ] Commit 4fcd1813e640 ("Fix reconnect to not defer smb3 session reconnect long after socket reconnect") changes the behaviour of the SMB2 echo service and causes it to renegotiate after a socket reconnect. However under default settings, the echo service could take up to 120 seconds to be scheduled. The patch forces the echo service to be called immediately resulting a negotiate call being made immediately on reconnect. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-26Make __xfs_xattr_put_listen preperly report errors.Artem Savkov
[ Upstream commit 791cc43b36eb1f88166c8505900cad1b43c7fe1a ] Commit 2a6fba6 "xfs: only return -errno or success from attr ->put_listent" changes the returnvalue of __xfs_xattr_put_listen to 0 in case when there is insufficient space in the buffer assuming that setting context->count to -1 would be enough, but all of the ->put_listent callers only check seen_enough. This results in a failed assertion: XFS: Assertion failed: context->count >= 0, file: fs/xfs/xfs_xattr.c, line: 175 in insufficient buffer size case. This is only reproducible with at least 2 xattrs and only when the buffer gets depleted before the last one. Furthermore if buffersize is such that it is enough to hold the last xattr's name, but not enough to hold the sum of preceeding xattr names listxattr won't fail with ERANGE, but will suceed returning last xattr's name without the first character. The first character end's up overwriting data stored at (context->alist - 1). Signed-off-by: Artem Savkov <asavkov@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-26NFSv4: Don't perform cached access checks before we've OPENed the fileTrond Myklebust
[ Upstream commit 762674f86d0328d5dc923c966e209e1ee59663f2 ] Donald Buczek reports that a nfs4 client incorrectly denies execute access based on outdated file mode (missing 'x' bit). After the mode on the server is 'fixed' (chmod +x) further execution attempts continue to fail, because the nfs ACCESS call updates the access parameter but not the mode parameter or the mode in the inode. The root cause is ultimately that the VFS is calling may_open() before the NFS client has a chance to OPEN the file and hence revalidate the access and attribute caches. Al Viro suggests: >>> Make nfs_permission() relax the checks when it sees MAY_OPEN, if you know >>> that things will be caught by server anyway? >> >> That can work as long as we're guaranteed that everything that calls >> inode_permission() with MAY_OPEN on a regular file will also follow up >> with a vfs_open() or dentry_open() on success. Is this always the >> case? > > 1) in do_tmpfile(), followed by do_dentry_open() (not reachable by NFS since > it doesn't have ->tmpfile() instance anyway) > > 2) in atomic_open(), after the call of ->atomic_open() has succeeded. > > 3) in do_last(), followed on success by vfs_open() > > That's all. All calls of inode_permission() that get MAY_OPEN come from > may_open(), and there's no other callers of that puppy. Reported-by: Donald Buczek <buczek@molgen.mpg.de> Link: https://bugzilla.kernel.org/show_bug.cgi?id=109771 Link: http://lkml.kernel.org/r/1451046656-26319-1-git-send-email-buczek@molgen.mpg.de Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-26NFS: Ensure we revalidate attributes before using execute_ok()Trond Myklebust
[ Upstream commit 5c5fc09a1157a11dbe84e6421c3e0b37d05238cb ] Donald Buczek reports that NFS clients can also report incorrect results for access() due to lack of revalidation of attributes before calling execute_ok(). Looking closely, it seems chdir() is afflicted with the same problem. Fix is to ensure we call nfs_revalidate_inode_rcu() or nfs_revalidate_inode() as appropriate before deciding to trust execute_ok(). Reported-by: Donald Buczek <buczek@molgen.mpg.de> Link: http://lkml.kernel.org/r/1451331530-3748-1-git-send-email-buczek@molgen.mpg.de Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-26xfs: only return -errno or success from attr ->put_listentEric Sandeen
[ Upstream commit 2a6fba6d2311151598abaa1e7c9abd5f8d024a43 ] Today, the put_listent formatters return either 1 or 0; if they return 1, some callers treat this as an error and return it up the stack, despite "1" not being a valid (negative) error code. The intent seems to be that if the input buffer is full, we set seen_enough or set count = -1, and return 1; but some callers check the return before checking the seen_enough or count fields of the context. Fix this by only returning non-zero for actual errors encountered, and rely on the caller to first check the return value, then check the values in the context to decide what to do. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-26xfs: in _attrlist_by_handle, copy the cursor back to userspaceDarrick J. Wong
[ Upstream commit 0facef7fb053be4353c0a48c2f48c9dbee91cb19 ] When we're iterating inode xattrs by handle, we have to copy the cursor back to userspace so that a subsequent invocation actually retrieves subsequent contents. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-07-10Merge tag 'v4.1.42' into 4.1-2.0.x-imxOtavio Salvador
Linux 4.1.42 * tag 'v4.1.42': (146 commits) Linux 4.1.42 mm: fix new crash in unmapped_area_topdown() mm: larger stack guard gap, between vmas alarmtimer: Rate limit periodic intervals MIPS: Fix bnezc/jialc return address calculation usb: dwc3: exynos fix axius clock error path to do cleanup genirq: Release resources in __setup_irq() error path swap: cond_resched in swap_cgroup_prepare() mm/memory-failure.c: use compound_head() flags for huge pages USB: gadgetfs, dummy-hcd, net2280: fix locking for callbacks usb: xhci: ASMedia ASM1042A chipset need shorts TX quirk drivers/misc/c2port/c2port-duramar2150.c: checking for NULL instead of IS_ERR() usb: r8a66597-hcd: decrease timeout usb: r8a66597-hcd: select a different endpoint on timeout USB: gadget: dummy_hcd: fix hub-descriptor removable fields [media] pvrusb2: reduce stack usage pvr2_eeprom_analyze() usb: core: fix potential memory leak in error path during hcd creation USB: hub: fix SS max number of ports iio: proximity: as3935: recalibrate RCO after resume staging: rtl8188eu: prevent an underflow in rtw_check_beacon_data() ...
2017-06-28mm: larger stack guard gap, between vmasSasha Levin
[ Upstream commit 1be7107fbe18eed3e319a6c3e83c78254b693acb ] Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25configfs: Fix race between create_link and configfs_rmdirNicholas Bellinger
[ Upstream commit ba80aa909c99802c428682c352b0ee0baac0acd3 ] This patch closes a long standing race in configfs between the creation of a new symlink in create_link(), while the symlink target's config_item is being concurrently removed via configfs_rmdir(). This can happen because the symlink target's reference is obtained by config_item_get() in create_link() before the CONFIGFS_USET_DROPPING bit set by configfs_detach_prep() during configfs_rmdir() shutdown is actually checked.. This originally manifested itself on ppc64 on v4.8.y under heavy load using ibmvscsi target ports with Novalink API: [ 7877.289863] rpadlpar_io: slot U8247.22L.212A91A-V1-C8 added [ 7879.893760] ------------[ cut here ]------------ [ 7879.893768] WARNING: CPU: 15 PID: 17585 at ./include/linux/kref.h:46 config_item_get+0x7c/0x90 [configfs] [ 7879.893811] CPU: 15 PID: 17585 Comm: targetcli Tainted: G O 4.8.17-customv2.22 #12 [ 7879.893812] task: c00000018a0d3400 task.stack: c0000001f3b40000 [ 7879.893813] NIP: d000000002c664ec LR: d000000002c60980 CTR: c000000000b70870 [ 7879.893814] REGS: c0000001f3b43810 TRAP: 0700 Tainted: G O (4.8.17-customv2.22) [ 7879.893815] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 28222242 XER: 00000000 [ 7879.893820] CFAR: d000000002c664bc SOFTE: 1 GPR00: d000000002c60980 c0000001f3b43a90 d000000002c70908 c0000000fbc06820 GPR04: c0000001ef1bd900 0000000000000004 0000000000000001 0000000000000000 GPR08: 0000000000000000 0000000000000001 d000000002c69560 d000000002c66d80 GPR12: c000000000b70870 c00000000e798700 c0000001f3b43ca0 c0000001d4949d40 GPR16: c00000014637e1c0 0000000000000000 0000000000000000 c0000000f2392940 GPR20: c0000001f3b43b98 0000000000000041 0000000000600000 0000000000000000 GPR24: fffffffffffff000 0000000000000000 d000000002c60be0 c0000001f1dac490 GPR28: 0000000000000004 0000000000000000 c0000001ef1bd900 c0000000f2392940 [ 7879.893839] NIP [d000000002c664ec] config_item_get+0x7c/0x90 [configfs] [ 7879.893841] LR [d000000002c60980] check_perm+0x80/0x2e0 [configfs] [ 7879.893842] Call Trace: [ 7879.893844] [c0000001f3b43ac0] [d000000002c60980] check_perm+0x80/0x2e0 [configfs] [ 7879.893847] [c0000001f3b43b10] [c000000000329770] do_dentry_open+0x2c0/0x460 [ 7879.893849] [c0000001f3b43b70] [c000000000344480] path_openat+0x210/0x1490 [ 7879.893851] [c0000001f3b43c80] [c00000000034708c] do_filp_open+0xfc/0x170 [ 7879.893853] [c0000001f3b43db0] [c00000000032b5bc] do_sys_open+0x1cc/0x390 [ 7879.893856] [c0000001f3b43e30] [c000000000009584] system_call+0x38/0xec [ 7879.893856] Instruction dump: [ 7879.893858] 409d0014 38210030 e8010010 7c0803a6 4e800020 3d220000 e94981e0 892a0000 [ 7879.893861] 2f890000 409effe0 39200001 992a0000 <0fe00000> 4bffffd0 60000000 60000000 [ 7879.893866] ---[ end trace 14078f0b3b5ad0aa ]--- To close this race, go ahead and obtain the symlink's target config_item reference only after the existing CONFIGFS_USET_DROPPING check succeeds. This way, if configfs_rmdir() wins create_link() will return -ENONET, and if create_link() wins configfs_rmdir() will return -EBUSY. Reported-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com> Tested-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25proc: add a schedule point in proc_pid_readdir()Eric Dumazet
[ Upstream commit 3ba4bceef23206349d4130ddf140819b365de7c8 ] We have seen proc_pid_readdir() invocations holding cpu for more than 50 ms. Add a cond_resched() to be gentle with other tasks. [akpm@linux-foundation.org: coding style fix] Link: http://lkml.kernel.org/r/1484238380.15816.42.camel@edumazet-glaptop3.roam.corp.google.com Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25romfs: use different way to generate fsid for BLOCK or MTDColy Li
[ Upstream commit f598f82e204ec0b17797caaf1b0311c52d43fb9a ] Commit 8a59f5d25265 ("fs/romfs: return f_fsid for statfs(2)") generates a 64bit id from sb->s_bdev->bd_dev. This is only correct when romfs is defined with CONFIG_ROMFS_ON_BLOCK. If romfs is only defined with CONFIG_ROMFS_ON_MTD, sb->s_bdev is NULL, referencing sb->s_bdev->bd_dev will triger an oops. Richard Weinberger points out that when CONFIG_ROMFS_BACKED_BY_BOTH=y, both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD are defined. Therefore when calling huge_encode_dev() to generate a 64bit id, I use the follow order to choose parameter, - CONFIG_ROMFS_ON_BLOCK defined use sb->s_bdev->bd_dev - CONFIG_ROMFS_ON_BLOCK undefined and CONFIG_ROMFS_ON_MTD defined use sb->s_dev when, - both CONFIG_ROMFS_ON_BLOCK and CONFIG_ROMFS_ON_MTD undefined leave id as 0 When CONFIG_ROMFS_ON_MTD is defined and sb->s_mtd is not NULL, sb->s_dev is set to a device ID generated by MTD_BLOCK_MAJOR and mtd index, otherwise sb->s_dev is 0. This is a try-best effort to generate a uniq file system ID, if all the above conditions are not meet, f_fsid of this romfs instance will be 0. Generally only one romfs can be built on single MTD block device, this method is enough to identify multiple romfs instances in a computer. Link: http://lkml.kernel.org/r/1482928596-115155-1-git-send-email-colyli@suse.de Signed-off-by: Coly Li <colyli@suse.de> Reported-by: Nong Li <nongli1031@gmail.com> Tested-by: Nong Li <nongli1031@gmail.com> Cc: Richard Weinberger <richard.weinberger@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25fscache: Clear outstanding writes when disabling a cookieDavid Howells
[ Upstream commit 6bdded59c8933940ac7e5b416448276ac89d1144 ] fscache_disable_cookie() needs to clear the outstanding writes on the cookie it's disabling because they cannot be completed after. Without this, fscache_nfs_open_file() gets stuck because it disables the cookie when the file is opened for writing but can't uncache the pages till afterwards - otherwise there's a race between the open routine and anyone who already has it open R/O and is still reading from it. Looking in /proc/pid/stack of the offending process shows: [<ffffffffa0142883>] __fscache_wait_on_page_write+0x82/0x9b [fscache] [<ffffffffa014336e>] __fscache_uncache_all_inode_pages+0x91/0xe1 [fscache] [<ffffffffa01740fa>] nfs_fscache_open_file+0x59/0x9e [nfs] [<ffffffffa01ccf41>] nfs4_file_open+0x17f/0x1b8 [nfsv4] [<ffffffff8117350e>] do_dentry_open+0x16d/0x2b7 [<ffffffff811743ac>] vfs_open+0x5c/0x65 [<ffffffff81184185>] path_openat+0x785/0x8fb [<ffffffff81184343>] do_filp_open+0x48/0x9e [<ffffffff81174710>] do_sys_open+0x13b/0x1cb [<ffffffff811747b9>] SyS_open+0x19/0x1b [<ffffffff81001c44>] do_syscall_64+0x80/0x17a [<ffffffff8165c2da>] return_from_SYSCALL_64+0x0/0x7a [<ffffffffffffffff>] 0xffffffffffffffff Reported-by: Jianhong Yin <jiyin@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25btrfs: fix memory leak in update_space_info failure pathJeff Mahoney
[ Upstream commit 896533a7da929136d0432713f02a3edffece2826 ] If we fail to add the space_info kobject, we'll leak the memory for the percpu counter. Fixes: 6ab0a2029c (btrfs: publish allocation data in sysfs) Cc: <stable@vger.kernel.org> # v3.14+ Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25btrfs: use correct types for page indices in btrfs_page_exists_in_rangeDavid Sterba
[ Upstream commit cc2b702c52094b637a351d7491ac5200331d0445 ] Variables start_idx and end_idx are supposed to hold a page index derived from the file offsets. The int type is not the right one though, offsets larger than 1 << 44 will get silently trimmed off the high bits. (1 << 44 is 16TiB) What can go wrong, if start is below the boundary and end gets trimmed: - if there's a page after start, we'll find it (radix_tree_gang_lookup_slot) - the final check "if (page->index <= end_idx)" will unexpectedly fail The function will return false, ie. "there's no page in the range", although there is at least one. btrfs_page_exists_in_range is used to prevent races in: * in hole punching, where we make sure there are not pages in the truncated range, otherwise we'll wait for them to finish and redo truncation, but we're going to replace the pages with holes anyway so the only problem is the intermediate state * lock_extent_direct: we want to make sure there are no pages before we lock and start DIO, to prevent stale data reads For practical occurence of the bug, there are several constaints. The file must be quite large, the affected range must cross the 16TiB boundary and the internal state of the file pages and pending operations must match. Also, we must not have started any ordered data in the range, otherwise we don't even reach the buggy function check. DIO locking tries hard in several places to avoid deadlocks with buffered IO and avoids waiting for ranges. The worst consequence seems to be stale data read. CC: Liu Bo <bo.li.liu@oracle.com> CC: stable@vger.kernel.org # 3.16+ Fixes: fc4adbff823f7 ("btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking") Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25ufs: set correct ->s_maxsizeAl Viro
[ Upstream commit 6b0d144fa758869bdd652c50aa41aaf601232550 ] Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25fix ufs_isblockset()Al Viro
[ Upstream commit 414cf7186dbec29bd946c138d6b5c09da5955a08 ] Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25ext4: keep existing extra fields when inode expandsKonstantin Khlebnikov
[ Upstream commit 887a9730614727c4fff7cb756711b190593fc1df ] ext4_expand_extra_isize() should clear only space between old and new size. Fixes: 6dd4ee7cab7e # v2.6.23 Cc: stable@vger.kernel.org Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25ext4: fix SEEK_HOLEJan Kara
[ Upstream commit 7d95eddf313c88b24f99d4ca9c2411a4b82fef33 ] Currently, SEEK_HOLE implementation in ext4 may both return that there's a hole at some offset although that offset already has data and skip some holes during a search for the next hole. The first problem is demostrated by: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (2.054 GiB/sec and 538461.5385 ops/sec) Whence Result HOLE 0 Where we can see that SEEK_HOLE wrongly returned offset 0 as containing a hole although we have written data there. The second problem can be demonstrated by: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (1.978 GiB/sec and 518518.5185 ops/sec) wrote 8192/8192 bytes at offset 131072 8 KiB, 2 ops; 0.0000 sec (2 GiB/sec and 500000.0000 ops/sec) Whence Result HOLE 139264 Where we can see that hole at offsets 56k..128k has been ignored by the SEEK_HOLE call. The underlying problem is in the ext4_find_unwritten_pgoff() which is just buggy. In some cases it fails to update returned offset when it finds a hole (when no pages are found or when the first found page has higher index than expected), in some cases conditions for detecting hole are just missing (we fail to detect a situation where indices of returned pages are not contiguous). Fix ext4_find_unwritten_pgoff() to properly detect non-contiguous page indices and also handle all cases where we got less pages then expected in one place and handle it properly there. CC: stable@vger.kernel.org Fixes: c8c0df241cc2719b1262e627f999638411934f60 CC: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25nfsd4: fix null dereference on replayJ. Bruce Fields
[ Upstream commit 9a307403d374b993061f5992a6e260c944920d0b ] if we receive a compound such that: - the sessionid, slot, and sequence number in the SEQUENCE op match a cached succesful reply with N ops, and - the Nth operation of the compound is a PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH, then nfsd4_sequence will return 0 and set cstate->status to nfserr_replay_cache. The current filehandle will not be set. This will cause us to call check_nfsd_access with first argument NULL. To nfsd4_compound it looks like we just succesfully executed an operation that set a filehandle, but the current filehandle is not set. Fix this by moving the nfserr_replay_cache earlier. There was never any reason to have it after the encode_op label, since the only case where he hit that is when opdesc->op_func sets it. Note that there are two ways we could hit this case: - a client is resending a previously sent compound that ended with one of the four PUTFH-like operations, or - a client is sending a *new* compound that (incorrectly) shares sessionid, slot, and sequence number with a previously sent compound, and the length of the previously sent compound happens to match the position of a PUTFH-like operation in the new compound. The second is obviously incorrect client behavior. The first is also very strange--the only purpose of a PUTFH-like operation is to set the current filehandle to be used by the following operation, so there's no point in having it as the last in a compound. So it's likely this requires a buggy or malicious client to reproduce. Reported-by: Scott Mayhew <smayhew@redhat.com> Cc: stable@kernel.vger.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25xfs: fix unaligned access in xfs_btree_visit_blocksEric Sandeen
[ Upstream commit a4d768e702de224cc85e0c8eac9311763403b368 ] This structure copy was throwing unaligned access warnings on sparc64: Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs] xfs_btree_copy_ptrs does a memcpy, which avoids it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25xfs: bad assertion for delalloc an extent that start at i_sizeZorro Lang
[ Upstream commit 892d2a5f705723b2cb488bfb38bcbdcf83273184 ] By run fsstress long enough time enough in RHEL-7, I find an assertion failure (harder to reproduce on linux-4.11, but problem is still there): XFS: Assertion failed: (iflags & BMV_IF_DELALLOC) != 0, file: fs/xfs/xfs_bmap_util.c The assertion is in xfs_getbmap() funciton: if (map[i].br_startblock == DELAYSTARTBLOCK && --> map[i].br_startoff <= XFS_B_TO_FSB(mp, XFS_ISIZE(ip))) ASSERT((iflags & BMV_IF_DELALLOC) != 0); When map[i].br_startoff == XFS_B_TO_FSB(mp, XFS_ISIZE(ip)), the startoff is just at EOF. But we only need to make sure delalloc extents that are within EOF, not include EOF. Signed-off-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25xfs: fix indlen accounting error on partial delalloc conversionBrian Foster
[ Upstream commit 0daaecacb83bc6b656a56393ab77a31c28139bc7 ] The delalloc -> real block conversion path uses an incorrect calculation in the case where the middle part of a delalloc extent is being converted. This is documented as a rare situation because XFS generally attempts to maximize contiguity by converting as much of a delalloc extent as possible. If this situation does occur, the indlen reservation for the two new delalloc extents left behind by the conversion of the middle range is calculated and compared with the original reservation. If more blocks are required, the delta is allocated from the global block pool. This delta value can be characterized as the difference between the new total requirement (temp + temp2) and the currently available reservation minus those blocks that have already been allocated (startblockval(PREV.br_startblock) - allocated). The problem is that the current code does not account for previously allocated blocks correctly. It subtracts the current allocation count from the (new - old) delta rather than the old indlen reservation. This means that more indlen blocks than have been allocated end up stashed in the remaining extents and free space accounting is broken as a result. Fix up the calculation to subtract the allocated block count from the original extent indlen and thus correctly allocate the reservation delta based on the difference between the new total requirement and the unused blocks from the original reservation. Also remove a bogus assert that contradicts the fact that the new indlen reservation can be larger than the original indlen reservation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25xfs: wait on new inodes during quotaoff dquot releaseBrian Foster
[ Upstream commit e20c8a517f259cb4d258e10b0cd5d4b30d4167a0 ] The quotaoff operation has a race with inode allocation that results in a livelock. An inode allocation that occurs before the quota status flags are updated acquires the appropriate dquots for the inode via xfs_qm_vop_dqalloc(). It then inserts the XFS_INEW inode into the perag radix tree, sometime later attaches the dquots to the inode and finally clears the XFS_INEW flag. Quotaoff expects to release the dquots from all inodes in the filesystem via xfs_qm_dqrele_all_inodes(). This invokes the AG inode iterator, which skips inodes in the XFS_INEW state because they are not fully constructed. If the scan occurs after dquots have been attached to an inode, but before XFS_INEW is cleared, the newly allocated inode will continue to hold a reference to the applicable dquots. When quotaoff invokes xfs_qm_dqpurge_all(), the reference count of those dquot(s) remain elevated and the dqpurge scan spins indefinitely. To address this problem, update the xfs_qm_dqrele_all_inodes() scan to wait on inodes marked on the XFS_INEW state. We wait on the inodes explicitly rather than skip and retry to avoid continuous retry loops due to a parallel inode allocation workload. Since quotaoff updates the quota state flags and uses a synchronous transaction before the dqrele scan, and dquots are attached to inodes after radix tree insertion iff quota is enabled, one INEW waiting pass through the AG guarantees that the scan has processed all inodes that could possibly hold dquot references. Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
2017-06-25xfs: update ag iterator to support wait on new inodesBrian Foster
[ Upstream commit ae2c4ac2dd39b23a87ddb14ceddc3f2872c6aef5 ] The AG inode iterator currently skips new inodes as such inodes are inserted into the inode radix tree before they are fully constructed. Certain contexts require the ability to wait on the construction of new inodes, however. The fs-wide dquot release from the quotaoff sequence is an example of this. Update the AG inode iterator to support the ability to wait on inodes flagged with XFS_INEW upon request. Create a new xfs_inode_ag_iterator_flags() interface and support a set of iteration flags to modify the iteration behavior. When the XFS_AGITER_INEW_WAIT flag is set, include XFS_INEW flags in the radix tree inode lookup and wait on them before the callback is executed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>