summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-02-07Merge tag 'v4.14.164' into 4.14-2.3.x-imxMarcel Ziswiler
This is the 4.14.164 stable release Conflicts: arch/arm/Kconfig.debug arch/arm/boot/dts/imx7s.dtsi arch/arm/mach-imx/cpuidle-imx6q.c arch/arm/mach-imx/cpuidle-imx6sx.c arch/arm64/kernel/cpu_errata.c arch/arm64/kvm/hyp/tlb.c drivers/crypto/caam/caamalg.c drivers/crypto/mxs-dcp.c drivers/dma/imx-sdma.c drivers/gpio/gpio-vf610.c drivers/gpu/drm/bridge/adv7511/adv7511_drv.c drivers/input/keyboard/imx_keypad.c drivers/input/keyboard/snvs_pwrkey.c drivers/mmc/core/block.c drivers/mmc/core/queue.h drivers/mmc/host/sdhci-esdhc-imx.c drivers/net/can/flexcan.c drivers/net/can/rx-offload.c drivers/net/ethernet/freescale/fec_main.c drivers/net/wireless/ath/ath10k/pci.c drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c drivers/pci/dwc/pci-imx6.c drivers/spi/spi-fsl-lpspi.c drivers/usb/dwc3/gadget.c include/net/tcp.h sound/soc/fsl/Kconfig sound/soc/fsl/fsl_esai.c
2020-01-12fs: avoid softlockups in s_inodes iteratorsEric Sandeen
[ Upstream commit 04646aebd30b99f2cfa0182435a2ec252fcb16d0 ] Anything that walks all inodes on sb->s_inodes list without rescheduling risks softlockups. Previous efforts were made in 2 functions, see: c27d82f fs/drop_caches.c: avoid softlockups in drop_pagecache_sb() ac05fbb inode: don't softlockup when evicting inodes but there hasn't been an audit of all walkers, so do that now. This also consistently moves the cond_resched() calls to the bottom of each loop in cases where it already exists. One loop remains: remove_dquot_ref(), because I'm not quite sure how to deal with that one w/o taking the i_lock. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-09fix compat handling of FICLONERANGE, FIDEDUPERANGE and FS_IOC_FIEMAPAl Viro
commit 6b2daec19094a90435abe67d16fb43b1a5527254 upstream. Unlike FICLONE, all of those take a pointer argument; they do need compat_ptr() applied to arg. Fixes: d79bdd52d8be ("vfs: wire up compat ioctl for CLONE/CLONE_RANGE") Fixes: 54dbc1517237 ("vfs: hoist the btrfs deduplication ioctl to the vfs") Fixes: ceac204e1da9 ("fs: make fiemap work from compat_ioctl") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09xfs: don't check for AG deadlock for realtime files in bunmapiOmar Sandoval
commit 69ffe5960df16938bccfe1b65382af0b3de51265 upstream. Commit 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") added a check in __xfs_bunmapi() to stop early if we would touch multiple AGs in the wrong order. However, this check isn't applicable for realtime files. In most cases, it just makes us do unnecessary commits. However, without the fix from the previous commit ("xfs: fix realtime file data space leak"), if the last and second-to-last extents also happen to have different "AG numbers", then the break actually causes __xfs_bunmapi() to return without making any progress, which sends xfs_itruncate_extents_flags() into an infinite loop. Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09nfsd4: fix up replay_matches_cache()Scott Mayhew
commit 6e73e92b155c868ff7fce9d108839668caf1d9be upstream. When running an nfs stress test, I see quite a few cached replies that don't match up with the actual request. The first comment in replay_matches_cache() makes sense, but the code doesn't seem to match... fix it. This isn't exactly a bugfix, as the server isn't required to catch every case of a false retry. So, we may as well do this, but if this is fixing a problem then that suggests there's a client bug. Fixes: 53da6a53e1d4 ("nfsd4: catch some false session retries") Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09locks: print unsigned ino in /proc/locksAmir Goldstein
commit 98ca480a8f22fdbd768e3dad07024c8d4856576c upstream. An ino is unsigned, so display it as such in /proc/locks. Cc: stable@vger.kernel.org Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09pstore/ram: Write new dumps to start of recycled zonesAleksandr Yashkin
commit 9e5f1c19800b808a37fb9815a26d382132c26c3d upstream. The ram_core.c routines treat przs as circular buffers. When writing a new crash dump, the old buffer needs to be cleared so that the new dump doesn't end up in the wrong place (i.e. at the end). The solution to this problem is to reset the circular buffer state before writing a new Oops dump. Signed-off-by: Aleksandr Yashkin <a.yashkin@inango-systems.com> Signed-off-by: Nikolay Merinov <n.merinov@inango-systems.com> Signed-off-by: Ariel Gilman <a.gilman@inango-systems.com> Link: https://lore.kernel.org/r/20191223133816.28155-1-n.merinov@inango-systems.com Fixes: 896fc1f0c4c6 ("pstore/ram: Switch to persistent_ram routines") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-09xfs: fix mount failure crash on invalid iclog memory accessBrian Foster
[ Upstream commit 798a9cada4694ca8d970259f216cec47e675bfd5 ] syzbot (via KASAN) reports a use-after-free in the error path of xlog_alloc_log(). Specifically, the iclog freeing loop doesn't handle the case of a fully initialized ->l_iclog linked list. Instead, it assumes that the list is partially constructed and NULL terminated. This bug manifested because there was no possible error scenario after iclog list setup when the original code was added. Subsequent code and associated error conditions were added some time later, while the original error handling code was never updated. Fix up the error loop to terminate either on a NULL iclog or reaching the end of the list. Reported-by: syzbot+c732f8644185de340492@syzkaller.appspotmail.com Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04filldir[64]: remove WARN_ON_ONCE() for bad directory entriesLinus Torvalds
commit b9959c7a347d6adbb558fba7e36e9fef3cba3b07 upstream. This was always meant to be a temporary thing, just for testing and to see if it actually ever triggered. The only thing that reported it was syzbot doing disk image fuzzing, and then that warning is expected. So let's just remove it before -rc4, because the extra sanity testing should probably go to -stable, but we don't want the warning to do so. Reported-by: syzbot+3031f712c7ad5dd4d926@syzkaller.appspotmail.com Fixes: 8a23eb804ca4 ("Make filldir[64]() verify the directory entry filename is valid") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Siddharth Chandrasekaran <csiddharth@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04Make filldir[64]() verify the directory entry filename is validLinus Torvalds
commit 8a23eb804ca4f2be909e372cf5a9e7b30ae476cd upstream. This has been discussed several times, and now filesystem people are talking about doing it individually at the filesystem layer, so head that off at the pass and just do it in getdents{64}(). This is partially based on a patch by Jann Horn, but checks for NUL bytes as well, and somewhat simplified. There's also commentary about how it might be better if invalid names due to filesystem corruption don't cause an immediate failure, but only an error at the end of the readdir(), so that people can still see the filenames that are ok. There's also been discussion about just how much POSIX strictly speaking requires this since it's about filesystem corruption. It's really more "protect user space from bad behavior" as pointed out by Jann. But since Eric Biederman looked up the POSIX wording, here it is for context: "From readdir: The readdir() function shall return a pointer to a structure representing the directory entry at the current position in the directory stream specified by the argument dirp, and position the directory stream at the next entry. It shall return a null pointer upon reaching the end of the directory stream. The structure dirent defined in the <dirent.h> header describes a directory entry. From definitions: 3.129 Directory Entry (or Link) An object that associates a filename with a file. Several directory entries can associate names with the same file. ... 3.169 Filename A name consisting of 1 to {NAME_MAX} bytes used to name a file. The characters composing the name may be selected from the set of all character values excluding the slash character and the null byte. The filenames dot and dot-dot have special meaning. A filename is sometimes referred to as a 'pathname component'." Note that I didn't bother adding the checks to any legacy interfaces that nobody uses. Also note that if this ends up being noticeable as a performance regression, we can fix that to do a much more optimized model that checks for both NUL and '/' at the same time one word at a time. We haven't really tended to optimize 'memchr()', and it only checks for one pattern at a time anyway, and we really _should_ check for NUL too (but see the comment about "soft errors" in the code about why it currently only checks for '/') See the CONFIG_DCACHE_WORD_ACCESS case of hash_name() for how the name lookup code looks for pathname terminating characters in parallel. Link: https://lore.kernel.org/lkml/20190118161440.220134-2-jannh@google.com/ Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jann Horn <jannh@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Siddharth Chandrasekaran <csiddharth@vmware.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04userfaultfd: require CAP_SYS_PTRACE for UFFD_FEATURE_EVENT_FORKMike Rapoport
[ Upstream commit 3c1c24d91ffd536de0a64688a9df7f49e58fadbc ] A while ago Andy noticed (http://lkml.kernel.org/r/CALCETrWY+5ynDct7eU_nDUqx=okQvjm=Y5wJvA4ahBja=CQXGw@mail.gmail.com) that UFFD_FEATURE_EVENT_FORK used by an unprivileged user may have security implications. As the first step of the solution the following patch limits the availably of UFFD_FEATURE_EVENT_FORK only for those having CAP_SYS_PTRACE. The usage of CAP_SYS_PTRACE ensures compatibility with CRIU. Yet, if there are other users of non-cooperative userfaultfd that run without CAP_SYS_PTRACE, they would be broken :( Current implementation of UFFD_FEATURE_EVENT_FORK modifies the file descriptor table from the read() implementation of uffd, which may have security implications for unprivileged use of the userfaultfd. Limit availability of UFFD_FEATURE_EVENT_FORK only for callers that have CAP_SYS_PTRACE. Link: http://lkml.kernel.org/r/1572967777-8812-2-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Daniel Colascione <dancol@google.com> Cc: Jann Horn <jannh@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Nick Kralevich <nnk@google.com> Cc: Nosh Minwalla <nosh@google.com> Cc: Pavel Emelyanov <ovzxemul@gmail.com> Cc: Tim Murray <timmurray@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04ocfs2: fix passing zero to 'PTR_ERR' warningDing Xiang
[ Upstream commit 188c523e1c271d537f3c9f55b6b65bf4476de32f ] Fix a static code checker warning: fs/ocfs2/acl.c:331 ocfs2_acl_chmod() warn: passing zero to 'PTR_ERR' Link: http://lkml.kernel.org/r/1dee278b-6c96-eec2-ce76-fe6e07c6e20f@linux.alibaba.com Fixes: 5ee0fbd50fd ("ocfs2: revert using ocfs2_acl_chmod to avoid inode cluster lock hang") Signed-off-by: Ding Xiang <dingxiang@cmss.chinamobile.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04fs/quota: handle overflows of sysctl fs.quota.* and report as unsigned longKonstantin Khlebnikov
[ Upstream commit 6fcbcec9cfc7b3c6a2c1f1a23ebacedff7073e0a ] Quota statistics counted as 64-bit per-cpu counter. Reading sums per-cpu fractions as signed 64-bit int, filters negative values and then reports lower half as signed 32-bit int. Result may looks like: fs.quota.allocated_dquots = 22327 fs.quota.cache_hits = -489852115 fs.quota.drops = -487288718 fs.quota.free_dquots = 22083 fs.quota.lookups = -486883485 fs.quota.reads = 22327 fs.quota.syncs = 335064 fs.quota.writes = 3088689 Values bigger than 2^31-1 reported as negative. All counters except "allocated_dquots" and "free_dquots" are monotonic, thus they should be reported as is without filtering negative values. Kernel doesn't have generic helper for 64-bit sysctl yet, let's use at least unsigned long. Link: https://lore.kernel.org/r/157337934693.2078.9842146413181153727.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04f2fs: fix to update dir's i_pino during cross_renameChao Yu
[ Upstream commit 2a60637f06ac94869b2e630eaf837110d39bf291 ] As Eric reported: RENAME_EXCHANGE support was just added to fsstress in xfstests: commit 65dfd40a97b6bbbd2a22538977bab355c5bc0f06 Author: kaixuxia <xiakaixu1987@gmail.com> Date: Thu Oct 31 14:41:48 2019 +0800 fsstress: add EXCHANGE renameat2 support This is causing xfstest generic/579 to fail due to fsck.f2fs reporting errors. I'm not sure what the problem is, but it still happens even with all the fs-verity stuff in the test commented out, so that the test just runs fsstress. generic/579 23s ... [10:02:25] [ 7.745370] run fstests generic/579 at 2019-11-04 10:02:25 _check_generic_filesystem: filesystem on /dev/vdc is inconsistent (see /results/f2fs/results-default/generic/579.full for details) [10:02:47] Ran: generic/579 Failures: generic/579 Failed 1 of 1 tests Xunit report: /results/f2fs/results-default/result.xml Here's the contents of 579.full: _check_generic_filesystem: filesystem on /dev/vdc is inconsistent *** fsck.f2fs output *** [ASSERT] (__chk_dots_dentries:1378) --> Bad inode number[0x24] for '..', parent parent ino is [0xd10] The root cause is that we forgot to update directory's i_pino during cross_rename, fix it. Fixes: 32f9bc25cbda0 ("f2fs: support ->rename2()") Signed-off-by: Chao Yu <yuchao0@huawei.com> Tested-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04jbd2: Fix statistics for the number of logged blocksJan Kara
[ Upstream commit 015c6033068208d6227612c878877919f3fcf6b6 ] jbd2 statistics counting number of blocks logged in a transaction was wrong. It didn't count the commit block and more importantly it didn't count revoke descriptor blocks. Make sure these get properly counted. Reviewed-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-13-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04ext4: update direct I/O read lock pattern for IOCB_NOWAITMatthew Bobrowski
[ Upstream commit 548feebec7e93e58b647dba70b3303dcb569c914 ] This patch updates the lock pattern in ext4_direct_IO_read() to not block on inode lock in cases of IOCB_NOWAIT direct I/O reads. The locking condition implemented here is similar to that of 942491c9e6d6 ("xfs: fix AIM7 regression"). Fixes: 16c54688592c ("ext4: Allow parallel DIO reads") Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c5d5e759f91747359fbd2c6f9a36240cf75ad79f.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31ext4: unlock on error in ext4_expand_extra_isize()Dan Carpenter
commit 7f420d64a08c1dcd65b27be82a27cf2bdb2e7847 upstream. We need to unlock the xattr before returning on this error path. Cc: stable@kernel.org # 4.13 Fixes: c03b45b853f5 ("ext4, project: expand inode extra size if possible") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20191213185010.6k7yl2tck3wlsdkt@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31ext4: check for directory entries too close to block endJan Kara
commit 109ba779d6cca2d519c5dd624a3276d03e21948e upstream. ext4_check_dir_entry() currently does not catch a case when a directory entry ends so close to the block end that the header of the next directory entry would not fit in the remaining space. This can lead to directory iteration code trying to access address beyond end of current buffer head leading to oops. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31ext4: fix ext4_empty_dir() for directories with holesJan Kara
commit 64d4ce892383b2ad6d782e080d25502f91bf2a38 upstream. Function ext4_empty_dir() doesn't correctly handle directories with holes and crashes on bh->b_data dereference when bh is NULL. Reorganize the loop to use 'offset' variable all the times instead of comparing pointers to current direntry with bh->b_data pointer. Also add more strict checking of '.' and '..' directory entries to avoid entering loop in possibly invalid state on corrupted filesystems. References: CVE-2019-19037 CC: stable@vger.kernel.org Fixes: 4e19d6b65fb4 ("ext4: allow directory holes") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: abort transaction after failed inode updates in create_subvolJosef Bacik
[ Upstream commit c7e54b5102bf3614cadb9ca32d7be73bad6cecf0 ] We can just abort the transaction here, and in fact do that for every other failure in this function except these two cases. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: return error pointer from alloc_test_extent_bufferDan Carpenter
[ Upstream commit b6293c821ea8fa2a631a2112cd86cd435effeb8b ] Callers of alloc_test_extent_buffer have not correctly interpreted the return value as error pointer, as alloc_test_extent_buffer should behave as alloc_extent_buffer. The self-tests were unaffected but btrfs_find_create_tree_block could call both functions and that would cause problems up in the call chain. Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: don't prematurely free work in scrub_missing_raid56_worker()Omar Sandoval
[ Upstream commit 57d4f0b863272ba04ba85f86bfdc0f976f0af91c ] Currently, scrub_missing_raid56_worker() puts and potentially frees sblock (which embeds the work item) and then submits a bio through scrub_wr_submit(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". Fix it by dropping the reference after we submit the bio. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: don't prematurely free work in reada_start_machine_worker()Omar Sandoval
[ Upstream commit e732fe95e4cad35fc1df278c23a32903341b08b3 ] Currently, reada_start_machine_worker() frees the reada_machine_work and then calls __reada_start_machine() to do readahead. This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". There _might_ already be a deadlock here: reada_start_machine_worker() can depend on itself through stacked filesystems (__read_start_machine() -> reada_start_machine_dev() -> reada_tree_block_flagged() -> read_extent_buffer_pages() -> submit_one_bio() -> btree_submit_bio_hook() -> btrfs_map_bio() -> submit_stripe_bio() -> submit_bio() onto a loop device can trigger readahead on the lower filesystem). Either way, let's fix it by freeing the work at the end. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: don't prematurely free work in run_ordered_work()Omar Sandoval
[ Upstream commit c495dcd6fbe1dce51811a76bb85b4675f6494938 ] We hit the following very strange deadlock on a system with Btrfs on a loop device backed by another Btrfs filesystem: 1. The top (loop device) filesystem queues an async_cow work item from cow_file_range_async(). We'll call this work X. 2. Worker thread A starts work X (normal_work_helper()). 3. Worker thread A executes the ordered work for the top filesystem (run_ordered_work()). 4. Worker thread A finishes the ordered work for work X and frees X (work->ordered_free()). 5. Worker thread A executes another ordered work and gets blocked on I/O to the bottom filesystem (still in run_ordered_work()). 6. Meanwhile, the bottom filesystem allocates and queues an async_cow work item which happens to be the recently-freed X. 7. The workqueue code sees that X is already being executed by worker thread A, so it schedules X to be executed _after_ worker thread A finishes (see the find_worker_executing_work() call in process_one_work()). Now, the top filesystem is waiting for I/O on the bottom filesystem, but the bottom filesystem is waiting for the top filesystem to finish, so we deadlock. This happens because we are breaking the workqueue assumption that a work item cannot be recycled while it still depends on other work. Fix it by waiting to free the work item until we are done with all of the related ordered work. P.S.: One might ask why the workqueue code doesn't try to detect a recycled work item. It actually does try by checking whether the work item has the same work function (find_worker_executing_work()), but in our case the function is the same. This is the only key that the workqueue code has available to compare, short of adding an additional, layer-violating "custom key". Considering that we're the only ones that have ever hit this, we should just play by the rules. Unfortunately, we haven't been able to create a minimal reproducer other than our full container setup using a compress-force=zstd filesystem on top of another compress-force=zstd filesystem. Suggested-by: Tejun Heo <tj@kernel.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31btrfs: don't prematurely free work in end_workqueue_fn()Omar Sandoval
[ Upstream commit 9be490f1e15c34193b1aae17da58e14dd9f55a95 ] Currently, end_workqueue_fn() frees the end_io_wq entry (which embeds the work item) and then calls bio_endio(). This is another potential instance of the bug in "btrfs: don't prematurely free work in run_ordered_work()". In particular, the endio call may depend on other work items. For example, btrfs_end_dio_bio() can call btrfs_subio_endio_read() -> __btrfs_correct_data_nocsum() -> dio_read_error() -> submit_dio_repair_bio(), which submits a bio that is also completed through a end_workqueue_fn() work item. However, __btrfs_correct_data_nocsum() waits for the newly submitted bio to complete, thus it depends on another work item. This example currently usually works because we use different workqueue helper functions for BTRFS_WQ_ENDIO_DATA and BTRFS_WQ_ENDIO_DIO_REPAIR. However, it may deadlock with stacked filesystems and is fragile overall. The proper fix is to free the work item at the very end of the work function, so let's do that. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31Btrfs: fix removal logic of the tree mod log that leads to use-after-free issuesFilipe Manana
commit 6609fee8897ac475378388238456c84298bff802 upstream. When a tree mod log user no longer needs to use the tree it calls btrfs_put_tree_mod_seq() to remove itself from the list of users and delete all no longer used elements of the tree's red black tree, which should be all elements with a sequence number less then our equals to the caller's sequence number. However the logic is broken because it can delete and free elements from the red black tree that have a sequence number greater then the caller's sequence number: 1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the tree mod log; 2) The task which got assigned the sequence number 1 calls btrfs_put_tree_mod_seq(); 3) Sequence number 1 is deleted from the list of sequence numbers; 4) The current minimum sequence number is computed to be the sequence number 2; 5) A task using sequence number 2 is at tree_mod_log_rewind() and gets a pointer to one of its elements from the red black tree through a call to tree_mod_log_search(); 6) The task with sequence number 1 iterates the red black tree of tree modification elements and deletes (and frees) all elements with a sequence number less then or equals to 2 (the computed minimum sequence number) - it ends up only leaving elements with sequence numbers of 3 and 4; 7) The task with sequence number 2 now uses the pointer to its element, already freed by the other task, at __tree_mod_log_rewind(), resulting in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces a trace like the following: [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1 [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [16804.548666] RIP: 0010:rb_next+0x16/0x50 (...) [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202 [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600 [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000 [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8 [16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000 [16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0 [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [16804.557583] Call Trace: [16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs] [16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs] [16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs] [16804.560700] find_parent_nodes+0x388/0x1120 [btrfs] [16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs] [16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs] [16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs] [16804.563112] ? __might_fault+0x11/0x90 [16804.563706] do_vfs_ioctl+0x45a/0x700 [16804.564299] ksys_ioctl+0x70/0x80 [16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20 [16804.565461] __x64_sys_ioctl+0x16/0x20 [16804.566020] do_syscall_64+0x5c/0x250 [16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe [16804.567153] RIP: 0033:0x7f4b1ba2add7 (...) [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7 [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003 [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44 [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48 [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50 (...) [16804.575623] ---[ end trace 87317359aad4ba50 ]--- Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that have a sequence number equals to the computed minimum sequence number, and not just elements with a sequence number greater then that minimum. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: handle ENOENT in btrfs_uuid_tree_iterateJosef Bacik
commit 714cd3e8cba6841220dce9063a7388a81de03825 upstream. If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the uuid tree we'll just continue and do btrfs_next_item(). However we've done a btrfs_release_path() at this point and no longer have a valid path. So increment the key and go back and do a normal search. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: do not leak reloc root if we fail to read the fs rootJosef Bacik
commit ca1aa2818a53875cfdd175fb5e9a2984e997cce9 upstream. If we fail to read the fs root corresponding with a reloc root we'll just break out and free the reloc roots. But we remove our current reloc_root from this list higher up, which means we'll leak this reloc_root. Fix this by adding ourselves back to the reloc_roots list so we are properly cleaned up. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: skip log replay on orphaned rootsJosef Bacik
commit 9bc574de590510eff899c3ca8dbaf013566b5efe upstream. My fsstress modifications coupled with generic/475 uncovered a failure to mount and replay the log if we hit a orphaned root. We do not want to replay the log for an orphan root, but it's completely legitimate to have an orphaned root with a log attached. Fix this by simply skipping replaying the log. We still need to pin it's root node so that we do not overwrite it while replaying other logs, as we re-read the log root at every stage of the replay. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: do not call synchronize_srcu() in inode_tree_delJosef Bacik
commit f72ff01df9cf5db25c76674cac16605992d15467 upstream. Testing with the new fsstress uncovered a pretty nasty deadlock with lookup and snapshot deletion. Process A unlink -> final iput -> inode_tree_del -> synchronize_srcu(subvol_srcu) Process B btrfs_lookup <- srcu_read_lock() acquired here -> btrfs_iget -> find inode that has I_FREEING set -> __wait_on_freeing_inode() We're holding the srcu_read_lock() while doing the iget in order to make sure our fs root doesn't go away, and then we are waiting for the inode to finish freeing. However because the free'ing process is doing a synchronize_srcu() we deadlock. Fix this by dropping the synchronize_srcu() in inode_tree_del(). We don't need people to stop accessing the fs root at this point, we're only adding our empty root to the dead roots list. A larger much more invasive fix is forthcoming to address how we deal with fs roots, but this fixes the immediate problem. Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31btrfs: don't double lock the subvol_sem for rename exchangeJosef Bacik
commit 943eb3bf25f4a7b745dd799e031be276aa104d82 upstream. If we're rename exchanging two subvols we'll try to lock this lock twice, which is bad. Just lock once if either of the ino's are subvols. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-21CIFS: Respect O_SYNC and O_DIRECT flags during reconnectPavel Shilovsky
commit 44805b0e62f15e90d233485420e1847133716bdc upstream. Currently the client translates O_SYNC and O_DIRECT flags into corresponding SMB create options when openning a file. The problem is that on reconnect when the file is being re-opened the client doesn't set those flags and it causes a server to reject re-open requests because create options don't match. The latter means that any subsequent system call against that open file fail until a share is re-mounted. Fix this by properly setting SMB create options when re-openning files after reconnects. Fixes: 1013e760d10e6: ("SMB3: Don't ignore O_SYNC/O_DSYNC and O_DIRECT flags") Cc: Stable <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17gfs2: fix glock reference problem in gfs2_trans_remove_revokeBob Peterson
[ Upstream commit fe5e7ba11fcf1d75af8173836309e8562aefedef ] Commit 9287c6452d2b fixed a situation in which gfs2 could use a glock after it had been freed. To do that, it temporarily added a new glock reference by calling gfs2_glock_hold in function gfs2_add_revoke. However, if the bd element was removed by gfs2_trans_remove_revoke, it failed to drop the additional reference. This patch adds logic to gfs2_trans_remove_revoke to properly drop the additional glock reference. Fixes: 9287c6452d2b ("gfs2: Fix occasional glock use-after-free") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-17ext4: fix a bug in ext4_wait_for_tail_page_commityangerkun
commit 565333a1554d704789e74205989305c811fd9c7a upstream. No need to wait for any commit once the page is fully truncated. Besides, it may confuse e.g. concurrent ext4_writepage() with the page still be dirty (will be cleared by truncate_pagecache() in ext4_setattr()) but buffers has been freed; and then trigger a bug show as below: [ 26.057508] ------------[ cut here ]------------ [ 26.058531] kernel BUG at fs/ext4/inode.c:2134! ... [ 26.088130] Call trace: [ 26.088695] ext4_writepage+0x914/0xb28 [ 26.089541] writeout.isra.4+0x1b4/0x2b8 [ 26.090409] move_to_new_page+0x3b0/0x568 [ 26.091338] __unmap_and_move+0x648/0x988 [ 26.092241] unmap_and_move+0x48c/0xbb8 [ 26.093096] migrate_pages+0x220/0xb28 [ 26.093945] kernel_mbind+0x828/0xa18 [ 26.094791] __arm64_sys_mbind+0xc8/0x138 [ 26.095716] el0_svc_common+0x190/0x490 [ 26.096571] el0_svc_handler+0x60/0xd0 [ 26.097423] el0_svc+0x8/0xc Run the procedure (generate by syzkaller) parallel with ext3. void main() { int fd, fd1, ret; void *addr; size_t length = 4096; int flags; off_t offset = 0; char *str = "12345"; fd = open("a", O_RDWR | O_CREAT); assert(fd >= 0); /* Truncate to 4k */ ret = ftruncate(fd, length); assert(ret == 0); /* Journal data mode */ flags = 0xc00f; ret = ioctl(fd, _IOW('f', 2, long), &flags); assert(ret == 0); /* Truncate to 0 */ fd1 = open("a", O_TRUNC | O_NOATIME); assert(fd1 >= 0); addr = mmap(NULL, length, PROT_WRITE | PROT_READ, MAP_SHARED, fd, offset); assert(addr != (void *)-1); memcpy(addr, str, 5); mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE); } And the bug will be triggered once we seen the below order. reproduce1 reproduce2 ... | ... truncate to 4k | change to journal data mode | | memcpy(set page dirty) truncate to 0: | ext4_setattr: | ... | ext4_wait_for_tail_page_commit | | mbind(trigger bug) truncate_pagecache(clean dirty)| ... ... | mbind will call ext4_writepage() since the page still be dirty, and then report the bug since the buffers has been free. Fix it by return directly once offset equals to 0 which means the page has been fully truncated. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17ext4: work around deleting a file with i_nlink == 0 safelyTheodore Ts'o
commit c7df4a1ecb8579838ec8c56b2bb6a6716e974f37 upstream. If the file system is corrupted such that a file's i_links_count is too small, then it's possible that when unlinking that file, i_nlink will already be zero. Previously we were working around this kind of corruption by forcing i_nlink to one; but we were doing this before trying to delete the directory entry --- and if the file system is corrupted enough that ext4_delete_entry() fails, then we exit with i_nlink elevated, and this causes the orphan inode list handling to be FUBAR'ed, such that when we unmount the file system, the orphan inode list can get corrupted. A better way to fix this is to simply skip trying to call drop_nlink() if i_nlink is already zero, thus moving the check to the place where it makes the most sense. https://bugzilla.kernel.org/show_bug.cgi?id=205433 Link: https://lore.kernel.org/r/20191112032903.8828-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17reiserfs: fix extended attributes on the root directoryJeff Mahoney
commit 60e4cf67a582d64f07713eda5fcc8ccdaf7833e6 upstream. Since commit d0a5b995a308 (vfs: Add IOP_XATTR inode operations flag) extended attributes haven't worked on the root directory in reiserfs. This is due to reiserfs conditionally setting the sb->s_xattrs handler array depending on whether it located or create the internal privroot directory. It necessarily does this after the root inode is already read in. The IOP_XATTR flag is set during inode initialization, so it never gets set on the root directory. This commit unconditionally assigns sb->s_xattrs and clears IOP_XATTR on internal inodes. The old return values due to the conditional assignment are handled via open_xa_root, which now returns EOPNOTSUPP as the VFS would have done. Link: https://lore.kernel.org/r/20191024143127.17509-1-jeffm@suse.com CC: stable@vger.kernel.org Fixes: d0a5b995a308 ("vfs: Add IOP_XATTR inode operations flag") Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17ext4: Fix credit estimate for final inode freeingJan Kara
commit 65db869c754e7c271691dd5feabf884347e694f5 upstream. Estimate for the number of credits needed for final freeing of inode in ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for orphan deletion, bitmap & group descriptor for inode freeing) and not just 3. [ Fixed minor whitespace nit. -- TYT ] Fixes: e50e5129f384 ("ext4: xattr-in-inode support") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17quota: fix livelock in dquot_writeback_dquotsDmitry Monakhov
commit 6ff33d99fc5c96797103b48b7b0902c296f09c05 upstream. Write only quotas which are dirty at entry. XFSTEST: https://github.com/dmonakhov/xfstests/commit/b10ad23566a5bf75832a6f500e1236084083cddc Link: https://lore.kernel.org/r/20191031103920.3919-1-dmonakhov@openvz.org CC: stable@vger.kernel.org Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17ext2: check err when partial != NULLChengguang Xu
commit e705f4b8aa27a59f8933e8f384e9752f052c469c upstream. Check err when partial == NULL is meaningless because partial == NULL means getting branch successfully without error. CC: stable@vger.kernel.org Link: https://lore.kernel.org/r/20191105045100.7104-1-cgxu519@mykernel.net Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17quota: Check that quota is not dirty before releaseDmitry Monakhov
commit df4bb5d128e2c44848aeb36b7ceceba3ac85080d upstream. There is a race window where quota was redirted once we drop dq_list_lock inside dqput(), but before we grab dquot->dq_lock inside dquot_release() TASK1 TASK2 (chowner) ->dqput() we_slept: spin_lock(&dq_list_lock) if (dquot_dirty(dquot)) { spin_unlock(&dq_list_lock); dquot->dq_sb->dq_op->write_dquot(dquot); goto we_slept if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) { spin_unlock(&dq_list_lock); dquot->dq_sb->dq_op->release_dquot(dquot); dqget() mark_dquot_dirty() dqput() goto we_slept; } So dquot dirty quota will be released by TASK1, but on next we_sleept loop we detect this and call ->write_dquot() for it. XFSTEST: https://github.com/dmonakhov/xfstests/commit/440a80d4cbb39e9234df4d7240aee1d551c36107 Link: https://lore.kernel.org/r/20191031103920.3919-2-dmonakhov@openvz.org CC: stable@vger.kernel.org Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17ovl: relax WARN_ON() on rename to selfAmir Goldstein
commit 6889ee5a53b8d969aa542047f5ac8acdc0e79a91 upstream. In ovl_rename(), if new upper is hardlinked to old upper underneath overlayfs before upper dirs are locked, user will get an ESTALE error and a WARN_ON will be printed. Changes to underlying layers while overlayfs is mounted may result in unexpected behavior, but it shouldn't crash the kernel and it shouldn't trigger WARN_ON() either, so relax this WARN_ON(). Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com Fixes: 804032fabb3b ("ovl: don't check rename to self") Cc: <stable@vger.kernel.org> # v4.9+ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17btrfs: record all roots for rename exchange on a subvolJosef Bacik
commit 3e1740993e43116b3bc71b0aad1e6872f6ccf341 upstream. Testing with the new fsstress support for subvolumes uncovered a pretty bad problem with rename exchange on subvolumes. We're modifying two different subvolumes, but we only start the transaction on one of them, so the other one is not added to the dirty root list. This is caught by btrfs_cow_block() with a warning because the root has not been updated, however if we do not modify this root again we'll end up pointing at an invalid root because the root item is never updated. Fix this by making sure we add the destination root to the trans list, the same as we do with normal renames. This fixes the corruption. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17Btrfs: send, skip backreference walking for extents with many referencesFilipe Manana
commit fd0ddbe2509568b00df364156f47561e9f469f15 upstream. Backreference walking, which is used by send to figure if it can issue clone operations instead of write operations, can be very slow and use too much memory when extents have many references. This change simply skips backreference walking when an extent has more than 64 references, in which case we fallback to a write operation instead of a clone operation. This limit is conservative and in practice I observed no signicant slowdown with up to 100 references and still low memory usage up to that limit. This is a temporary workaround until there are speedups in the backref walking code, and as such it does not attempt to add extra interfaces or knobs to tweak the threshold. Reported-by: Atemu <atemu.main@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17btrfs: Remove btrfs_bio::flags memberQu Wenruo
commit 34b127aecd4fe8e6a3903e10f204a7b7ffddca22 upstream. The last user of btrfs_bio::flags was removed in commit 326e1dbb5736 ("block: remove management of bi_remaining when restoring original bi_end_io"), remove it. (Tagged for stable as the structure is heavily used and space savings are desirable.) CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17Btrfs: fix negative subv_writers counter and data space leak after buffered ↵Filipe Manana
write commit a0e248bb502d5165b3314ac3819e888fdcdf7d9f upstream. When doing a buffered write it's possible to leave the subv_writers counter of the root, used for synchronization between buffered nocow writers and snapshotting. This happens in an exceptional case like the following: 1) We fail to allocate data space for the write, since there's not enough available data space nor enough unallocated space for allocating a new data block group; 2) Because of that failure, we try to go to NOCOW mode, which succeeds and therefore we set the local variable 'only_release_metadata' to true and set the root's sub_writers counter to 1 through the call to btrfs_start_write_no_snapshotting() made by check_can_nocow(); 3) The call to btrfs_copy_from_user() returns zero, which is very unlikely to happen but not impossible; 4) No pages are copied because btrfs_copy_from_user() returned zero; 5) We call btrfs_end_write_no_snapshotting() which decrements the root's subv_writers counter to 0; 6) We don't set 'only_release_metadata' back to 'false' because we do it only if 'copied', the value returned by btrfs_copy_from_user(), is greater than zero; 7) On the next iteration of the while loop, which processes the same page range, we are now able to allocate data space for the write (we got enough data space released in the meanwhile); 8) After this if we fail at btrfs_delalloc_reserve_metadata(), because now there isn't enough free metadata space, or in some other place further below (prepare_pages(), lock_and_cleanup_extent_if_need(), btrfs_dirty_pages()), we break out of the while loop with 'only_release_metadata' having a value of 'true'; 9) Because 'only_release_metadata' is 'true' we end up decrementing the root's subv_writers counter to -1 (through a call to btrfs_end_write_no_snapshotting()), and we also end up not releasing the data space previously reserved through btrfs_check_data_free_space(). As a consequence the mechanism for synchronizing NOCOW buffered writes with snapshotting gets broken. Fix this by always setting 'only_release_metadata' to false at the start of each iteration. Fixes: 8257b2dc3c1a ("Btrfs: introduce btrfs_{start, end}_nocow_write() for each subvolume") Fixes: 7ee9e4405f26 ("Btrfs: check if we can nocow if we don't have data space") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17btrfs: use refcount_inc_not_zero in kill_all_nodesJosef Bacik
commit baf320b9d531f1cfbf64c60dd155ff80a58b3796 upstream. We hit the following warning while running down a different problem [ 6197.175850] ------------[ cut here ]------------ [ 6197.185082] refcount_t: underflow; use-after-free. [ 6197.194704] WARNING: CPU: 47 PID: 966 at lib/refcount.c:190 refcount_sub_and_test_checked+0x53/0x60 [ 6197.521792] Call Trace: [ 6197.526687] __btrfs_release_delayed_node+0x76/0x1c0 [ 6197.536615] btrfs_kill_all_delayed_nodes+0xec/0x130 [ 6197.546532] ? __btrfs_btree_balance_dirty+0x60/0x60 [ 6197.556482] btrfs_clean_one_deleted_snapshot+0x71/0xd0 [ 6197.566910] cleaner_kthread+0xfa/0x120 [ 6197.574573] kthread+0x111/0x130 [ 6197.581022] ? kthread_create_on_node+0x60/0x60 [ 6197.590086] ret_from_fork+0x1f/0x30 [ 6197.597228] ---[ end trace 424bb7ae00509f56 ]--- This is because the free side drops the ref without the lock, and then takes the lock if our refcount is 0. So you can have nodes on the tree that have a refcount of 0. Fix this by zero'ing out that element in our temporary array so we don't try to kill it again. CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17btrfs: check page->mapping when loading free space cacheJosef Bacik
commit 3797136b626ad4b6582223660c041efdea8f26b2 upstream. While testing 5.2 we ran into the following panic [52238.017028] BUG: kernel NULL pointer dereference, address: 0000000000000001 [52238.105608] RIP: 0010:drop_buffers+0x3d/0x150 [52238.304051] Call Trace: [52238.308958] try_to_free_buffers+0x15b/0x1b0 [52238.317503] shrink_page_list+0x1164/0x1780 [52238.325877] shrink_inactive_list+0x18f/0x3b0 [52238.334596] shrink_node_memcg+0x23e/0x7d0 [52238.342790] ? do_shrink_slab+0x4f/0x290 [52238.350648] shrink_node+0xce/0x4a0 [52238.357628] balance_pgdat+0x2c7/0x510 [52238.365135] kswapd+0x216/0x3e0 [52238.371425] ? wait_woken+0x80/0x80 [52238.378412] ? balance_pgdat+0x510/0x510 [52238.386265] kthread+0x111/0x130 [52238.392727] ? kthread_create_on_node+0x60/0x60 [52238.401782] ret_from_fork+0x1f/0x30 The page we were trying to drop had a page->private, but had no page->mapping and so called drop_buffers, assuming that we had a buffer_head on the page, and then panic'ed trying to deref 1, which is our page->private for data pages. This is happening because we're truncating the free space cache while we're trying to load the free space cache. This isn't supposed to happen, and I'll fix that in a followup patch. However we still shouldn't allow those sort of mistakes to result in messing with pages that do not belong to us. So add the page->mapping check to verify that we still own this page after dropping and re-acquiring the page lock. This page being unlocked as: btrfs_readpage extent_read_full_page __extent_read_full_page __do_readpage if (!nr) unlock_page <-- nr can be 0 only if submit_extent_page returns an error CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add callchain ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17iomap: Fix pipe page leakage during splicingJan Kara
commit 419e9c38aa075ed0cd3c13d47e15954b686bcdb6 upstream. When splicing using iomap_dio_rw() to a pipe, we may leak pipe pages because bio_iov_iter_get_pages() records that the pipe will have full extent worth of data however if file size is not block size aligned iomap_dio_rw() returns less than what bio_iov_iter_get_pages() set up and splice code gets confused leaking a pipe page with the file tail. Handle the situation similarly to the old direct IO implementation and revert iter to actually returned read amount which makes iter consistent with value returned from iomap_dio_rw() and thus the splice code is happy. Fixes: ff6a9292e6f6 ("iomap: implement direct I/O") CC: stable@vger.kernel.org Reported-by: syzbot+991400e8eba7e00a26e1@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17kernfs: fix ino wrap-around detectionTejun Heo
commit e23f568aa63f64cd6b355094224cc9356c0f696b upstream. When the 32bit ino wraps around, kernfs increments the generation number to distinguish reused ino instances. The wrap-around detection tests whether the allocated ino is lower than what the cursor but the cursor is pointing to the next ino to allocate so the condition never triggers. Fix it by remembering the last ino and comparing against that. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: 4a3ef68acacf ("kernfs: implement i_generation") Cc: Namhyung Kim <namhyung@kernel.org> Cc: stable@vger.kernel.org # v4.14+ Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17CIFS: Fix SMB2 oplock break processingPavel Shilovsky
commit fa9c2362497fbd64788063288dc4e74daf977ebb upstream. Even when mounting modern protocol version the server may be configured without supporting SMB2.1 leases and the client uses SMB2 oplock to optimize IO performance through local caching. However there is a problem in oplock break handling that leads to missing a break notification on the client who has a file opened. It latter causes big latencies to other clients that are trying to open the same file. The problem reproduces when there are multiple shares from the same server mounted on the client. The processing code tries to match persistent and volatile file ids from the break notification with an open file but it skips all share besides the first one. Fix this by looking up in all shares belonging to the server that issued the oplock break. Cc: Stable <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>