summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2014-06-08Merge branch 'next' (accumulated 3.16 merge window patches) into masterLinus Torvalds
Now that 3.15 is released, this merges the 'next' branch into 'master', bringing us to the normal situation where my 'master' branch is the merge window. * accumulated work in next: (6809 commits) ufs: sb mutex merge + mutex_destroy powerpc: update comments for generic idle conversion cris: update comments for generic idle conversion idle: remove cpu_idle() forward declarations nbd: zero from and len fields in NBD_CMD_DISCONNECT. mm: convert some level-less printks to pr_* MAINTAINERS: adi-buildroot-devel is moderated MAINTAINERS: add linux-api for review of API/ABI changes mm/kmemleak-test.c: use pr_fmt for logging fs/dlm/debug_fs.c: replace seq_printf by seq_puts fs/dlm/lockspace.c: convert simple_str to kstr fs/dlm/config.c: convert simple_str to kstr mm: mark remap_file_pages() syscall as deprecated mm: memcontrol: remove unnecessary memcg argument from soft limit functions mm: memcontrol: clean up memcg zoneinfo lookup mm/memblock.c: call kmemleak directly from memblock_(alloc|free) mm/mempool.c: update the kmemleak stack trace for mempool allocations lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations mm: introduce kmemleak_update_trace() mm/kmemleak.c: use %u to print ->checksum ...
2014-06-06Btrfs: send, fix corrupted path strings for long pathsFilipe Manana
If a path has more than 230 characters, we allocate a new buffer to use for the path, but we were forgotting to copy the contents of the previous buffer into the new one, which has random content from the kmalloc call. Test: mkfs.btrfs -f /dev/sdd mount /dev/sdd /mnt TEST_PATH="/mnt/fdmanana/.config/google-chrome-mysetup/Default/Pepper_Data/Shockwave_Flash/WritableRoot/#SharedObjects/JSHJ4ZKN/s.wsj.net/[[IMPORT]]/players.edgesuite.net/flash/plugins/osmf/advanced-streaming-plugin/v2.7/osmf1.6/Ak#" mkdir -p $TEST_PATH echo "hello world" > $TEST_PATH/amaiAdvancedStreamingPlugin.txt btrfs subvolume snapshot -r /mnt /mnt/mysnap1 btrfs send /mnt/mysnap1 -f /tmp/1.snap A test for xfstests follows. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Cc: Marc Merlin <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Chris Mason <clm@fb.com>
2014-06-04mm: non-atomically mark page accessed during page cache allocation where ↵Mel Gorman
possible aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-03Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull core locking updates from Ingo Molnar: "The main changes in this cycle were: - reduced/streamlined smp_mb__*() interface that allows more usecases and makes the existing ones less buggy, especially in rarer architectures - add rwsem implementation comments - bump up lockdep limits" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) rwsem: Add comments to explain the meaning of the rwsem's count field lockdep: Increase static allocations arch: Mass conversion of smp_mb__*() arch,doc: Convert smp_mb__*() arch,xtensa: Convert smp_mb__*() arch,x86: Convert smp_mb__*() arch,tile: Convert smp_mb__*() arch,sparc: Convert smp_mb__*() arch,sh: Convert smp_mb__*() arch,score: Convert smp_mb__*() arch,s390: Convert smp_mb__*() arch,powerpc: Convert smp_mb__*() arch,parisc: Convert smp_mb__*() arch,openrisc: Convert smp_mb__*() arch,mn10300: Convert smp_mb__*() arch,mips: Convert smp_mb__*() arch,metag: Convert smp_mb__*() arch,m68k: Convert smp_mb__*() arch,m32r: Convert smp_mb__*() arch,ia64: Convert smp_mb__*() ...
2014-05-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull two btrfs fixes from Chris Mason: "This has two fixes that we've been testing for 3.16, but since both are safe and fix real bugs, it makes sense to send for 3.15 instead" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: send, fix incorrect ref access when using extrefs Btrfs: fix EIO on reading file after ioctl clone works on it
2014-05-20Btrfs: send, fix incorrect ref access when using extrefsFilipe Manana
When running send, if an inode only has extended reference items associated to it and no regular references, send.c:get_first_ref() was incorrectly assuming the reference it found was of type BTRFS_INODE_REF_KEY due to use of the wrong key variable. This caused weird behaviour when using the found item has a regular reference, such as weird path string, and occasionally (when lucky) a crash: [ 190.600652] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC [ 190.600994] Modules linked in: btrfs xor raid6_pq binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc psmouse serio_raw evbug pcspkr i2c_piix4 e1000 floppy [ 190.602565] CPU: 2 PID: 14520 Comm: btrfs Not tainted 3.13.0-fdm-btrfs-next-26+ #1 [ 190.602728] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 190.602868] task: ffff8800d447c920 ti: ffff8801fa79e000 task.ti: ffff8801fa79e000 [ 190.603030] RIP: 0010:[<ffffffff813266b4>] [<ffffffff813266b4>] memcpy+0x54/0x110 [ 190.603262] RSP: 0018:ffff8801fa79f880 EFLAGS: 00010202 [ 190.603395] RAX: ffff8800d4326e3f RBX: 000000000000036a RCX: ffff880000000000 [ 190.603553] RDX: 000000000000032a RSI: ffe708844042936a RDI: ffff8800d43271a9 [ 190.603710] RBP: ffff8801fa79f8c8 R08: 00000000003a4ef0 R09: 0000000000000000 [ 190.603867] R10: 793a4ef09f000000 R11: 9f0000000053726f R12: ffff8800d43271a9 [ 190.604020] R13: 0000160000000000 R14: ffff8802110134f0 R15: 000000000000036a [ 190.604020] FS: 00007fb423d09b80(0000) GS:ffff880216200000(0000) knlGS:0000000000000000 [ 190.604020] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 190.604020] CR2: 00007fb4229d4b78 CR3: 00000001f5d76000 CR4: 00000000000006e0 [ 190.604020] Stack: [ 190.604020] ffffffffa01f4d49 ffff8801fa79f8f0 00000000000009f9 ffff8801fa79f8c8 [ 190.604020] 00000000000009f9 ffff880211013260 000000000000f971 ffff88021147dba8 [ 190.604020] 00000000000009f9 ffff8801fa79f918 ffffffffa02367f5 ffff8801fa79f928 [ 190.604020] Call Trace: [ 190.604020] [<ffffffffa01f4d49>] ? read_extent_buffer+0xb9/0x120 [btrfs] [ 190.604020] [<ffffffffa02367f5>] fs_path_add_from_extent_buffer+0x45/0x60 [btrfs] [ 190.604020] [<ffffffffa0238806>] get_first_ref+0x1f6/0x210 [btrfs] [ 190.604020] [<ffffffffa0238994>] __get_cur_name_and_parent+0x174/0x3a0 [btrfs] [ 190.604020] [<ffffffff8118df3d>] ? kmem_cache_alloc_trace+0x11d/0x1e0 [ 190.604020] [<ffffffffa0236674>] ? fs_path_alloc+0x24/0x60 [btrfs] [ 190.604020] [<ffffffffa0238c91>] get_cur_path+0xd1/0x240 [btrfs] (...) Steps to reproduce (either crash or some weirdness like an odd path string): mkfs.btrfs -f -O extref /dev/sdd mount /dev/sdd /mnt mkdir /mnt/testdir touch /mnt/testdir/foobar for i in `seq 1 2550`; do ln /mnt/testdir/foobar /mnt/testdir/foobar_link_`printf "%04d" $i` done ln /mnt/testdir/foobar /mnt/testdir/final_foobar_name rm -f /mnt/testdir/foobar for i in `seq 1 2550`; do rm -f /mnt/testdir/foobar_link_`printf "%04d" $i` done btrfs subvolume snapshot -r /mnt /mnt/mysnap btrfs send /mnt/mysnap -f /tmp/mysnap.send Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2014-05-20Btrfs: fix EIO on reading file after ioctl clone works on itLiu Bo
For inline data extent, we need to make its length aligned, otherwise, we can get a phantom extent map which confuses readpages() to return -EIO. This can be detected by xfstests/btrfs/035. Reported-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: limit the path size in send to PATH_MAX Btrfs: correctly set profile flags on seqlock retry Btrfs: use correct key when repeating search for extent item Btrfs: fix inode caching vs tree log Btrfs: fix possible memory leaks in open_ctree() Btrfs: avoid triggering bug_on() when we fail to start inode caching task Btrfs: move btrfs_{set,clear}_and_info() to ctree.h btrfs: replace error code from btrfs_drop_extents btrfs: Change the hole range to a more accurate value. btrfs: fix use-after-free in mount_subvol()
2014-04-26Btrfs: limit the path size in send to PATH_MAXChris Mason
fs_path_ensure_buf is used to make sure our path buffers for send are big enough for the path names as we construct them. The buffer size is limited to 32K by the length field in the struct. But bugs in the path construction can end up trying to build a huge buffer, and we'll do invalid memmmoves when the buffer length field wraps. This patch is step one, preventing the overflows. Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24Btrfs: correctly set profile flags on seqlock retryFilipe Manana
If we had to retry on the profiles seqlock (due to a concurrent write), we would set bits on the input flags that corresponded both to the current profile and to previous values of the profile. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24Btrfs: use correct key when repeating search for extent itemFilipe Manana
If skinny metadata is enabled and our first tree search fails to find a skinny extent item, we may repeat a tree search for a "fat" extent item (if the previous item in the leaf is not the "fat" extent we're looking for). However we were not setting the new key's objectid to the right value, as we previously used the same key variable to peek at the previous item in the leaf, which has a different objectid. So just set the right objectid to avoid modifying/deleting a wrong item if we repeat the tree search. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24Btrfs: fix inode caching vs tree logMiao Xie
Currently, with inode cache enabled, we will reuse its inode id immediately after unlinking file, we may hit something like following: |->iput inode |->return inode id into inode cache |->create dir,fsync |->power off An easy way to reproduce this problem is: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt -o inode_cache,commit=100 dd if=/dev/zero of=/mnt/data bs=1M count=10 oflag=sync inode_id=`ls -i /mnt/data | awk '{print $1}'` rm -f /mnt/data i=1 while [ 1 ] do mkdir /mnt/dir_$i test1=`stat /mnt/dir_$i | grep Inode: | awk '{print $4}'` if [ $test1 -eq $inode_id ] then dd if=/dev/zero of=/mnt/dir_$i/data bs=1M count=1 oflag=sync echo b > /proc/sysrq-trigger fi sleep 1 i=$(($i+1)) done mount /dev/sdb /mnt umount /dev/sdb btrfs check /dev/sdb We fix this problem by adding unlinked inode's id into pinned tree, and we can not reuse them until committing transaction. Cc: stable@vger.kernel.org Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24Btrfs: fix possible memory leaks in open_ctree()Wang Shilong
Fix possible memory leaks in the following error handling paths: read_tree_block() btrfs_recover_log_trees btrfs_commit_super() btrfs_find_orphan_roots() btrfs_cleanup_fs_roots() Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24Btrfs: avoid triggering bug_on() when we fail to start inode caching taskWang Shilong
When running stress test(including snapshots,balance,fstress), we trigger the following BUG_ON() which is because we fail to start inode caching task. [ 181.131945] kernel BUG at fs/btrfs/inode-map.c:179! [ 181.137963] invalid opcode: 0000 [#1] SMP [ 181.217096] CPU: 11 PID: 2532 Comm: btrfs Not tainted 3.14.0 #1 [ 181.240521] task: ffff88013b621b30 ti: ffff8800b6ada000 task.ti: ffff8800b6ada000 [ 181.367506] Call Trace: [ 181.371107] [<ffffffffa036c1be>] btrfs_return_ino+0x9e/0x110 [btrfs] [ 181.379191] [<ffffffffa038082b>] btrfs_evict_inode+0x46b/0x4c0 [btrfs] [ 181.387464] [<ffffffff810b5a70>] ? autoremove_wake_function+0x40/0x40 [ 181.395642] [<ffffffff811dc5fe>] evict+0x9e/0x190 [ 181.401882] [<ffffffff811dcde3>] iput+0xf3/0x180 [ 181.408025] [<ffffffffa03812de>] btrfs_orphan_cleanup+0x1ee/0x430 [btrfs] [ 181.416614] [<ffffffffa03a6abd>] btrfs_mksubvol.isra.29+0x3bd/0x450 [btrfs] [ 181.425399] [<ffffffffa03a6cd6>] btrfs_ioctl_snap_create_transid+0x186/0x190 [btrfs] [ 181.435059] [<ffffffffa03a6e3b>] btrfs_ioctl_snap_create_v2+0xeb/0x130 [btrfs] [ 181.444148] [<ffffffffa03a9656>] btrfs_ioctl+0xf76/0x2b90 [btrfs] [ 181.451971] [<ffffffff8117e565>] ? handle_mm_fault+0x475/0xe80 [ 181.459509] [<ffffffff8167ba0c>] ? __do_page_fault+0x1ec/0x520 [ 181.467046] [<ffffffff81185b35>] ? do_mmap_pgoff+0x2f5/0x3c0 [ 181.474393] [<ffffffff811d4da8>] do_vfs_ioctl+0x2d8/0x4b0 [ 181.481450] [<ffffffff811d5001>] SyS_ioctl+0x81/0xa0 [ 181.488021] [<ffffffff81680b69>] system_call_fastpath+0x16/0x1b We should avoid triggering BUG_ON() here, instead, we output warning messages and clear inode_cache option. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24Btrfs: move btrfs_{set,clear}_and_info() to ctree.hWang Shilong
Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24btrfs: replace error code from btrfs_drop_extentsDavid Sterba
There's a case which clone does not handle and used to BUG_ON instead, (testcase xfstests/btrfs/035), now returns EINVAL. This error code is confusing to the ioctl caller, as it normally signifies errorneous arguments. Change it to ENOPNOTSUPP which allows a fall back to copy instead of clone. This does not affect the common reflink operation. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-24btrfs: Change the hole range to a more accurate value.Qu Wenruo
Commit 3ac0d7b96a268a98bd474cab8bce3a9f125aaccf fixed the btrfs expanding write problem but the hole punched is sometimes too large for some iovec, which has unmapped data ranges. This patch will change to hole range to a more accurate value using the counts checked by the write check routines. Reported-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-18arch: Mass conversion of smp_mb__*()Peter Zijlstra
Mostly scripted conversion of the smp_mb__* barriers. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-arch@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-14btrfs: fix use-after-free in mount_subvol()Christoph Jaeger
Pointer 'newargs' is used after the memory that it points to has already been freed. Picked up by Coverity - CID 1201425. Fixes: 0723a0473f ("btrfs: allow mounting btrfs subvolumes with different ro/rw options") Signed-off-by: Christoph Jaeger <christophjaeger@linux.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
2014-04-11Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull second set of btrfs updates from Chris Mason: "The most important changes here are from Josef, fixing a btrfs regression in 3.14 that can cause corruptions in the extent allocation tree when snapshots are in use. Josef also fixed some deadlocks in send/recv and other assorted races when balance is running" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (23 commits) Btrfs: fix compile warnings on on avr32 platform btrfs: allow mounting btrfs subvolumes with different ro/rw options btrfs: export global block reserve size as space_info btrfs: fix crash in remount(thread_pool=) case Btrfs: abort the transaction when we don't find our extent ref Btrfs: fix EINVAL checks in btrfs_clone Btrfs: fix unlock in __start_delalloc_inodes() Btrfs: scrub raid56 stripes in the right way Btrfs: don't compress for a small write Btrfs: more efficient io tree navigation on wait_extent_bit Btrfs: send, build path string only once in send_hole btrfs: filter invalid arg for btrfs resize Btrfs: send, fix data corruption due to incorrect hole detection Btrfs: kmalloc() doesn't return an ERR_PTR Btrfs: fix snapshot vs nocow writting btrfs: Change the expanding write sequence to fix snapshot related bug. btrfs: make device scan less noisy btrfs: fix lockdep warning with reclaim lock inversion Btrfs: hold the commit_root_sem when getting the commit root during send Btrfs: remove transaction from send ...
2014-04-11Btrfs: fix compile warnings on on avr32 platformWang Shilong
fs/btrfs/scrub.c: In function 'get_raid56_logic_offset': fs/btrfs/scrub.c:2269: warning: comparison of distinct pointer types lacks a cast fs/btrfs/scrub.c:2269: warning: right shift count >= width of type fs/btrfs/scrub.c:2269: warning: passing argument 1 of '__div64_32' from incompatible pointer type Since @rot is an int type, we should not use do_div(), fix it. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-10btrfs: allow mounting btrfs subvolumes with different ro/rw optionsHarald Hoyer
Given the following /etc/fstab entries: /dev/sda3 /mnt/foo btrfs subvol=foo,ro 0 0 /dev/sda3 /mnt/bar btrfs subvol=bar,rw 0 0 you can't issue: $ mount /mnt/foo $ mount /mnt/bar You would have to do: $ mount /mnt/foo $ mount -o remount,rw /mnt/foo $ mount --bind -o remount,ro /mnt/foo $ mount /mnt/bar or $ mount /mnt/bar $ mount --rw /mnt/foo $ mount --bind -o remount,ro /mnt/foo With this patch you can do $ mount /mnt/foo $ mount /mnt/bar $ cat /proc/self/mountinfo 49 33 0:41 /foo /mnt/foo ro,relatime shared:36 - btrfs /dev/sda3 rw,ssd,space_cache 87 33 0:41 /bar /mnt/bar rw,relatime shared:74 - btrfs /dev/sda3 rw,ssd,space_cache Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07mm: implement ->map_pages for page cacheKirill A. Shutemov
filemap_map_pages() is generic implementation of ->map_pages() for filesystems who uses page cache. It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ning Qu <quning@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07btrfs: export global block reserve size as space_infoDavid Sterba
Introduce a block group type bit for a global reserve and fill the space info for SPACE_INFO ioctl. This should replace the newly added ioctl (01e219e8069516cdb98594d417b8bb8d906ed30d) to get just the 'size' part of the global reserve, while the actual usage can be now visible in the 'btrfs fi df' output during ENOSPC stress. The unpatched userspace tools will show the blockgroup as 'unknown'. CC: Jeff Mahoney <jeffm@suse.com> CC: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07btrfs: fix crash in remount(thread_pool=) caseSergei Trofimovich
Reproducer: mount /dev/ubda /mnt mount -oremount,thread_pool=42 /mnt Gives a crash: ? btrfs_workqueue_set_max+0x0/0x70 btrfs_resize_thread_pool+0xe3/0xf0 ? sync_filesystem+0x0/0xc0 ? btrfs_resize_thread_pool+0x0/0xf0 btrfs_remount+0x1d2/0x570 ? kern_path+0x0/0x80 do_remount_sb+0xd9/0x1c0 do_mount+0x26a/0xbf0 ? kfree+0x0/0x1b0 SyS_mount+0xc4/0x110 It's a call btrfs_workqueue_set_max(fs_info->scrub_wr_completion_workers, new_pool_size); with fs_info->scrub_wr_completion_workers = NULL; as scrub wqs get created only on user's demand. Patch skips not-created-yet workqueues. Signed-off-by: Sergei Trofimovich <slyfox@gentoo.org> CC: Qu Wenruo <quwenruo@cn.fujitsu.com> CC: Chris Mason <clm@fb.com> CC: Josef Bacik <jbacik@fb.com> CC: linux-btrfs@vger.kernel.org Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: abort the transaction when we don't find our extent refJosef Bacik
I'm not sure why we weren't aborting here in the first place, it is obviously a bad time from the fact that we print the leaf and yell loudly about it. Fix this up, otherwise we panic because our path could be pointing into oblivion. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: fix EINVAL checks in btrfs_cloneChris Mason
btrfs_drop_extents can now return -EINVAL, but only one caller in btrfs_clone was checking for it. This adds it to the caller for inline extents, which is where we really need it. Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: fix unlock in __start_delalloc_inodes()Wang Shilong
This patch fix a regression caused by the following patch: Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock break while loop will make us call @spin_unlock() without calling @spin_lock() before, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: scrub raid56 stripes in the right wayWang Shilong
Steps to reproduce: # mkfs.btrfs -f /dev/sda[8-11] -m raid5 -d raid5 # mount /dev/sda8 /mnt # btrfs scrub start -BR /mnt # echo $? <--unverified errors make return value be 3 This is because we don't setup right mapping between physical and logical address for raid56, which makes checksum mismatch. But we will find everthing is fine later when rechecking using btrfs_map_block(). This patch fixed the problem by settuping right mappings and we only verify data stripes' checksums. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: don't compress for a small writeWang Shilong
To compress a small file range(<=blocksize) that is not an inline extent can not save disk space at all. skip it can save us some cpu time. This patch can also fix wrong setting nocompression flag for inode, say a case when @total_in is 4096, and then we get @total_compressed 52,because we do aligment to page cache size firstly, and then we get into conclusion @total_in=@total_compressed thus we will clear this inode's compression flag. An exception comes from inserting inline extent failure but we still have @total_compressed < @total_in,so we will still reset inode's flag, this is ok, because we don't have good compression effect. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: more efficient io tree navigation on wait_extent_bitFilipe Manana
If we don't reschedule use rb_next to find the next extent state instead of a full tree search, which is more efficient and safe since we didn't release the io tree's lock. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: send, build path string only once in send_holeFilipe Manana
There's no point building the path string in each iteration of the send_hole loop, as it produces always the same string. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07btrfs: filter invalid arg for btrfs resizeGui Hecheng
Originally following cmds will work: # btrfs fi resize -10A <mnt> # btrfs fi resize -10Gaha <mnt> Filter the arg by checking the return pointer of memparse. Signed-off-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: send, fix data corruption due to incorrect hole detectionFilipe Manana
During an incremental send, when we finish processing an inode (corresponding to a regular file) we would assume the gap between the end of the last processed file extent and the file's size corresponded to a file hole, and therefore incorrectly send a bunch of zero bytes to overwrite that region in the file. This affects only kernel 3.14. Reproducer: mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt xfs_io -f -c "falloc -k 0 268435456" /mnt/foo btrfs subvolume snapshot -r /mnt /mnt/mysnap0 xfs_io -c "pwrite -S 0x01 -b 9216 16190218 9216" /mnt/foo xfs_io -c "pwrite -S 0x02 -b 1121 198720104 1121" /mnt/foo xfs_io -c "pwrite -S 0x05 -b 9216 107887439 9216" /mnt/foo xfs_io -c "pwrite -S 0x06 -b 9216 225520207 9216" /mnt/foo xfs_io -c "pwrite -S 0x07 -b 67584 102138300 67584" /mnt/foo xfs_io -c "pwrite -S 0x08 -b 7000 94897484 7000" /mnt/foo xfs_io -c "pwrite -S 0x09 -b 113664 245083212 113664" /mnt/foo xfs_io -c "pwrite -S 0x10 -b 123 17937788 123" /mnt/foo xfs_io -c "pwrite -S 0x11 -b 39936 229573311 39936" /mnt/foo xfs_io -c "pwrite -S 0x12 -b 67584 174792222 67584" /mnt/foo xfs_io -c "pwrite -S 0x13 -b 9216 249253213 9216" /mnt/foo xfs_io -c "pwrite -S 0x16 -b 67584 150046083 67584" /mnt/foo xfs_io -c "pwrite -S 0x17 -b 39936 118246040 39936" /mnt/foo xfs_io -c "pwrite -S 0x18 -b 67584 215965442 67584" /mnt/foo xfs_io -c "pwrite -S 0x19 -b 33792 97096725 33792" /mnt/foo xfs_io -c "pwrite -S 0x20 -b 125952 166300596 125952" /mnt/foo xfs_io -c "pwrite -S 0x21 -b 123 1078957 123" /mnt/foo xfs_io -c "pwrite -S 0x25 -b 9216 212044492 9216" /mnt/foo xfs_io -c "pwrite -S 0x26 -b 7000 265037146 7000" /mnt/foo xfs_io -c "pwrite -S 0x27 -b 42757 215922685 42757" /mnt/foo xfs_io -c "pwrite -S 0x28 -b 7000 69865411 7000" /mnt/foo xfs_io -c "pwrite -S 0x29 -b 67584 67948958 67584" /mnt/foo xfs_io -c "pwrite -S 0x30 -b 39936 266967019 39936" /mnt/foo xfs_io -c "pwrite -S 0x31 -b 1121 19582453 1121" /mnt/foo xfs_io -c "pwrite -S 0x32 -b 17408 257710255 17408" /mnt/foo xfs_io -c "pwrite -S 0x33 -b 39936 3895518 39936" /mnt/foo xfs_io -c "pwrite -S 0x34 -b 125952 12045847 125952" /mnt/foo xfs_io -c "pwrite -S 0x35 -b 17408 19156379 17408" /mnt/foo xfs_io -c "pwrite -S 0x36 -b 39936 50160066 39936" /mnt/foo xfs_io -c "pwrite -S 0x37 -b 113664 9549793 113664" /mnt/foo xfs_io -c "pwrite -S 0x38 -b 105472 94391506 105472" /mnt/foo xfs_io -c "pwrite -S 0x39 -b 23552 143632863 23552" /mnt/foo xfs_io -c "pwrite -S 0x40 -b 39936 241283845 39936" /mnt/foo xfs_io -c "pwrite -S 0x41 -b 113664 199937606 113664" /mnt/foo xfs_io -c "pwrite -S 0x42 -b 67584 67380093 67584" /mnt/foo xfs_io -c "pwrite -S 0x43 -b 67584 26793129 67584" /mnt/foo xfs_io -c "pwrite -S 0x44 -b 39936 14421913 39936" /mnt/foo xfs_io -c "pwrite -S 0x45 -b 123 253097405 123" /mnt/foo xfs_io -c "pwrite -S 0x46 -b 1121 128233424 1121" /mnt/foo xfs_io -c "pwrite -S 0x47 -b 105472 91577959 105472" /mnt/foo xfs_io -c "pwrite -S 0x48 -b 1121 7245381 1121" /mnt/foo xfs_io -c "pwrite -S 0x49 -b 113664 182414694 113664" /mnt/foo xfs_io -c "pwrite -S 0x50 -b 9216 32750608 9216" /mnt/foo xfs_io -c "pwrite -S 0x51 -b 67584 266546049 67584" /mnt/foo xfs_io -c "pwrite -S 0x52 -b 67584 87969398 67584" /mnt/foo xfs_io -c "pwrite -S 0x53 -b 9216 260848797 9216" /mnt/foo xfs_io -c "pwrite -S 0x54 -b 39936 119461243 39936" /mnt/foo xfs_io -c "pwrite -S 0x55 -b 7000 200178693 7000" /mnt/foo xfs_io -c "pwrite -S 0x56 -b 9216 243316029 9216" /mnt/foo xfs_io -c "pwrite -S 0x57 -b 7000 209658229 7000" /mnt/foo xfs_io -c "pwrite -S 0x58 -b 101376 179745192 101376" /mnt/foo xfs_io -c "pwrite -S 0x59 -b 9216 64012300 9216" /mnt/foo xfs_io -c "pwrite -S 0x60 -b 125952 181705139 125952" /mnt/foo xfs_io -c "pwrite -S 0x61 -b 23552 235737348 23552" /mnt/foo xfs_io -c "pwrite -S 0x62 -b 113664 106021355 113664" /mnt/foo xfs_io -c "pwrite -S 0x63 -b 67584 135753552 67584" /mnt/foo xfs_io -c "pwrite -S 0x64 -b 23552 95730888 23552" /mnt/foo xfs_io -c "pwrite -S 0x65 -b 11 17311415 11" /mnt/foo xfs_io -c "pwrite -S 0x66 -b 33792 120695553 33792" /mnt/foo xfs_io -c "pwrite -S 0x67 -b 9216 17164631 9216" /mnt/foo xfs_io -c "pwrite -S 0x68 -b 9216 136065853 9216" /mnt/foo xfs_io -c "pwrite -S 0x69 -b 67584 37752198 67584" /mnt/foo xfs_io -c "pwrite -S 0x70 -b 101376 189717473 101376" /mnt/foo xfs_io -c "pwrite -S 0x71 -b 7000 227463698 7000" /mnt/foo xfs_io -c "pwrite -S 0x72 -b 9216 12655137 9216" /mnt/foo xfs_io -c "pwrite -S 0x73 -b 7000 7488866 7000" /mnt/foo xfs_io -c "pwrite -S 0x74 -b 113664 87813649 113664" /mnt/foo xfs_io -c "pwrite -S 0x75 -b 33792 25802183 33792" /mnt/foo xfs_io -c "pwrite -S 0x76 -b 39936 93524024 39936" /mnt/foo xfs_io -c "pwrite -S 0x77 -b 33792 113336388 33792" /mnt/foo xfs_io -c "pwrite -S 0x78 -b 105472 184955320 105472" /mnt/foo xfs_io -c "pwrite -S 0x79 -b 101376 225691598 101376" /mnt/foo xfs_io -c "pwrite -S 0x80 -b 23552 77023155 23552" /mnt/foo xfs_io -c "pwrite -S 0x81 -b 11 201888192 11" /mnt/foo xfs_io -c "pwrite -S 0x82 -b 11 115332492 11" /mnt/foo xfs_io -c "pwrite -S 0x83 -b 67584 230278015 67584" /mnt/foo xfs_io -c "pwrite -S 0x84 -b 11 120589073 11" /mnt/foo xfs_io -c "pwrite -S 0x85 -b 125952 202207819 125952" /mnt/foo xfs_io -c "pwrite -S 0x86 -b 113664 86672080 113664" /mnt/foo xfs_io -c "pwrite -S 0x87 -b 17408 208459603 17408" /mnt/foo xfs_io -c "pwrite -S 0x88 -b 7000 73372211 7000" /mnt/foo xfs_io -c "pwrite -S 0x89 -b 7000 42252122 7000" /mnt/foo xfs_io -c "pwrite -S 0x90 -b 23552 46784881 23552" /mnt/foo xfs_io -c "pwrite -S 0x91 -b 101376 63172351 101376" /mnt/foo xfs_io -c "pwrite -S 0x92 -b 23552 59341931 23552" /mnt/foo xfs_io -c "pwrite -S 0x93 -b 39936 239599283 39936" /mnt/foo xfs_io -c "pwrite -S 0x94 -b 67584 175643105 67584" /mnt/foo xfs_io -c "pwrite -S 0x97 -b 23552 105534880 23552" /mnt/foo xfs_io -c "pwrite -S 0x98 -b 113664 8236844 113664" /mnt/foo xfs_io -c "pwrite -S 0x99 -b 125952 144489686 125952" /mnt/foo xfs_io -c "pwrite -S 0xa0 -b 7000 73273112 7000" /mnt/foo xfs_io -c "pwrite -S 0xa1 -b 125952 194580243 125952" /mnt/foo xfs_io -c "pwrite -S 0xa2 -b 123 56296779 123" /mnt/foo xfs_io -c "pwrite -S 0xa3 -b 11 233066845 11" /mnt/foo xfs_io -c "pwrite -S 0xa4 -b 39936 197727090 39936" /mnt/foo xfs_io -c "pwrite -S 0xa5 -b 101376 53579812 101376" /mnt/foo xfs_io -c "pwrite -S 0xa6 -b 9216 85669738 9216" /mnt/foo xfs_io -c "pwrite -S 0xa7 -b 125952 21266322 125952" /mnt/foo xfs_io -c "pwrite -S 0xa8 -b 23552 125726568 23552" /mnt/foo xfs_io -c "pwrite -S 0xa9 -b 9216 18423680 9216" /mnt/foo xfs_io -c "pwrite -S 0xb0 -b 1121 165901483 1121" /mnt/foo btrfs subvolume snapshot -r /mnt /mnt/mysnap1 xfs_io -c "pwrite -S 0xff -b 10 16190218 10" /mnt/foo btrfs subvolume snapshot -r /mnt /mnt/mysnap2 md5sum /mnt/foo # returns 79e53f1466bfc09fd82b450689e6119e md5sum /mnt/mysnap2/foo # returns 79e53f1466bfc09fd82b450689e6119e too btrfs send /mnt/mysnap1 -f /tmp/1.snap btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt btrfs receive /mnt -f /tmp/1.snap btrfs receive /mnt -f /tmp/2.snap md5sum /mnt/mysnap2/foo # returns 2bb414c5155767cedccd7063e51beabd !! A testcase for xfstests follows soon too. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: kmalloc() doesn't return an ERR_PTRDan Carpenter
The error handling was copy and pasted from memdup_user(). It should be checking for NULL obviously. Fixes: abccd00f8af2 ('btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: fix snapshot vs nocow writtingWang Shilong
While running fsstress and snapshots concurrently, we will hit something like followings: Thread 1 Thread 2 |->fallocate |->write pages |->join transaction |->add ordered extent |->end transaction |->flushing data |->creating pending snapshots |->write data into src root's fallocated space After above work flows finished, we will get a state that source and snapshot root share same space, but source root have written data into fallocated space, this will make fsck fail to verify checksums for snapshot root's preallocating file extent data.Nocow writting also has this same problem. Fix this problem by syncing snapshots with nocow writting: 1.for nocow writting,if there are pending snapshots, we will fall into COW way. 2.if there are pending nocow writes, snapshots for this root will be blocked until nocow writting finish. Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07btrfs: Change the expanding write sequence to fix snapshot related bug.Qu Wenruo
When testing fsstress with snapshot making background, some snapshot following problem. Snapshot 270: inode 323: size 0 Snapshot 271: inode 323: size 349145 |-------Hole---|---------Empty gap-------|-------Hole-----| 0 122880 172032 349145 Snapshot 272: inode 323: size 349145 |-------Hole---|------------Data---------|-------Hole-----| 0 122880 172032 349145 The fsstress operation on inode 323 is the following: write: offset 126832 len 43124 truncate: size 349145 Since the write with offset is consist of 2 operations: 1. punch hole 2. write data Hole punching is faster than data write, so hole punching in write and truncate is done first and then buffered write, so the snapshot 271 got empty gap, which will not pass btrfsck. To fix the bug, this patch will change the write sequence which will first punch a hole covering the write end if a hole is needed. Reported-by: Gui Hecheng <guihc.fnst@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07btrfs: make device scan less noisyDavid Sterba
Print the message only when the device is seen for the first time. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07btrfs: fix lockdep warning with reclaim lock inversionJeff Mahoney
When encountering memory pressure, testers have run into the following lockdep warning. It was caused by __link_block_group calling kobject_add with the groups_sem held. kobject_add calls kvasprintf with GFP_KERNEL, which gets us into reclaim context. The kobject doesn't actually need to be added under the lock -- it just needs to ensure that it's only added for the first block group to be linked. ========================================================= [ INFO: possible irq lock inversion dependency detected ] 3.14.0-rc8-default #1 Not tainted --------------------------------------------------------- kswapd0/169 just changed the state of lock: (&delayed_node->mutex){+.+.-.}, at: [<ffffffffa018baea>] __btrfs_release_delayed_node+0x3a/0x200 [btrfs] but this lock took another, RECLAIM_FS-unsafe lock in the past: (&found->groups_sem){+++++.} and interrupts could create inverse lock ordering between them. other info that might help us debug this: Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&found->groups_sem); local_irq_disable(); lock(&delayed_node->mutex); lock(&found->groups_sem); <Interrupt> lock(&delayed_node->mutex); *** DEADLOCK *** 2 locks held by kswapd0/169: #0: (shrinker_rwsem){++++..}, at: [<ffffffff81159e8a>] shrink_slab+0x3a/0x160 #1: (&type->s_umount_key#27){++++..}, at: [<ffffffff811bac6f>] grab_super_passive+0x3f/0x90 Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-07Btrfs: hold the commit_root_sem when getting the commit root during sendJosef Bacik
We currently rely too heavily on roots being read-only to save us from just accessing root->commit_root. We can easily balance blocks out from underneath a read only root, so to save us from getting screwed make sure we only access root->commit_root under the commit root sem. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06Btrfs: remove transaction from sendJosef Bacik
Lets try this again. We can deadlock the box if we send on a box and try to write onto the same fs with the app that is trying to listen to the send pipe. This is because the writer could get stuck waiting for a transaction commit which is being blocked by the send. So fix this by making sure looking at the commit roots is always going to be consistent. We do this by keeping track of which roots need to have their commit roots swapped during commit, and then taking the commit_root_sem and swapping them all at once. Then make sure we take a read lock on the commit_root_sem in cases where we search the commit root to make sure we're always looking at a consistent view of the commit roots. Previously we had problems with this because we would swap a fs tree commit root and then swap the extent tree commit root independently which would cause the backref walking code to screw up sometimes. With this patch we no longer deadlock and pass all the weird send/receive corner cases. Thanks, Reportedy-by: Hugo Mills <hugo@carfax.org.uk> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06Btrfs: don't clear uptodate if the eb is under IOJosef Bacik
So I have an awful exercise script that will run snapshot, balance and send/receive in parallel. This sometimes would crash spectacularly and when it came back up the fs would be completely hosed. Turns out this is because of a bad interaction of balance and send/receive. Send will hold onto its entire path for the whole send, but its blocks could get relocated out from underneath it, and because it doesn't old tree locks theres nothing to keep this from happening. So it will go to read in a slot with an old transid, and we could have re-allocated this block for something else and it could have a completely different transid. But because we think it is invalid we clear uptodate and re-read in the block. If we do this before we actually write out the new block we could write back stale data to the fs, and boom we're screwed. Now we definitely need to fix this disconnect between send and balance, but we really really need to not allow ourselves to accidently read in stale data over new data. So make sure we check if the extent buffer is not under io before clearing uptodate, this will kick back EIO to the caller instead of reading in stale data and keep us from corrupting the fs. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06Btrfs: check for an extent_op on the locked refJosef Bacik
We could have possibly added an extent_op to the locked_ref while we dropped locked_ref->lock, so check for this case as well and loop around. Otherwise we could lose flag updates which would lead to extent tree corruption. Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-06Btrfs: do not reset last_snapshot after relocationJosef Bacik
This was done to allow NO_COW to continue to be NO_COW after relocation but it is not right. When relocating we will convert blocks to FULL_BACKREF that we relocate. We can leave some of these full backref blocks behind if they are not cow'ed out during the relocation, like if we fail the relocation with ENOSPC and then just drop the reloc tree. Then when we go to cow the block again we won't lookup the extent flags because we won't think there has been a snapshot recently which means we will do our normal ref drop thing instead of adding back a tree ref and dropping the shared ref. This will cause btrfs_free_extent to blow up because it can't find the ref we are trying to free. This was found with my ref verifying tool. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-04-04Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Major changes for 3.14 include support for the newly added ZERO_RANGE and COLLAPSE_RANGE fallocate operations, and scalability improvements in the jbd2 layer and in xattr handling when the extended attributes spill over into an external block. Other than that, the usual clean ups and minor bug fixes" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (42 commits) ext4: fix premature freeing of partial clusters split across leaf blocks ext4: remove unneeded test of ret variable ext4: fix comment typo ext4: make ext4_block_zero_page_range static ext4: atomically set inode->i_flags in ext4_set_inode_flags() ext4: optimize Hurd tests when reading/writing inodes ext4: kill i_version support for Hurd-castrated file systems ext4: each filesystem creates and uses its own mb_cache fs/mbcache.c: doucple the locking of local from global data fs/mbcache.c: change block and index hash chain to hlist_bl_node ext4: Introduce FALLOC_FL_ZERO_RANGE flag for fallocate ext4: refactor ext4_fallocate code ext4: Update inode i_size after the preallocation ext4: fix partial cluster handling for bigalloc file systems ext4: delete path dealloc code in ext4_ext_handle_uninitialized_extents ext4: only call sync_filesystm() when remounting read-only fs: push sync_filesystem() down to the file system's remount_fs() jbd2: improve error messages for inconsistent journal heads jbd2: minimize region locked by j_list_lock in jbd2_journal_forget() jbd2: minimize region locked by j_list_lock in journal_get_create_access() ...
2014-04-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs changes from Chris Mason: "This is a pretty long stream of bug fixes and performance fixes. Qu Wenruo has replaced the btrfs async threads with regular kernel workqueues. We'll keep an eye out for performance differences, but it's nice to be using more generic code for this. We still have some corruption fixes and other patches coming in for the merge window, but this batch is tested and ready to go" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (108 commits) Btrfs: fix a crash of clone with inline extents's split btrfs: fix uninit variable warning Btrfs: take into account total references when doing backref lookup Btrfs: part 2, fix incremental send's decision to delay a dir move/rename Btrfs: fix incremental send's decision to delay a dir move/rename Btrfs: remove unnecessary inode generation lookup in send Btrfs: fix race when updating existing ref head btrfs: Add trace for btrfs_workqueue alloc/destroy Btrfs: less fs tree lock contention when using autodefrag Btrfs: return EPERM when deleting a default subvolume Btrfs: add missing kfree in btrfs_destroy_workqueue Btrfs: cache extent states in defrag code path Btrfs: fix deadlock with nested trans handles Btrfs: fix possible empty list access when flushing the delalloc inodes Btrfs: split the global ordered extents mutex Btrfs: don't flush all delalloc inodes when we doesn't get s_umount lock Btrfs: reclaim delalloc metadata more aggressively Btrfs: remove unnecessary lock in may_commit_transaction() Btrfs: remove the unnecessary flush when preparing the pages Btrfs: just do dirty page flush for the inode with compression before direct IO ...
2014-04-03mm + fs: store shadow entries in page cacheJohannes Weiner
Reclaim will be leaving shadow entries in the page cache radix tree upon evicting the real page. As those pages are found from the LRU, an iput() can lead to the inode being freed concurrently. At this point, reclaim must no longer install shadow pages because the inode freeing code needs to ensure the page tree is really empty. Add an address_space flag, AS_EXITING, that the inode freeing code sets under the tree lock before doing the final truncate. Reclaim will check for this flag before installing shadow pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03mm + fs: prepare for non-page entries in page cache radix treesJohannes Weiner
shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-03fs/direct-io.c: remove some left over checksDan Carpenter
We know that "ret > 0" is true here. These tests were left over from commit 02afc27faec9 ('direct-io: Handle O_(D)SYNC AIO') and aren't needed any more. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>