summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-08-16signalfd: fix information leak in signalfd_copyinfoAmanieu d'Antras
commit 3ead7c52bdb0ab44f4bb1feed505a8323cc12ba7 upstream. This function may copy the si_addr_lsb field to user mode when it hasn't been initialized, which can leak kernel stack data to user mode. Just checking the value of si_code is insufficient because the same si_code value is shared between multiple signals. This is solved by checking the value of si_signo in addition to si_code. Signed-off-by: Amanieu d'Antras <amanieu@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16ocfs2: fix BUG in ocfs2_downconvert_thread_do_work()Joseph Qi
commit 209f7512d007980fd111a74a064d70a3656079cf upstream. The "BUG_ON(list_empty(&osb->blocked_lock_list))" in ocfs2_downconvert_thread_do_work can be triggered in the following case: ocfs2dc has firstly saved osb->blocked_lock_count to local varibale processed, and then processes the dentry lockres. During the dentry put, it calls iput and then deletes rw, inode and open lockres from blocked list in ocfs2_mark_lockres_freeing. And this causes the variable `processed' to not reflect the number of blocked lockres to be processed, which triggers the BUG. Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-16fsnotify: fix oops in fsnotify_clear_marks_by_group_flags()Jan Kara
commit 8f2f3eb59dff4ec538de55f2e0592fec85966aab upstream. fsnotify_clear_marks_by_group_flags() can race with fsnotify_destroy_marks() so that when fsnotify_destroy_mark_locked() drops mark_mutex, a mark from the list iterated by fsnotify_clear_marks_by_group_flags() can be freed and thus the next entry pointer we have cached may become stale and we dereference free memory. Fix the problem by first moving marks to free to a special private list and then always free the first entry in the special list. This method is safe even when entries from the list can disappear once we drop the lock. Signed-off-by: Jan Kara <jack@suse.com> Reported-by: Ashish Sangwan <a.sangwan@samsung.com> Reviewed-by: Ashish Sangwan <a.sangwan@samsung.com> Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-10freeing unlinked file indefinitely delayedAl Viro
commit 75a6f82a0d10ef8f13cd8fe7212911a0252ab99e upstream. Normally opening a file, unlinking it and then closing will have the inode freed upon close() (provided that it's not otherwise busy and has no remaining links, of course). However, there's one case where that does *not* happen. Namely, if you open it by fhandle with cold dcache, then unlink() and close(). In normal case you get d_delete() in unlink(2) notice that dentry is busy and unhash it; on the final dput() it will be forcibly evicted from dcache, triggering iput() and inode removal. In this case, though, we end up with *two* dentries - disconnected (created by open-by-fhandle) and regular one (used by unlink()). The latter will have its reference to inode dropped just fine, but the former will not - it's considered hashed (it is on the ->s_anon list), so it will stay around until the memory pressure will finally do it in. As the result, we have the final iput() delayed indefinitely. It's trivial to reproduce - void flush_dcache(void) { system("mount -o remount,rw /"); } static char buf[20 * 1024 * 1024]; main() { int fd; union { struct file_handle f; char buf[MAX_HANDLE_SZ]; } x; int m; x.f.handle_bytes = sizeof(x); chdir("/root"); mkdir("foo", 0700); fd = open("foo/bar", O_CREAT | O_RDWR, 0600); close(fd); name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0); flush_dcache(); fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR); unlink("foo/bar"); write(fd, buf, sizeof(buf)); system("df ."); /* 20Mb eaten */ close(fd); system("df ."); /* should've freed those 20Mb */ flush_dcache(); system("df ."); /* should be the same as #2 */ } will spit out something like Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 303843 1131 100% / Filesystem 1K-blocks Used Available Use% Mounted on /dev/root 322023 283282 21692 93% / - inode gets freed only when dentry is finally evicted (here we trigger than by remount; normally it would've happened in response to memory pressure hell knows when). Acked-by: J. Bruce Fields <bfields@fieldses.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03hpfs: hpfs_error: Remove static buffer, use vsprintf extension %pV insteadJoe Perches
commit a28e4b2b18ccb90df402da3f21e1a83c9d4f8ec1 upstream. Removing unnecessary static buffers is good. Use the vsprintf %pV extension instead. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Mikulas Patocka <mikulas@twibright.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-039p: don't leave a half-initialized inode sitting aroundAl Viro
commit 0a73d0a204a4a04a1e110539c5a524ae51f91d6d upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03fixing infinite OPEN loop in 4.0 stateid recoveryOlga Kornievskaia
commit e8d975e73e5fa05f983fbf2723120edcf68e0b38 upstream. Problem: When an operation like WRITE receives a BAD_STATEID, even though recovery code clears the RECLAIM_NOGRACE recovery flag before recovering the open state, because of clearing delegation state for the associated inode, nfs_inode_find_state_and_recover() gets called and it makes the same state with RECLAIM_NOGRACE flag again. As a results, when we restart looking over the open states, we end up in the infinite loop instead of breaking out in the next test of state flags. Solution: unset the RECLAIM_NOGRACE set because of calling of nfs_inode_find_state_and_recover() after returning from calling recover_open() function. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03NFS: Fix size of NFSACL SETACL operationsChuck Lever
commit d683cc49daf7c5afca8cd9654aaa1bf63cdf2ad9 upstream. When encoding the NFSACL SETACL operation, reserve just the estimated size of the ACL rather than a fixed maximum. This eliminates needless zero padding on the wire that the server ignores. Fixes: ee5dc7732bd5 ('NFS: Fix "kernel BUG at fs/nfs/nfs3xdr.c:1338!"') Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03fuse: initialize fc->release before calling itMiklos Szeredi
commit 0ad0b3255a08020eaf50e34ef0d6df5bdf5e09ed upstream. fc->release is called from fuse_conn_put() which was used in the error cleanup before fc->release was initialized. [Jeremiah Mahler <jmmahler@gmail.com>: assign fc->release after calling fuse_conn_init(fc) instead of before.] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Fixes: a325f9b92273 ("fuse: update fuse_conn_init() and separate out fuse_conn_kill()") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03Btrfs: use kmem_cache_free when freeing entry in inode cacheFilipe Manana
commit c3f4a1685bb87e59c886ee68f7967eae07d4dffa upstream. The free space entries are allocated using kmem_cache_zalloc(), through __btrfs_add_free_space(), therefore we should use kmem_cache_free() and not kfree() to avoid any confusion and any potential problem. Looking at the kfree() definition at mm/slab.c it has the following comment: /* * (...) * * Don't free memory not originally allocated by kmalloc() * or you will run into trouble. */ So better be safe and use kmem_cache_free(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03xfs: fix remote symlinks on V5/CRC filesystemsEric Sandeen
commit 2ac56d3d4bd625450a54d4c3f9292d58f6b88232 upstream. If we create a CRC filesystem, mount it, and create a symlink with a path long enough that it can't live in the inode, we get a very strange result upon remount: # ls -l mnt total 4 lrwxrwxrwx. 1 root root 929 Jun 15 16:58 link -> XSLM XSLM is the V5 symlink block header magic (which happens to be followed by a NUL, so the string looks terminated). xfs_readlink_bmap() advanced cur_chunk by the size of the header for CRC filesystems, but never actually used that pointer; it kept reading from bp->b_addr, which is the start of the block, rather than the start of the symlink data after the header. Looks like this problem goes back to v3.10. Fixing this gets us reading the proper link target, again. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03jbd2: fix ocfs2 corrupt when updating journal superblock failsJoseph Qi
commit 6f6a6fda294506dfe0e3e0a253bb2d2923f28f0a upstream. If updating journal superblock fails after journal data has been flushed, the error is omitted and this will mislead the caller as a normal case. In ocfs2, the checkpoint will be treated successfully and the other node can get the lock to update. Since the sb_start is still pointing to the old log block, it will rewrite the journal data during journal recovery by the other node. Thus the new updates will be overwritten and ocfs2 corrupts. So in above case we have to return the error, and ocfs2_commit_cache will take care of the error and prevent the other node to do update first. And only after recovering journal it can do the new updates. The issue discussion mail can be found at: https://oss.oracle.com/pipermail/ocfs2-devel/2015-June/010856.html http://comments.gmane.org/gmane.comp.file-systems.ext4/48841 [ Fixed bug in patch which allowed a non-negative error return from jbd2_cleanup_journal_tail() to leak out of jbd2_fjournal_flush(); this was causing xfstests ext4/306 to fail. -- Ted ] Reported-by: Yiwen Jiang <jiangyiwen@huawei.com> Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Tested-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03jbd2: use GFP_NOFS in jbd2_cleanup_journal_tail()Dmitry Monakhov
commit b4f1afcd068f6e533230dfed00782cd8a907f96b upstream. jbd2_cleanup_journal_tail() can be invoked by jbd2__journal_start() So allocations should be done with GFP_NOFS [Full stack trace snipped from 3.10-rh7] [<ffffffff815c4bd4>] dump_stack+0x19/0x1b [<ffffffff8105dba1>] warn_slowpath_common+0x61/0x80 [<ffffffff8105dcca>] warn_slowpath_null+0x1a/0x20 [<ffffffff815c2142>] slab_pre_alloc_hook.isra.31.part.32+0x15/0x17 [<ffffffff8119c045>] kmem_cache_alloc+0x55/0x210 [<ffffffff811477f5>] ? mempool_alloc_slab+0x15/0x20 [<ffffffff811477f5>] mempool_alloc_slab+0x15/0x20 [<ffffffff81147939>] mempool_alloc+0x69/0x170 [<ffffffff815cb69e>] ? _raw_spin_unlock_irq+0xe/0x20 [<ffffffff8109160d>] ? finish_task_switch+0x5d/0x150 [<ffffffff811f1a8e>] bio_alloc_bioset+0x1be/0x2e0 [<ffffffff8127ee49>] blkdev_issue_flush+0x99/0x120 [<ffffffffa019a733>] jbd2_cleanup_journal_tail+0x93/0xa0 [jbd2] -->GFP_KERNEL [<ffffffffa019aca1>] jbd2_log_do_checkpoint+0x221/0x4a0 [jbd2] [<ffffffffa019afc7>] __jbd2_log_wait_for_space+0xa7/0x1e0 [jbd2] [<ffffffffa01952d8>] start_this_handle+0x2d8/0x550 [jbd2] [<ffffffff811b02a9>] ? __memcg_kmem_put_cache+0x29/0x30 [<ffffffff8119c120>] ? kmem_cache_alloc+0x130/0x210 [<ffffffffa019573a>] jbd2__journal_start+0xba/0x190 [jbd2] [<ffffffff811532ce>] ? lru_cache_add+0xe/0x10 [<ffffffffa01c9549>] ? ext4_da_write_begin+0xf9/0x330 [ext4] [<ffffffffa01f2c77>] __ext4_journal_start_sb+0x77/0x160 [ext4] [<ffffffffa01c9549>] ext4_da_write_begin+0xf9/0x330 [ext4] [<ffffffff811446ec>] generic_file_buffered_write_iter+0x10c/0x270 [<ffffffff81146918>] __generic_file_write_iter+0x178/0x390 [<ffffffff81146c6b>] __generic_file_aio_write+0x8b/0xb0 [<ffffffff81146ced>] generic_file_aio_write+0x5d/0xc0 [<ffffffffa01bf289>] ext4_file_write+0xa9/0x450 [ext4] [<ffffffff811c31d9>] ? pipe_read+0x379/0x4f0 [<ffffffff811b93f0>] do_sync_write+0x90/0xe0 [<ffffffff811b9b6d>] vfs_write+0xbd/0x1e0 [<ffffffff811ba5b8>] SyS_write+0x58/0xb0 [<ffffffff815d4799>] system_call_fastpath+0x16/0x1b Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: replace open coded nofail allocation in ext4_free_blocks()Michal Hocko
commit 7444a072c387a93ebee7066e8aee776954ab0e41 upstream. ext4_free_blocks is looping around the allocation request and mimics __GFP_NOFAIL behavior without any allocation fallback strategy. Let's remove the open coded loop and replace it with __GFP_NOFAIL. Without the flag the allocator has no way to find out never-fail requirement and cannot help in any way. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: correctly migrate a file with a hole at the beginningEryu Guan
commit 8974fec7d72e3e02752fe0f27b4c3719c78d9a15 upstream. Currently ext4_ind_migrate() doesn't correctly handle a file which contains a hole at the beginning of the file. This caused the migration to be done incorrectly, and then if there is a subsequent following delayed allocation write to the "hole", this would reclaim the same data blocks again and results in fs corruption. # assmuing 4k block size ext4, with delalloc enabled # skip the first block and write to the second block xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile # converting to indirect-mapped file, which would move the data blocks # to the beginning of the file, but extent status cache still marks # that region as a hole chattr -e /mnt/ext4/testfile # delayed allocation writes to the "hole", reclaim the same data block # again, results in i_blocks corruption xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile umount /mnt/ext4 e2fsck -nf /dev/sda6 ... Inode 53, i_blocks is 16, should be 8. Fix? no ... Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: be more strict when migrating to non-extent based fileEryu Guan
commit d6f123a9297496ad0b6335fe881504c4b5b2a5e5 upstream. Currently the check in ext4_ind_migrate() is not enough before doing the real conversion: a) delayed allocated extents could bypass the check on eh->eh_entries and eh->eh_depth This can be demonstrated by this script xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile where testfile has two extents but still be converted to non-extent based file format. b) only extent length is checked but not the offset, which would result in data lose (delalloc) or fs corruption (nodelalloc), because non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30) blocks This can be demostrated by xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile chattr -e /mnt/ext4/testfile sync If delalloc is enabled, dmesg prints EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53 EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5 EXT4-fs (dm-4): This should not happen!! Data will be lost If delalloc is disabled, e2fsck -nf shows corruption Inode 53, i_size is 5497558142976, should be 4096. Fix? no Fix the two issues by a) forcing all delayed allocation blocks to be allocated before checking eh->eh_depth and eh->eh_entries b) limiting the last logical block of the extent is within direct map Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: fix reservation release on invalidatepage for delalloc fsLukas Czerner
commit 9705acd63b125dee8b15c705216d7186daea4625 upstream. On delalloc enabled file system on invalidatepage operation in ext4_da_page_release_reservation() we want to clear the delayed buffer and remove the extent covering the delayed buffer from the extent status tree. However currently there is a bug where on the systems with page size > block size we will always remove extents from the start of the page regardless where the actual delayed buffers are positioned in the page. This leads to the errors like this: EXT4-fs warning (device loop0): ext4_da_release_space:1225: ext4_da_release_space: ino 13, to_free 1 with only 0 reserved data blocks This however can cause data loss on writeback time if the file system is in ENOSPC condition because we're releasing reservation for someones else delayed buffer. Fix this by only removing extents that corresponds to the part of the page we want to invalidate. This problem is reproducible by the following fio receipt (however I was only able to reproduce it with fio-2.1 or older. [global] bs=8k iodepth=1024 iodepth_batch=60 randrepeat=1 size=1m directory=/mnt/test numjobs=20 [job1] ioengine=sync bs=1k direct=1 rw=randread filename=file1:file2 [job2] ioengine=libaio rw=randwrite direct=1 filename=file1:file2 [job3] bs=1k ioengine=posixaio rw=randwrite direct=1 filename=file1:file2 [job5] bs=1k ioengine=sync rw=randread filename=file1:file2 [job7] ioengine=libaio rw=randwrite filename=file1:file2 [job8] ioengine=posixaio rw=randwrite filename=file1:file2 [job10] ioengine=mmap rw=randwrite bs=1k filename=file1:file2 [job11] ioengine=mmap rw=randwrite direct=1 filename=file1:file2 Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: don't retry file block mapping on bigalloc fs with non-extent fileDarrick J. Wong
commit 292db1bc6c105d86111e858859456bcb11f90f91 upstream. ext4 isn't willing to map clusters to a non-extent file. Don't signal this with an out of space error, since the FS will retry the allocation (which didn't fail) forever. Instead, return EUCLEAN so that the operation will fail immediately all the way back to userspace. (The fix is either to run e2fsck -E bmap2extent, or to chattr +e the file.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: call sync_blockdev() before invalidate_bdev() in put_super()Theodore Ts'o
commit 89d96a6f8e6491f24fc8f99fd6ae66820e85c6c1 upstream. Normally all of the buffers will have been forced out to disk before we call invalidate_bdev(), but there will be some cases, where a file system operation was aborted due to an ext4_error(), where there may still be some dirty buffers in the buffer cache for the device. So try to force them out to memory before calling invalidate_bdev(). This fixes a warning triggered by generic/081: WARNING: CPU: 1 PID: 3473 at /usr/projects/linux/ext4/fs/block_dev.c:56 __blkdev_put+0xb5/0x16f() Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-08-03ext4: fix race between truncate and __ext4_journalled_writepage()Theodore Ts'o
commit bdf96838aea6a265f2ae6cbcfb12a778c84a0b8e upstream. The commit cf108bca465d: "ext4: Invert the locking order of page_lock and transaction start" caused __ext4_journalled_writepage() to drop the page lock before the page was written back, as part of changing the locking order to jbd2_journal_start -> page_lock. However, this introduced a potential race if there was a truncate racing with the data=journalled writeback mode. Fix this by grabbing the page lock after starting the journal handle, and then checking to see if page had gotten truncated out from under us. This fixes a number of different warnings or BUG_ON's when running xfstests generic/086 in data=journalled mode, including: jbd2_journal_dirty_metadata: vdc-8: bad jh for block 115643: transaction (ee3fe7 c0, 164), jh->b_transaction ( (null), 0), jh->b_next_transaction ( (null), 0), jlist 0 - and - kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:2200! ... Call Trace: [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c02b2de5>] __ext4_journalled_invalidatepage+0x10f/0x117 [<c02b2ded>] ? __ext4_journalled_invalidatepage+0x117/0x117 [<c027d883>] ? lock_buffer+0x36/0x36 [<c02b2dfa>] ext4_journalled_invalidatepage+0xd/0x22 [<c0229139>] do_invalidatepage+0x22/0x26 [<c0229198>] truncate_inode_page+0x5b/0x85 [<c022934b>] truncate_inode_pages_range+0x156/0x38c [<c0229592>] truncate_inode_pages+0x11/0x15 [<c022962d>] truncate_pagecache+0x55/0x71 [<c02b913b>] ext4_setattr+0x4a9/0x560 [<c01ca542>] ? current_kernel_time+0x10/0x44 [<c026c4d8>] notify_change+0x1c7/0x2be [<c0256a00>] do_truncate+0x65/0x85 [<c0226f31>] ? file_ra_state_init+0x12/0x29 - and - WARNING: CPU: 1 PID: 1331 at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1396 irty_metadata+0x14a/0x1ae() ... Call Trace: [<c01b879f>] ? console_unlock+0x3a1/0x3ce [<c082cbb4>] dump_stack+0x48/0x60 [<c0178b65>] warn_slowpath_common+0x89/0xa0 [<c02ef2cf>] ? jbd2_journal_dirty_metadata+0x14a/0x1ae [<c0178bef>] warn_slowpath_null+0x14/0x18 [<c02ef2cf>] jbd2_journal_dirty_metadata+0x14a/0x1ae [<c02d8615>] __ext4_handle_dirty_metadata+0xd4/0x19d [<c02b2f44>] write_end_fn+0x40/0x53 [<c02b4a16>] ext4_walk_page_buffers+0x4e/0x6a [<c02b59e7>] ext4_writepage+0x354/0x3b8 [<c02b2f04>] ? mpage_release_unused_pages+0xd4/0xd4 [<c02b1b21>] ? wait_on_buffer+0x2c/0x2c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b5a5b>] __writepage+0x10/0x2e [<c0225956>] write_cache_pages+0x22d/0x32c [<c02b5a4b>] ? ext4_writepage+0x3b8/0x3b8 [<c02b6ee8>] ext4_writepages+0x102/0x607 [<c019adfe>] ? sched_clock_local+0x10/0x10e [<c01a8a7c>] ? __lock_is_held+0x2e/0x44 [<c01a8ad5>] ? lock_is_held+0x43/0x51 [<c0226dff>] do_writepages+0x1c/0x29 [<c0276bed>] __writeback_single_inode+0xc3/0x545 [<c0277c07>] writeback_sb_inodes+0x21f/0x36d ... Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-10fs: Fix S_NOSEC handlingJan Kara
commit 2426f3910069ed47c0cc58559a6d088af7920201 upstream. file_remove_suid() could mistakenly set S_NOSEC inode bit when root was modifying the file. As a result following writes to the file by ordinary user would avoid clearing suid or sgid bits. Fix the bug by checking actual mode bits before setting S_NOSEC. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-03d_walk() might skip too muchJari Ruusu
When Al Viro's VFS deadlock fix "deal with deadlock in d_walk()" was backported to 3.10.y 3.4.y and 3.2.y stable kernel brances, the deadlock fix was copied to 3 different places. Later, a bug in that code was discovered. Al Viro's fix involved fixing only one part of code in mainline kernel. That fix is called "d_walk() might skip too much". 3.10.y 3.4.y and 3.2.y stable kernel brances need that later fix copied to 3 different places. Greg Kroah-Hartman included Al Viro's "d_walk() might skip too much" fix only once in 3.10.80 kernel, leaving 2 more places without a fix. The patch below was not written by me. I only applied Al Viro's "d_walk() might skip too much" fix 2 more times to 3.10.80 kernel, and cheched that the fixes went to correct places. With this patch applied, all 3 places that I am aware of 3.10.y stable branch are now fixed. Signed-off-by: Jari Ruusu <jariruusu@users.sourceforge.net> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-03Btrfs: make xattr replace operations atomicFilipe Manana
commit 5f5bc6b1e2d5a6f827bc860ef2dc5b6f365d1339 upstream. Replacing a xattr consists of doing a lookup for its existing value, delete the current value from the respective leaf, release the search path and then finally insert the new value. This leaves a time window where readers (getxattr, listxattrs) won't see any value for the xattr. Xattrs are used to store ACLs, so this has security implications. This change also fixes 2 other existing issues which were: *) Deleting the old xattr value without verifying first if the new xattr will fit in the existing leaf item (in case multiple xattrs are packed in the same item due to name hash collision); *) Returning -EEXIST when the flag XATTR_CREATE is given and the xattr doesn't exist but we have have an existing item that packs muliple xattrs with the same name hash as the input xattr. In this case we should return ENOSPC. A test case for xfstests follows soon. Thanks to Alexandre Oliva for reporting the non-atomicity of the xattr replace implementation. Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> [shengyong: backport to 3.10 - FIX: CVE-2014-9710 - adjust context - ASSERT() was added v3.12, so we do check with if statement - set the first parameter of btrfs_item_nr() as NULL, because it is not used, and is removed in v3.13 ] Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-03fs: take i_mutex during prepare_binprm for set[ug]id executablesJann Horn
commit 8b01fc86b9f425899f8a3a8fc1c47d73c2c20543 upstream. This prevents a race between chown() and execve(), where chowning a setuid-user binary to root would momentarily make the binary setuid root. This patch was mostly written by Linus Torvalds. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Charles Williams <ciwillia@brocade.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Sheng Yong <shengyong1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-03get rid of s_files and files_lockAl Viro
commit eee5cc2702929fd41cce28058dc6d6717f723f87 upstream. The only thing we need it for is alt-sysrq-r (emergency remount r/o) and these days we can do just as well without going through the list of files. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [wangkai: backport to 3.10: adjust context] Signed-off-by: Wang Kai <morgan.wang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-07-03fput: turn "list_head delayed_fput_list" into llist_headOleg Nesterov
commit 4f5e65a1cc90bbb15b9f6cdc362922af1bcc155a upstream. fput() and delayed_fput() can use llist and avoid the locking. This is unlikely path, it is not that this change can improve the performance, but this way the code looks simpler. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrey Vagin <avagin@openvz.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: David Howells <dhowells@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Wang Kai <morgan.wang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-29pipe: iovec: Fix memory corruption when retrying atomic copy as non-atomicBen Hutchings
pipe_iov_copy_{from,to}_user() may be tried twice with the same iovec, the first time atomically and the second time not. The second attempt needs to continue from the iovec position, pipe buffer offset and remaining length where the first attempt failed, but currently the pipe buffer offset and remaining length are reset. This will corrupt the piped data (possibly also leading to an information leak between processes) and may also corrupt kernel memory. This was fixed upstream by commits f0d1bec9d58d ("new helper: copy_page_from_iter()") and 637b58c2887e ("switch pipe_read() to copy_page_to_iter()"), but those aren't suitable for stable. This fix for older kernel versions was made by Seth Jennings for RHEL and I have extracted it from their update. CVE-2015-1805 References: https://bugzilla.redhat.com/show_bug.cgi?id=1202855 Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-22btrfs: cleanup orphans while looking up default subvolumeJeff Mahoney
commit 727b9784b6085c99c2f836bf4fcc2848dc9cf904 upstream. Orphans in the fs tree are cleaned up via open_ctree and subvolume orphans are cleaned via btrfs_lookup_dentry -- except when a default subvolume is in use. The name for the default subvolume uses a manual lookup that doesn't trigger orphan cleanup and needs to trigger it manually as well. This doesn't apply to the remount case since the subvolumes are cleaned up by walking the root radix tree. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-22btrfs: incorrect handling for fiemap_fill_next_extent returnChengyu Song
commit 26e726afe01c1c82072cf23a5ed89ce25f39d9f2 upstream. fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was the last extent that will fit in user array. If 1 is returned, the return value may eventually returned to user space, which should not happen, according to manpage of ioctl. Signed-off-by: Chengyu Song <csong84@gatech.edu> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-05fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappingsAndrew Morton
commit 2b1d3ae940acd11be44c6eced5873d47c2e00ffa upstream. load_elf_binary() returns `retval', not `error'. Fixes: a87938b2e246b81b4fb ("fs/binfmt_elf.c: fix bug in loading of PIE binaries") Reported-by: James Hogan <james.hogan@imgtec.com> Cc: Michael Davidson <md@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-05vfs: read file_handle only once in handle_to_pathSasha Levin
commit 161f873b89136eb1e69477c847d5a5033239d9ba upstream. We used to read file_handle twice. Once to get the amount of extra bytes, and once to fetch the entire structure. This may be problematic since we do size verifications only after the first read, so if the number of extra bytes changes in userspace between the first and second calls, we'll have an incoherent view of file_handle. Instead, read the constant size once, and copy that over to the final structure without having to re-read it again. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-05ext4: check for zero length extent explicitlyEryu Guan
commit 2f974865ffdfe7b9f46a9940836c8b167342563d upstream. The following commit introduced a bug when checking for zero length extent 5946d08 ext4: check for overlapping extents in ext4_valid_extent_entries() Zero length extent could pass the check if lblock is zero. Adding the explicit check for zero length back. Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-05ext4: convert write_begin methods to stable_page_writes semanticsDmitry Monakhov
commit 7afe5aa59ed3da7b6161617e7f157c7c680dc41e upstream. Use wait_for_stable_page() instead of wait_on_page_writeback() Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alex Shi <alex.shi@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-05d_walk() might skip too muchAl Viro
commit 2159184ea01e4ae7d15f2017e296d4bc82d5aeb0 upstream. when we find that a child has died while we'd been trying to ascend, we should go into the first live sibling itself, rather than its sibling. Off-by-one in question had been introduced in "deal with deadlock in d_walk()" and the fix needs to be backported to all branches this one has been backported to. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-06-05fs, omfs: add NULL terminator in the end up the token listSasha Levin
commit dcbff39da3d815f08750552fdd04f96b51751129 upstream. match_token() expects a NULL terminator at the end of the token list so that it would know where to stop. Not having one causes it to overrun to invalid memory. In practice, passing a mount option that omfs didn't recognize would sometimes panic the system. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Bob Copeland <me@bobcopeland.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17nilfs2: fix sanity check of btree level in nilfs_btree_root_broken()Ryusuke Konishi
commit d8fd150fe3935e1692bf57c66691e17409ebb9c1 upstream. The range check for b-tree level parameter in nilfs_btree_root_broken() is wrong; it accepts the case of "level == NILFS_BTREE_LEVEL_MAX" even though the level is limited to values in the range of 0 to (NILFS_BTREE_LEVEL_MAX - 1). Since the level parameter is read from storage device and used to index nilfs_btree_path array whose element count is NILFS_BTREE_LEVEL_MAX, it can cause memory overrun during btree operations if the boundary value is set to the level parameter on device. This fixes the broken sanity check and adds a comment to clarify that the upper bound NILFS_BTREE_LEVEL_MAX is exclusive. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-17ocfs2: dlm: fix race between purge and get lock resourceJunxiao Bi
commit b1432a2a35565f538586774a03bf277c27fc267d upstream. There is a race window in dlm_get_lock_resource(), which may return a lock resource which has been purged. This will cause the process to hang forever in dlmlock() as the ast msg can't be handled due to its lock resource not existing. dlm_get_lock_resource { ... spin_lock(&dlm->spinlock); tmpres = __dlm_lookup_lockres_full(dlm, lockid, namelen, hash); if (tmpres) { spin_unlock(&dlm->spinlock); >>>>>>>> race window, dlm_run_purge_list() may run and purge the lock resource spin_lock(&tmpres->spinlock); ... spin_unlock(&tmpres->spinlock); } } Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-13ext4: fix data corruption caused by unwritten and delayed extentsLukas Czerner
commit d2dc317d564a46dfc683978a2e5a4f91434e9711 upstream. Currently it is possible to lose whole file system block worth of data when we hit the specific interaction with unwritten and delayed extents in status extent tree. The problem is that when we insert delayed extent into extent status tree the only way to get rid of it is when we write out delayed buffer. However there is a limitation in the extent status tree implementation so that when inserting unwritten extent should there be even a single delayed block the whole unwritten extent would be marked as delayed. At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed. When we try to write into that block later ext4_da_map_blocks() will set the buffer new and delayed and map it to invalid block which causes the rest of the block to be zeroed loosing already written data. For now we can fix this by simply not allowing to set delayed status on written extent in the extent status tree. Also add WARN_ON() to make sure that we notice if this happens in the future. This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff This can be theoretically also reproduced by at random by running fsx, but it's not very reliable, though on machines with bigger page size (like ppc) this can be seen more often (especially xfstest generic/127) Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06RCU pathwalk breakage when running into a symlink overmounting somethingAl Viro
commit 3cab989afd8d8d1bc3d99fef0e7ed87c31e7b647 upstream. Calling unlazy_walk() in walk_component() and do_last() when we find a symlink that needs to be followed doesn't acquire a reference to vfsmount. That's fine when the symlink is on the same vfsmount as the parent directory (which is almost always the case), but it's not always true - one _can_ manage to bind a symlink on top of something. And in such cases we end up with excessive mntput(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06ext4: make fsync to sync parent dir in no-journal for real this timeLukas Czerner
commit e12fb97222fc41e8442896934f76d39ef99b590a upstream. Previously commit 14ece1028b3ed53ffec1b1213ffc6acaf79ad77c added a support for for syncing parent directory of newly created inodes to make sure that the inode is not lost after a power failure in no-journal mode. However this does not work in majority of cases, namely: - if the directory has inline data - if the directory is already indexed - if the directory already has at least one block and: - the new entry fits into it - or we've successfully converted it to indexed So in those cases we might lose the inode entirely even after fsync in the no-journal mode. This also includes ext2 default mode obviously. I've noticed this while running xfstest generic/321 and even though the test should fail (we need to run fsck after a crash in no-journal mode) I could not find a newly created entries even when if it was fsynced before. Fix this by adjusting the ext4_add_entry() successful exit paths to set the inode EXT4_STATE_NEWENTRY so that fsync has the chance to fsync the parent directory as well. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Frank Mayhar <fmayhar@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06fs/binfmt_elf.c: fix bug in loading of PIE binariesMichael Davidson
commit a87938b2e246b81b4fb713edb371a9fa3c5c3c86 upstream. With CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE enabled, and a normal top-down address allocation strategy, load_elf_binary() will attempt to map a PIE binary into an address range immediately below mm->mmap_base. Unfortunately, load_elf_ binary() does not take account of the need to allocate sufficient space for the entire binary which means that, while the first PT_LOAD segment is mapped below mm->mmap_base, the subsequent PT_LOAD segment(s) end up being mapped above mm->mmap_base into the are that is supposed to be the "gap" between the stack and the binary. Since the size of the "gap" on x86_64 is only guaranteed to be 128MB this means that binaries with large data segments > 128MB can end up mapping part of their data segment over their stack resulting in corruption of the stack (and the data segment once the binary starts to run). Any PIE binary with a data segment > 128MB is vulnerable to this although address randomization means that the actual gap between the stack and the end of the binary is normally greater than 128MB. The larger the data segment of the binary the higher the probability of failure. Fix this by calculating the total size of the binary in the same way as load_elf_interp(). Signed-off-by: Michael Davidson <md@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06Btrfs: fix inode eviction infinite loop after cloning into itFilipe Manana
commit ccccf3d67294714af2d72a6fd6fd7d73b01c9329 upstream. If we attempt to clone a 0 length region into a file we can end up inserting a range in the inode's extent_io tree with a start offset that is greater then the end offset, which triggers immediately the following warning: [ 3914.619057] WARNING: CPU: 17 PID: 4199 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3914.620886] BTRFS: end < start 4095 4096 (...) [ 3914.638093] Call Trace: [ 3914.638636] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3914.639620] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3914.640789] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3914.642041] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3914.643236] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3914.644441] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3914.645711] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3914.646914] [<ffffffff8142b2fb>] ? _raw_spin_unlock+0x28/0x33 [ 3914.648058] [<ffffffffa03cbac4>] ? test_range_bit+0xcc/0xde [btrfs] [ 3914.650105] [<ffffffffa03cb3c3>] lock_extent+0x13/0x15 [btrfs] [ 3914.651361] [<ffffffffa03db39e>] lock_extent_range+0x3d/0xcd [btrfs] [ 3914.652761] [<ffffffffa03de1fe>] btrfs_ioctl_clone+0x278/0x388 [btrfs] [ 3914.654128] [<ffffffff811226dd>] ? might_fault+0x58/0xb5 [ 3914.655320] [<ffffffffa03e0909>] btrfs_ioctl+0xb51/0x2195 [btrfs] (...) [ 3914.669271] ---[ end trace 14843d3e2e622fc1 ]--- This later makes the inode eviction handler enter an infinite loop that keeps dumping the following warning over and over: [ 3915.117629] WARNING: CPU: 22 PID: 4228 at fs/btrfs/extent_io.c:435 insert_state+0x4b/0x10b [btrfs]() [ 3915.119913] BTRFS: end < start 4095 4096 (...) [ 3915.137394] Call Trace: [ 3915.137913] [<ffffffff81425fd9>] dump_stack+0x4c/0x65 [ 3915.139154] [<ffffffff81045390>] warn_slowpath_common+0xa1/0xbb [ 3915.140316] [<ffffffffa03ca44f>] ? insert_state+0x4b/0x10b [btrfs] [ 3915.141505] [<ffffffff810453f0>] warn_slowpath_fmt+0x46/0x48 [ 3915.142709] [<ffffffffa03ca44f>] insert_state+0x4b/0x10b [btrfs] [ 3915.143849] [<ffffffffa03ca729>] __set_extent_bit+0x107/0x3f4 [btrfs] [ 3915.145120] [<ffffffffa038c1e3>] ? btrfs_kill_super+0x17/0x23 [btrfs] [ 3915.146352] [<ffffffff811548f6>] ? deactivate_locked_super+0x3b/0x50 [ 3915.147565] [<ffffffffa03cb256>] lock_extent_bits+0x65/0x1bf [btrfs] [ 3915.148785] [<ffffffff8142b7e2>] ? _raw_write_unlock+0x28/0x33 [ 3915.149931] [<ffffffffa03bc325>] btrfs_evict_inode+0x196/0x482 [btrfs] [ 3915.151154] [<ffffffff81168904>] evict+0xa0/0x148 [ 3915.152094] [<ffffffff811689e5>] dispose_list+0x39/0x43 [ 3915.153081] [<ffffffff81169564>] evict_inodes+0xdc/0xeb [ 3915.154062] [<ffffffff81154418>] generic_shutdown_super+0x49/0xef [ 3915.155193] [<ffffffff811546d1>] kill_anon_super+0x13/0x1e [ 3915.156274] [<ffffffffa038c1e3>] btrfs_kill_super+0x17/0x23 [btrfs] (...) [ 3915.167404] ---[ end trace 14843d3e2e622fc2 ]--- So just bail out of the clone ioctl if the length of the region to clone is zero, without locking any extent range, in order to prevent this issue (same behaviour as a pwrite with a 0 length for example). This is trivial to reproduce. For example, the steps for the test I just made for fstests: mkfs.btrfs -f SCRATCH_DEV mount SCRATCH_DEV $SCRATCH_MNT touch $SCRATCH_MNT/foo touch $SCRATCH_MNT/bar $CLONER_PROG -s 0 -d 4096 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar umount $SCRATCH_MNT A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-05-06Btrfs: fix log tree corruption when fs mounted with -o discardFilipe Manana
commit dcc82f4783ad91d4ab654f89f37ae9291cdc846a upstream. While committing a transaction we free the log roots before we write the new super block. Freeing the log roots implies marking the disk location of every node/leaf (metadata extent) as pinned before the new super block is written. This is to prevent the disk location of log metadata extents from being reused before the new super block is written, otherwise we would have a corrupted log tree if before the new super block is written a crash/reboot happens and the location of any log tree metadata extent ended up being reused and rewritten. Even though we pinned the log tree's metadata extents, we were issuing a discard against them if the fs was mounted with the -o discard option, resulting in corruption of the log tree if a crash/reboot happened before writing the new super block - the next time the fs was mounted, during the log replay process we would find nodes/leafs of the log btree with a content full of zeroes, causing the process to fail and require the use of the tool btrfs-zero-log to wipeout the log tree (and all data previously fsynced becoming lost forever). Fix this by not doing a discard when pinning an extent. The discard will be done later when it's safe (after the new super block is committed) at extent-tree.c:btrfs_finish_extent_commit(). Fixes: e688b7252f78 (Btrfs: fix extent pinning bugs in the tree log) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29dcache: Fix locking bugs in backported "deal with deadlock in d_walk()"Ben Hutchings
commit 20defcec264ceab2630356fb9d397f3d237b5e6d upstream in 3.2-stable Steven Rostedt reported: > Porting -rt to the latest 3.2 stable tree I triggered this bug: > > ===================================== > [ BUG: bad unlock balance detected! ] > ------------------------------------- > rm/1638 is trying to release lock (rcu_read_lock) at: > [<c04fde6c>] rcu_read_unlock+0x0/0x23 > but there are no more locks to release! > > other info that might help us debug this: > 2 locks held by rm/1638: > #0: (&sb->s_type->i_mutex_key#9/1){+.+.+.}, at: [<c04f93eb>] do_rmdir+0x5f/0xd2 > #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<c04f9329>] vfs_rmdir+0x49/0xac > > stack backtrace: > Pid: 1638, comm: rm Not tainted 3.2.66-test-rt96+ #2 > Call Trace: > [<c083f390>] ? printk+0x1d/0x1f > [<c0463cdf>] print_unlock_inbalance_bug+0xc3/0xcd > [<c04653a8>] lock_release_non_nested+0x98/0x1ec > [<c046228d>] ? trace_hardirqs_off_caller+0x18/0x90 > [<c0456f1c>] ? local_clock+0x2d/0x50 > [<c04fde6c>] ? d_hash+0x2f/0x2f > [<c04fde6c>] ? d_hash+0x2f/0x2f > [<c046568e>] lock_release+0x192/0x1ad > [<c04fde83>] rcu_read_unlock+0x17/0x23 > [<c04ff344>] shrink_dcache_parent+0x227/0x270 > [<c04f9348>] vfs_rmdir+0x68/0xac > [<c04f9424>] do_rmdir+0x98/0xd2 > [<c04f03ad>] ? fput+0x1a3/0x1ab > [<c084dd42>] ? sysenter_exit+0xf/0x1a > [<c0465b58>] ? trace_hardirqs_on_caller+0x118/0x149 > [<c04fa3e0>] sys_unlinkat+0x2b/0x35 > [<c084dd13>] sysenter_do_call+0x12/0x12 > > > > > There's a path to calling rcu_read_unlock() without calling > rcu_read_lock() in have_submounts(). > > goto positive; > > positive: > if (!locked && read_seqretry(&rename_lock, seq)) > goto rename_retry; > > rename_retry: > rcu_read_unlock(); > > in the above path, rcu_read_lock() is never done before calling > rcu_read_unlock(); I reviewed locking contexts in all three functions that I changed when backporting "deal with deadlock in d_walk()". It's actually worse than this: - We don't hold this_parent->d_lock at the 'positive' label in have_submounts(), but it is unlocked after 'rename_retry'. - There is an rcu_read_unlock() after the 'out' label in select_parent(), but it's not held at the 'goto out'. Fix all three lock imbalances. Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29deal with deadlock in d_walk()Al Viro
commit ca5358ef75fc69fee5322a38a340f5739d997c10 upstream. ... by not hitting rename_retry for reasons other than rename having happened. In other words, do _not_ restart when finding that between unlocking the child and locking the parent the former got into __dentry_kill(). Skip the killed siblings instead... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Hutchings <ben@decadent.org.uk> [hujianyang: Backported to 3.10 refer to the work of Ben Hutchings in 3.2: - As we only have try_to_ascend() and not d_walk(), apply this change to all callers of try_to_ascend() - Adjust context to make __dentry_kill() apply to d_kill()] Signed-off-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29move d_rcu from overlapping d_child to overlapping d_aliasAl Viro
commit 946e51f2bf37f1656916eb75bd0742ba33983c28 upstream. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Ben Hutchings <ben@decadent.org.uk> [hujianyang: Backported to 3.10 refer to the work of Ben Hutchings in 3.2: - Apply name changes in all the different places we use d_alias and d_child - Move the WARN_ON() in __d_free() to d_free() as we don't have dentry_free()] Signed-off-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29splice: Apply generic position and size checks to each writeBen Hutchings
commit 894c6350eaad7e613ae267504014a456e00a3e2a from the 3.2-stable branch. We need to check the position and size of file writes against various limits, using generic_write_check(). This was not being done for the splice write path. It was fixed upstream by commit 8d0207652cbe ("->splice_write() via ->write_iter()") but we can't apply that. CVE-2014-7822 Signed-off-by: Ben Hutchings <ben@decadent.org.uk> [Ben fixed it in 3.2 stable, i ported it to 3.10 stable] Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29jfs: fix readdir regressionDave Kleikamp
Upstream commit 44512449, "jfs: fix readdir cookie incompatibility with NFSv4", was backported incorrectly into the stable trees which used the filldir callback (rather than dir_emit). The position is being incorrectly passed to filldir for the . and .. entries. The still-maintained stable trees that need to be fixed are 3.2.y, 3.4.y and 3.10.y. https://bugzilla.kernel.org/show_bug.cgi?id=94741 Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: jfs-discussion@lists.sourceforge.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29remove extra definitions of U32_MAXAlex Elder
commit 04f9b74e4d96d349de12fdd4e6626af4a9f75e09 upstream. Now that the definition is centralized in <linux/kernel.h>, the definitions of U32_MAX (and related) elsewhere in the kernel can be removed. Signed-off-by: Alex Elder <elder@linaro.org> Acked-by: Sage Weil <sage@inktank.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-04-29conditionally define U32_MAXAlex Elder
commit 77719536dc00f8fd8f5abe6dadbde5331c37f996 upstream. The symbol U32_MAX is defined in several spots. Change these definitions to be conditional. This is in preparation for the next patch, which centralizes the definition in <linux/kernel.h>. Signed-off-by: Alex Elder <elder@linaro.org> Cc: Sage Weil <sage@inktank.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>