summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-11-14Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanupChris Mason
commit 6e5aafb27419f32575b27ef9d6a31e5d54661aca upstream. If we hit any errors in btrfs_lookup_csums_range, we'll loop through all the csums we allocate and free them. But the code was using list_entry incorrectly, and ended up trying to free the on-stack list_head instead. This bug came from commit 0678b6185 btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range() Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Erik Berg <btrfs@slipsprogrammoer.no> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14xfs: avoid false quotacheck after unclean shutdownEric Sandeen
commit 5ef828c4152726f56751c78ea844f08d2b2a4fa3 upstream. The commit 83e782e xfs: Remove incore use of XFS_OQUOTA_ENFD and XFS_OQUOTA_CHKD added a new function xfs_sb_quota_from_disk() which swaps on-disk XFS_OQUOTA_* flags for in-core XFS_GQUOTA_* and XFS_PQUOTA_* flags after the superblock is read. However, if log recovery is required, the superblock is read again, and the modified in-core flags are re-read from disk, so we have XFS_OQUOTA_* flags in memory again. This causes the XFS_QM_NEED_QUOTACHECK() test to be true, because the XFS_OQUOTA_CHKD is still set, and not XFS_GQUOTA_CHKD or XFS_PQUOTA_CHKD. Change xfs_sb_from_disk to call xfs_sb_quota_from disk and always convert the disk flags to in-memory flags. Add a lower-level function which can be called with "false" to not convert the flags, so that the sb verifier can verify exactly what was on disk, per Brian Foster's suggestion. Reported-by: Cyril B. <cbay@excellency.fr> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: Arkadiusz Miśkiewicz <arekm@maven.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14quota: Properly return errors from dquot_writeback_dquots()Jan Kara
commit 474d2605d119479e5aa050f738632e63589d4bb5 upstream. Due to a switched left and right side of an assignment, dquot_writeback_dquots() never returned error. This could result in errors during quota writeback to not be reported to userspace properly. Fix it. Coverity-id: 1226884 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext3: Don't check quota format when there are no quota filesJan Kara
commit 7938db449bbc55bbeb164bec7af406212e7e98f1 upstream. The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14nfsd4: fix crash on unknown operation numberJ. Bruce Fields
commit 51904b08072a8bf2b9ed74d1bd7a5300a614471d upstream. Unknown operation numbers are caught in nfsd4_decode_compound() which sets op->opnum to OP_ILLEGAL and op->status to nfserr_op_illegal. The error causes the main loop in nfsd4_proc_compound() to skip most processing. But nfsd4_proc_compound also peeks ahead at the next operation in one case and doesn't take similar precautions there. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix oops when loading block bitmap failedJan Kara
commit 599a9b77ab289d85c2d5c8607624efbe1f552b0f upstream. When we fail to load block bitmap in __ext4_new_inode() we will dereference NULL pointer in ext4_journal_get_write_access(). So check for error from ext4_read_block_bitmap(). Coverity-id: 989065 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: enable journal checksum when metadata checksum feature enabledDarrick J. Wong
commit 98c1a7593fa355fda7f5a5940c8bf5326ca964ba upstream. If metadata checksumming is turned on for the FS, we need to tell the journal to use checksumming too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix overflow when updating superblock backups after resizeJan Kara
commit 9378c6768e4fca48971e7b6a9075bc006eda981d upstream. When there are no meta block groups update_backups() will compute the backup block in 32-bit arithmetics thus possibly overflowing the block number and corrupting the filesystem. OTOH filesystems without meta block groups larger than 16 TB should be rare. Fix the problem by doing the counting in 64-bit arithmetics. Coverity-id: 741252 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: check s_chksum_driver when looking for bg csum presenceDarrick J. Wong
commit 813d32f91333e4c33d5a19b67167c4bae42dae75 upstream. Convert the ext4_has_group_desc_csum predicate to look for a checksum driver instead of the metadata_csum flag and change the bg checksum calculation function to look for GDT_CSUM before taking the crc16 path. Without this patch, if we mount with ^uninit_bg,^metadata_csum and later metadata_csum gets turned on by accident, the block group checksum functions will incorrectly assume that checksumming is enabled (metadata_csum) but that crc16 should be used (!s_chksum_driver). This is totally wrong, so fix the predicate and the checksum formula selection. (Granted, if the metadata_csum feature bit gets enabled on a live FS then something underhanded is going on, but we could at least avoid writing garbage into the on-disk fields.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: Replace open coded mdata csum feature to helper functionDmitry Monakhov
commit 9aa5d32ba269bec0e7eaba2697a986a7b0bc8528 upstream. Besides the fact that this replacement improves code readability it also protects from errors caused direct EXT4_S(sb)->s_es manipulation which may result attempt to use uninitialized csum machinery. #Testcase_BEGIN IMG=/dev/ram0 MNT=/mnt mkfs.ext4 $IMG mount $IMG $MNT #Enable feature directly on disk, on mounted fs tune2fs -O metadata_csum $IMG # Provoke metadata update, likey result in OOPS touch $MNT/test umount $MNT #Testcase_END # Replacement script @@ expression E; @@ - EXT4_HAS_RO_COMPAT_FEATURE(E, EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) + ext4_has_metadata_csum(E) https://bugzilla.kernel.org/show_bug.cgi?id=82201 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix reservation overflow in ext4_da_write_beginEric Sandeen
commit 0ff8947fc5f700172b37cbca811a38eb9cb81e08 upstream. Delalloc write journal reservations only reserve 1 credit, to update the inode if necessary. However, it may happen once in a filesystem's lifetime that a file will cross the 2G threshold, and require the LARGE_FILE feature to be set in the superblock as well, if it was not set already. This overruns the transaction reservation, and can be demonstrated simply on any ext4 filesystem without the LARGE_FILE feature already set: dd if=/dev/zero of=testfile bs=1 seek=2147483646 count=1 \ conv=notrunc of=testfile sync dd if=/dev/zero of=testfile bs=1 seek=2147483647 count=1 \ conv=notrunc of=testfile leads to: EXT4-fs: ext4_do_update_inode:4296: aborting transaction: error 28 in __ext4_handle_dirty_super EXT4-fs error (device loop0) in ext4_do_update_inode:4301: error 28 EXT4-fs error (device loop0) in ext4_reserve_inode_write:4757: Readonly filesystem EXT4-fs error (device loop0) in ext4_dirty_inode:4876: error 28 EXT4-fs error (device loop0) in ext4_da_write_end:2685: error 28 Adjust the number of credits based on whether the flag is already set, and whether the current write may extend past the LARGE_FILE limit. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: add ext4_iget_normal() which is to be used for dir tree lookupsTheodore Ts'o
commit f4bb2981024fc91b23b4d09a8817c415396dbabb upstream. If there is a corrupted file system which has directory entries that point at reserved, metadata inodes, prohibit them from being used by treating them the same way we treat Boot Loader inodes --- that is, mark them to be bad inodes. This prohibits them from being opened, deleted, or modified via chmod, chown, utimes, etc. In particular, this prevents a corrupted file system which has a directory entry which points at the journal inode from being deleted and its blocks released, after which point Much Hilarity Ensues. Reported-by: Sami Liedes <sami.liedes@iki.fi> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: grab missed write_count for EXT4_IOC_SWAP_BOOTDmitry Monakhov
commit 3e67cfad22230ebed85c56cbe413876f33fea82b upstream. Otherwise this provokes complain like follows: WARNING: CPU: 12 PID: 5795 at fs/ext4/ext4_jbd2.c:48 ext4_journal_check_start+0x4e/0xa0() Modules linked in: brd iTCO_wdt lpc_ich mfd_core igb ptp dm_mirror dm_region_hash dm_log dm_mod CPU: 12 PID: 5795 Comm: python Not tainted 3.17.0-rc2-00175-gae5344f #158 Hardware name: Intel Corporation W2600CR/W2600CR, BIOS SE5C600.86B.99.99.x028.061320111235 06/13/2011 0000000000000030 ffff8808116cfd28 ffffffff815c7dfc 0000000000000030 0000000000000000 ffff8808116cfd68 ffffffff8106ce8c ffff8808116cfdc8 ffff880813b16000 ffff880806ad6ae8 ffffffff81202008 0000000000000000 Call Trace: [<ffffffff815c7dfc>] dump_stack+0x51/0x6d [<ffffffff8106ce8c>] warn_slowpath_common+0x8c/0xc0 [<ffffffff81202008>] ? ext4_ioctl+0x9e8/0xeb0 [<ffffffff8106ceda>] warn_slowpath_null+0x1a/0x20 [<ffffffff8122867e>] ext4_journal_check_start+0x4e/0xa0 [<ffffffff81228c10>] __ext4_journal_start_sb+0x90/0x110 [<ffffffff81202008>] ext4_ioctl+0x9e8/0xeb0 [<ffffffff8107b0bd>] ? ptrace_stop+0x24d/0x2f0 [<ffffffff81088530>] ? alloc_pid+0x480/0x480 [<ffffffff8107b1f2>] ? ptrace_do_notify+0x92/0xb0 [<ffffffff81186545>] do_vfs_ioctl+0x4e5/0x550 [<ffffffff815cdbcb>] ? _raw_spin_unlock_irq+0x2b/0x40 [<ffffffff81186603>] SyS_ioctl+0x53/0x80 [<ffffffff815ce2ce>] tracesys+0xd0/0xd5 Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: fix mmap data corruption when blocksize < pagesizeJan Kara
commit d6320cbfc92910a3e5f10c42d98c231c98db4f60 upstream. Use truncate_isize_extended() when hole is being created in a file so that ->page_mkwrite() will get called for the partial tail page if it is mmaped (see the first patch in the series for details). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: don't check quota format when there are no quota filesJan Kara
commit 279bf6d390933d5353ab298fcc306c391a961469 upstream. The check whether quota format is set even though there are no quota files with journalled quota is pointless and it actually makes it impossible to turn off journalled quotas (as there's no way to unset journalled quota format). Just remove the check. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14ext4: check EA value offset when loadingDarrick J. Wong
commit a0626e75954078cfacddb00a4545dde821170bc5 upstream. When loading extended attributes, check each entry's value offset to make sure it doesn't collide with the entries. Without this check it is easy to crash the kernel by mounting a malicious FS containing a file with an EA wherein e_value_offs = 0 and e_value_size > 0 and then deleting the EA, which corrupts the name list. (See the f_ea_value_crash test's FS image in e2fsprogs for an example.) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14jbd2: free bh when descriptor block checksum failsDarrick J. Wong
commit 064d83892e9ba547f7d4eae22cbca066d95210ce upstream. Free the buffer head if the journal descriptor block fails checksum verification. This is the jbd2 port of the e2fsprogs patch "e2fsck: free bh on csum verify error in do_one_pass". Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14pstore: Fix duplicate {console,ftrace}-efi entriesValdis Kletnieks
commit d4bf205da618bbd0b038e404d646f14e76915718 upstream. The pstore filesystem still creates duplicate filename/inode pairs for some pstore types. Add the id to the filename to prevent that. Before patch: [/sys/fs/pstore] ls -li total 0 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi 1250 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi After: [/sys/fs/pstore] ls -li total 0 1232 -r--r--r--. 1 root root 148 Sep 29 17:09 console-efi-141202499100000 1231 -r--r--r--. 1 root root 67 Sep 29 17:09 console-efi-141202499200000 1230 -r--r--r--. 1 root root 148 Sep 29 17:44 console-efi-141202705400000 1229 -r--r--r--. 1 root root 67 Sep 29 17:44 console-efi-141202705500000 1228 -r--r--r--. 1 root root 67 Sep 29 20:42 console-efi-141203772600000 1227 -r--r--r--. 1 root root 148 Sep 29 23:42 console-efi-141204854900000 1226 -r--r--r--. 1 root root 67 Sep 29 23:42 console-efi-141204855000000 1225 -r--r--r--. 1 root root 148 Sep 29 23:59 console-efi-141204954200000 1224 -r--r--r--. 1 root root 67 Sep 29 23:59 console-efi-141204954400000 Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14mnt: Prevent pivot_root from creating a loop in the mount treeEric W. Biederman
commit 0d0826019e529f21c84687521d03f60cd241ca7d upstream. Andy Lutomirski recently demonstrated that when chroot is used to set the root path below the path for the new ``root'' passed to pivot_root the pivot_root system call succeeds and leaks mounts. In examining the code I see that starting with a new root that is below the current root in the mount tree will result in a loop in the mount tree after the mounts are detached and then reattached to one another. Resulting in all kinds of ugliness including a leak of that mounts involved in the leak of the mount loop. Prevent this problem by ensuring that the new mount is reachable from the current root of the mount tree. [Added stable cc. Fixes CVE-2014-7970. --Andy] Reported-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Andy Lutomirski <luto@amacapital.net> Link: http://lkml.kernel.org/r/87bnpmihks.fsf@x220.int.ebiederm.org Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14kill wbuf_queued/wbuf_dwork_lockAl Viro
commit 99358a1ca53e8e6ce09423500191396f0e6584d2 upstream. schedule_delayed_work() happening when the work is already pending is a cheap no-op. Don't bother with ->wbuf_queued logics - it's both broken (cancelling ->wbuf_dwork leaves it set, as spotted by Jeff Harris) and pointless. It's cheaper to let schedule_delayed_work() handle that case. Reported-by: Jeff Harris <jefftharris@gmail.com> Tested-by: Jeff Harris <jefftharris@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14missing data dependency barrier in prepend_name()Al Viro
commit 6d13f69444bd3d4888e43f7756449748f5a98bad upstream. AFAICS, prepend_name() is broken on SMP alpha. Disclaimer: I don't have SMP alpha boxen to reproduce it on. However, it really looks like the race is real. CPU1: d_path() on /mnt/ramfs/<255-character>/foo CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character> CPU2 does d_alloc(), which allocates an external name, stores the name there including terminating NUL, does smp_wmb() and stores its address in dentry->d_name.name. It proceeds to d_add(dentry, NULL) and d_move() old dentry over to that. ->d_name.name value ends up in that dentry. In the meanwhile, CPU1 gets to prepend_name() for that dentry. It fetches ->d_name.name and ->d_name.len; the former ends up pointing to new name (64-byte kmalloc'ed array), the latter - 255 (length of the old name). Nothing to force the ordering there, and normally that would be OK, since we'd run into the terminating NUL and stop. Except that it's alpha, and we'd need a data dependency barrier to guarantee that we see that store of NUL __d_alloc() has done. In a similar situation dentry_cmp() would survive; it does explicit smp_read_barrier_depends() after fetching ->d_name.name. prepend_name() doesn't and it risks walking past the end of kmalloc'ed object and possibly oops due to taking a page fault in kernel mode. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14vfs: fix data corruption when blocksize < pagesize for mmaped dataJan Kara
commit 90a8020278c1598fafd071736a0846b38510309c upstream. ->page_mkwrite() is used by filesystems to allocate blocks under a page which is becoming writeably mmapped in some process' address space. This allows a filesystem to return a page fault if there is not enough space available, user exceeds quota or similar problem happens, rather than silently discarding data later when writepage is called. However VFS fails to call ->page_mkwrite() in all the cases where filesystems need it when blocksize < pagesize. For example when blocksize = 1024, pagesize = 4096 the following is problematic: ftruncate(fd, 0); pwrite(fd, buf, 1024, 0); map = mmap(NULL, 1024, PROT_WRITE, MAP_SHARED, fd, 0); map[0] = 'a'; ----> page_mkwrite() for index 0 is called ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */ mremap(map, 1024, 10000, 0); map[4095] = 'a'; ----> no page_mkwrite() called At the moment ->page_mkwrite() is called, filesystem can allocate only one block for the page because i_size == 1024. Otherwise it would create blocks beyond i_size which is generally undesirable. But later at ->writepage() time, we also need to store data at offset 4095 but we don't have block allocated for it. This patch introduces a helper function filesystems can use to have ->page_mkwrite() called at all the necessary moments. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14UBIFS: fix free log space calculationArtem Bityutskiy
commit ba29e721eb2df6df8f33c1f248388bb037a47914 upstream. Hu (hujianyang <hujianyang@huawei.com>) discovered an issue in the 'empty_log_bytes()' function, which calculates how many bytes are left in the log: " If 'c->lhead_lnum + 1 == c->ltail_lnum' and 'c->lhead_offs == c->leb_size', 'h' would equalent to 't' and 'empty_log_bytes()' would return 'c->log_bytes' instead of 0. " At this point it is not clear what would be the consequences of this, and whether this may lead to any problems, but this patch addresses the issue just in case. Tested-by: hujianyang <hujianyang@huawei.com> Reported-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14UBIFS: fix a race conditionArtem Bityutskiy
commit 052c28073ff26f771d44ef33952a41d18dadd255 upstream. Hu (hujianyang@huawei.com) discovered a race condition which may lead to a situation when UBIFS is unable to mount the file-system after an unclean reboot. The problem is theoretical, though. In UBIFS, we have the log, which basically a set of LEBs in a certain area. The log has the tail and the head. Every time user writes data to the file-system, the UBIFS journal grows, and the log grows as well, because we append new reference nodes to the head of the log. So the head moves forward all the time, while the log tail stays at the same position. At any time, the UBIFS master node points to the tail of the log. When we mount the file-system, we scan the log, and we always start from its tail, because this is where the master node points to. The only occasion when the tail of the log changes is the commit operation. The commit operation has 2 phases - "commit start" and "commit end". The former is relatively short, and does not involve much I/O. During this phase we mostly just build various in-memory lists of the things which have to be written to the flash media during "commit end" phase. During the commit start phase, what we do is we "clean" the log. Indeed, the commit operation will index all the data in the journal, so the entire journal "disappears", and therefore the data in the log become unneeded. So we just move the head of the log to the next LEB, and write the CS node there. This LEB will be the tail of the new log when the commit operation finishes. When the "commit start" phase finishes, users may write more data to the file-system, in parallel with the ongoing "commit end" operation. At this point the log tail was not changed yet, it is the same as it had been before we started the commit. The log head keeps moving forward, though. The commit operation now needs to write the new master node, and the new master node should point to the new log tail. After this the LEBs between the old log tail and the new log tail can be unmapped and re-used again. And here is the possible problem. We do 2 operations: (a) We first update the log tail position in memory (see 'ubifs_log_end_commit()'). (b) And then we write the master node (see the big lock of code in 'do_commit()'). But nothing prevents the log head from moving forward between (a) and (b), and the log head may "wrap" now to the old log tail. And when the "wrap" happens, the contends of the log tail gets erased. Now a power cut happens and we are in trouble. We end up with the old master node pointing to the old tail, which was erased. And replay fails because it expects the master node to point to the correct log tail at all times. This patch merges the abovementioned (a) and (b) operations by moving the master node change code to the 'ubifs_log_end_commit()' function, so that it runs with the log mutex locked, which will prevent the log from being changed benween operations (a) and (b). Reported-by: hujianyang <hujianyang@huawei.com> Tested-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14UBIFS: remove mst_mutexArtem Bityutskiy
commit 07e19dff63e3d5d6500d831e36554ac9b1b0560e upstream. The 'mst_mutex' is not needed since because 'ubifs_write_master()' is only called on the mount path and commit path. The mount path is sequential and there is no parallelism, and the commit path is also serialized - there is only one commit going on at a time. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14fs: allow open(dir, O_TMPFILE|..., 0) with mode 0Eric Rannaud
commit 69a91c237ab0ebe4e9fdeaf6d0090c85275594ec upstream. The man page for open(2) indicates that when O_CREAT is specified, the 'mode' argument applies only to future accesses to the file: Note that this mode applies only to future accesses of the newly created file; the open() call that creates a read-only file may well return a read/write file descriptor. The man page for open(2) implies that 'mode' is treated identically by O_CREAT and O_TMPFILE. O_TMPFILE, however, behaves differently: int fd = open("/tmp", O_TMPFILE | O_RDWR, 0); assert(fd == -1); assert(errno == EACCES); int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600); assert(fd > 0); For O_CREAT, do_last() sets acc_mode to MAY_OPEN only: if (*opened & FILE_CREATED) { /* Don't check for write permission, don't truncate */ open_flag &= ~O_TRUNC; will_truncate = false; acc_mode = MAY_OPEN; path_to_nameidata(path, nd); goto finish_open_created; } But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to may_open(). This patch lines up the behavior of O_TMPFILE with O_CREAT. After the inode is created, may_open() is called with acc_mode = MAY_OPEN, in do_tmpfile(). A different, but related glibc bug revealed the discrepancy: https://sourceware.org/bugzilla/show_bug.cgi?id=17523 The glibc lazily loads the 'mode' argument of open() and openat() using va_arg() only if O_CREAT is present in 'flags' (to support both the 2 argument and the 3 argument forms of open; same idea for openat()). However, the glibc ignores the 'mode' argument if O_TMPFILE is in 'flags'. On x86_64, for open(), it magically works anyway, as 'mode' is in RDX when entering open(), and is still in RDX on SYSCALL, which is where the kernel looks for the 3rd argument of a syscall. But openat() is not quite so lucky: 'mode' is in RCX when entering the glibc wrapper for openat(), while the kernel looks for the 4th argument of a syscall in R10. Indeed, the syscall calling convention differs from the regular calling convention in this respect on x86_64. So the kernel sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and fails with EACCES. Signed-off-by: Eric Rannaud <e@nanocritical.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14fs: Fix theoretical division by 0 in super_cache_scan().Tetsuo Handa
commit 475d0db742e3755c6b267f48577ff7cbb7dfda0d upstream. total_objects could be 0 and is used as a denom. While total_objects is a "long", total_objects == 0 unlikely happens for 3.12 and later kernels because 32-bit architectures would not be able to hold (1 << 32) objects. However, total_objects == 0 may happen for kernels between 3.1 and 3.11 because total_objects in prune_super() was an "int" and (e.g.) x86_64 architecture might be able to hold (1 << 32) objects. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14fs: make cont_expand_zero interruptibleMikulas Patocka
commit c2ca0fcd202863b14bd041a7fece2e789926c225 upstream. This patch makes it possible to kill a process looping in cont_expand_zero. A process may spend a lot of time in this function, so it is desirable to be able to kill it. It happened to me that I wanted to copy a piece data from the disk to a file. By mistake, I used the "seek" parameter to dd instead of "skip". Due to the "seek" parameter, dd attempted to extend the file and became stuck doing so - the only possibility was to reset the machine or wait many hours until the filesystem runs out of space and cont_expand_zero fails. We need this patch to be able to terminate the process. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14lockd: Try to reconnect if statd has movedBenjamin Coddington
commit 173b3afceebe76fa2205b2c8808682d5b541fe3c upstream. If rpc.statd is restarted, upcalls to monitor hosts can fail with ECONNREFUSED. In that case force a lookup of statd's new port and retry the upcall. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30xfs: ensure WB_SYNC_ALL writeback handles partial pages correctlyDave Chinner
commit 0d085a529b427d97710e6a41f8a4f23e1757cd12 upstream. XFS has been having trouble with stray delayed allocation extents beyond EOF for a long time. Recent changes to the collapse range code has triggered erroneous EBUSY errors on page invalidtion for block size smaller than page size filesystems. These have been caused by dirty buffers beyond EOF on a partial page which do not get written to disk during a sync. The issue is that write-ahead in xfs_cluster_write() finds such a partial page and handles it by leaving the page dirty but pushing it into a writeback state. This used to work just fine, as the write_cache_pages() code would then find the dirty partial page in the next mapping tree lookup as the dirty tag is still set. Unfortunately, when we moved to a mark and sweep approach to writeback to fix other writeback sync issues, we broken this. THe act of marking the page as under writeback now clears the TOWRITE tag in the radix tree, even though the page is still dirty. This causes the TOWRITE tag to be cleared, and hence the next lookup on the mapping tree does not find the dirty partial page and so doesn't try to write it again. This same writeback bug was found recently in ext4 and fixed in commit 1c8349a ("ext4: fix data integrity sync in ordered mode") without communication to the wider filesystem community. We can use exactly the same fix here so the TOWRITE flag is not cleared on partial page writes. cc: stable@vger.kernel.org # dependent on 1c8349a17137b93f0a83f276c764a6df1b9a116e Root-cause-found-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30ecryptfs: avoid to access NULL pointer when write metadata in xattrChao Yu
commit 35425ea2492175fd39f6116481fe98b2b3ddd4ca upstream. Christopher Head 2014-06-28 05:26:20 UTC described: "I tried to reproduce this on 3.12.21. Instead, when I do "echo hello > foo" in an ecryptfs mount with ecryptfs_xattr specified, I get a kernel crash: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8110eb39>] fsstack_copy_attr_all+0x2/0x61 PGD d7840067 PUD b2c3c067 PMD 0 Oops: 0002 [#1] SMP Modules linked in: nvidia(PO) CPU: 3 PID: 3566 Comm: bash Tainted: P O 3.12.21-gentoo-r1 #2 Hardware name: ASUSTek Computer Inc. G60JX/G60JX, BIOS 206 03/15/2010 task: ffff8801948944c0 ti: ffff8800bad70000 task.ti: ffff8800bad70000 RIP: 0010:[<ffffffff8110eb39>] [<ffffffff8110eb39>] fsstack_copy_attr_all+0x2/0x61 RSP: 0018:ffff8800bad71c10 EFLAGS: 00010246 RAX: 00000000000181a4 RBX: ffff880198648480 RCX: 0000000000000000 RDX: 0000000000000004 RSI: ffff880172010450 RDI: 0000000000000000 RBP: ffff880198490e40 R08: 0000000000000000 R09: 0000000000000000 R10: ffff880172010450 R11: ffffea0002c51e80 R12: 0000000000002000 R13: 000000000000001a R14: 0000000000000000 R15: ffff880198490e40 FS: 00007ff224caa700(0000) GS:ffff88019fcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 00000000bb07f000 CR4: 00000000000007e0 Stack: ffffffff811826e8 ffff8800a39d8000 0000000000000000 000000000000001a ffff8800a01d0000 ffff8800a39d8000 ffffffff81185fd5 ffffffff81082c2c 00000001a39d8000 53d0abbc98490e40 0000000000000037 ffff8800a39d8220 Call Trace: [<ffffffff811826e8>] ? ecryptfs_setxattr+0x40/0x52 [<ffffffff81185fd5>] ? ecryptfs_write_metadata+0x1b3/0x223 [<ffffffff81082c2c>] ? should_resched+0x5/0x23 [<ffffffff8118322b>] ? ecryptfs_initialize_file+0xaf/0xd4 [<ffffffff81183344>] ? ecryptfs_create+0xf4/0x142 [<ffffffff810f8c0d>] ? vfs_create+0x48/0x71 [<ffffffff810f9c86>] ? do_last.isra.68+0x559/0x952 [<ffffffff810f7ce7>] ? link_path_walk+0xbd/0x458 [<ffffffff810fa2a3>] ? path_openat+0x224/0x472 [<ffffffff810fa7bd>] ? do_filp_open+0x2b/0x6f [<ffffffff81103606>] ? __alloc_fd+0xd6/0xe7 [<ffffffff810ee6ab>] ? do_sys_open+0x65/0xe9 [<ffffffff8157d022>] ? system_call_fastpath+0x16/0x1b RIP [<ffffffff8110eb39>] fsstack_copy_attr_all+0x2/0x61 RSP <ffff8800bad71c10> CR2: 0000000000000000 ---[ end trace df9dba5f1ddb8565 ]---" If we create a file when we mount with ecryptfs_xattr_metadata option, we will encounter a crash in this path: ->ecryptfs_create ->ecryptfs_initialize_file ->ecryptfs_write_metadata ->ecryptfs_write_metadata_to_xattr ->ecryptfs_setxattr ->fsstack_copy_attr_all It's because our dentry->d_inode used in fsstack_copy_attr_all is NULL, and it will be initialized when ecryptfs_initialize_file finish. So we should skip copying attr from lower inode when the value of ->d_inode is invalid. Signed-off-by: Chao Yu <chao2.yu@samsung.com> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30fanotify: enable close-on-exec on events' fd when requested in fanotify_init()Yann Droneaud
commit 0b37e097a648aa71d4db1ad108001e95b69a2da4 upstream. According to commit 80af258867648 ("fanotify: groups can specify their f_flags for new fd"), file descriptors created as part of file access notification events inherit flags from the event_f_flags argument passed to syscall fanotify_init(2)[1]. Unfortunately O_CLOEXEC is currently silently ignored. Indeed, event_f_flags are only given to dentry_open(), which only seems to care about O_ACCMODE and O_PATH in do_dentry_open(), O_DIRECT in open_check_o_direct() and O_LARGEFILE in generic_file_open(). It's a pity, since, according to some lookup on various search engines and http://codesearch.debian.net/, there's already some userspace code which use O_CLOEXEC: - in systemd's readahead[2]: fanotify_fd = fanotify_init(FAN_CLOEXEC|FAN_NONBLOCK, O_RDONLY|O_LARGEFILE|O_CLOEXEC|O_NOATIME); - in clsync[3]: #define FANOTIFY_EVFLAGS (O_LARGEFILE|O_RDONLY|O_CLOEXEC) int fanotify_d = fanotify_init(FANOTIFY_FLAGS, FANOTIFY_EVFLAGS); - in examples [4] from "Filesystem monitoring in the Linux kernel" article[5] by Aleksander Morgado: if ((fanotify_fd = fanotify_init (FAN_CLOEXEC, O_RDONLY | O_CLOEXEC | O_LARGEFILE)) < 0) Additionally, since commit 48149e9d3a7e ("fanotify: check file flags passed in fanotify_init"). having O_CLOEXEC as part of fanotify_init() second argument is expressly allowed. So it seems expected to set close-on-exec flag on the file descriptors if userspace is allowed to request it with O_CLOEXEC. But Andrew Morton raised[6] the concern that enabling now close-on-exec might break existing applications which ask for O_CLOEXEC but expect the file descriptor to be inherited across exec(). In the other hand, as reported by Mihai Dontu[7] close-on-exec on the file descriptor returned as part of file access notify can break applications due to deadlock. So close-on-exec is needed for most applications. More, applications asking for close-on-exec are likely expecting it to be enabled, relying on O_CLOEXEC being effective. If not, it might weaken their security, as noted by Jan Kara[8]. So this patch replaces call to macro get_unused_fd() by a call to function get_unused_fd_flags() with event_f_flags value as argument. This way O_CLOEXEC flag in the second argument of fanotify_init(2) syscall is interpreted and close-on-exec get enabled when requested. [1] http://man7.org/linux/man-pages/man2/fanotify_init.2.html [2] http://cgit.freedesktop.org/systemd/systemd/tree/src/readahead/readahead-collect.c?id=v208#n294 [3] https://github.com/xaionaro/clsync/blob/v0.2.1/sync.c#L1631 https://github.com/xaionaro/clsync/blob/v0.2.1/configuration.h#L38 [4] http://www.lanedo.com/~aleksander/fanotify/fanotify-example.c [5] http://www.lanedo.com/2013/filesystem-monitoring-linux-kernel/ [6] http://lkml.kernel.org/r/20141001153621.65e9258e65a6167bf2e4cb50@linux-foundation.org [7] http://lkml.kernel.org/r/20141002095046.3715eb69@mdontu-l [8] http://lkml.kernel.org/r/20141002104410.GB19748@quack.suse.cz Link: http://lkml.kernel.org/r/cover.1411562410.git.ydroneaud@opteya.com Signed-off-by: Yann Droneaud <ydroneaud@opteya.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed by: Heinrich Schuchardt <xypron.glpk@gmx.de> Tested-by: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Mihai Don\u021bu <mihai.dontu@gmail.com> Cc: Pádraig Brady <P@draigBrady.com> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Jan Kara <jack@suse.cz> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Michael Kerrisk-manpages <mtk.manpages@gmail.com> Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de> Cc: Richard Guy Briggs <rgb@redhat.com> Cc: Eric Paris <eparis@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30NFSv4.1: Fix an NFSv4.1 state renewal regressionAndy Adamson
commit d1f456b0b9545f1606a54cd17c20775f159bd2ce upstream. Commit 2f60ea6b8ced ("NFSv4: The NFSv4.0 client must send RENEW calls if it holds a delegation") set the NFS4_RENEW_TIMEOUT flag in nfs4_renew_state, and does not put an nfs41_proc_async_sequence call, the NFSv4.1 lease renewal heartbeat call, on the wire to renew the NFSv4.1 state if the flag was not set. The NFS4_RENEW_TIMEOUT flag is set when "now" is after the last renewal (cl_last_renewal) plus the lease time divided by 3. This is arbitrary and sometimes does the following: In normal operation, the only way a future state renewal call is put on the wire is via a call to nfs4_schedule_state_renewal, which schedules a nfs4_renew_state workqueue task. nfs4_renew_state determines if the NFS4_RENEW_TIMEOUT should be set, and the calls nfs41_proc_async_sequence, which only gets sent if the NFS4_RENEW_TIMEOUT flag is set. Then the nfs41_proc_async_sequence rpc_release function schedules another state remewal via nfs4_schedule_state_renewal. Without this change we can get into a state where an application stops accessing the NFSv4.1 share, state renewal calls stop due to the NFS4_RENEW_TIMEOUT flag _not_ being set. The only way to recover from this situation is with a clientid re-establishment, once the application resumes and the server has timed out the lease and so returns NFS4ERR_BAD_SESSION on the subsequent SEQUENCE operation. An example application: open, lock, write a file. sleep for 6 * lease (could be less) ulock, close. In the above example with NFSv4.1 delegations enabled, without this change, there are no OP_SEQUENCE state renewal calls during the sleep, and the clientid is recovered due to lease expiration on the close. This issue does not occur with NFSv4.1 delegations disabled, nor with NFSv4.0, with or without delegations enabled. Signed-off-by: Andy Adamson <andros@netapp.com> Link: http://lkml.kernel.org/r/1411486536-23401-1-git-send-email-andros@netapp.com Fixes: 2f60ea6b8ced (NFSv4: The NFSv4.0 client must send RENEW calls...) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30NFSv4: fix open/lock state recovery error handlingTrond Myklebust
commit df817ba35736db2d62b07de6f050a4db53492ad8 upstream. The current open/lock state recovery unfortunately does not handle errors such as NFS4ERR_CONN_NOT_BOUND_TO_SESSION correctly. Instead of looping, just proceeds as if the state manager is finished recovering. This patch ensures that we loop back, handle higher priority errors and complete the open/lock state recovery. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30NFSv4: Fix lock recovery when CREATE_SESSION/SETCLIENTID_CONFIRM failsTrond Myklebust
commit a4339b7b686b4acc8b6de2b07d7bacbe3ae44b83 upstream. If a NFSv4.x server returns NFS4ERR_STALE_CLIENTID in response to a CREATE_SESSION or SETCLIENTID_CONFIRM in order to tell us that it rebooted a second time, then the client will currently take this to mean that it must declare all locks to be stale, and hence ineligible for reboot recovery. RFC3530 and RFC5661 both suggest that the client should instead rely on the server to respond to inelegible open share, lock and delegation reclaim requests with NFS4ERR_NO_GRACE in this situation. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30Fixing lease renewalOlga Kornievskaia
commit 8faaa6d5d48b201527e0451296d9e71d23afb362 upstream. Commit c9fdeb28 removed a 'continue' after checking if the lease needs to be renewed. However, if client hasn't moved, the code falls down to starting reboot recovery erroneously (ie., sends open reclaim and gets back stale_clientid error) before recovering from getting stale_clientid on the renew operation. Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Fixes: c9fdeb280b8c (NFS: Add basic migration support to state manager thread) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30Btrfs: send, fix data corruption due to incorrect hole detectionFilipe Manana
commit 766b5e5ae78dd04a93a275690a49e23d7dcb1f39 upstream. During an incremental send, when we finish processing an inode (corresponding to a regular file) we would assume the gap between the end of the last processed file extent and the file's size corresponded to a file hole, and therefore incorrectly send a bunch of zero bytes to overwrite that region in the file. This affects only kernel 3.14. Reproducer: mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt xfs_io -f -c "falloc -k 0 268435456" /mnt/foo btrfs subvolume snapshot -r /mnt /mnt/mysnap0 xfs_io -c "pwrite -S 0x01 -b 9216 16190218 9216" /mnt/foo xfs_io -c "pwrite -S 0x02 -b 1121 198720104 1121" /mnt/foo xfs_io -c "pwrite -S 0x05 -b 9216 107887439 9216" /mnt/foo xfs_io -c "pwrite -S 0x06 -b 9216 225520207 9216" /mnt/foo xfs_io -c "pwrite -S 0x07 -b 67584 102138300 67584" /mnt/foo xfs_io -c "pwrite -S 0x08 -b 7000 94897484 7000" /mnt/foo xfs_io -c "pwrite -S 0x09 -b 113664 245083212 113664" /mnt/foo xfs_io -c "pwrite -S 0x10 -b 123 17937788 123" /mnt/foo xfs_io -c "pwrite -S 0x11 -b 39936 229573311 39936" /mnt/foo xfs_io -c "pwrite -S 0x12 -b 67584 174792222 67584" /mnt/foo xfs_io -c "pwrite -S 0x13 -b 9216 249253213 9216" /mnt/foo xfs_io -c "pwrite -S 0x16 -b 67584 150046083 67584" /mnt/foo xfs_io -c "pwrite -S 0x17 -b 39936 118246040 39936" /mnt/foo xfs_io -c "pwrite -S 0x18 -b 67584 215965442 67584" /mnt/foo xfs_io -c "pwrite -S 0x19 -b 33792 97096725 33792" /mnt/foo xfs_io -c "pwrite -S 0x20 -b 125952 166300596 125952" /mnt/foo xfs_io -c "pwrite -S 0x21 -b 123 1078957 123" /mnt/foo xfs_io -c "pwrite -S 0x25 -b 9216 212044492 9216" /mnt/foo xfs_io -c "pwrite -S 0x26 -b 7000 265037146 7000" /mnt/foo xfs_io -c "pwrite -S 0x27 -b 42757 215922685 42757" /mnt/foo xfs_io -c "pwrite -S 0x28 -b 7000 69865411 7000" /mnt/foo xfs_io -c "pwrite -S 0x29 -b 67584 67948958 67584" /mnt/foo xfs_io -c "pwrite -S 0x30 -b 39936 266967019 39936" /mnt/foo xfs_io -c "pwrite -S 0x31 -b 1121 19582453 1121" /mnt/foo xfs_io -c "pwrite -S 0x32 -b 17408 257710255 17408" /mnt/foo xfs_io -c "pwrite -S 0x33 -b 39936 3895518 39936" /mnt/foo xfs_io -c "pwrite -S 0x34 -b 125952 12045847 125952" /mnt/foo xfs_io -c "pwrite -S 0x35 -b 17408 19156379 17408" /mnt/foo xfs_io -c "pwrite -S 0x36 -b 39936 50160066 39936" /mnt/foo xfs_io -c "pwrite -S 0x37 -b 113664 9549793 113664" /mnt/foo xfs_io -c "pwrite -S 0x38 -b 105472 94391506 105472" /mnt/foo xfs_io -c "pwrite -S 0x39 -b 23552 143632863 23552" /mnt/foo xfs_io -c "pwrite -S 0x40 -b 39936 241283845 39936" /mnt/foo xfs_io -c "pwrite -S 0x41 -b 113664 199937606 113664" /mnt/foo xfs_io -c "pwrite -S 0x42 -b 67584 67380093 67584" /mnt/foo xfs_io -c "pwrite -S 0x43 -b 67584 26793129 67584" /mnt/foo xfs_io -c "pwrite -S 0x44 -b 39936 14421913 39936" /mnt/foo xfs_io -c "pwrite -S 0x45 -b 123 253097405 123" /mnt/foo xfs_io -c "pwrite -S 0x46 -b 1121 128233424 1121" /mnt/foo xfs_io -c "pwrite -S 0x47 -b 105472 91577959 105472" /mnt/foo xfs_io -c "pwrite -S 0x48 -b 1121 7245381 1121" /mnt/foo xfs_io -c "pwrite -S 0x49 -b 113664 182414694 113664" /mnt/foo xfs_io -c "pwrite -S 0x50 -b 9216 32750608 9216" /mnt/foo xfs_io -c "pwrite -S 0x51 -b 67584 266546049 67584" /mnt/foo xfs_io -c "pwrite -S 0x52 -b 67584 87969398 67584" /mnt/foo xfs_io -c "pwrite -S 0x53 -b 9216 260848797 9216" /mnt/foo xfs_io -c "pwrite -S 0x54 -b 39936 119461243 39936" /mnt/foo xfs_io -c "pwrite -S 0x55 -b 7000 200178693 7000" /mnt/foo xfs_io -c "pwrite -S 0x56 -b 9216 243316029 9216" /mnt/foo xfs_io -c "pwrite -S 0x57 -b 7000 209658229 7000" /mnt/foo xfs_io -c "pwrite -S 0x58 -b 101376 179745192 101376" /mnt/foo xfs_io -c "pwrite -S 0x59 -b 9216 64012300 9216" /mnt/foo xfs_io -c "pwrite -S 0x60 -b 125952 181705139 125952" /mnt/foo xfs_io -c "pwrite -S 0x61 -b 23552 235737348 23552" /mnt/foo xfs_io -c "pwrite -S 0x62 -b 113664 106021355 113664" /mnt/foo xfs_io -c "pwrite -S 0x63 -b 67584 135753552 67584" /mnt/foo xfs_io -c "pwrite -S 0x64 -b 23552 95730888 23552" /mnt/foo xfs_io -c "pwrite -S 0x65 -b 11 17311415 11" /mnt/foo xfs_io -c "pwrite -S 0x66 -b 33792 120695553 33792" /mnt/foo xfs_io -c "pwrite -S 0x67 -b 9216 17164631 9216" /mnt/foo xfs_io -c "pwrite -S 0x68 -b 9216 136065853 9216" /mnt/foo xfs_io -c "pwrite -S 0x69 -b 67584 37752198 67584" /mnt/foo xfs_io -c "pwrite -S 0x70 -b 101376 189717473 101376" /mnt/foo xfs_io -c "pwrite -S 0x71 -b 7000 227463698 7000" /mnt/foo xfs_io -c "pwrite -S 0x72 -b 9216 12655137 9216" /mnt/foo xfs_io -c "pwrite -S 0x73 -b 7000 7488866 7000" /mnt/foo xfs_io -c "pwrite -S 0x74 -b 113664 87813649 113664" /mnt/foo xfs_io -c "pwrite -S 0x75 -b 33792 25802183 33792" /mnt/foo xfs_io -c "pwrite -S 0x76 -b 39936 93524024 39936" /mnt/foo xfs_io -c "pwrite -S 0x77 -b 33792 113336388 33792" /mnt/foo xfs_io -c "pwrite -S 0x78 -b 105472 184955320 105472" /mnt/foo xfs_io -c "pwrite -S 0x79 -b 101376 225691598 101376" /mnt/foo xfs_io -c "pwrite -S 0x80 -b 23552 77023155 23552" /mnt/foo xfs_io -c "pwrite -S 0x81 -b 11 201888192 11" /mnt/foo xfs_io -c "pwrite -S 0x82 -b 11 115332492 11" /mnt/foo xfs_io -c "pwrite -S 0x83 -b 67584 230278015 67584" /mnt/foo xfs_io -c "pwrite -S 0x84 -b 11 120589073 11" /mnt/foo xfs_io -c "pwrite -S 0x85 -b 125952 202207819 125952" /mnt/foo xfs_io -c "pwrite -S 0x86 -b 113664 86672080 113664" /mnt/foo xfs_io -c "pwrite -S 0x87 -b 17408 208459603 17408" /mnt/foo xfs_io -c "pwrite -S 0x88 -b 7000 73372211 7000" /mnt/foo xfs_io -c "pwrite -S 0x89 -b 7000 42252122 7000" /mnt/foo xfs_io -c "pwrite -S 0x90 -b 23552 46784881 23552" /mnt/foo xfs_io -c "pwrite -S 0x91 -b 101376 63172351 101376" /mnt/foo xfs_io -c "pwrite -S 0x92 -b 23552 59341931 23552" /mnt/foo xfs_io -c "pwrite -S 0x93 -b 39936 239599283 39936" /mnt/foo xfs_io -c "pwrite -S 0x94 -b 67584 175643105 67584" /mnt/foo xfs_io -c "pwrite -S 0x97 -b 23552 105534880 23552" /mnt/foo xfs_io -c "pwrite -S 0x98 -b 113664 8236844 113664" /mnt/foo xfs_io -c "pwrite -S 0x99 -b 125952 144489686 125952" /mnt/foo xfs_io -c "pwrite -S 0xa0 -b 7000 73273112 7000" /mnt/foo xfs_io -c "pwrite -S 0xa1 -b 125952 194580243 125952" /mnt/foo xfs_io -c "pwrite -S 0xa2 -b 123 56296779 123" /mnt/foo xfs_io -c "pwrite -S 0xa3 -b 11 233066845 11" /mnt/foo xfs_io -c "pwrite -S 0xa4 -b 39936 197727090 39936" /mnt/foo xfs_io -c "pwrite -S 0xa5 -b 101376 53579812 101376" /mnt/foo xfs_io -c "pwrite -S 0xa6 -b 9216 85669738 9216" /mnt/foo xfs_io -c "pwrite -S 0xa7 -b 125952 21266322 125952" /mnt/foo xfs_io -c "pwrite -S 0xa8 -b 23552 125726568 23552" /mnt/foo xfs_io -c "pwrite -S 0xa9 -b 9216 18423680 9216" /mnt/foo xfs_io -c "pwrite -S 0xb0 -b 1121 165901483 1121" /mnt/foo btrfs subvolume snapshot -r /mnt /mnt/mysnap1 xfs_io -c "pwrite -S 0xff -b 10 16190218 10" /mnt/foo btrfs subvolume snapshot -r /mnt /mnt/mysnap2 md5sum /mnt/foo # returns 79e53f1466bfc09fd82b450689e6119e md5sum /mnt/mysnap2/foo # returns 79e53f1466bfc09fd82b450689e6119e too btrfs send /mnt/mysnap1 -f /tmp/1.snap btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/2.snap mkfs.btrfs -f /dev/sdc mount /dev/sdc /mnt btrfs receive /mnt -f /tmp/1.snap btrfs receive /mnt -f /tmp/2.snap md5sum /mnt/mysnap2/foo # returns 2bb414c5155767cedccd7063e51beabd !! A testcase for xfstests follows soon too. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30fs: Add a missing permission check to do_umountAndy Lutomirski
commit a1480dcc3c706e309a88884723446f2e84fedd5b upstream. Accessing do_remount_sb should require global CAP_SYS_ADMIN, but only one of the two call sites was appropriately protected. Fixes CVE-2014-7975. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30Btrfs: fix race in WAIT_SYNC ioctlSage Weil
commit 42383020beb1cfb05f5d330cc311931bc4917a97 upstream. We check whether transid is already committed via last_trans_committed and then search through trans_list for pending transactions. If last_trans_committed is updated by btrfs_commit_transaction after we check it (there is no locking), we will fail to find the committed transaction and return EINVAL to the caller. This has been observed occasionally by ceph-osd (which uses this ioctl heavily). Fix by rechecking whether the provided transid <= last_trans_committed after the search fails, and if so return 0. Signed-off-by: Sage Weil <sage@redhat.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30Btrfs: fix build_backref_tree issue with multiple shared blocksJosef Bacik
commit bbe9051441effce51c9a533d2c56440df64db2d7 upstream. Marc Merlin sent me a broken fs image months ago where it would blow up in the upper->checked BUG_ON() in build_backref_tree. This is because we had a scenario like this block a -- level 4 (not shared) | block b -- level 3 (reloc block, shared) | block c -- level 2 (not shared) | block d -- level 1 (shared) | block e -- level 0 (shared) We go to build a backref tree for block e, we notice block d is shared and add it to the list of blocks to lookup it's backrefs for. Now when we loop around we will check edges for the block, so we will see we looked up block c last time. So we lookup block d and then see that the block that points to it is block c and we can just skip that edge since we've already been up this path. The problem is because we clear need_check when we see block d (as it is shared) we never add block b as needing to be checked. And because block c is in our path already we bail out before we walk up to block b and add it to the backref check list. To fix this we need to reset need_check if we trip over a block that doesn't need to be checked. This will make sure that any subsequent blocks in the path as we're walking up afterwards are added to the list to be processed. With this patch I can now mount Marc's fs image and it'll complete the balance without panicing. Thanks, Reported-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30Btrfs: cleanup error handling in build_backref_treeJosef Bacik
commit 75bfb9aff45e44625260f52a5fd581b92ace3e62 upstream. When balance panics it tends to panic in the BUG_ON(!upper->checked); test, because it means it couldn't build the backref tree properly. This is annoying to users and frankly a recoverable error, nothing in this function is actually fatal since it is just an in-memory building of the backrefs for a given bytenr. So go through and change all the BUG_ON()'s to ASSERT()'s, and fix the BUG_ON(!upper->checked) thing to just return an error. This patch also fixes the error handling so it tears down the work we've done properly. This code was horribly broken since we always just panic'ed instead of actually erroring out, so it needed to be completely re-worked. With this patch my broken image no longer panics when I mount it. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30Btrfs: try not to ENOSPC on log replayJosef Bacik
commit 1d52c78afbbf80b58299e076a159617d6b42fe3c upstream. When doing log replay we may have to update inodes, which traditionally goes through our delayed inode stuff. This will try to move space over from the trans handle, but we don't reserve space in our trans handle on replay since we don't know how much we will need, so instead we try to flush. But because we have a trans handle open we won't flush anything, so if we are out of reserve space we will simply return ENOSPC. Since we know that if an operation made it into the log then we definitely had space before the box bought the farm then we don't need to worry about doing this space reservation. Use the fs_info->log_root_recovering flag to skip the delayed inode stuff and update the item directly. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30Btrfs: fix up bounds checking in lseekLiu Bo
commit 4d1a40c66bed0b3fa43b9da5fbd5cbe332e4eccf upstream. An user reported this, it is because that lseek's SEEK_SET/SEEK_CUR/SEEK_END allow a negative value for @offset, but btrfs's SEEK_DATA/SEEK_HOLE don't prepare for that and convert the negative @offset into unsigned type, so we get (end < start) warning. [ 1269.835374] ------------[ cut here ]------------ [ 1269.836809] WARNING: CPU: 0 PID: 1241 at fs/btrfs/extent_io.c:430 insert_state+0x11d/0x140() [ 1269.838816] BTRFS: end < start 4094 18446744073709551615 [ 1269.840334] CPU: 0 PID: 1241 Comm: a.out Tainted: G W 3.16.0+ #306 [ 1269.858229] Call Trace: [ 1269.858612] [<ffffffff81801a69>] dump_stack+0x4e/0x68 [ 1269.858952] [<ffffffff8107894c>] warn_slowpath_common+0x8c/0xc0 [ 1269.859416] [<ffffffff81078a36>] warn_slowpath_fmt+0x46/0x50 [ 1269.859929] [<ffffffff813b0fbd>] insert_state+0x11d/0x140 [ 1269.860409] [<ffffffff813b1396>] __set_extent_bit+0x3b6/0x4e0 [ 1269.860805] [<ffffffff813b21c7>] lock_extent_bits+0x87/0x200 [ 1269.861697] [<ffffffff813a5b28>] btrfs_file_llseek+0x148/0x2a0 [ 1269.862168] [<ffffffff811f201e>] SyS_lseek+0xae/0xc0 [ 1269.862620] [<ffffffff8180b212>] system_call_fastpath+0x16/0x1b [ 1269.862970] ---[ end trace 4d33ea885832054b ]--- This assumes that btrfs starts finding DATA/HOLE from the beginning of file if the assigned @offset is negative. Also we add alignment for lock_extent_bits 's range. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30Btrfs: add missing compression property remove in btrfs_ioctl_setflagsFilipe Manana
commit 78a017a2c92df9b571db0a55a016280f9019c65e upstream. The behaviour of a 'chattr -c' consists of getting the current flags, clearing the FS_COMPR_FL bit and then sending the result to the set flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags passed to the ioctl. This results in the compression property not being cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL was set in the received flags. Reproducer: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt && cd /mnt $ mkdir a $ chattr +c a $ touch a/file $ lsattr a/file --------c------- a/file $ chattr -c a $ touch a/file2 $ lsattr a/file2 --------c------- a/file2 $ lsattr -d a ---------------- a Reported-by: Andreas Schneider <asn@cryptomilk.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30btrfs: wake up transaction thread from SYNC_FS ioctlDavid Sterba
commit 2fad4e83e12591eb3bd213875b9edc2d18e93383 upstream. The transaction thread may want to do more work, namely it pokes the cleaner ktread that will start processing uncleaned subvols. This can be triggered by user via the 'btrfs fi sync' command, otherwise there was a delay up to 30 seconds before the cleaner started to clean old snapshots. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09mm: per-thread vma cachingDavidlohr Bueso
commit 615d6e8756c87149f2d4c1b93d471bca002bd849 upstream. This patch is a continuation of efforts trying to optimize find_vma(), avoiding potentially expensive rbtree walks to locate a vma upon faults. The original approach (https://lkml.org/lkml/2013/11/1/410), where the largest vma was also cached, ended up being too specific and random, thus further comparison with other approaches were needed. There are two things to consider when dealing with this, the cache hit rate and the latency of find_vma(). Improving the hit-rate does not necessarily translate in finding the vma any faster, as the overhead of any fancy caching schemes can be too high to consider. We currently cache the last used vma for the whole address space, which provides a nice optimization, reducing the total cycles in find_vma() by up to 250%, for workloads with good locality. On the other hand, this simple scheme is pretty much useless for workloads with poor locality. Analyzing ebizzy runs shows that, no matter how many threads are running, the mmap_cache hit rate is less than 2%, and in many situations below 1%. The proposed approach is to replace this scheme with a small per-thread cache, maximizing hit rates at a very low maintenance cost. Invalidations are performed by simply bumping up a 32-bit sequence number. The only expensive operation is in the rare case of a seq number overflow, where all caches that share the same address space are flushed. Upon a miss, the proposed replacement policy is based on the page number that contains the virtual address in question. Concretely, the following results are seen on an 80 core, 8 socket x86-64 box: 1) System bootup: Most programs are single threaded, so the per-thread scheme does improve ~50% hit rate by just adding a few more slots to the cache. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 50.61% | 19.90 | | patched | 73.45% | 13.58 | +----------------+----------+------------------+ 2) Kernel build: This one is already pretty good with the current approach as we're dealing with good locality. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 75.28% | 11.03 | | patched | 88.09% | 9.31 | +----------------+----------+------------------+ 3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload. +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 70.66% | 17.14 | | patched | 91.15% | 12.57 | +----------------+----------+------------------+ 4) Ebizzy: There's a fair amount of variation from run to run, but this approach always shows nearly perfect hit rates, while baseline is just about non-existent. The amounts of cycles can fluctuate between anywhere from ~60 to ~116 for the baseline scheme, but this approach reduces it considerably. For instance, with 80 threads: +----------------+----------+------------------+ | caching scheme | hit-rate | cycles (billion) | +----------------+----------+------------------+ | baseline | 1.06% | 91.54 | | patched | 99.97% | 14.18 | +----------------+----------+------------------+ [akpm@linux-foundation.org: fix nommu build, per Davidlohr] [akpm@linux-foundation.org: document vmacache_valid() logic] [akpm@linux-foundation.org: attempt to untangle header files] [akpm@linux-foundation.org: add vmacache_find() BUG_ON] [hughd@google.com: add vmacache_valid_mm() (from Oleg)] [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: adjust and enhance comments] Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Michel Lespinasse <walken@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Tested-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09hugetlb: ensure hugepage access is denied if hugepages are not supportedNishanth Aravamudan
commit 457c1b27ed56ec472d202731b12417bff023594a upstream. Currently, I am seeing the following when I `mount -t hugetlbfs /none /dev/hugetlbfs`, and then simply do a `ls /dev/hugetlbfs`. I think it's related to the fact that hugetlbfs is properly not correctly setting itself up in this state?: Unable to handle kernel paging request for data at address 0x00000031 Faulting instruction address: 0xc000000000245710 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries .... In KVM guests on Power, in a guest not backed by hugepages, we see the following: AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 64 kB HPAGE_SHIFT == 0 in this configuration, which indicates that hugepages are not supported at boot-time, but this is only checked in hugetlb_init(). Extract the check to a helper function, and use it in a few relevant places. This does make hugetlbfs not supported (not registered at all) in this environment. I believe this is fine, as there are no valid hugepages and that won't change at runtime. [akpm@linux-foundation.org: use pr_info(), per Mel] [akpm@linux-foundation.org: fix build when HPAGE_SHIFT is undefined] Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09CIFS: Fix SMB2 readdir error handlingPavel Shilovsky
commit 52755808d4525f4d5b86d112d36ffc7a46f3fb48 upstream. SMB2 servers indicates the end of a directory search with STATUS_NO_MORE_FILE error code that is not processed now. This causes generic/257 xfstest to fail. Fix this by triggering the end of search by this error code in SMB2_query_directory. Also when negotiating CIFS protocol we tell the server to close the search automatically at the end and there is no need to do it itself. In the case of SMB2 protocol, we need to close it explicitly - separate close directory checks for different protocols. Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09Fix problem recognizing symlinksSteve French
commit 19e81573fca7b87ced7701e01ba164b968d929bd upstream. Changeset eb85d94bd introduced a problem where if a cifs open fails during query info of a file we will still try to close the file (happens with certain types of reparse points) even though the file handle is not valid. In addition for SMB2/SMB3 we were not mapping the return code returned by Windows when trying to open a file (like a Windows NFS symlink) which is a reparse point. Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-09udf: Avoid infinite loop when processing indirect ICBsJan Kara
commit c03aa9f6e1f938618e6db2e23afef0574efeeb65 upstream. We did not implement any bound on number of indirect ICBs we follow when loading inode. Thus corrupted medium could cause kernel to go into an infinite loop, possibly causing a stack overflow. Fix the possible stack overflow by removing recursion from __udf_read_inode() and limit number of indirect ICBs we follow to avoid infinite loops. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Chuck Ebbert <cebbert.lkml@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>