summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-09-05NFSv4: Fix problems with close in the presence of a delegationTrond Myklebust
commit aee7af356e151494d5014f57b33460b162f181b5 upstream. In the presence of delegations, we can no longer assume that the state->n_rdwr, state->n_rdonly, state->n_wronly reflect the open stateid share mode, and so we need to calculate the initial value for calldata->arg.fmode using the state->flags. Reported-by: James Drews <drews@engr.wisc.edu> Fixes: 88069f77e1ac5 (NFSv41: Fix a potential state leakage when...) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05NFSv3: Fix another acl regressionTrond Myklebust
commit f87d928f6d98644d39809a013a22f981d39017cf upstream. When creating a new object on the NFS server, we should not be sending posix setacl requests unless the preceding posix_acl_create returned a non-trivial acl. Doing so, causes Solaris servers in particular to return an EINVAL. Fixes: 013cdf1088d72 (nfs: use generic posix ACL infrastructure,,,) Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1132786 Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05svcrdma: Select NFSv4.1 backchannel transport based on forward channelChuck Lever
commit 3c45ddf823d679a820adddd53b52c6699c9a05ac upstream. The current code always selects XPRT_TRANSPORT_BC_TCP for the back channel, even when the forward channel was not TCP (eg, RDMA). When a 4.1 mount is attempted with RDMA, the server panics in the TCP BC code when trying to send CB_NULL. Instead, construct the transport protocol number from the forward channel transport or'd with XPRT_TRANSPORT_BC. Transports that do not support bi-directional RPC will not have registered a "BC" transport, causing create_backchannel_client() to fail immediately. Fixes: https://bugzilla.linux-nfs.org/show_bug.cgi?id=265 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05nfs3_list_one_acl(): check get_acl() result with IS_ERR_OR_NULLAndrey Utkin
commit 7a9e75a185e6b3a3860e6a26fb6e88691fc2c9d9 upstream. There was a check for result being not NULL. But get_acl() may return NULL, or ERR_PTR, or actual pointer. The purpose of the function where current change is done is to "list ACLs only when they are available", so any error condition of get_acl() mustn't be elevated, and returning 0 there is still valid. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81111 Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: 74adf83f5d77 (nfs: only show Posix ACLs in listxattr if actually...) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05NFSD: Decrease nfsd_users in nfsd_startup_generic failKinglong Mee
commit d9499a95716db0d4bc9b67e88fd162133e7d6b08 upstream. A memory allocation failure could cause nfsd_startup_generic to fail, in which case nfsd_users wouldn't be incorrectly left elevated. After nfsd restarts nfsd_startup_generic will then succeed without doing anything--the first consequence is likely nfs4_start_net finding a bad laundry_wq and crashing. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Fixes: 4539f14981ce "nfsd: replace boolean nfsd_up flag by users counter" Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05jbd2: fix descriptor block size handling errors with journal_csumDarrick J. Wong
commit db9ee220361de03ee86388f9ea5e529eaad5323c upstream. It turns out that there are some serious problems with the on-disk format of journal checksum v2. The foremost is that the function to calculate descriptor tag size returns sizes that are too big. This causes alignment issues on some architectures and is compounded by the fact that some parts of jbd2 use the structure size (incorrectly) to determine the presence of a 64bit journal instead of checking the feature flags. Therefore, introduce journal checksum v3, which enlarges the descriptor block tag format to allow for full 32-bit checksums of journal blocks, fix the journal tag function to return the correct sizes, and fix the jbd2 recovery code to use feature flags to determine 64bitness. Add a few function helpers so we don't have to open-code quite so many pieces. Switching to a 16-byte block size was found to increase journal size overhead by a maximum of 0.1%, to convert a 32-bit journal with no checksumming to a 32-bit journal with checksum v3 enabled. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: TR Reardon <thomas_reardon@hotmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05jbd2: fix infinite loop when recovering corrupt journal blocksDarrick J. Wong
commit 022eaa7517017efe4f6538750c2b59a804dc7df7 upstream. When recovering the journal, don't fall into an infinite loop if we encounter a corrupt journal block. Instead, just skip the block and return an error, which fails the mount and thus forces the user to run a full filesystem fsck. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05ext4: update i_disksize coherently with block allocation on error pathDmitry Monakhov
commit 6603120e96eae9a5d6228681ae55c7fdc998d1bb upstream. In case of delalloc block i_disksize may be less than i_size. So we have to update i_disksize each time we allocated and submitted some blocks beyond i_disksize. We weren't doing this on the error paths, so fix this. testcase: xfstest generic/019 Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: fix crash on endio of reading corrupted blockLiu Bo
commit 38c1c2e44bacb37efd68b90b3f70386a8ee370ee upstream. The crash is ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:2124! [...] Workqueue: btrfs-endio normal_work_helper [btrfs] RIP: 0010:[<ffffffffa02d6055>] [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs] This is in fact a regression. It is because we forgot to increase @offset properly in reading corrupted block, so that the @offset remains, and this leads to checksum errors while reading left blocks queued up in the same bio, and then ends up with hiting the above BUG_ON. Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: fix compressed write corruption on enospcLiu Bo
commit ce62003f690dff38d3164a632ec69efa15c32cbf upstream. When failing to allocate space for the whole compressed extent, we'll fallback to uncompressed IO, but we've forgotten to redirty the pages which belong to this compressed extent, and these 'clean' pages will simply skip 'submit' part and go to endio directly, at last we got data corruption as we write nothing. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Tested-By: Martin Steigerwald <martin@lichtvoll.de> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: read lock extent buffer while walking backrefsFilipe Manana
commit 6f7ff6d7832c6be13e8c95598884dbc40ad69fb7 upstream. Before processing the extent buffer, acquire a read lock on it, so that we're safe against concurrent updates on the extent buffer. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: fix csum tree corruption, duplicate and outdated checksumsFilipe Manana
commit 27b9a8122ff71a8cadfbffb9c4f0694300464f3b upstream. Under rare circumstances we can end up leaving 2 versions of a checksum for the same file extent range. The reason for this is that after calling btrfs_next_leaf we process slot 0 of the leaf it returns, instead of processing the slot set in path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after btrfs_next_leaf() releases the path and before it searches for the next leaf, another task might cause a split of the next leaf, which migrates some of its keys to the leaf we were processing before calling btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the same leaf but with path->slots[0] having a slot number corresponding to the first new key it got, that is, a slot number that didn't exist before calling btrfs_next_leaf(), as the leaf now has more keys than it had before. So we must really process the returned leaf starting at path->slots[0] always, as it isn't always 0, and the key at slot 0 can have an offset much lower than our search offset/bytenr. For example, consider the following scenario, where we have: sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568 four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472 Leaf N: slot = 0 slot = btrfs_header_nritems() - 1 |-------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] | |-------------------------------------------------------------------| Leaf N + 1: slot = 0 slot = btrfs_header_nritems() - 1 |--------------------------------------------------------------------| | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 | |--------------------------------------------------------------------| Because we are at the last slot of leaf N, we call btrfs_next_leaf() to find the next highest key, which releases the current path and then searches for that next key. However after releasing the path and before finding that next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore btrfs_next_leaf() will returns us a path again with leaf N but with the slot pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N is then: slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1 |----------------------------------------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] | |----------------------------------------------------------------------------------------------------| And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump into the "insert:" label, which will set tmp to: tmp = min((sums->len - total_bytes) >> blocksize_bits, (next_offset - file_key.offset) >> blocksize_bits) = min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) = min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4 and ins_size = csum_size * tmp = 4 * 4 = 16 bytes. In other words, we insert a new csum item in the tree with key (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong, because the item with key (CSUM CSUM 40161280) (the one that was moved from leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288 bytes of our data and won't get those old checksums removed. So this leaves us 2 different checksums for 3 4kb blocks of data in the tree, and breaks the logical rule: Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover An obvious bad effect of this is that a subsequent csum tree lookup to get the checksum of any of the blocks with logical offset of 40161280, 40165376 or 40169472 (the last 3 4kb blocks of file data), will get the old checksums. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05Btrfs: Fix memory corruption by ulist_add_merge() on 32bit archTakashi Iwai
commit 4eb1f66dce6c4dc28dd90a7ffbe6b2b1cb08aa4e upstream. We've got bug reports that btrfs crashes when quota is enabled on 32bit kernel, typically with the Oops like below: BUG: unable to handle kernel NULL pointer dereference at 00000004 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs] *pde = 00000000 Oops: 0000 [#1] SMP CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs] task: f1478130 ti: f147c000 task.ti: f147c000 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0 EIP is at find_parent_nodes+0x360/0x1380 [btrfs] EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690 Stack: 00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050 00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000 00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000 Call Trace: [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs] [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs] [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs] [<c025e38b>] process_one_work+0x11b/0x390 [<c025eea1>] worker_thread+0x101/0x340 [<c026432b>] kthread+0x9b/0xb0 [<c0712a71>] ret_from_kernel_thread+0x21/0x30 [<c0264290>] kthread_create_on_node+0x110/0x110 This indicates a NULL corruption in prefs_delayed list. The further investigation and bisection pointed that the call of ulist_add_merge() results in the corruption. ulist_add_merge() takes u64 as aux and writes a 64bit value into old_aux. The callers of this function in backref.c, however, pass a pointer of a pointer to old_aux. That is, the function overwrites 64bit value on 32bit pointer. This caused a NULL in the adjacent variable, in this case, prefs_delayed. Here is a quick attempt to band-aid over this: a new function, ulist_add_merge_ptr() is introduced to pass/store properly a pointer value instead of u64. There are still ugly void ** cast remaining in the callers because void ** cannot be taken implicitly. But, it's safer than explicit cast to u64, anyway. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046 Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05ext4: fix BUG_ON in mb_free_blocks()Theodore Ts'o
commit c99d1e6e83b06744c75d9f5e491ed495a7086b7b upstream. If we suffer a block allocation failure (for example due to a memory allocation failure), it's possible that we will call ext4_discard_allocated_blocks() before we've actually allocated any blocks. In that case, fe_len and fe_start in ac->ac_f_ex will still be zero, and this will result in mb_free_blocks(inode, e4b, 0, 0) triggering the BUG_ON on mb_free_blocks(): BUG_ON(last >= (sb->s_blocksize << 3)); Fix this by bailing out of ext4_discard_allocated_blocks() if fs_len is zero. Also fix a missing ext4_mb_unload_buddy() call in ext4_discard_allocated_blocks(). Google-Bug-Id: 16844242 Fixes: 86f0afd463215fc3e58020493482faa4ac3a4d69 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05debugfs: Fix corrupted loop in debugfs_remove_recursiveSteven Rostedt
commit 485d44022a152c0254dd63445fdb81c4194cbf0e upstream. [ I'm currently running my tests on it now, and so far, after a few hours it has yet to blow up. I'll run it for 24 hours which it never succeeded in the past. ] The tracing code has a way to make directories within the debugfs file system as well as deleting them using mkdir/rmdir in the instance directory. This is very limited in functionality, such as there is no renames, and the parent directory "instance" can not be modified. The tracing code creates the instance directory from the debugfs code and then replaces the dentry->d_inode->i_op with its own to allow for mkdir/rmdir to work. When these are called, the d_entry and inode locks need to be released to call the instance creation and deletion code. That code has its own accounting and locking to serialize everything to prevent multiple users from causing harm. As the parent "instance" directory can not be modified this simplifies things. I created a stress test that creates several threads that randomly creates and deletes directories thousands of times a second. The code stood up to this test and I submitted it a while ago. Recently I added a new test that adds readers to the mix. While the instance directories were being added and deleted, readers would read from these directories and even enable tracing within them. This test was able to trigger a bug: general protection fault: 0000 [#1] PREEMPT SMP Modules linked in: ... CPU: 3 PID: 17789 Comm: rmdir Tainted: G W 3.15.0-rc2-test+ #41 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007 task: ffff88003786ca60 ti: ffff880077018000 task.ti: ffff880077018000 RIP: 0010:[<ffffffff811ed5eb>] [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367 RSP: 0018:ffff880077019df8 EFLAGS: 00010246 RAX: 0000000000000002 RBX: ffff88006f0fe490 RCX: 0000000000000000 RDX: dead000000100058 RSI: 0000000000000246 RDI: ffff88003786d454 RBP: ffff88006f0fe640 R08: 0000000000000628 R09: 0000000000000000 R10: 0000000000000628 R11: ffff8800795110a0 R12: ffff88006f0fe640 R13: ffff88006f0fe640 R14: ffffffff81817d0b R15: ffffffff818188b7 FS: 00007ff13ae24700(0000) GS:ffff88007d580000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003054ec7be0 CR3: 0000000076d51000 CR4: 00000000000007e0 Stack: ffff88007a41ebe0 dead000000100058 00000000fffffffe ffff88006f0fe640 0000000000000000 ffff88006f0fe678 ffff88007a41ebe0 ffff88003793a000 00000000fffffffe ffffffff810bde82 ffff88006f0fe640 ffff88007a41eb28 Call Trace: [<ffffffff810bde82>] ? instance_rmdir+0x15b/0x1de [<ffffffff81132e2d>] ? vfs_rmdir+0x80/0xd3 [<ffffffff81132f51>] ? do_rmdir+0xd1/0x139 [<ffffffff8124ad9e>] ? trace_hardirqs_on_thunk+0x3a/0x3c [<ffffffff814fea62>] ? system_call_fastpath+0x16/0x1b Code: fe ff ff 48 8d 75 30 48 89 df e8 c9 fd ff ff 85 c0 75 13 48 c7 c6 b8 cc d2 81 48 c7 c7 b0 cc d2 81 e8 8c 7a f5 ff 48 8b 54 24 08 <48> 8b 82 a8 00 00 00 48 89 d3 48 2d a8 00 00 00 48 89 44 24 08 RIP [<ffffffff811ed5eb>] debugfs_remove_recursive+0x1bd/0x367 RSP <ffff880077019df8> It took a while, but every time it triggered, it was always in the same place: list_for_each_entry_safe(child, next, &parent->d_subdirs, d_u.d_child) { Where the child->d_u.d_child seemed to be corrupted. I added lots of trace_printk()s to see what was wrong, and sure enough, it was always the child's d_u.d_child field. I looked around to see what touches it and noticed that in __dentry_kill() which calls dentry_free(): static void dentry_free(struct dentry *dentry) { /* if dentry was never visible to RCU, immediate free is OK */ if (!(dentry->d_flags & DCACHE_RCUACCESS)) __d_free(&dentry->d_u.d_rcu); else call_rcu(&dentry->d_u.d_rcu, __d_free); } I also noticed that __dentry_kill() unlinks the child->d_u.child under the parent->d_lock spin_lock. Looking back at the loop in debugfs_remove_recursive() it never takes the parent->d_lock to do the list walk. Adding more tracing, I was able to prove this was the issue: ftrace-t-15385 1.... 246662024us : dentry_kill <ffffffff81138b91>: free ffff88006d573600 rmdir-15409 2.... 246662024us : debugfs_remove_recursive <ffffffff811ec7e5>: child=ffff88006d573600 next=dead000000100058 The dentry_kill freed ffff88006d573600 just as the remove recursive was walking it. In order to fix this, the list walk needs to be modified a bit to take the parent->d_lock. The safe version is no longer necessary, as every time we remove a child, the parent->d_lock must be released and the list walk must start over. Each time a child is removed, even though it may still be on the list, it should be skipped by the first check in the loop: if (!debugfs_positive(child)) continue; Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05ext4: fix ext4_discard_allocated_blocks() if we can't allocate the pa structTheodore Ts'o
commit 86f0afd463215fc3e58020493482faa4ac3a4d69 upstream. If there is a failure while allocating the preallocation structure, a number of blocks can end up getting marked in the in-memory buddy bitmap, and then not getting released. This can result in the following corruption getting reported by the kernel: EXT4-fs error (device sda3): ext4_mb_generate_buddy:758: group 1126, 12793 clusters in bitmap, 12729 in gd In that case, we need to release the blocks using mb_free_blocks(). Tested: fs smoke test; also demonstrated that with injected errors, the file system is no longer getting corrupted Google-Bug-Id: 16657874 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-09-05isofs: Fix unbounded recursion when processing relocated directoriesJan Kara
commit 410dd3cf4c9b36f27ed4542ee18b1af5e68645a4 upstream. We did not check relocated directory in any way when processing Rock Ridge 'CL' tag. Thus a corrupted isofs image can possibly have a CL entry pointing to another CL entry leading to possibly unbounded recursion in kernel code and thus stack overflow or deadlocks (if there is a loop created from CL entries). Fix the problem by not allowing CL entry to point to a directory entry with CL entry (such use makes no good sense anyway) and by checking whether CL entry doesn't point to itself. Reported-by: Chris Evans <cevans@google.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-08-14xfs: log vector rounding leaks log spaceDave Chinner
commit 110dc24ad2ae4e9b94b08632fe1eb2fcdff83045 upstream. The addition of direct formatting of log items into the CIL linear buffer added alignment restrictions that the start of each vector needed to be 64 bit aligned. Hence padding was added in xlog_finish_iovec() to round up the vector length to ensure the next vector started with the correct alignment. This adds a small number of bytes to the size of the linear buffer that is otherwise unused. The issue is that we then use the linear buffer size to determine the log space used by the log item, and this includes the unused space. Hence when we account for space used by the log item, it's more than is actually written into the iclogs, and hence we slowly leak this space. This results on log hangs when reserving space, with threads getting stuck with these stack traces: Call Trace: [<ffffffff81d15989>] schedule+0x29/0x70 [<ffffffff8150d3a2>] xlog_grant_head_wait+0xa2/0x1a0 [<ffffffff8150d55d>] xlog_grant_head_check+0xbd/0x140 [<ffffffff8150ee33>] xfs_log_reserve+0x103/0x220 [<ffffffff814b7f05>] xfs_trans_reserve+0x2f5/0x310 ..... The 4 bytes is significant. Brain Foster did all the hard work in tracking down a reproducable leak to inode chunk allocation (it went away with the ikeep mount option). His rough numbers were that creating 50,000 inodes leaked 11 log blocks. This turns out to be roughly 800 inode chunks or 1600 inode cluster buffers. That works out at roughly 4 bytes per cluster buffer logged, and at that I started looking for a 4 byte leak in the buffer logging code. What I found was that a struct xfs_buf_log_format structure for an inode cluster buffer is 28 bytes in length. This gets rounded up to 32 bytes, but the vector length remains 28 bytes. Hence the CIL ticket reservation is decremented by 32 bytes (via lv->lv_buf_len) for that vector rather than 28 bytes which are written into the log. The fix for this problem is to separately track the bytes used by the log vectors in the item and use that instead of the buffer length when accounting for the log space that will be used by the formatted log item. Again, thanks to Brian Foster for doing all the hard work and long hours to isolate this leak and make finding the bug relatively simple. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Cc: Bill <billstuff2001@sbcglobal.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31fs: umount on symlink leaks mnt countVasily Averin
commit 295dc39d941dc2ae53d5c170365af4c9d5c16212 upstream. Currently umount on symlink blocks following umount: /vz is separate mount # ls /vz/ -al | grep test drwxr-xr-x. 2 root root 4096 Jul 19 01:14 testdir lrwxrwxrwx. 1 root root 11 Jul 19 01:16 testlink -> /vz/testdir # umount -l /vz/testlink umount: /vz/testlink: not mounted (expected) # lsof /vz # umount /vz umount: /vz: device is busy. (unexpected) In this case mountpoint_last() gets an extra refcount on path->mnt Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Ian Kent <raven@themaw.net> Acked-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31coredump: fix the setting of PF_DUMPCORESilesh C V
commit aed8adb7688d5744cb484226820163af31d2499a upstream. Commit 079148b919d0 ("coredump: factor out the setting of PF_DUMPCORE") cleaned up the setting of PF_DUMPCORE by removing it from all the linux_binfmt->core_dump() and moving it to zap_threads().But this ended up clearing all the previously set flags. This causes issues during core generation when tsk->flags is checked again (eg. for PF_USED_MATH to dump floating point registers). Fix this. Signed-off-by: Silesh C V <svellattu@mvista.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Mandeep Singh Baines <msb@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-31nfs: only show Posix ACLs in listxattr if actually presentChristoph Hellwig
commit 74adf83f5d7720925499b4938f930591f947b660 upstream. The big ACL switched nfs to use generic_listxattr, which calls all existing ->list handlers. Add a custom .listxattr implementation that only lists the ACLs if they actually are present on the given inode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Philippe Troin <phil@fifi.org> Tested-by: Philippe Troin <phil@fifi.org> Fixes: 013cdf1088d7 (nfs: use generic posix ACL infrastructure ...) Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28aio: protect reqs_available updates from changes in interrupt handlersBenjamin LaHaise
commit 263782c1c95bbddbb022dc092fd89a36bb8d5577 upstream. As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to have put_reqs_available() called from irq context. While put_reqs_available() is per cpu, it did not protect itself from interrupts on the same CPU. This lead to aio_complete() corrupting the available io requests count when run under a heavy O_DIRECT workloads as reported by Robert Elliott. Fix this by disabling irq updates around the per cpu batch updates of reqs_available. Many thanks to Robert and folks for testing and tracking this down. Reported-by: Robert Elliot <Elliott@hp.com> Tested-by: Robert Elliot <Elliott@hp.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28quota: missing lock in dqcache_shrink_scan()Niu Yawei
commit d68aab6b8f572406aa93b45ef6483934dd3b54a6 upstream. Commit 1ab6c4997e04 (fs: convert fs shrinkers to new scan/count API) accidentally removed locking from quota shrinker. Fix it - dqcache_shrink_scan() should use dq_list_lock to protect the scan on free_dquots list. Fixes: 1ab6c4997e04a00c50c6d786c2f046adc0d1f5de Signed-off-by: Niu Yawei <yawei.niu@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28fuse: ignore entry-timeout on LOOKUP_REVALAnand Avati
commit 154210ccb3a871e631bf39fdeb7a8731d98af87b upstream. The following test case demonstrates the bug: sh# mount -t glusterfs localhost:meta-test /mnt/one sh# mount -t glusterfs localhost:meta-test /mnt/two sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; echo stuff > /mnt/one/file bash: /mnt/one/file: Stale file handle sh# echo stuff > /mnt/one/file; rm -f /mnt/two/file; sleep 1; echo stuff > /mnt/one/file On the second open() on /mnt/one, FUSE would have used the old nodeid (file handle) trying to re-open it. Gluster is returning -ESTALE. The ESTALE propagates back to namei.c:filename_lookup() where lookup is re-attempted with LOOKUP_REVAL. The right behavior now, would be for FUSE to ignore the entry-timeout and and do the up-call revalidation. Instead FUSE is ignoring LOOKUP_REVAL, succeeding the revalidation (because entry-timeout has not passed), and open() is again retried on the old file handle and finally the ESTALE is going back to the application. Fix: if revalidation is happening with LOOKUP_REVAL, then ignore entry-timeout and always do the up-call. Signed-off-by: Anand Avati <avati@redhat.com> Reviewed-by: Niels de Vos <ndevos@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28fuse: handle large user and group IDMiklos Szeredi
commit 233a01fa9c4c7c41238537e8db8434667ff28a2f upstream. If the number in "user_id=N" or "group_id=N" mount options was larger than INT_MAX then fuse returned EINVAL. Fix this to handle all valid uid/gid values. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-28fuse: timeout comparison fixMiklos Szeredi
commit 126b9d4365b110c157bc4cbc32540dfa66c9c85a upstream. As suggested by checkpatch.pl, use time_before64() instead of direct comparison of jiffies64 values. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: fix a potential deadlock in __ext4_es_shrink()Theodore Ts'o
commit 3f1f9b851311a76226140b55b1ea22111234a7c2 upstream. This fixes the following lockdep complaint: [ INFO: possible circular locking dependency detected ] 3.16.0-rc2-mm1+ #7 Tainted: G O ------------------------------------------------------- kworker/u24:0/4356 is trying to acquire lock: (&(&sbi->s_es_lru_lock)->rlock){+.+.-.}, at: [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0 but task is already holding lock: (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180 which lock already depends on the new lock. Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->i_es_lock); lock(&(&sbi->s_es_lru_lock)->rlock); lock(&ei->i_es_lock); lock(&(&sbi->s_es_lru_lock)->rlock); *** DEADLOCK *** 6 locks held by kworker/u24:0/4356: #0: ("writeback"){.+.+.+}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560 #1: ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff81071d00>] process_one_work+0x180/0x560 #2: (&type->s_umount_key#22){++++++}, at: [<ffffffff811a9c74>] grab_super_passive+0x44/0x90 #3: (jbd2_handle){+.+...}, at: [<ffffffff812979f9>] start_this_handle+0x189/0x5f0 #4: (&ei->i_data_sem){++++..}, at: [<ffffffff81247062>] ext4_map_blocks+0x132/0x550 #5: (&ei->i_es_lock){++++-.}, at: [<ffffffff81286961>] ext4_es_insert_extent+0x71/0x180 stack backtrace: CPU: 0 PID: 4356 Comm: kworker/u24:0 Tainted: G O 3.16.0-rc2-mm1+ #7 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Workqueue: writeback bdi_writeback_workfn (flush-253:0) ffffffff8213dce0 ffff880014b07538 ffffffff815df0bb 0000000000000007 ffffffff8213e040 ffff880014b07588 ffffffff815db3dd ffff880014b07568 ffff880014b07610 ffff88003b868930 ffff88003b868908 ffff88003b868930 Call Trace: [<ffffffff815df0bb>] dump_stack+0x4e/0x68 [<ffffffff815db3dd>] print_circular_bug+0x1fb/0x20c [<ffffffff810a7a3e>] __lock_acquire+0x163e/0x1d00 [<ffffffff815e89dc>] ? retint_restore_args+0xe/0xe [<ffffffff815ddc7b>] ? __slab_alloc+0x4a8/0x4ce [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff810a8707>] lock_acquire+0x87/0x120 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff8128592d>] ? ext4_es_free_extent+0x5d/0x70 [<ffffffff815e6f09>] _raw_spin_lock+0x39/0x50 [<ffffffff81285fff>] ? __ext4_es_shrink+0x4f/0x2e0 [<ffffffff8119760b>] ? kmem_cache_alloc+0x18b/0x1a0 [<ffffffff81285fff>] __ext4_es_shrink+0x4f/0x2e0 [<ffffffff812869b8>] ext4_es_insert_extent+0xc8/0x180 [<ffffffff812470f4>] ext4_map_blocks+0x1c4/0x550 [<ffffffff8124c4c4>] ext4_writepages+0x6d4/0xd00 ... Reported-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-by: Minchan Kim <minchan@kernel.org> Cc: Zheng Liu <gnehzuil.liu@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: disable synchronous transaction batching if max_batch_time==0Eric Sandeen
commit 5dd214248f94d430d70e9230bda72f2654ac88a8 upstream. The mount manpage says of the max_batch_time option, This optimization can be turned off entirely by setting max_batch_time to 0. But the code doesn't do that. So fix the code to do that. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: clarify ext4_error message in ext4_mb_generate_buddy_error()Theodore Ts'o
commit 94d4c066a4ff170a2671b1a9b153febbf36796f6 upstream. We are spending a lot of time explaining to users what this error means. Let's try to improve the message to avoid this problem. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: clarify error count warning messagesTheodore Ts'o
commit ae0f78de2c43b6fadd007c231a352b13b5be8ed2 upstream. Make it clear that values printed are times, and that it is error since last fsck. Also add note about fsck version required. Signed-off-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-17ext4: fix unjournalled bg descriptor while initializing inode bitmapTheodore Ts'o
commit 61c219f5814277ecb71d64cb30297028d6665979 upstream. The first time that we allocate from an uninitialized inode allocation bitmap, if the block allocation bitmap is also uninitalized, we need to get write access to the block group descriptor before we start modifying the block group descriptor flags and updating the free block count, etc. Otherwise, there is the potential of a bad journal checksum (if journal checksums are enabled), and of the file system becoming inconsistent if we crash at exactly the wrong time. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09aio: block io_destroy() until all context requests are completedAnatol Pomozov
commit e02ba72aabfade4c9cd6e3263e9b57bf890ad25c upstream. deletes aio context and all resources related to. It makes sense that no IO operations connected to the context should be running after the context is destroyed. As we removed io_context we have no chance to get requests status or call io_getevents(). man page for io_destroy says that this function may block until all context's requests are completed. Before kernel 3.11 io_destroy() blocked indeed, but since aio refactoring in 3.11 it is not true anymore. Here is a pseudo-code that shows a testcase for a race condition discovered in 3.11: initialize io_context io_submit(read to buffer) io_destroy() // context is destroyed so we can free the resources free(buffers); // if the buffer is allocated by some other user he'll be surprised // to learn that the buffer still filled by an outstanding operation // from the destroyed io_context The fix is straight-forward - add a completion struct and wait on it in io_destroy, complete() should be called when number of in-fligh requests reaches zero. If two or more io_destroy() called for the same context simultaneously then only the first one waits for IO completion, other calls behaviour is undefined. Tested: ran http://pastebin.com/LrPsQ4RL testcase for several hours and do not see the race condition anymore. Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09nfsd: fix rare symlink decoding bugJ. Bruce Fields
commit 76f47128f9b33af1e96819746550d789054c9664 upstream. An NFS operation that creates a new symlink includes the symlink data, which is xdr-encoded as a length followed by the data plus 0 to 3 bytes of zero-padding as required to reach a 4-byte boundary. The vfs, on the other hand, wants null-terminated data. The simple way to handle this would be by copying the data into a newly allocated buffer with space for the final null. The current nfsd_symlink code tries to be more clever by skipping that step in the (likely) case where the byte following the string is already 0. But that assumes that the byte following the string is ours to look at. In fact, it might be the first byte of a page that we can't read, or of some object that another task might modify. Worse, the NFSv4 code tries to fix the problem by actually writing to that byte. In the NFSv2/v3 cases this actually appears to be safe: - nfs3svc_decode_symlinkargs explicitly null-terminates the data (after first checking its length and copying it to a new page). - NFSv2 limits symlinks to 1k. The buffer holding the rpc request is always at least a page, and the link data (and previous fields) have maximum lengths that prevent the request from reaching the end of a page. In the NFSv4 case the CREATE op is potentially just one part of a long compound so can end up on the end of a page if you're unlucky. The minimal fix here is to copy and null-terminate in the NFSv4 case. The nfsd_symlink() interface here seems too fragile, though. It should really either do the copy itself every time or just require a null-terminated string. Reported-by: Jeff Layton <jlayton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09ext4: Fix hole punching for files with indirect blocksJan Kara
commit a93cd4cf86466caa49cfe64607bea7f0bde3f916 upstream. Hole punching code for files with indirect blocks wrongly computed number of blocks which need to be cleared when traversing the indirect block tree. That could result in punching more blocks than actually requested and thus effectively cause a data loss. For example: fallocate -n -p 10240000 4096 will punch the range 10240000 - 12632064 instead of the range 1024000 - 10244096. Fix the calculation. Fixes: 8bad6fc813a3a5300f51369c39d315679fd88c72 Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09ext4: Fix buffer double free in ext4_alloc_branch()Jan Kara
commit c5c7b8ddfbf8cb3b2291e515a34ab1b8982f5a2d upstream. Error recovery in ext4_alloc_branch() calls ext4_forget() even for buffer corresponding to indirect block it did not allocate. This leads to brelse() being called twice for that buffer (once from ext4_forget() and once from cleanup in ext4_ind_map_blocks()) leading to buffer use count misaccounting. Eventually (but often much later because there are other users of the buffer) we will see messages like: VFS: brelse: Trying to free free buffer Another manifestation of this problem is an error: JBD2 unexpected failure: jbd2_journal_revoke: !buffer_revoked(bh); inconsistent data on disk The fix is easy - don't forget buffer we did not allocate. Also add an explanatory comment because the indexing at ext4_alloc_branch() is somewhat subtle. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09CIFS: fix mount failure with broken pathnames when smb3 mount with mapchars ↵Steve French
option commit ce36d9ab3bab06b7b5522f5c8b68fac231b76ffb upstream. When we SMB3 mounted with mapchars (to allow reserved characters : \ / > < * ? via the Unicode Windows to POSIX remap range) empty paths (eg when we open "" to query the root of the SMB3 directory on mount) were not null terminated so we sent garbarge as a path name on empty paths which caused SMB2/SMB2.1/SMB3 mounts to fail when mapchars was specified. mapchars is particularly important since Unix Extensions for SMB3 are not supported (yet) Signed-off-by: Steve French <smfrench@gmail.com> Reviewed-by: David Disseldorp <ddiss@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-09fs/cifs: fix regression in cifs_create_mf_symlink()Björn Baumbach
commit a1d0b84c308d7fdfb67eb76498116a6c2fdda507 upstream. commit d81b8a40e2ece0a9ab57b1fe1798e291e75bf8fc ("CIFS: Cleanup cifs open codepath") changed disposition to FILE_OPEN. Signed-off-by: Björn Baumbach <bb@sernet.de> Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Stefan Metzmacher <metze@samba.org> Cc: Pavel Shilovsky <piastry@etersoft.ru> Cc: Steve French <sfrench@samba.org> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06reiserfs: call truncate_setsize under tailpack mutexJeff Mahoney
commit 22e7478ddbcb670e33fab72d0bbe7c394c3a2c84 upstream. Prior to commit 0e4f6a791b1e (Fix reiserfs_file_release()), reiserfs truncates serialized on i_mutex. They mostly still do, with the exception of reiserfs_file_release. That blocks out other writers via the tailpack mutex and the inode openers counter adjusted in reiserfs_file_open. However, NFS will call reiserfs_setattr without having called ->open, so we end up with a race when nfs is calling ->setattr while another process is releasing the file. Ultimately, it triggers the BUG_ON(inode->i_size != new_file_size) check in maybe_indirect_to_direct. The solution is to pull the lock into reiserfs_setattr to encompass the truncate_setsize call as well. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06xfs: xfs_readsb needs to check for magic numbersDave Chinner
commit 556b8883cfac3d3203557e161ea8005f8b5479b2 upstream. Commit daba542 ("xfs: skip verification on initial "guess" superblock read") dropped the use of a verifier for the initial superblock read so we can probe the sector size of the filesystem stored in the superblock. It, however, now fails to validate that what was read initially is actually an XFS superblock and hence will fail the sector size check and return ENOSYS. This causes probe-based mounts to fail because it expects XFS to return EINVAL when it doesn't recognise the superblock format. Reported-by: Plamen Petrov <plamen.sisi@gmail.com> Tested-by: Plamen Petrov <plamen.sisi@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06nfs: Fix cache_validity check in nfs_write_pageuptodate()Scott Mayhew
commit 18dd78c427513fb0f89365138be66e6ee8700d1b upstream. NFS_INO_INVALID_DATA cannot be ignored, even if we have a delegation. We're still having some problems with data corruption when multiple clients are appending to a file and those clients are being granted write delegations on open. To reproduce: Client A: vi /mnt/`hostname -s` while :; do echo "XXXXXXXXXXXXXXX" >>/mnt/file; sleep $(( $RANDOM % 5 )); done Client B: vi /mnt/`hostname -s` while :; do echo "YYYYYYYYYYYYYYY" >>/mnt/file; sleep $(( $RANDOM % 5 )); done What's happening is that in nfs_update_inode() we're recognizing that the file size has changed and we're setting NFS_INO_INVALID_DATA accordingly, but then we ignore the cache_validity flags in nfs_write_pageuptodate() because we have a delegation. As a result, in nfs_updatepage() we're extending the write to cover the full page even though we've not read in the data to begin with. Signed-off-by: Scott Mayhew <smayhew@redhat.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06NFS: populate ->net in mount data when remountingMateusz Guzik
commit a914722f333b3359d2f4f12919380a334176bb89 upstream. Otherwise the kernel oopses when remounting with IPv6 server because net is dereferenced in dev_get_by_name. Use net ns of current thread so that dev_get_by_name does not operate on foreign ns. Changing the address is prohibited anyway so this should not affect anything. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Cc: linux-nfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06NFS: Use raw_write_seqcount_begin/end int nfs4_reclaim_open_stateTrond Myklebust
commit abbec2da13f0e4c5d9b78b7e2c025a3e617228ba upstream. The addition of lockdep code to write_seqcount_begin/end has lead to a bunch of false positive claims of ABBA deadlocks with the so_lock spinlock. Audits show that this simply cannot happen because the read side code does not spin while holding so_lock. Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06NFS: Don't declare inode uptodate unless all attributes were checkedTrond Myklebust
commit 43b6535e717d2f656f71d9bd16022136b781c934 upstream. Fix a bug, whereby nfs_update_inode() was declaring the inode to be up to date despite not having checked all the attributes. The bug occurs because the temporary variable in which we cache the validity information is 'sanitised' before reapplying to nfsi->cache_validity. Reported-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06nfsd: getattr for FATTR4_WORD0_FILES_AVAIL needs the statfs bufferChristoph Hellwig
commit 12337901d654415d9f764b5f5ba50052e9700f37 upstream. Note nobody's ever noticed because the typical client probably never requests FILES_AVAIL without also requesting something else on the list. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06nfsd4: fix FREE_STATEID lockowner leakJ. Bruce Fields
commit 48385408b45523d9a432c66292d47ef43efcbb94 upstream. 27b11428b7de ("nfsd4: remove lockowner when removing lock stateid") introduced a memory leak. Reported-by: Jeff Layton <jeff.layton@primarydata.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06pNFS: Handle allocation errors correctly in filelayout_alloc_layout_hdr()Trond Myklebust
commit 6df200f5d5191bdde4d2e408215383890f956781 upstream. Return the NULL pointer when the allocation fails. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06UBIFS: Remove incorrect assertion in shrink_tnc()hujianyang
commit 72abc8f4b4e8574318189886de627a2bfe6cd0da upstream. I hit the same assert failed as Dolev Raviv reported in Kernel v3.10 shows like this: [ 9641.164028] UBIFS assert failed in shrink_tnc at 131 (pid 13297) [ 9641.234078] CPU: 1 PID: 13297 Comm: mmap.test Tainted: G O 3.10.40 #1 [ 9641.234116] [<c0011a6c>] (unwind_backtrace+0x0/0x12c) from [<c000d0b0>] (show_stack+0x20/0x24) [ 9641.234137] [<c000d0b0>] (show_stack+0x20/0x24) from [<c0311134>] (dump_stack+0x20/0x28) [ 9641.234188] [<c0311134>] (dump_stack+0x20/0x28) from [<bf22425c>] (shrink_tnc_trees+0x25c/0x350 [ubifs]) [ 9641.234265] [<bf22425c>] (shrink_tnc_trees+0x25c/0x350 [ubifs]) from [<bf2245ac>] (ubifs_shrinker+0x25c/0x310 [ubifs]) [ 9641.234307] [<bf2245ac>] (ubifs_shrinker+0x25c/0x310 [ubifs]) from [<c00cdad8>] (shrink_slab+0x1d4/0x2f8) [ 9641.234327] [<c00cdad8>] (shrink_slab+0x1d4/0x2f8) from [<c00d03d0>] (do_try_to_free_pages+0x300/0x544) [ 9641.234344] [<c00d03d0>] (do_try_to_free_pages+0x300/0x544) from [<c00d0a44>] (try_to_free_pages+0x2d0/0x398) [ 9641.234363] [<c00d0a44>] (try_to_free_pages+0x2d0/0x398) from [<c00c6a60>] (__alloc_pages_nodemask+0x494/0x7e8) [ 9641.234382] [<c00c6a60>] (__alloc_pages_nodemask+0x494/0x7e8) from [<c00f62d8>] (new_slab+0x78/0x238) [ 9641.234400] [<c00f62d8>] (new_slab+0x78/0x238) from [<c031081c>] (__slab_alloc.constprop.42+0x1a4/0x50c) [ 9641.234419] [<c031081c>] (__slab_alloc.constprop.42+0x1a4/0x50c) from [<c00f80e8>] (kmem_cache_alloc_trace+0x54/0x188) [ 9641.234459] [<c00f80e8>] (kmem_cache_alloc_trace+0x54/0x188) from [<bf227908>] (do_readpage+0x168/0x468 [ubifs]) [ 9641.234553] [<bf227908>] (do_readpage+0x168/0x468 [ubifs]) from [<bf2296a0>] (ubifs_readpage+0x424/0x464 [ubifs]) [ 9641.234606] [<bf2296a0>] (ubifs_readpage+0x424/0x464 [ubifs]) from [<c00c17c0>] (filemap_fault+0x304/0x418) [ 9641.234638] [<c00c17c0>] (filemap_fault+0x304/0x418) from [<c00de694>] (__do_fault+0xd4/0x530) [ 9641.234665] [<c00de694>] (__do_fault+0xd4/0x530) from [<c00e10c0>] (handle_pte_fault+0x480/0xf54) [ 9641.234690] [<c00e10c0>] (handle_pte_fault+0x480/0xf54) from [<c00e2bf8>] (handle_mm_fault+0x140/0x184) [ 9641.234716] [<c00e2bf8>] (handle_mm_fault+0x140/0x184) from [<c0316688>] (do_page_fault+0x150/0x3ac) [ 9641.234737] [<c0316688>] (do_page_fault+0x150/0x3ac) from [<c000842c>] (do_DataAbort+0x3c/0xa0) [ 9641.234759] [<c000842c>] (do_DataAbort+0x3c/0xa0) from [<c0314e38>] (__dabt_usr+0x38/0x40) After analyzing the code, I found a condition that may cause this failed in correct operations. Thus, I think this assertion is wrong and should be removed. Suppose there are two clean znodes and one dirty znode in TNC. So the per-filesystem atomic_t @clean_zn_cnt is (2). If commit start, dirty_znode is set to COW_ZNODE in get_znodes_to_commit() in case of potentially ops on this znode. We clear COW bit and DIRTY bit in write_index() without @tnc_mutex locked. We don't increase @clean_zn_cnt in this place. As the comments in write_index() shows, if another process hold @tnc_mutex and dirty this znode after we clean it, @clean_zn_cnt would be decreased to (1). We will increase @clean_zn_cnt to (2) with @tnc_mutex locked in free_obsolete_znodes() to keep it right. If shrink_tnc() performs between decrease and increase, it will release other 2 clean znodes it holds and found @clean_zn_cnt is less than zero (1 - 2 = -1), then hit the assertion. Because free_obsolete_znodes() will soon correct @clean_zn_cnt and no harm to fs in this case, I think this assertion could be removed. 2 clean zondes and 1 dirty znode, @clean_zn_cnt == 2 Thread A (commit) Thread B (write or others) Thread C (shrinker) ->write_index ->clear_bit(DIRTY_NODE) ->clear_bit(COW_ZNODE) @clean_zn_cnt == 2 ->mutex_locked(&tnc_mutex) ->dirty_cow_znode ->!ubifs_zn_cow(znode) ->!test_and_set_bit(DIRTY_NODE) ->atomic_dec(&clean_zn_cnt) ->mutex_unlocked(&tnc_mutex) @clean_zn_cnt == 1 ->mutex_locked(&tnc_mutex) ->shrink_tnc ->destroy_tnc_subtree ->atomic_sub(&clean_zn_cnt, 2) ->ubifs_assert <- hit ->mutex_unlocked(&tnc_mutex) @clean_zn_cnt == -1 ->mutex_lock(&tnc_mutex) ->free_obsolete_znodes ->atomic_inc(&clean_zn_cnt) ->mutux_unlock(&tnc_mutex) @clean_zn_cnt == 0 (correct after shrink) Signed-off-by: hujianyang <hujianyang@huawei.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-07-06UBIFS: fix an mmap and fsync race conditionhujianyang
commit 691a7c6f28ac90cccd0dbcf81348ea90b211bdd0 upstream. There is a race condition in UBIFS: Thread A (mmap) Thread B (fsync) ->__do_fault ->write_cache_pages -> ubifs_vm_page_mkwrite -> budget_space -> lock_page -> release/convert_page_budget -> SetPagePrivate -> TestSetPageDirty -> unlock_page -> lock_page -> TestClearPageDirty -> ubifs_writepage -> do_writepage -> release_budget -> ClearPagePrivate -> unlock_page -> !(ret & VM_FAULT_LOCKED) -> lock_page -> set_page_dirty -> ubifs_set_page_dirty -> TestSetPageDirty (set page dirty without budgeting) -> unlock_page This leads to situation where we have a diry page but no budget allocated for this page, so further write-back may fail with -ENOSPC. In this fix we return from page_mkwrite without performing unlock_page. We return VM_FAULT_LOCKED instead. After doing this, the race above will not happen. Signed-off-by: hujianyang <hujianyang@huawei.com> Tested-by: Laurence Withers <lwithers@guralp.com> Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30epoll: fix use-after-free in eventpoll_release_fileKonstantin Khlebnikov
commit ebe06187bf2aec10d537ce4595e416035367d703 upstream. This fixes use-after-free of epi->fllink.next inside list loop macro. This loop actually releases elements in the body. The list is rcu-protected but here we cannot hold rcu_read_lock because we need to lock mutex inside. The obvious solution is to use list_for_each_entry_safe(). RCU-ness isn't essential because nobody can change this list under us, it's final fput for this file. The bug was introduced by ae10b2b4eb01 ("epoll: optimize EPOLL_CTL_DEL using rcu") Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Jason Baron <jbaron@akamai.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-06-30btrfs: allocate raid type kobjects dynamicallyJeff Mahoney
commit c1895442be01c58449e3bf9272f22062a670e08f upstream. We are currently allocating space_info objects in an array when we allocate space_info. When a user does something like: # btrfs balance start -mconvert=raid1 -dconvert=raid1 /mnt # btrfs balance start -mconvert=single -dconvert=single /mnt -f # btrfs balance start -mconvert=raid1 -dconvert=raid1 / We can end up with memory corruption since the kobject hasn't been reinitialized properly and the name pointer was left set. The rationale behind allocating them statically was to avoid creating a separate kobject container that just contained the raid type. It used the index in the array to determine the index. Ultimately, though, this wastes more memory than it saves in all but the most complex scenarios and introduces kobject lifetime questions. This patch allocates the kobjects dynamically instead. Note that we also remove the kobject_get/put of the parent kobject since kobject_add and kobject_del do that internally. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>