summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2017-01-06xfs: set AGI buffer type in xlog_recover_clear_agi_bucketEric Sandeen
commit 6b10b23ca94451fae153a5cc8d62fd721bec2019 upstream. xlog_recover_clear_agi_bucket didn't set the type to XFS_BLFT_AGI_BUF, so we got a warning during log replay (or an ASSERT on a debug build). XFS (md0): Unknown buffer type 0! XFS (md0): _xfs_buf_ioapply: no ops on block 0xaea8802/0x1 Fix this, as was done in f19b872b for 2 other locations with the same problem. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06CIFS: Fix a possible memory corruption in push locksPavel Shilovsky
commit e3d240e9d505fc67f8f8735836df97a794bbd946 upstream. If maxBuf is not 0 but less than a size of SMB2 lock structure we can end up with a memory corruption. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06CIFS: Fix missing nls unload in smb2_reconnect()Pavel Shilovsky
commit 4772c79599564bd08ee6682715a7d3516f67433f upstream. Acked-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06CIFS: Fix a possible memory corruption during reconnectPavel Shilovsky
commit 53e0e11efe9289535b060a51d4cf37c25e0d0f2b upstream. We can not unlock/lock cifs_tcp_ses_lock while walking through ses and tcon lists because it can corrupt list iterator pointers and a tcon structure can be released if we don't hold an extra reference. Fix it by moving a reconnect process to a separate delayed work and acquiring a reference to every tcon that needs to be reconnected. Also do not send an echo request on newly established connections. Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06exec: Ensure mm->user_ns contains the execed filesEric W. Biederman
commit f84df2a6f268de584a201e8911384a2d244876e3 upstream. When the user namespace support was merged the need to prevent ptrace from revealing the contents of an unreadable executable was overlooked. Correct this oversight by ensuring that the executed file or files are in mm->user_ns, by adjusting mm->user_ns. Use the new function privileged_wrt_inode_uidgid to see if the executable is a member of the user namespace, and as such if having CAP_SYS_PTRACE in the user namespace should allow tracing the executable. If not update mm->user_ns to the parent user namespace until an appropriate parent is found. Reported-by: Jann Horn <jann@thejh.net> Fixes: 9e4a36ece652 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06fs: exec: apply CLOEXEC before changing dumpable task flagsAleksa Sarai
commit 613cc2b6f272c1a8ad33aefa21cad77af23139f7 upstream. If you have a process that has set itself to be non-dumpable, and it then undergoes exec(2), any CLOEXEC file descriptors it has open are "exposed" during a race window between the dumpable flags of the process being reset for exec(2) and CLOEXEC being applied to the file descriptors. This can be exploited by a process by attempting to access /proc/<pid>/fd/... during this window, without requiring CAP_SYS_PTRACE. The race in question is after set_dumpable has been (for get_link, though the trace is basically the same for readlink): [vfs] -> proc_pid_link_inode_operations.get_link -> proc_pid_get_link -> proc_fd_access_allowed -> ptrace_may_access(task, PTRACE_MODE_READ_FSCREDS); Which will return 0, during the race window and CLOEXEC file descriptors will still be open during this window because do_close_on_exec has not been called yet. As a result, the ordering of these calls should be reversed to avoid this race window. This is of particular concern to container runtimes, where joining a PID namespace with file descriptors referring to the host filesystem can result in security issues (since PRCTL_SET_DUMPABLE doesn't protect against access of CLOEXEC file descriptors -- file descriptors which may reference filesystem objects the container shouldn't have access to). Cc: dev@opencontainers.org Cc: <stable@vger.kernel.org> # v3.2+ Reported-by: Michael Crosby <crosbymichael@gmail.com> Signed-off-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06f2fs: set ->owner for debugfs status file's file_operationsNicolai Stange
commit 05e6ea2685c964db1e675a24a4f4e2adc22d2388 upstream. The struct file_operations instance serving the f2fs/status debugfs file lacks an initialization of its ->owner. This means that although that file might have been opened, the f2fs module can still get removed. Any further operation on that opened file, releasing included, will cause accesses to unmapped memory. Indeed, Mike Marshall reported the following: BUG: unable to handle kernel paging request at ffffffffa0307430 IP: [<ffffffff8132a224>] full_proxy_release+0x24/0x90 <...> Call Trace: [] __fput+0xdf/0x1d0 [] ____fput+0xe/0x10 [] task_work_run+0x8e/0xc0 [] do_exit+0x2ae/0xae0 [] ? __audit_syscall_entry+0xae/0x100 [] ? syscall_trace_enter+0x1ca/0x310 [] do_group_exit+0x44/0xc0 [] SyS_exit_group+0x14/0x20 [] do_syscall_64+0x61/0x150 [] entry_SYSCALL64_slow_path+0x25/0x25 <...> ---[ end trace f22ae883fa3ea6b8 ]--- Fixing recursive fault but reboot is needed! Fix this by initializing the f2fs/status file_operations' ->owner with THIS_MODULE. This will allow debugfs to grab a reference to the f2fs module upon any open on that file, thus preventing it from getting removed. Fixes: 902829aa0b72 ("f2fs: move proc files to debugfs") Reported-by: Mike Marshall <hubcap@omnibond.com> Reported-by: Martin Brandenburg <martin@omnibond.com> Signed-off-by: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ext4: do not perform data journaling when data is encryptedSergey Karamov
commit 73b92a2a5e97d17cc4d5c4fe9d724d3273fb6fd2 upstream. Currently data journalling is incompatible with encryption: enabling both at the same time has never been supported by design, and would result in unpredictable behavior. However, users are not precluded from turning on both features simultaneously. This change programmatically replaces data journaling for encrypted regular files with ordered data journaling mode. Background: Journaling encrypted data has not been supported because it operates on buffer heads of the page in the page cache. Namely, when the commit happens, which could be up to five seconds after caching, the commit thread uses the buffer heads attached to the page to copy the contents of the page to the journal. With encryption, it would have been required to keep the bounce buffer with ciphertext for up to the aforementioned five seconds, since the page cache can only hold plaintext and could not be used for journaling. Alternatively, it would be required to setup the journal to initiate a callback at the commit time to perform deferred encryption - in this case, not only would the data have to be written twice, but it would also have to be encrypted twice. This level of complexity was not justified for a mode that in practice is very rarely used because of the overhead from the data journalling. Solution: If data=journaled has been set as a mount option for a filesystem, or if journaling is enabled on a regular file, do not perform journaling if the file is also encrypted, instead fall back to the data=ordered mode for the file. Rationale: The intent is to allow seamless and proper filesystem operation when journaling and encryption have both been enabled, and have these two conflicting features gracefully resolved by the filesystem. Fixes: 4461471107b7 Signed-off-by: Sergey Karamov <skaramov@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ext4: return -ENOMEM instead of successDan Carpenter
commit 578620f451f836389424833f1454eeeb2ffc9e9f upstream. We should set the error code if kzalloc() fails. Fixes: 67cf5b09a46f ("ext4: add the basic function for inline data support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ext4: reject inodes with negative sizeDarrick J. Wong
commit 7e6e1ef48fc02f3ac5d0edecbb0c6087cd758d58 upstream. Don't load an inode with a negative size; this causes integer overflow problems in the VFS. [ Added EXT4_ERROR_INODE() to mark file system as corrupted. -TYT] Fixes: a48380f769df (ext4: rename i_dir_acl to i_size_high) Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ext4: add sanity checking to count_overhead()Theodore Ts'o
commit c48ae41bafe31e9a66d8be2ced4e42a6b57fa814 upstream. The commit "ext4: sanity check the block and cluster size at mount time" should prevent any problems, but in case the superblock is modified while the file system is mounted, add an extra safety check to make sure we won't overrun the allocated buffer. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ext4: fix in-superblock mount options processingTheodore Ts'o
commit 5aee0f8a3f42c94c5012f1673420aee96315925a upstream. Fix a large number of problems with how we handle mount options in the superblock. For one, if the string in the superblock is long enough that it is not null terminated, we could run off the end of the string and try to interpret superblocks fields as characters. It's unlikely this will cause a security problem, but it could result in an invalid parse. Also, parse_options is destructive to the string, so in some cases if there is a comma-separated string, it would be modified in the superblock. (Fortunately it only happens on file systems with a 1k block size.) Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ext4: use more strict checks for inodes_per_block on mountTheodore Ts'o
commit cd6bb35bf7f6d7d922509bf50265383a0ceabe96 upstream. Centralize the checks for inodes_per_block and be more strict to make sure the inodes_per_block_group can't end up being zero. Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ext4: fix stack memory corruption with 64k block sizeChandan Rajendra
commit 30a9d7afe70ed6bd9191d3000e2ef1a34fb58493 upstream. The number of 'counters' elements needed in 'struct sg' is super_block->s_blocksize_bits + 2. Presently we have 16 'counters' elements in the array. This is insufficient for block sizes >= 32k. In such cases the memcpy operation performed in ext4_mb_seq_groups_show() would cause stack memory corruption. Fixes: c9de560ded61f Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ext4: fix mballoc breakage with 64k block sizeChandan Rajendra
commit 69e43e8cc971a79dd1ee5d4343d8e63f82725123 upstream. 'border' variable is set to a value of 2 times the block size of the underlying filesystem. With 64k block size, the resulting value won't fit into a 16-bit variable. Hence this commit changes the data type of 'border' to 'unsigned int'. Fixes: c9de560ded61f Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06ptrace: Capture the ptracer's creds not PT_PTRACE_CAPEric W. Biederman
commit 64b875f7ac8a5d60a4e191479299e931ee949b67 upstream. When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was overlooked. This can result in incorrect behavior when an application like strace traces an exec of a setuid executable. Further PT_PTRACE_CAP does not have enough information for making good security decisions as it does not report which user namespace the capability is in. This has already allowed one mistake through insufficient granulariy. I found this issue when I was testing another corner case of exec and discovered that I could not get strace to set PT_PTRACE_CAP even when running strace as root with a full set of caps. This change fixes the above issue with strace allowing stracing as root a setuid executable without disabling setuid. More fundamentaly this change allows what is allowable at all times, by using the correct information in it's decision. Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12") Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06block_dev: don't test bdev->bd_contains when it is not stableNeilBrown
commit bcc7f5b4bee8e327689a4d994022765855c807ff upstream. bdev->bd_contains is not stable before calling __blkdev_get(). When __blkdev_get() is called on a parition with ->bd_openers == 0 it sets bdev->bd_contains = bdev; which is not correct for a partition. After a call to __blkdev_get() succeeds, ->bd_openers will be > 0 and then ->bd_contains is stable. When FMODE_EXCL is used, blkdev_get() calls bd_start_claiming() -> bd_prepare_to_claim() -> bd_may_claim() This call happens before __blkdev_get() is called, so ->bd_contains is not stable. So bd_may_claim() cannot safely use ->bd_contains. It currently tries to use it, and this can lead to a BUG_ON(). This happens when a whole device is already open with a bd_holder (in use by dm in my particular example) and two threads race to open a partition of that device for the first time, one opening with O_EXCL and one without. The thread that doesn't use O_EXCL gets through blkdev_get() to __blkdev_get(), gains the ->bd_mutex, and sets bdev->bd_contains = bdev; Immediately thereafter the other thread, using FMODE_EXCL, calls bd_start_claiming() from blkdev_get(). This should fail because the whole device has a holder, but because bdev->bd_contains == bdev bd_may_claim() incorrectly reports success. This thread continues and blocks on bd_mutex. The first thread then sets bdev->bd_contains correctly and drops the mutex. The thread using FMODE_EXCL then continues and when it calls bd_may_claim() again in: BUG_ON(!bd_may_claim(bdev, whole, holder)); The BUG_ON fires. Fix this by removing the dependency on ->bd_contains in bd_may_claim(). As bd_may_claim() has direct access to the whole device, it can simply test if the target bdev is the whole device. Fixes: 6b4517a7913a ("block: implement bd_claiming and claiming block") Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06btrfs: make file clone aware of fatal signalsWang Xiaoguang
commit 69ae5e4459e43e56f03d0987e865fbac2b05af2a upstream. Indeed this just make the behavior similar to xfs when process has fatal signals pending, and it'll make fstests/generic/298 happy. Signed-off-by: Wang Xiaoguang <wangxg.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06Btrfs: don't BUG() during drop snapshotJosef Bacik
commit 4867268c57ff709a7b6b86ae6f6537d846d1443a upstream. Really there's lots of things that can go wrong here, kill all the BUG_ON()'s and replace the logic ones with ASSERT()'s and return EIO instead. Signed-off-by: Josef Bacik <jbacik@fb.com> [ switched to btrfs_err, errors go to common label ] Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06Btrfs: fix memory leak in do_walk_downLiu Bo
commit a958eab0ed7fdc1b977bc25d3af6efedaa945488 upstream. The extent buffer 'next' needs to be free'd conditionally. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06Btrfs: don't leak reloc root nodes on errorJosef Bacik
commit 6bdf131fac2336adb1a628f992ba32384f653a55 upstream. We don't track the reloc roots in any sort of normal way, so the only way the root/commit_root nodes get free'd is if the relocation finishes successfully and the reloc root is deleted. Fix this by free'ing them in free_reloc_roots. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06Btrfs: return gracefully from balance if fs tree is corruptedLiu Bo
commit 3561b9db70928f207be4570b48fc19898eeaef54 upstream. When relocating tree blocks, we firstly get block information from back references in the extent tree, we then search fs tree to try to find all parents of a block. However, if fs tree is corrupted, eg. if there're some missing items, we could come across these WARN_ONs and BUG_ONs. This makes us print some error messages and return gracefully from balance. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06Btrfs: bail out if block group has different mixed flagLiu Bo
commit 49303381f19ab16a371a061b67e783d3f570d56e upstream. Currently we allow inconsistence about mixed flag (BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA). We'd get ENOSPC if block group has mixed flag and btrfs doesn't. If that happens, we have one space_info with mixed flag and another space_info only with BTRFS_BLOCK_GROUP_METADATA, and global_block_rsv.space_info points to the latter one, but all bytes from block_group contributes to the mixed space_info, thus all the allocation will fail with ENOSPC. This adds a check for the above case. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> [ updated message ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06Btrfs: fix memory leak in reading btree blocksLiu Bo
commit 2571e739677f1e4c0c63f5ed49adcc0857923625 upstream. So we can read a btree block via readahead or intentional read, and we can end up with a memory leak when something happens as follows, 1) readahead starts to read block A but does not wait for read completion, 2) btree_readpage_end_io_hook finds that block A is corrupted, and it needs to clear all block A's pages' uptodate bit. 3) meanwhile an intentional read kicks in and checks block A's pages' uptodate to decide which page needs to be read. 4) when some pages have the uptodate bit during 3)'s check so 3) doesn't count them for eb->io_pages, but they are later cleared by 2) so we has to readpage on the page, we get the wrong eb->io_pages which results in a memory leak of this block. This fixes the problem by firstly getting all pages's locking and then checking pages' uptodate bit. t1(readahead) t2(readahead endio) t3(the following read) read_extent_buffer_pages end_bio_extent_readpage for pg in eb: for page 0,1,2 in eb: if pg is uptodate: btree_readpage_end_io_hook(pg) num_reads++ if uptodate: eb->io_pages = num_reads SetPageUptodate(pg) _______________ for pg in eb: for page 3 in eb: read_extent_buffer_pages if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb: __extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate: clear_extent_buffer_uptodate(eb) num_reads++ for pg in eb: eb->io_pages = num_reads ClearPageUptodate(page) _______________ for pg in eb: if pg is NOT uptodate: __extent_read_full_page(pg) So t3's eb->io_pages is not consistent with the number of pages it's reading, and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative number so that we're not able to free the eb. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06Btrfs: fix qgroup rescan worker initializationFilipe Manana
commit 8d9eddad19467b008e0c881bc3133d7da94b7ec1 upstream. We were setting the qgroup_rescan_running flag to true only after the rescan worker started (which is a task run by a queue). So if a user space task starts a rescan and immediately after asks to wait for the rescan worker to finish, this second call might happen before the rescan worker task starts running, in which case the rescan wait ioctl returns immediatley, not waiting for the rescan worker to finish. This was making the fstest btrfs/022 fail very often. Fixes: d2c609b834d6 (btrfs: properly track when rescan worker is running) Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06btrfs: store and load values of stripes_min/stripes_max in balance status itemDavid Sterba
commit ed0df618b1b06d7431ee4d985317fc5419a5d559 upstream. The balance status item contains currently known filter values, but the stripes filter was unintentionally not among them. This would mean, that interrupted and automatically restarted balance does not apply the stripe filters. Fixes: dee32d0ac3719ef8d640efaf0884111df444730f Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06Btrfs: fix tree search logic when replaying directory entry deletesRobbie Ko
commit 2a7bf53f577e49c43de4ffa7776056de26db65d9 upstream. If a log tree has a layout like the following: leaf N: ... item 240 key (282 DIR_LOG_ITEM 0) itemoff 8189 itemsize 8 dir log end 1275809046 leaf N + 1: item 0 key (282 DIR_LOG_ITEM 3936149215) itemoff 16275 itemsize 8 dir log end 18446744073709551615 ... When we pass the value 1275809046 + 1 as the parameter start_ret to the function tree-log.c:find_dir_range() (done by replay_dir_deletes()), we end up with path->slots[0] having the value 239 (points to the last item of leaf N, item 240). Because the dir log item in that position has an offset value smaller than *start_ret (1275809046 + 1) we need to move on to the next leaf, however the logic for that is wrong since it compares the current slot to the number of items in the leaf, which is smaller and therefore we don't lookup for the next leaf but instead we set the slot to point to an item that does not exist, at slot 240, and we later operate on that slot which has unexpected content or in the worst case can result in an invalid memory access (accessing beyond the last page of leaf N's extent buffer). So fix the logic that checks when we need to lookup at the next leaf by first incrementing the slot and only after to check if that slot is beyond the last item of the current leaf. Signed-off-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Fixes: e02119d5a7b4 (Btrfs: Add a write ahead tree log to optimize synchronous operations) Signed-off-by: Filipe Manana <fdmanana@suse.com> [Modified changelog for clarity and correctness] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06btrfs: limit async_work allocation and worker func durationMaxim Patlasov
commit 2939e1a86f758b55cdba73e29397dd3d94df13bc upstream. Problem statement: unprivileged user who has read-write access to more than one btrfs subvolume may easily consume all kernel memory (eventually triggering oom-killer). Reproducer (./mkrmdir below essentially loops over mkdir/rmdir): [root@kteam1 ~]# cat prep.sh DEV=/dev/sdb mkfs.btrfs -f $DEV mount $DEV /mnt for i in `seq 1 16` do mkdir /mnt/$i btrfs subvolume create /mnt/SV_$i ID=`btrfs subvolume list /mnt |grep "SV_$i$" |cut -d ' ' -f 2` mount -t btrfs -o subvolid=$ID $DEV /mnt/$i chmod a+rwx /mnt/$i done [root@kteam1 ~]# sh prep.sh [maxim@kteam1 ~]$ for i in `seq 1 16`; do ./mkrmdir /mnt/$i 2000 2000 & done [root@kteam1 ~]# for i in `seq 1 4`; do grep "kmalloc-128" /proc/slabinfo | grep -v dma; sleep 60; done kmalloc-128 10144 10144 128 32 1 : tunables 0 0 0 : slabdata 317 317 0 kmalloc-128 9992352 9992352 128 32 1 : tunables 0 0 0 : slabdata 312261 312261 0 kmalloc-128 24226752 24226752 128 32 1 : tunables 0 0 0 : slabdata 757086 757086 0 kmalloc-128 42754240 42754240 128 32 1 : tunables 0 0 0 : slabdata 1336070 1336070 0 The huge numbers above come from insane number of async_work-s allocated and queued by btrfs_wq_run_delayed_node. The problem is caused by btrfs_wq_run_delayed_node() queuing more and more works if the number of delayed items is above BTRFS_DELAYED_BACKGROUND. The worker func (btrfs_async_run_delayed_root) processes at least BTRFS_DELAYED_BATCH items (if they are present in the list). So, the machinery works as expected while the list is almost empty. As soon as it is getting bigger, worker func starts to process more than one item at a time, it takes longer, and the chances to have async_works queued more than needed is getting higher. The problem above is worsened by another flaw of delayed-inode implementation: if async_work was queued in a throttling branch (number of items >= BTRFS_DELAYED_WRITEBACK), corresponding worker func won't quit until the number of items < BTRFS_DELAYED_BACKGROUND / 2. So, it is possible that the func occupies CPU infinitely (up to 30sec in my experiments): while the func is trying to drain the list, the user activity may add more and more items to the list. The patch fixes both problems in straightforward way: refuse queuing too many works in btrfs_wq_run_delayed_node and bail out of worker func if at least BTRFS_DELAYED_WRITEBACK items are processed. Changed in v2: remove support of thresh == NO_THRESHOLD. Signed-off-by: Maxim Patlasov <mpatlasov@virtuozzo.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-12-02NFSv4.x: hide array-bounds warningArnd Bergmann
commit d55b352b01bc78fbc3d1bb650140668b87e58bf9 upstream. A correct bugfix introduced a harmless warning that shows up with gcc-7: fs/nfs/callback.c: In function 'nfs_callback_up': fs/nfs/callback.c:214:14: error: array subscript is outside array bounds [-Werror=array-bounds] What happens here is that the 'minorversion == 0' check tells the compiler that we assume minorversion can be something other than 0, but when CONFIG_NFS_V4_1 is disabled that would be invalid and result in an out-of-bounds access. The added check for IS_ENABLED(CONFIG_NFS_V4_1) tells gcc that this really can't happen, which makes the code slightly smaller and also avoids the warning. The bugfix that introduced the warning is marked for stable backports, we want this one backported to the same releases. Fixes: 98b0f80c2396 ("NFSv4.x: Fix a refcount leak in nfs_callback_up_net") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-26ext4: sanity check the block and cluster size at mount timeTheodore Ts'o
commit 8cdf3372fe8368f56315e66bea9f35053c418093 upstream. If the block size or cluster size is insane, reject the mount. This is important for security reasons (although we shouldn't be just depending on this check). Ref: http://www.securityfocus.com/archive/1/539661 Ref: https://bugzilla.redhat.com/show_bug.cgi?id=1332506 Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-26fuse: fix fuse_write_end() if zero bytes were copiedMiklos Szeredi
commit 59c3b76cc61d1d676f965c192cc7969aa5cb2744 upstream. If pos is at the beginning of a page and copied is zero then page is not zeroed but is marked uptodate. Fix by skipping everything except unlock/put of page if zero bytes were copied. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: 6b12c1b37e55 ("fuse: Implement write_begin/write_end callbacks") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-18btrfs: qgroup: Prevent qgroup->reserved from going subzeroGoldwyn Rodrigues
commit 0b34c261e235a5c74dcf78bd305845bd15fe2b42 upstream. While free'ing qgroup->reserved resources, we much check if the page has not been invalidated by a truncate operation by checking if the page is still dirty before reducing the qgroup resources. Resources in such a case are free'd when the entire extent is released by delayed_ref. This fixes a double accounting while releasing resources in case of truncating a file, reproduced by the following testcase. SCRATCH_DEV=/dev/vdb SCRATCH_MNT=/mnt mkfs.btrfs -f $SCRATCH_DEV mount -t btrfs $SCRATCH_DEV $SCRATCH_MNT cd $SCRATCH_MNT btrfs quota enable $SCRATCH_MNT btrfs subvolume create a btrfs qgroup limit 500m a $SCRATCH_MNT sync for c in {1..15}; do dd if=/dev/zero bs=1M count=40 of=$SCRATCH_MNT/a/file; done sleep 10 sync sleep 5 touch $SCRATCH_MNT/a/newfile echo "Removing file" rm $SCRATCH_MNT/a/file Fixes: b9d0b38928 ("btrfs: Add handler for invalidate page") Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-18coredump: fix unfreezable coredumping taskAndrey Ryabinin
commit 70d78fe7c8b640b5acfad56ad341985b3810998a upstream. It could be not possible to freeze coredumping task when it waits for 'core_state->startup' completion, because threads are frozen in get_signal() before they got a chance to complete 'core_state->startup'. Inability to freeze a task during suspend will cause suspend to fail. Also CRIU uses cgroup freezer during dump operation. So with an unfreezable task the CRIU dump will fail because it waits for a transition from 'FREEZING' to 'FROZEN' state which will never happen. Use freezer_do_not_count() to tell freezer to ignore coredumping task while it waits for core_state->startup completion. Link: http://lkml.kernel.org/r/1475225434-3753-1-git-send-email-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10ovl: fsync after copy-upMiklos Szeredi
commit 641089c1549d8d3df0b047b5de7e9a111362cdce upstream. Make sure the copied up file hits the disk before renaming to the final destination. If this is not done then the copy-up may corrupt the data in the file in case of a crash. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10ubifs: Fix regression in ubifs_readdir()Richard Weinberger
commit a00052a296e54205cf238c75bd98d17d5d02a6db upstream. Commit c83ed4c9dbb35 ("ubifs: Abort readdir upon error") broke overlayfs support because the fix exposed an internal error code to VFS. Reported-by: Peter Rosin <peda@axentia.se> Tested-by: Peter Rosin <peda@axentia.se> Reported-by: Ralph Sennhauser <ralph.sennhauser@gmail.com> Tested-by: Ralph Sennhauser <ralph.sennhauser@gmail.com> Fixes: c83ed4c9dbb35 ("ubifs: Abort readdir upon error") Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10ubifs: Abort readdir upon errorRichard Weinberger
commit c83ed4c9dbb358b9e7707486e167e940d48bfeed upstream. If UBIFS is facing an error while walking a directory, it reports this error and ubifs_readdir() returns the error code. But the VFS readdir logic does not make the getdents system call fail in all cases. When the readdir cursor indicates that more entries are present, the system call will just return and the libc wrapper will try again since it also knows that more entries are present. This causes the libc wrapper to busy loop for ever when a directory is corrupted on UBIFS. A common approach do deal with corrupted directory entries is skipping them by setting the cursor to the next entry. On UBIFS this approach is not possible since we cannot compute the next directory entry cursor position without reading the current entry. So all we can do is setting the cursor to the "no more entries" position and make getdents exit. Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10btrfs: fix races on root_log_ctx listsChris Mason
commit 570dd45042a7c8a7aba1ee029c5dd0f5ccf41b9b upstream. btrfs_remove_all_log_ctxs takes a shortcut where it avoids walking the list because it knows all of the waiters are patiently waiting for the commit to finish. But, there's a small race where btrfs_sync_log can remove itself from the list if it finds a log commit is already done. Also, it uses list_del_init() to remove itself from the list, but there's no way to know if btrfs_remove_all_log_ctxs has already run, so we don't know for sure if it is safe to call list_del_init(). This gets rid of all the shortcuts for btrfs_remove_all_log_ctxs(), and just calls it with the proper locking. This is part two of the corruption fixed by cbd60aa7cd1. I should have done this in the first place, but convinced myself the optimizations were safe. A 12 hour run of dbench 2048 will eventually trigger a list debug WARN_ON for the list_del_init() in btrfs_sync_log(). Fixes: d1433debe7f4346cf9fc0dafc71c3137d2a97bc4 Reported-by: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-10libxfs: clean up _calc_dquots_per_chunkDarrick J. Wong
commit 58d789678546d46d7bbd809dd7dab417c0f23655 upstream. The function xfs_calc_dquots_per_chunk takes a parameter in units of basic blocks. The kernel seems to get the units wrong, but userspace got 'fixed' by commenting out the unnecessary conversion. Fix both. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-31posix_acl: Clear SGID bit when setting file permissionsJan Kara
commit 073931017b49d9458aa351605b43a7e34598caef upstream. When file permissions are modified via chmod(2) and the user is not in the owning group or capable of CAP_FSETID, the setgid bit is cleared in inode_change_ok(). Setting a POSIX ACL via setxattr(2) sets the file permissions as well as the new ACL, but doesn't clear the setgid bit in a similar way; this allows to bypass the check in chmod(2). Fix that. References: CVE-2016-7097 Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28ubifs: Fix xattr_names length in exit pathsRichard Weinberger
commit 843741c5778398ea67055067f4cc65ae6c80ca0e upstream. When the operation fails we also have to undo the changes we made to ->xattr_names. Otherwise listxattr() will report wrong lengths. Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28jbd2: fix incorrect unlock on j_list_lockTaesoo Kim
commit 559cce698eaf4ccecb2213b2519ea3a0413e5155 upstream. When 'jh->b_transaction == transaction' (asserted by below) J_ASSERT_JH(jh, (jh->b_transaction == transaction || ... 'journal->j_list_lock' will be incorrectly unlocked, since the the lock is aquired only at the end of if / else-if statements (missing the else case). Signed-off-by: Taesoo Kim <tsgatesv@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Fixes: 6e4862a5bb9d12be87e4ea5d9a60836ebed71d28 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28ext4: do not advertise encryption support when disabledEric Biggers
commit c4704a4fbe834eee4109ca064131d440941f6235 upstream. The sysfs file /sys/fs/ext4/features/encryption was present on kernels compiled with CONFIG_EXT4_FS_ENCRYPTION=n. This was misleading because such kernels do not actually support ext4 encryption. Therefore, only provide this file on kernels compiled with CONFIG_EXT4_FS_ENCRYPTION=y. Note: since the ext4 feature files are all hardcoded to have a contents of "supported", it really is the presence or absence of the file that is significant, not the contents (and this change reflects that). Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28ceph: fix error handling in ceph_read_iterNikolay Borisov
commit 0d7718f666be181fda1ba2d08f137d87c1419347 upstream. In case __ceph_do_getattr returns an error and the retry_op in ceph_read_iter is not READ_INLINE, then it's possible to invoke __free_page on a page which is NULL, this naturally leads to a crash. This can happen when, for example, a process waiting on a MDS reply receives sigterm. Fix this by explicitly checking whether the page is set or not. Signed-off-by: Nikolay Borisov <kernel@kyup.com> Reviewed-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28isofs: Do not return EACCES for unknown filesystemsJan Kara
commit a2ed0b391dd9c3ef1d64c7c3e370f4a5ffcd324a upstream. When isofs_mount() is called to mount a device read-write, it returns EACCES even before it checks that the device actually contains an isofs filesystem. This may confuse mount(8) which then tries to mount all subsequent filesystem types in read-only mode. Fix the problem by returning EACCES only once we verify that the device indeed contains an iso9660 filesystem. Fixes: 17b7f7cf58926844e1dd40f5eb5348d481deca6a Reported-by: Kent Overstreet <kent.overstreet@gmail.com> Reported-by: Karel Zak <kzak@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28Fix regression which breaks DFS mountingSachin Prabhu
commit d171356ff11ab1825e456dfb979755e01b3c54a1 upstream. Patch a6b5058 results in -EREMOTE returned by is_path_accessible() in cifs_mount() to be ignored which breaks DFS mounting. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28Cleanup missing frees on some ioctlsSteve French
commit 24df1483c272c99ed88b0cba135d0e1dfdee3930 upstream. Cleanup some missing mem frees on some cifs ioctls, and clarify others to make more obvious that no data is returned. Signed-off-by: Steve French <smfrench@gmail.com> Acked-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28Do not send SMB3 SET_INFO request if nothing is changingSteve French
commit 18dd8e1a65ddae2351d0f0d6dd4a334f441fc5fa upstream. [CIFS] We had cases where we sent a SMB2/SMB3 setinfo request with all timestamp (and DOS attribute) fields marked as 0 (ie do not change) e.g. on chmod or chown. Signed-off-by: Steve French <steve.french@primarydata.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28SMB3: GUIDs should be constructed as random but valid uuidsSteve French
commit fa70b87cc6641978b20e12cc5d517e9ffc0086d4 upstream. GUIDs although random, and 16 bytes, need to be generated as proper uuids. Signed-off-by: Steve French <steve.french@primarydata.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Reported-by: David Goebels <davidgoe@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28Set previous session id correctly on SMB3 reconnectSteve French
commit c2afb8147e69819885493edf3a7c1ce03aaf2d4e upstream. Signed-off-by: Steve French <steve.french@primarydata.com> Reported-by: David Goebel <davidgoe@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-10-28Display number of credits availableSteve French
commit 9742805d6b1bfb45d7f267648c34fb5bcd347397 upstream. In debugging smb3, it is useful to display the number of credits available, so we can see when the server has not granted sufficient operations for the client to make progress, or alternatively the client has requested too many credits (as we saw in a recent bug) so we can compare with the number of credits the server thinks we have. Add a /proc/fs/cifs/DebugData line to display the client view on how many credits are available. Signed-off-by: Steve French <steve.french@primarydata.com> Reported-by: Germano Percossi <germano.percossi@citrix.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>