summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2012-02-03sysfs: Complain bitterly about attempts to remove files from nonexistent ↵Eric W. Biederman
directories. commit ce597919361dcec97341151690e780eade2a9cf4 upstream. Recently an OOPS was observed from the usb serial io_ti driver when it tried to remove sysfs directories. Upon investigation it turns out this driver was always buggy and that a recent sysfs change had stopped guarding itself against removing attributes from sysfs directories that had already been removed. :( Historically we have been silent about attempting to files from nonexistent sysfs directories and have politely returned error codes. That has resulted in people writing broken code that ignores the error codes. Issue a kernel WARNING and a stack backtrace to make it clear in no uncertain terms that abusing sysfs is not ok, and the callers need to fix their code. This change transforms the io_ti OOPS into a more comprehensible error message and stack backtrace. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Reported-by: Wolfgang Frisch <wfpub@roembden.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03jbd: Issue cache flush after checkpointingJan Kara
commit 353b67d8ced4dc53281c88150ad295e24bc4b4c5 upstream. When we reach cleanup_journal_tail(), there is no guarantee that checkpointed buffers are on a stable storage - especially if buffers were written out by log_do_checkpoint(), they are likely to be only in disk's caches. Thus when we update journal superblock, effectively removing old transaction from journal, this write of superblock can get to stable storage before those checkpointed buffers which can result in filesystem corruption after a crash. A similar problem can happen if we replay the journal and wipe it before flushing disk's caches. Thus we must unconditionally issue a cache flush before we update journal superblock in these cases. The fix is slightly complicated by the fact that we have to get log tail before we issue cache flush but we can store it in the journal superblock only after the cache flush. Otherwise we risk races where new tail is written before appropriate cache flush is finished. I managed to reproduce the corruption using somewhat tweaked Chris Mason's barrier-test scheduler. Also this should fix occasional reports of 'Bit already freed' filesystem errors which are totally unreproducible but inspection of several fs images I've gathered over time points to a problem like this. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03xfs: Fix missing xfs_iunlock() on error recovery path in xfs_readlink()Jan Kara
commit 9b025eb3a89e041bab6698e3858706be2385d692 upstream. Commit b52a360b forgot to call xfs_iunlock() when it detected corrupted symplink and bailed out. Fix it by jumping to 'out' instead of doing return. CC: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Alex Elder <elder@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03eCryptfs: Fix oops when printing debug info in extent crypto functionsTyler Hicks
commit 58ded24f0fcb85bddb665baba75892f6ad0f4b8a upstream. If pages passed to the eCryptfs extent-based crypto functions are not mapped and the module parameter ecryptfs_verbosity=1 was specified at loading time, a NULL pointer dereference will occur. Note that this wouldn't happen on a production system, as you wouldn't pass ecryptfs_verbosity=1 on a production system. It leaks private information to the system logs and is for debugging only. The debugging info printed in these messages is no longer very useful and rather than doing a kmap() in these debugging paths, it will be better to simply remove the debugging paths completely. https://launchpad.net/bugs/913651 Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03eCryptfs: Check inode changes in setattrTyler Hicks
commit a261a03904849c3df50bd0300efb7fb3f865137d upstream. Most filesystems call inode_change_ok() very early in ->setattr(), but eCryptfs didn't call it at all. It allowed the lower filesystem to make the call in its ->setattr() function. Then, eCryptfs would copy the appropriate inode attributes from the lower inode to the eCryptfs inode. This patch changes that and actually calls inode_change_ok() on the eCryptfs inode, fairly early in ecryptfs_setattr(). Ideally, the call would happen earlier in ecryptfs_setattr(), but there are some possible inode initialization steps that must happen first. Since the call was already being made on the lower inode, the change in functionality should be minimal, except for the case of a file extending truncate call. In that case, inode_newsize_ok() was never being called on the eCryptfs inode. Rather than inode_newsize_ok() catching maximum file size errors early on, eCryptfs would encrypt zeroed pages and write them to the lower filesystem until the lower filesystem's write path caught the error in generic_write_checks(). This patch introduces a new function, called ecryptfs_inode_newsize_ok(), which checks if the new lower file size is within the appropriate limits when the truncate operation will be growing the lower file. In summary this change prevents eCryptfs truncate operations (and the resulting page encryptions), which would exceed the lower filesystem limits or FSIZE rlimits, from ever starting. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reviewed-by: Li Wang <liwang@nudt.edu.cn> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03eCryptfs: Make truncate path killableTyler Hicks
commit 5e6f0d769017cc49207ef56996e42363ec26c1f0 upstream. ecryptfs_write() handles the truncation of eCryptfs inodes. It grabs a page, zeroes out the appropriate portions, and then encrypts the page before writing it to the lower filesystem. It was unkillable and due to the lack of sparse file support could result in tying up a large portion of system resources, while encrypting pages of zeros, with no way for the truncate operation to be stopped from userspace. This patch adds the ability for ecryptfs_write() to detect a pending fatal signal and return as gracefully as possible. The intent is to leave the lower file in a useable state, while still allowing a user to break out of the encryption loop. If a pending fatal signal is detected, the eCryptfs inode size is updated to reflect the modified inode size and then -EINTR is returned. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03ecryptfs: Improve metadata read failure loggingTim Gardner
commit 30373dc0c87ffef68d5628e77d56ffb1fa22e1ee upstream. Print inode on metadata read failure. The only real way of dealing with metadata read failures is to delete the underlying file system file. Having the inode allows one to 'find . -inum INODE`. [tyhicks@canonical.com: Removed some minor not-for-stable parts] Signed-off-by: Tim Gardner <tim.gardner@canonical.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03eCryptfs: Sanitize write counts of /dev/ecryptfsTyler Hicks
commit db10e556518eb9d21ee92ff944530d84349684f4 upstream. A malicious count value specified when writing to /dev/ecryptfs may result in a a very large kernel memory allocation. This patch peeks at the specified packet payload size, adds that to the size of the packet headers and compares the result with the write count value. The resulting maximum memory allocation size is approximately 532 bytes. Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-01-25proc: clear_refs: do not clear reserved pagesWill Deacon
commit 85e72aa5384b1a614563ad63257ded0e91d1a620 upstream. /proc/pid/clear_refs is used to clear the Referenced and YOUNG bits for pages and corresponding page table entries of the task with PID pid, which includes any special mappings inserted into the page tables in order to provide things like vDSOs and user helper functions. On ARM this causes a problem because the vectors page is mapped as a global mapping and since ec706dab ("ARM: add a vma entry for the user accessible vector page"), a VMA is also inserted into each task for this page to aid unwinding through signals and syscall restarts. Since the vectors page is required for handling faults, clearing the YOUNG bit (and subsequently writing a faulting pte) means that we lose the vectors page *globally* and cannot fault it back in. This results in a system deadlock on the next exception. To see this problem in action, just run: $ echo 1 > /proc/self/clear_refs on an ARM platform (as any user) and watch your system hang. I think this has been the case since 2.6.37 This patch avoids clearing the aforementioned bits for reserved pages, therefore leaving the vectors page intact on ARM. Since reserved pages are not candidates for swap, this change should not have any impact on the usefulness of clear_refs. Signed-off-by: Will Deacon <will.deacon@arm.com> Reported-by: Moussa Ba <moussaba@micron.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Russell King <rmk@arm.linux.org.uk> Acked-by: Nicolas Pitre <nico@linaro.org> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25cifs: lower default wsize when unix extensions are not usedJeff Layton
commit ce91acb3acae26f4163c5a6f1f695d1a1e8d9009 upstream. We've had some reports of servers (namely, the Solaris in-kernel CIFS server) that don't deal properly with writes that are "too large" even though they set CAP_LARGE_WRITE_ANDX. Change the default to better mirror what windows clients do. Cc: Pavel Shilovsky <piastry@etersoft.ru> Reported-by: Nick Davis <phireph0x@yahoo.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25xfs: fix endian conversion issue in discard codeDave Chinner
commit b1c770c273a4787069306fc82aab245e9ac72e9d upstream When finding the longest extent in an AG, we read the value directly out of the AGF buffer without endian conversion. This will give an incorrect length, resulting in FITRIM operations potentially not trimming everything that it should. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25proc: clean up and fix /proc/<pid>/mem handlingLinus Torvalds
commit e268337dfe26dfc7efd422a804dbb27977a3cccc upstream. Jüri Aedla reported that the /proc/<pid>/mem handling really isn't very robust, and it also doesn't match the permission checking of any of the other related files. This changes it to do the permission checks at open time, and instead of tracking the process, it tracks the VM at the time of the open. That simplifies the code a lot, but does mean that if you hold the file descriptor open over an execve(), you'll continue to read from the _old_ VM. That is different from our previous behavior, but much simpler. If somebody actually finds a load where this matters, we'll need to revert this commit. I suspect that nobody will ever notice - because the process mapping addresses will also have changed as part of the execve. So you cannot actually usefully access the fd across a VM change simply because all the offsets for IO would have changed too. Reported-by: Jüri Aedla <asd@ut.ee> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25fix cputime overflow in uptime_proc_showMartin Schwidefsky
commit c3e0ef9a298e028a82ada28101ccd5cf64d209ee upstream. For 32-bit architectures using standard jiffies the idletime calculation in uptime_proc_show will quickly overflow. It takes (2^32 / HZ) seconds of idle-time, or e.g. 12.45 days with no load on a quad-core with HZ=1000. Switch to 64-bit calculations. Cc: Michael Abbott <michael.abbott@diamond.ac.uk> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25pnfsblock: limit bio page countPeng Tao
commit 74a6eeb44ca6174d9cc93b9b8b4d58211c57bc80 upstream. One bio can have at most BIO_MAX_PAGES pages. We should limit it bec otherwise bio_alloc will fail when there are many pages in one read/write_pagelist. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25pnfsblock: don't spinlock when freeing block_devPeng Tao
commit 93a3844ee0f843b05a1df4b52e1a19ff26b98d24 upstream. bl_free_block_dev() may sleep. We can not call it with spinlock held. Besides, there is no need to take bm_lock as we are last user freeing bm_devlist. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25pnfsblock: acquire im_lock in _preload_rangePeng Tao
commit 39e567ae36fe03c2b446e1b83ee3d39bea08f90b upstream. When calling _add_entry, we should take the im_lock to protect agains other modifiers. Signed-off-by: Peng Tao <peng_tao@emc.com> Signed-off-by: Benny Halevy <bhalevy@tonian.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25fix shrink_dcache_parent() livelockMiklos Szeredi
commit eaf5f9073533cde21c7121c136f1c3f072d9cf59 upstream. Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may cause shrink_dcache_parent() to loop forever. Here's what appears to happen: 1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1 2 - CPU1: select_parent(P) locks P->d_lock 3 - CPU0: shrink_dentry_list() locks C->d_lock dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock 4 - CPU1: select_parent(P) locks C->d_lock, moves C from dispose list being processed on CPU0 to the new dispose list, returns 1 5 - CPU0: shrink_dentry_list() finds dispose list empty, returns 6 - Goto 2 with CPU0 and CPU1 switched Basically select_parent() steals the dentry from shrink_dentry_list() and thinks it found a new one, causing shrink_dentry_list() to think it's making progress and loop over and over. One way to trigger this is to make udev calls stat() on the sysfs file while it is going away. Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick: ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true" Then execute the following loop: while true; do echo -bond0 > /sys/class/net/bonding_masters echo +bond0 > /sys/class/net/bonding_masters echo -bond1 > /sys/class/net/bonding_masters echo +bond1 > /sys/class/net/bonding_masters done One fix would be to check all callers and prevent concurrent calls to shrink_dcache_parent(). But I think a better solution is to stop the stealing behavior. This patch adds a new dentry flag that is set when the dentry is added to the dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a new reference just before being pruned. If the dentry has this flag, select_parent() will skip it and let shrink_dentry_list() retry pruning it. With select_parent() skipping those dentries there will not be the appearance of progress (new dentries found) when there is none, hence shrink_dcache_parent() will not loop forever. Set the flag is also set in prune_dcache_sb() for consistency as suggested by Linus. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25dcache: use a dispose list in select_parentDave Chinner
commit b48f03b319ba78f3abf9a7044d1f436d8d90f4f9 upstream. select_parent currently abuses the dentry cache LRU to provide cleanup features for child dentries that need to be freed. It moves them to the tail of the LRU, then tells shrink_dcache_parent() to calls __shrink_dcache_sb to unconditionally move them to a dispose list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to relock the dentries to move them off the LRU onto the dispose list, but otherwise does not touch the dentries that select_parent() moved to the tail of the LRU. It then passses the dispose list to shrink_dentry_list() which tries to free the dentries. IOWs, the use of __shrink_dcache_sb() is superfluous - we can build exactly the same list of dentries for disposal directly in select_parent() and call shrink_dentry_list() instead of calling __shrink_dcache_sb() to do that. This means that we avoid long holds on the lru lock walking the LRU moving dentries to the dispose list We also avoid the need to relock each dentry just to move it off the LRU, reducing the numebr of times we lock each dentry to dispose of them in shrink_dcache_parent() from 3 to 2 times. Further, we remove one of the two callers of __shrink_dcache_sb(). This also means that __shrink_dcache_sb can be moved into back into prune_dcache_sb() and we no longer have to handle referenced dentries conditionally, simplifying the code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25fsnotify: don't BUG in fsnotify_destroy_mark()Miklos Szeredi
commit fed474857efbed79cd390d0aee224231ca718f63 upstream. Removing the parent of a watched file results in "kernel BUG at fs/notify/mark.c:139". To reproduce add "-w /tmp/audit/dir/watched_file" to audit.rules rm -rf /tmp/audit/dir This is caused by fsnotify_destroy_mark() being called without an extra reference taken by the caller. Reported by Francesco Cosoleto here: https://bugzilla.novell.com/show_bug.cgi?id=689860 Fix by removing the BUG_ON and adding a comment about not accessing mark after the iput. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25nfsd: Fix oops when parsing a 0 length exportSasha Levin
commit b2ea70afade7080360ac55c4e64ff7a5fafdb67b upstream. expkey_parse() oopses when handling a 0 length export. This is easily triggerable from usermode by writing 0 bytes into '/proc/[proc id]/net/rpc/nfsd.fh/channel'. Below is the log: [ 1402.286893] BUG: unable to handle kernel paging request at ffff880077c49fff [ 1402.287632] IP: [<ffffffff812b4b99>] expkey_parse+0x28/0x2e1 [ 1402.287632] PGD 2206063 PUD 1fdfd067 PMD 1ffbc067 PTE 8000000077c49160 [ 1402.287632] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 1402.287632] CPU 1 [ 1402.287632] Pid: 20198, comm: trinity Not tainted 3.2.0-rc2-sasha-00058-gc65cd37 #6 [ 1402.287632] RIP: 0010:[<ffffffff812b4b99>] [<ffffffff812b4b99>] expkey_parse+0x28/0x2e1 [ 1402.287632] RSP: 0018:ffff880077f0fd68 EFLAGS: 00010292 [ 1402.287632] RAX: ffff880077c49fff RBX: 00000000ffffffea RCX: 0000000001043400 [ 1402.287632] RDX: 0000000000000000 RSI: ffff880077c4a000 RDI: ffffffff82283de0 [ 1402.287632] RBP: ffff880077f0fe18 R08: 0000000000000001 R09: ffff880000000000 [ 1402.287632] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880077c4a000 [ 1402.287632] R13: ffffffff82283de0 R14: 0000000001043400 R15: ffffffff82283de0 [ 1402.287632] FS: 00007f25fec3f700(0000) GS:ffff88007d400000(0000) knlGS:0000000000000000 [ 1402.287632] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 1402.287632] CR2: ffff880077c49fff CR3: 0000000077e1d000 CR4: 00000000000406e0 [ 1402.287632] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1402.287632] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 1402.287632] Process trinity (pid: 20198, threadinfo ffff880077f0e000, task ffff880077db17b0) [ 1402.287632] Stack: [ 1402.287632] ffff880077db17b0 ffff880077c4a000 ffff880077f0fdb8 ffffffff810b411e [ 1402.287632] ffff880000000000 ffff880077db17b0 ffff880077c4a000 ffffffff82283de0 [ 1402.287632] 0000000001043400 ffffffff82283de0 ffff880077f0fde8 ffffffff81111f63 [ 1402.287632] Call Trace: [ 1402.287632] [<ffffffff810b411e>] ? lock_release+0x1af/0x1bc [ 1402.287632] [<ffffffff81111f63>] ? might_fault+0x97/0x9e [ 1402.287632] [<ffffffff81111f1a>] ? might_fault+0x4e/0x9e [ 1402.287632] [<ffffffff81a8bcf2>] cache_do_downcall+0x3e/0x4f [ 1402.287632] [<ffffffff81a8c950>] cache_write.clone.16+0xbb/0x130 [ 1402.287632] [<ffffffff81a8c9df>] ? cache_write_pipefs+0x1a/0x1a [ 1402.287632] [<ffffffff81a8c9f8>] cache_write_procfs+0x19/0x1b [ 1402.287632] [<ffffffff8118dc54>] proc_reg_write+0x8e/0xad [ 1402.287632] [<ffffffff8113fe81>] vfs_write+0xaa/0xfd [ 1402.287632] [<ffffffff8114142d>] ? fget_light+0x35/0x9e [ 1402.287632] [<ffffffff8113ff8b>] sys_write+0x48/0x6f [ 1402.287632] [<ffffffff81bbdb92>] system_call_fastpath+0x16/0x1b [ 1402.287632] Code: c0 c9 c3 55 48 63 d2 48 89 e5 48 8d 44 32 ff 41 57 41 56 41 55 41 54 53 bb ea ff ff ff 48 81 ec 88 00 00 00 48 89 b5 58 ff ff ff [ 1402.287632] 38 0a 0f 85 89 02 00 00 c6 00 00 48 8b 3d 44 4a e5 01 48 85 [ 1402.287632] RIP [<ffffffff812b4b99>] expkey_parse+0x28/0x2e1 [ 1402.287632] RSP <ffff880077f0fd68> [ 1402.287632] CR2: ffff880077c49fff [ 1402.287632] ---[ end trace 368ef53ff773a5e3 ]--- Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: linux-nfs@vger.kernel.org Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25nfsd4: fix lockowner matchingJ. Bruce Fields
commit b93d87c19821ba7d3ee11557403d782e541071ad upstream. Lockowners are looked up by file as well as by owner, but we were forgetting to do a comparison on the file. This could cause an incorrect result from lockt. (Note looking up the inode from the lockowner is pretty awkward here. The data structures need fixing.) Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25Unused iocbs in a batch should not be accounted as active.Gleb Natapov
commit 69e4747ee9727d660b88d7e1efe0f4afcb35db1b upstream. Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are allocated in a batch during processing of first iocbs. All iocbs in a batch are automatically added to ctx->active_reqs list and accounted in ctx->reqs_active. If one (not the last one) of iocbs submitted by an user fails, further iocbs are not processed, but they are still present in ctx->active_reqs and accounted in ctx->reqs_active. This causes process to stuck in a D state in wait_for_all_aios() on exit since ctx->reqs_active will never go down to zero. Furthermore since kiocb_batch_free() frees iocb without removing it from active_reqs list the list become corrupted which may cause oops. Fix this by removing iocb from ctx->active_reqs and updating ctx->reqs_active in kiocb_batch_free(). Signed-off-by: Gleb Natapov <gleb@redhat.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25UBIFS: make debugging messages light againArtem Bityutskiy
commit 1f5d78dc4823a85f112aaa2d0f17624f8c2a6c52 upstream. We switch to dynamic debugging in commit 56e46742e846e4de167dde0e1e1071ace1c882a5 but did not take into account that now we do not control anymore whether a specific message is enabled or not. So now we lock the "dbg_lock" and release it in every debugging macro, which make them not so light-weight. This commit removes the "dbg_lock" protection from the debugging macros to fix the issue. The downside is that now our DBGKEY() stuff is broken, but this is not critical at all and will be fixed later. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25UBIFS: fix debugging messagesArtem Bityutskiy
commit d34315da9146253351146140ea4b277193ee5e5f upstream. Patch 56e46742e846e4de167dde0e1e1071ace1c882a5 broke UBIFS debugging messages: before that commit when UBIFS debugging was enabled, users saw few useful debugging messages after mount. However, that patch turned 'dbg_msg()' into 'pr_debug()', so to enable the debugging messages users have to enable them first via /sys/kernel/debug/dynamic_debug/control, which is very impractical. This commit makes 'dbg_msg()' to use 'printk()' instead of 'pr_debug()', just as it was before the breakage. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25nfs: fix regression in handling of context= option in NFSv4Jeff Layton
commit 8a0d551a59ac92d8ff048d6cb29d3a02073e81e8 upstream. Setting the security context of a NFSv4 mount via the context= mount option is currently broken. The NFSv4 codepath allocates a parsed options struct, and then parses the mount options to fill it. It eventually calls nfs4_remote_mount which calls security_init_mnt_opts. That clobbers the lsm_opts struct that was populated earlier. This bug also looks like it causes a small memory leak on each v4 mount where context= is used. Fix this by moving the initialization of the lsm_opts into nfs_alloc_parsed_mount_data. Also, add a destructor for nfs_parsed_mount_data to make it easier to free all of the allocations hanging off of it, and to ensure that the security_free_mnt_opts is called whenever security_init_mnt_opts is. I believe this regression was introduced quite some time ago, probably by commit c02d7adf. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25NFSv4: include bitmap in nfsv4 get acl dataAndy Adamson
commit bf118a342f10dafe44b14451a1392c3254629a1f upstream. The NFSv4 bitmap size is unbounded: a server can return an arbitrary sized bitmap in an FATTR4_WORD0_ACL request. Replace using the nfs4_fattr_bitmap_maxsz as a guess to the maximum bitmask returned by a server with the inclusion of the bitmap (xdr length plus bitmasks) and the acl data xdr length to the (cached) acl page data. This is a general solution to commit e5012d1f "NFSv4.1: update nfs4_fattr_bitmap_maxsz" and fixes hitting a BUG_ON in xdr_shrink_bufhead when getting ACLs. Fix a bug in decode_getacl that returned -EINVAL on ACLs > page when getxattr was called with a NULL buffer, preventing ACL > PAGE_SIZE from being retrieved. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25NFS - fix recent breakage to NFS error handling.NeilBrown
commit 2edb6bc3852c681c0d948245bd55108dc6407604 upstream. From c6d615d2b97fe305cbf123a8751ced859dca1d5e Mon Sep 17 00:00:00 2001 From: NeilBrown <neilb@suse.de> Date: Wed, 16 Nov 2011 09:39:05 +1100 Subject: NFS - fix recent breakage to NFS error handling. commit 02c24a82187d5a628c68edfe71ae60dc135cd178 made a small and presumably unintended change to write error handling in NFS. Previously an error from filemap_write_and_wait_range would only be of interest if nfs_file_fsync did not return an error. After this commit, an error from filemap_write_and_wait_range would mean that (the rest of) nfs_file_fsync would not even be called. This means that: 1/ you are more likely to see EIO than e.g. EDQUOT or ENOSPC. 2/ NFS_CONTEXT_ERROR_WRITE remains set for longer so more writes are synchronous. This patch restores previous behaviour. Cc: Josef Bacik <josef@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25NFSv4.1: fix backchannel slotid off-by-one bugAndy Adamson
commit 61f2e5106582d02f30b6807e3f9c07463c572ccb upstream. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25pnfs-obj: Must return layout on IO errorBoaz Harrosh
commit fe0fe83585f88346557868a803a479dfaaa0688a upstream. As mandated by the standard. In case of an IO error, a pNFS objects layout driver must return it's layout. This is because all device errors are reported to the server as part of the layout return buffer. This is implemented the same way PNFS_LAYOUTRET_ON_SETATTR is done, through a bit flag on the pnfs_layoutdriver_type->flags member. The flag is set by the layout driver that wants a layout_return preformed at pnfs_ld_{write,read}_done in case of an error. (Though I have not defined a wrapper like pnfs_ld_layoutret_on_setattr because this code is never called outside of pnfs.c and pnfs IO paths) Without this patch 3.[0-2] Kernels leak memory and have an annoying WARN_ON after every IO error utilizing the pnfs-obj driver. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25pnfs-obj: pNFS errors are communicated on iodata->pnfs_errorBoaz Harrosh
commit 5c0b4129c07b902b27d3f3ebc087757f534a3abd upstream. Some time along the way pNFS IO errors were switched to communicate with a special iodata->pnfs_error member instead of the regular RPC members. But objlayout was not switched over. Fix that! Without this fix any IO error is hanged, because IO is not switched to MDS and pages are never cleared or read. [Applies to 3.2.0. Same bug different patch for 3.1/0 Kernels] Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25ext4: fix undefined behavior in ext4_fill_flex_info()Xi Wang
commit d50f2ab6f050311dbf7b8f5501b25f0bf64a439b upstream. Commit 503358ae01b70ce6909d19dd01287093f6b6271c ("ext4: avoid divide by zero when trying to mount a corrupted file system") fixes CVE-2009-4307 by performing a sanity check on s_log_groups_per_flex, since it can be set to a bogus value by an attacker. sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex; groups_per_flex = 1 << sbi->s_log_groups_per_flex; if (groups_per_flex < 2) { ... } This patch fixes two potential issues in the previous commit. 1) The sanity check might only work on architectures like PowerPC. On x86, 5 bits are used for the shifting amount. That means, given a large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36 is essentially 1 << 4 = 16, rather than 0. This will bypass the check, leaving s_log_groups_per_flex and groups_per_flex inconsistent. 2) The sanity check relies on undefined behavior, i.e., oversized shift. A standard-confirming C compiler could rewrite the check in unexpected ways. Consider the following equivalent form, assuming groups_per_flex is unsigned for simplicity. groups_per_flex = 1 << sbi->s_log_groups_per_flex; if (groups_per_flex == 0 || groups_per_flex == 1) { We compile the code snippet using Clang 3.0 and GCC 4.6. Clang will completely optimize away the check groups_per_flex == 0, leaving the patched code as vulnerable as the original. GCC keeps the check, but there is no guarantee that future versions will do the same. Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-25ext4: add missing ext4_resize_end on error pathsDjalal Harouni
commit 014a1770371a028d22f364718c805f4216911ecd upstream. Online resize ioctls 'EXT4_IOC_GROUP_EXTEND' and 'EXT4_IOC_GROUP_ADD' call ext4_resize_begin() to check permissions and to set the EXT4_RESIZING bit lock, they do their work and they must finish with ext4_resize_end() which calls clear_bit_unlock() to unlock and to avoid -EBUSY errors for the next resize operations. This patch adds the missing ext4_resize_end() calls on error paths. Patch tested. Signed-off-by: Djalal Harouni <tixxdz@opendz.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12xfs: fix acl count validation in xfs_acl_from_disk()Xi Wang
commit 093019cf1b18dd31b2c3b77acce4e000e2cbc9ce upstream. Commit fa8b18ed didn't prevent the integer overflow and possible memory corruption. "count" can go negative and bypass the check. Signed-off-by: Xi Wang <xi.wang@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12udf: Fix deadlock when converting file from in-ICB one to normal oneJan Kara
commit d2eb8c359309ec45d6bf5b147303ab8e13be86ea upstream. During BKL removal in 2.6.38, conversion of files from in-ICB format to normal format got broken. We call ->writepage with i_data_sem held but udf_get_block() also acquires i_data_sem thus creating A-A deadlock. We fix the problem by dropping i_data_sem before calling ->writepage() which is safe since i_mutex still protects us against any changes in the file. Also fix pagelock - i_data_sem lock inversion in udf_expand_file_adinicb() by dropping i_data_sem before calling find_or_create_page(). Reported-by: Matthias Matiak <netzpython@mail-on.us> Tested-by: Matthias Matiak <netzpython@mail-on.us> Reviewed-by: Namjae Jeon <linkinjeon@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12ext3: Don't warn from writepage when readonly inode is spotted after errorJan Kara
commit 33c104d415e92a51aaf638dc3d93920cfa601e5c upstream. WARN_ON_ONCE(IS_RDONLY(inode)) tends to trip when filesystem hits error and is remounted read-only. This unnecessarily scares users (well, they should be scared because of filesystem error, but the stack trace distracts them from the right source of their fear ;-). We could as well just remove the WARN_ON but it's not hard to fix it to not trip on filesystem with errors and not use more cycles in the common case so that's what we do. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12reiserfs: Force inode evictions before umount to avoid crashJeff Mahoney
commit a9e36da655e54545c3289b2a0700b5c443de0edd upstream. This patch fixes a crash in reiserfs_delete_xattrs during umount. When shrink_dcache_for_umount clears the dcache from generic_shutdown_super, delayed evictions are forced to disk. If an evicted inode has extended attributes associated with it, it will need to walk the xattr tree to locate and remove them. But since shrink_dcache_for_umount will BUG if it encounters active dentries, the xattr tree must be released before it's called or it will crash during every umount. This patch forces the evictions to occur before generic_shutdown_super by calling shrink_dcache_sb first. The additional evictions caused by the removal of each associated xattr file and dir will be automatically handled as they're added to the LRU list. CC: reiserfs-devel@vger.kernel.org Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12reiserfs: Fix quota mount option parsingJan Kara
commit a06d789b424190e9f59da391681f908486db2554 upstream. When jqfmt mount option is not specified on remount, we mistakenly clear s_jquota_fmt value stored in superblock. Fix the problem. CC: reiserfs-devel@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12ore: FIX breakage when MISC_FILESYSTEMS is not setBoaz Harrosh
commit 831c2dc5f47c1dc79c32229d75065ada1dcc66e1 upstream. As Reported by Randy Dunlap When MISC_FILESYSTEMS is not enabled and NFS4.1 is: fs/built-in.o: In function `objio_alloc_io_state': objio_osd.c:(.text+0xcb525): undefined reference to `ore_get_rw_state' fs/built-in.o: In function `_write_done': objio_osd.c:(.text+0xcb58d): undefined reference to `ore_check_io' fs/built-in.o: In function `_read_done': ... When MISC_FILESYSTEMS, which is more of a GUI thing then anything else, is not selected. exofs/Kconfig is never examined during Kconfig, and it can not do it's magic stuff to automatically select everything needed. We must split exofs/Kconfig in two. The ore one is always included. And the exofs one is left in it's old place in the menu. Reported-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12ore: Must support none-PAGE-aligned IOBoaz Harrosh
commit 724577ca355795b0a25c93ccbeee927871ca1a77 upstream. NFS might send us offsets that are not PAGE aligned. So we must read in the reminder of the first/last pages, in cases we need it for Parity calculations. We only add an sg segments to read the partial page. But we don't mark it as read=true because it is a lock-for-write page. TODO: In some cases (IO spans a single unit) we can just adjust the raid_unit offset/length, but this is left for later Kernels. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12ore: fix BUG_ON, too few sgs when readingBoaz Harrosh
commit 361aba569f55dd159b850489a3538253afbb3973 upstream. When reading RAID5 files, in rare cases, we calculated too few sg segments. There should be two extra for the beginning and end partial units. Also "too few sg segments" should not be a BUG_ON there is all the mechanics in place to handle it, as a short read. So just return -ENOMEM and the rest of the code will gracefully split the IO. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12ore: Fix crash in case of an IO error.Boaz Harrosh
commit ffefb8eaa367e8a5c14f779233d9da1fbc23d164 upstream. The users of ore_check_io() expect the reported device (In case of error) to be indexed relative to the passed-in ore_components table, and not the logical dev index. This causes a crash inside objlayoutdriver in case of an IO error. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-04minixfs: misplaced checks lead to dentry leakAl Viro
bitmap size sanity checks should be done *before* allocating ->s_root; there their cleanup on failure would be correct. As it is, we do iput() on root inode, but leak the root dentry... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Josh Boyer <jwboyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-04[CIFS] default ntlmv2 for cifs mount delayed to 3.3Steve French
Turned out the ntlmv2 (default security authentication) upgrade was harder to test than expected, and we ran out of time to test against Apple and a few other servers that we wanted to. Delay upgrade of default security from ntlm to ntlmv2 (on mount) to 3.3. Still works fine to specify it explicitly via "sec=ntlmv2" so this should be fine. Acked-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Steve French <smfrench@gmail.com>
2012-01-03cifs: fix bad buffer length check in coalesce_t2Jeff Layton
The current check looks to see if the RFC1002 length is larger than CIFSMaxBufSize, and fails if it is. The buffer is actually larger than that by MAX_CIFS_HDR_SIZE. This bug has been around for a long time, but the fact that we used to cap the clients MaxBufferSize at the same level as the server tended to paper over it. Commit c974befa changed that however and caused this bug to bite in more cases. Reported-and-Tested-by: Konstantinos Skarlatos <k.skarlatos@gmail.com> Tested-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com>
2011-12-30Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: disable use of dcache for readdir etc.
2011-12-29Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: log all dirty inodes in xfs_fs_sync_fs xfs: log the inode in ->write_inode calls for kupdate
2011-12-29procfs: do not confuse jiffies with cputime64_tAndreas Schwab
Commit 2a95ea6c0d129b4 ("procfs: do not overflow get_{idle,iowait}_time for nohz") did not take into account that one some architectures jiffies and cputime use different units. This causes get_idle_time() to return numbers in the wrong units, making the idle time fields in /proc/stat wrong. Instead of converting the usec value returned by get_cpu_{idle,iowait}_time_us to units of jiffies, use the new function usecs_to_cputime64 to convert it to the correct unit of cputime64_t. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Artem S. Tashkinov" <t.artem@mailcity.com> Cc: Dave Jones <davej@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-29ceph: disable use of dcache for readdir etc.Sage Weil
Ceph attempts to use the dcache to satisfy negative lookups and readdir when the entire directory contents are in cache. Disable this behavior until lingering bugs in this code are shaken out; we'll re-enable these hooks once things are fully stable. Signed-off-by: Sage Weil <sage@newdream.net>
2011-12-26vfs: fix handling of lock allocation failure in lease-break caseLinus Torvalds
Bruce Fields notes that commit 778fc546f749 ("locks: fix tracking of inprogress lease breaks") introduced a possible error pointer dereference on failure to allocate memory. locks_conflict() will dereference the passed-in new lease lock structure that may be an error pointer. This means an open (without O_NONBLOCK set) on a file with a lease applied (generally only done when Samba or nfsd (with v4) is running) could crash if a kmalloc() fails. So instead of playing games with IS_ERROR() all over the place, just check the allocation failure early. That makes the code more straightforward, and avoids this possible bad pointer dereference. Based-on-patch-by: J. Bruce Fields <bfields@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-23Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linuxLinus Torvalds
for linus: writeback reason binary tracing format fix * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: show writeback reason with __print_symbolic