summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2013-04-25Btrfs: make sure nbytes are right after log replayJosef Bacik
commit 4bc4bee4595662d8bff92180d5c32e3313a704b0 upstream. While trying to track down a tree log replay bug I noticed that fsck was always complaining about nbytes not being right for our fsynced file. That is because the new fsync stuff doesn't wait for ordered extents to complete, so the inodes nbytes are not necessarily updated properly when we log it. So to fix this we need to set nbytes to whatever it is on the inode that is on disk, so when we replay the extents we can just add the bytes that are being added as we replay the extent. This makes it work for the case that we have the wrong nbytes or the case that we logged everything and nbytes is actually correct. With this I'm no longer getting nbytes errors out of btrfsck. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Lingzhu Xiang <lxiang@redhat.com> Reviewed-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-25hfsplus: fix potential overflow in hfsplus_file_truncate()Vyacheslav Dubeyko
commit 12f267a20aecf8b84a2a9069b9011f1661c779b4 upstream. Change a u32 to loff_t hfsplus_file_truncate(). Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Hin-Tak Leung <htl10@users.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-16vfs: Revert spurious fix to spinning prevention in prune_icache_sbSuleiman Souhlal
commit 5b55d708335a9e3e4f61f2dadf7511502205ccd1 upstream. Revert commit 62a3ddef6181 ("vfs: fix spinning prevention in prune_icache_sb"). This commit doesn't look right: since we are looking at the tail of the list (sb->s_inode_lru.prev) if we want to skip an inode, we should put it back at the head of the list instead of the tail, otherwise we will keep spinning on it. Discovered when investigating why prune_icache_sb came top in perf reports of a swapping load. Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-16cifs: Allow passwords which begin with a delimitorSachin Prabhu
commit c369c9a4a7c82d33329d869cbaf93304cc7a0c40 upstream. Fixes a regression in cifs_parse_mount_options where a password which begins with a delimitor is parsed incorrectly as being a blank password. Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-12reiserfs: Fix warning and inode leak when deleting inode with xattrsJan Kara
commit 35e5cbc0af240778e61113286c019837e06aeec6 upstream. After commit 21d8a15a (lookup_one_len: don't accept . and ..) reiserfs started failing to delete xattrs from inode. This was due to a buggy test for '.' and '..' in fill_with_dentries() which resulted in passing '.' and '..' entries to lookup_one_len() in some cases. That returned error and so we failed to iterate over all xattrs of and inode. Fix the test in fill_with_dentries() along the lines of the one in lookup_one_len(). Reported-by: Pawel Zawora <pzawora@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-12UBIFS: make space fixup work in the remount caseArtem Bityutskiy
commit 67e753ca41782913d805ff4a8a2b0f60b26b7915 upstream. The UBIFS space fixup is a useful feature which allows to fixup the "broken" flash space at the time of the first mount. The "broken" space is usually the result of using a "dumb" industrial flasher which is not able to skip empty NAND pages and just writes all 0xFFs to the empty space, which has grave side-effects for UBIFS when UBIFS trise to write useful data to those empty pages. The fix-up feature works roughly like this: 1. mkfs.ubifs sets the fixup flag in UBIFS superblock when creating the image (see -F option) 2. when the file-system is mounted for the first time, UBIFS notices the fixup flag and re-writes the entire media atomically, which may take really a lot of time. 3. UBIFS clears the fixup flag in the superblock. This works fine when the file system is mounted R/W for the very first time. But it did not really work in the case when we first mount the file-system R/O, and then re-mount R/W. The reason was that we started the fixup procedure too late, which we cannot really do because we have to fixup the space before it starts being used. Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Reported-by: Mark Jackson <mpfj-list@mimc.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05ext4: use atomic64_t for the per-flexbg free_clusters countTheodore Ts'o
commit 90ba983f6889e65a3b506b30dc606aa9d1d46cd2 upstream. A user who was using a 8TB+ file system and with a very large flexbg size (> 65536) could cause the atomic_t used in the struct flex_groups to overflow. This was detected by PaX security patchset: http://forums.grsecurity.net/viewtopic.php?f=3&t=3289&p=12551#p12551 This bug was introduced in commit 9f24e4208f7e, so it's been around since 2.6.30. :-( Fix this by using an atomic64_t for struct orlav_stats's free_clusters. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Lingzhu Xiang <lxiang@redhat.com> Reviewed-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05ext4: convert number of blocks to clusters properlyLukas Czerner
commit 810da240f221d64bf90020f25941b05b378186fe upstream. We're using macro EXT4_B2C() to convert number of blocks to number of clusters for bigalloc file systems. However, we should be using EXT4_NUM_B2C(). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Lingzhu Xiang <lxiang@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05Btrfs: fix space leak when we fail to reserve metadata spaceJosef Bacik
commit f4881bc7a83eff263789dd524b7c269d138d4af5 upstream. Dave reported a warning when running xfstest 275. We have been leaking delalloc metadata space when our reservations fail. This is because we were improperly calculating how much space to free for our checksum reservations. The problem is we would sometimes free up space that had already been freed in another thread and we would end up with negative usage for the delalloc space. This patch fixes the problem by calculating how much space the other threads would have already freed, and then calculate how much space we need to free had we not done the reservation at all, and then freeing any excess space. This makes xfstests 275 no longer have leaked space. Thanks Reported-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Lingzhu Xiang <lxiang@redhat.com> Reviewed-by: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05nfsd4: reject "negative" acl lengthsJ. Bruce Fields
commit 64a817cfbded8674f345d1117b117f942a351a69 upstream. Since we only enforce an upper bound, not a lower bound, a "negative" length can get through here. The symptom seen was a warning when we attempt to a kmalloc with an excessive size. Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05loop: prevent bdev freeing while device in useAnatol Pomozov
commit c1681bf8a7b1b98edee8b862a42c19c4e53205fd upstream. struct block_device lifecycle is defined by its inode (see fs/block_dev.c) - block_device allocated first time we access /dev/loopXX and deallocated on bdev_destroy_inode. When we create the device "losetup /dev/loopXX afile" we want that block_device stay alive until we destroy the loop device with "losetup -d". But because we do not hold /dev/loopXX inode its counter goes 0, and inode/bdev can be destroyed at any moment. Usually it happens at memory pressure or when user drops inode cache (like in the test below). When later in loop_clr_fd() we want to use bdev we have use-after-free error with following stack: BUG: unable to handle kernel NULL pointer dereference at 0000000000000280 bd_set_size+0x10/0xa0 loop_clr_fd+0x1f8/0x420 [loop] lo_ioctl+0x200/0x7e0 [loop] lo_compat_ioctl+0x47/0xe0 [loop] compat_blkdev_ioctl+0x341/0x1290 do_filp_open+0x42/0xa0 compat_sys_ioctl+0xc1/0xf20 do_sys_open+0x16e/0x1d0 sysenter_dispatch+0x7/0x1a To prevent use-after-free we need to grab the device in loop_set_fd() and put it later in loop_clr_fd(). The issue is reprodusible on current Linus head and v3.3. Here is the test: dd if=/dev/zero of=loop.file bs=1M count=1 while [ true ]; do losetup /dev/loop0 loop.file echo 2 > /proc/sys/vm/drop_caches losetup -d /dev/loop0 done [ Doing bdgrab/bput in loop_set_fd/loop_clr_fd is safe, because every time we call loop_set_fd() we check that loop_device->lo_state is Lo_unbound and set it to Lo_bound If somebody will try to set_fd again it will get EBUSY. And if we try to loop_clr_fd() on unbound loop device we'll get ENXIO. loop_set_fd/loop_clr_fd (and any other loop ioctl) is called under loop_device->lo_ctl_mutex. ] Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05Btrfs: don't drop path when printing out tree errors in scrubJosef Bacik
commit d8fe29e9dea8d7d61fd140d8779326856478fc62 upstream. A user reported a panic where we were panicing somewhere in tree_backref_for_extent from scrub_print_warning. He only captured the trace but looking at scrub_print_warning we drop the path right before we mess with the extent buffer to print out a bunch of stuff, which isn't right. So fix this by dropping the path after we use the eb if we need to. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05Btrfs: limit the global reserve to 512mbJosef Bacik
commit fdf30d1c1b386e1b73116cc7e0fb14e962b763b0 upstream. A user reported a problem where he was getting early ENOSPC with hundreds of gigs of free data space and 6 gigs of free metadata space. This is because the global block reserve was taking up the entire free metadata space. This is ridiculous, we have infrastructure in place to throttle if we start using too much of the global reserve, so instead of letting it get this huge just limit it to 512mb so that users can still get work done. This allowed the user to complete his rsync without issues. Thanks Reported-and-tested-by: Stefan Priebe <s.priebe@profihost.ag> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05Btrfs: fix race between mmap writes and compressionChris Mason
commit 4adaa611020fa6ac65b0ac8db78276af4ec04e63 upstream. Btrfs uses page_mkwrite to ensure stable pages during crc calculations and mmap workloads. We call clear_page_dirty_for_io before we do any crcs, and this forces any application with the file mapped to wait for the crc to finish before it is allowed to change the file. With compression on, the clear_page_dirty_for_io step is happening after we've compressed the pages. This means the applications might be changing the pages while we are compressing them, and some of those modifications might not hit the disk. This commit adds the clear_page_dirty_for_io before compression starts and makes sure to redirty the page if we have to fallback to uncompressed IO as well. Signed-off-by: Chris Mason <chris.mason@fusionio.com> Reported-by: Alexandre Oliva <oliva@gnu.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05pnfs-block: removing DM device maybe cause oops when call dev_removefanchaoting
commit 4376c94618c26225e69e17b7c91169c45a90b292 upstream. when pnfs block using device mapper,if umounting later,it maybe cause oops. we apply "1 + sizeof(bl_umount_request)" memory for msg->data, the memory maybe overflow when we do "memcpy(&dataptr [sizeof(bl_msg)], &bl_umount_request, sizeof(bl_umount_request))", because the size of bl_msg is more than 1 byte. Signed-off-by: fanchaoting<fanchaoting@cn.fujitsu.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05sysfs: handle failure path correctly for readdir()Ming Lei
commit e5110f411d2ee35bf8d202ccca2e89c633060dca upstream. In case of 'if (filp->f_pos == 0 or 1)' of sysfs_readdir(), the failure from filldir() isn't handled, and the reference counter of the sysfs_dirent object pointed by filp->private_data will be released without clearing filp->private_data, so use after free bug will be triggered later. This patch returns immeadiately under the situation for fixing the bug, and it is reasonable to return from readdir() when filldir() fails. Reported-by: Dave Jones <davej@redhat.com> Tested-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-05sysfs: fix race between readdir and lseekMing Lei
commit 991f76f837bf22c5bb07261cfd86525a0a96650c upstream. While readdir() is running, lseek() may set filp->f_pos as zero, then may leave filp->private_data pointing to one sysfs_dirent object without holding its reference counter, so the sysfs_dirent object may be used after free in next readdir(). This patch holds inode->i_mutex to avoid the problem since the lock is always held in readdir path. Reported-by: Dave Jones <davej@redhat.com> Tested-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28exec: use -ELOOP for max recursion depthKees Cook
commit d740269867021faf4ce38a449353d2b986c34a67 upstream. To avoid an explosion of request_module calls on a chain of abusive scripts, fail maximum recursion with -ELOOP instead of -ENOEXEC. As soon as maximum recursion depth is hit, the error will fail all the way back up the chain, aborting immediately. This also has the side-effect of stopping the user's shell from attempting to reexecute the top-level file as a shell script. As seen in the dash source: if (cmd != path_bshell && errno == ENOEXEC) { *argv-- = cmd; *argv = cmd = path_bshell; goto repeat; } The above logic was designed for running scripts automatically that lacked the "#!" header, not to re-try failed recursion. On a legitimate -ENOEXEC, things continue to behave as the shell expects. Additionally, when tracking recursion, the binfmt handlers should not be involved. The recursion being tracked is the depth of calls through search_binary_handler(), so that function should be exclusively responsible for tracking the depth. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: halfdog <me@halfdog.net> Cc: P J P <ppandit@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28udf: avoid info leak on exportMathias Krause
commit 0143fc5e9f6f5aad4764801015bc8d4b4a278200 upstream. For type 0x51 the udf.parent_partref member in struct fid gets copied uninitialized to userland. Fix this by initializing it to 0. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28isofs: avoid info leak on exportMathias Krause
commit fe685aabf7c8c9f138e5ea900954d295bf229175 upstream. For type 1 the parent_offset member in struct isofs_fid gets copied uninitialized to userland. Fix this by initializing it to 0. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Jan Kara <jack@suse.cz> Cc: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28udf: Fix bitmap overflow on large filesystems with small block sizeJan Kara
commit 89b1f39eb4189de745fae554b0d614d87c8d5c63 upstream. For large UDF filesystems with 512-byte blocks the number of necessary bitmap blocks is larger than 2^16 so s_nr_groups in udf_bitmap overflows (the number will overflow for filesystems larger than 128 GB with 512-byte blocks). That results in ENOSPC errors despite the filesystem has plenty of free space. Fix the problem by changing s_nr_groups' type to 'int'. That is enough even for filesystems 2^32 blocks (UDF maximum) and 512-byte blocksize. Reported-and-tested-by: v10lator@myway.de Signed-off-by: Jan Kara <jack@suse.cz> Cc: Jim Trigg <jtrigg@spamcop.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28ext4: fix data=journal fast mount/umount hangTheodore Ts'o
commit 2b405bfa84063bfa35621d2d6879f52693c614b0 upstream. In data=journal mode, if we unmount the file system before a transaction has a chance to complete, when the journal inode is being evicted, we can end up calling into jbd2_log_wait_commit() for the last transaction, after the journalling machinery has been shut down. Arguably we should adjust ext4_should_journal_data() to return FALSE for the journal inode, but the only place it matters is ext4_evict_inode(), and so to save a bit of CPU time, and to make the patch much more obviously correct by inspection(tm), we'll fix it by explicitly not trying to waiting for a journal commit when we are evicting the journal inode, since it's guaranteed to never succeed in this case. This can be easily replicated via: mount -t ext4 -o data=journal /dev/vdb /vdb ; umount /vdb ------------[ cut here ]------------ WARNING: at /usr/projects/linux/ext4/fs/jbd2/journal.c:542 __jbd2_log_start_commit+0xba/0xcd() Hardware name: Bochs JBD2: bad log_start_commit: 3005630206 3005630206 0 0 Modules linked in: Pid: 2909, comm: umount Not tainted 3.8.0-rc3 #1020 Call Trace: [<c015c0ef>] warn_slowpath_common+0x68/0x7d [<c02b7e7d>] ? __jbd2_log_start_commit+0xba/0xcd [<c015c177>] warn_slowpath_fmt+0x2b/0x2f [<c02b7e7d>] __jbd2_log_start_commit+0xba/0xcd [<c02b8075>] jbd2_log_start_commit+0x24/0x34 [<c0279ed5>] ext4_evict_inode+0x71/0x2e3 [<c021f0ec>] evict+0x94/0x135 [<c021f9aa>] iput+0x10a/0x110 [<c02b7836>] jbd2_journal_destroy+0x190/0x1ce [<c0175284>] ? bit_waitqueue+0x50/0x50 [<c028d23f>] ext4_put_super+0x52/0x294 [<c020efe3>] generic_shutdown_super+0x48/0xb4 [<c020f071>] kill_block_super+0x22/0x60 [<c020f3e0>] deactivate_locked_super+0x22/0x49 [<c020f5d6>] deactivate_super+0x30/0x33 [<c0222795>] mntput_no_expire+0x107/0x10c [<c02233a7>] sys_umount+0x2cf/0x2e0 [<c02233ca>] sys_oldumount+0x12/0x14 [<c08096b8>] syscall_call+0x7/0xb ---[ end trace 6a954cc790501c1f ]--- jbd2_log_wait_commit: error: j_commit_request=-1289337090, tid=0 Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28ext4: fix the wrong number of the allocated blocks in ext4_split_extent()Zheng Liu
commit 3a2256702e47f68f921dfad41b1764d05c572329 upstream. This commit fixes a wrong return value of the number of the allocated blocks in ext4_split_extent. When the length of blocks we want to allocate is greater than the length of the current extent, we return a wrong number. Let's see what happens in the following case when we call ext4_split_extent(). map: [48, 72] ex: [32, 64, u] 'ex' will be split into two parts: ex1: [32, 47, u] ex2: [48, 64, w] 'map->m_len' is returned from this function, and the value is 24. But the real length is 16. So it should be fixed. Meanwhile in this commit we use right length of the allocated blocks when get_reserved_cluster_alloc in ext4_ext_handle_uninitialized_extents is called. Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28jbd2: fix use after free in jbd2_journal_dirty_metadata()Jan Kara
commit ad56edad089b56300fd13bb9eeb7d0424d978239 upstream. jbd2_journal_dirty_metadata() didn't get a reference to journal_head it was working with. This is OK in most of the cases since the journal head should be attached to a transaction but in rare occasions when we are journalling data, __ext4_journalled_writepage() can race with jbd2_journal_invalidatepage() stripping buffers from a page and thus journal head can be freed under hands of jbd2_journal_dirty_metadata(). Fix the problem by getting own journal head reference in jbd2_journal_dirty_metadata() (and also in jbd2_journal_set_triggers() which can possibly have the same issue). Reported-by: Zheng Liu <gnehzuil.liu@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-28cifs: ignore everything in SPNEGO blob after mechTypesJeff Layton
commit f853c616883a8de966873a1dab283f1369e275a1 upstream. We've had several reports of people attempting to mount Windows 8 shares and getting failures with a return code of -EINVAL. The default sec= mode changed recently to sec=ntlmssp. With that, we expect and parse a SPNEGO blob from the server in the NEGOTIATE reply. The current decode_negTokenInit function first parses all of the mechTypes and then tries to parse the rest of the negTokenInit reply. The parser however currently expects a mechListMIC or nothing to follow the mechTypes, but Windows 8 puts a mechToken field there instead to carry some info for the new NegoEx stuff. In practice, we don't do anything with the fields after the mechTypes anyway so I don't see any real benefit in continuing to parse them. This patch just has the kernel ignore the fields after the mechTypes. We'll probably need to reinstate some of this if we ever want to support NegoEx. Reported-by: Jason Burgess <jason@jacknife2.dns2go.com> Reported-by: Yan Li <elliot.li.tech@gmail.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-20block: use i_size_write() in bd_set_size()Guo Chao
commit d646a02a9d44d1421f273ae3923d97b47b918176 upstream. blkdev_ioctl(GETBLKSIZE) uses i_size_read() to read size of block device. If we update block size directly, reader may see intermediate result in some machines and configurations. Use i_size_write() instead. Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: M. Hindess <hindessm@uk.ibm.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> Acked-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-20btrfs: use rcu_barrier() to wait for bdev puts at unmountEric Sandeen
commit bc178622d40d87e75abc131007342429c9b03351 upstream. Doing this would reliably fail with -EBUSY for me: # mount /dev/sdb2 /mnt/scratch; umount /mnt/scratch; mkfs.btrfs -f /dev/sdb2 ... unable to open /dev/sdb2: Device or resource busy because mkfs.btrfs tries to open the device O_EXCL, and somebody still has it. Using systemtap to track bdev gets & puts shows a kworker thread doing a blkdev put after mkfs attempts a get; this is left over from the unmount path: btrfs_close_devices __btrfs_close_devices call_rcu(&device->rcu, free_device); free_device INIT_WORK(&device->rcu_work, __free_device); schedule_work(&device->rcu_work); so unmount might complete before __free_device fires & does its blkdev_put. Adding an rcu_barrier() to btrfs_close_devices() causes unmount to wait until all blkdev_put()s are done, and the device is truly free once unmount completes. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-20ext3: Fix format string issuesLars-Peter Clausen
commit 8d0c2d10dd72c5292eda7a06231056a4c972e4cc upstream. ext3_msg() takes the printk prefix as the second parameter and the format string as the third parameter. Two callers of ext3_msg omit the prefix and pass the format string as the second parameter and the first parameter to the format string as the third parameter. In both cases this string comes from an arbitrary source. Which means the string may contain format string characters, which will lead to undefined and potentially harmful behavior. The issue was introduced in commit 4cf46b67eb("ext3: Unify log messages in ext3") and is fixed by this patch. Signed-off-by: Lars-Peter Clausen <lars@metafoo.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-14vfs: fix pipe counter breakageAl Viro
commit a930d8790552658140d7d0d2e316af4f0d76a512 upstream. If you open a pipe for neither read nor write, the pipe code will not add any usage counters to the pipe, causing the 'struct pipe_inode_info" to be potentially released early. That doesn't normally matter, since you cannot actually use the pipe, but the pipe release code - particularly fasync handling - still expects the actual pipe infrastructure to all be there. And rather than adding NULL pointer checks, let's just disallow this case, the same way we already do for the named pipe ("fifo") case. This is ancient going back to pre-2.4 days, and until trinity, nobody naver noticed. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-14Fix: compat_rw_copy_check_uvector() misuse in aio, readv, writev, and ↵Mathieu Desnoyers
security keys commit 8aec0f5d4137532de14e6554fd5dd201ff3a3c49 upstream. Looking at mm/process_vm_access.c:process_vm_rw() and comparing it to compat_process_vm_rw() shows that the compatibility code requires an explicit "access_ok()" check before calling compat_rw_copy_check_uvector(). The same difference seems to appear when we compare fs/read_write.c:do_readv_writev() to fs/compat.c:compat_do_readv_writev(). This subtle difference between the compat and non-compat requirements should probably be debated, as it seems to be error-prone. In fact, there are two others sites that use this function in the Linux kernel, and they both seem to get it wrong: Now shifting our attention to fs/aio.c, we see that aio_setup_iocb() also ends up calling compat_rw_copy_check_uvector() through aio_setup_vectored_rw(). Unfortunately, the access_ok() check appears to be missing. Same situation for security/keys/compat.c:compat_keyctl_instantiate_key_iov(). I propose that we add the access_ok() check directly into compat_rw_copy_check_uvector(), so callers don't have to worry about it, and it therefore makes the compat call code similar to its non-compat counterpart. Place the access_ok() check in the same location where copy_from_user() can trigger a -EFAULT error in the non-compat code, so the ABI behaviors are alike on both compat and non-compat. While we are here, fix compat_do_readv_writev() so it checks for compat_rw_copy_check_uvector() negative return values. And also, fix a memory leak in compat_keyctl_instantiate_key_iov() error handling. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-14NFS: Don't allow NFS silly-renamed files to be deleted, no signalTrond Myklebust
commit 5a7a613a47a715711b3f2d3322a0eac21d459166 upstream. Commit 73ca100 broke the code that prevents the client from deleting a silly renamed dentry. This affected "delete on last close" semantics as after that commit, nothing prevented removal of silly-renamed files. As a result, a process holding a file open could easily get an ESTALE on the file in a directory where some other process issued 'rm -rf some_dir_containing_the_file' twice. Before the commit, any attempt at unlinking silly renamed files would fail inside may_delete() with -EBUSY because of the DCACHE_NFSFS_RENAMED flag. The following testcase demonstrates the problem: tail -f /nfsmnt/dir/file & rm -rf /nfsmnt/dir rm -rf /nfsmnt/dir # second removal does not fail, 'tail' process receives ESTALE The problem with the above commit is that it unhashes the old and new dentries from the lookup path, even in the normal case when a signal is not encountered and it would have been safe to call d_move. Unfortunately the old dentry has the special DCACHE_NFSFS_RENAMED flag set on it. Unhashing has the side-effect that future lookups call d_alloc(), allocating a new dentry without the special flag for any silly-renamed files. As a result, subsequent calls to unlink silly renamed files do not fail but allow the removal to go through. This will result in ESTALE errors for any other process doing operations on the file. To fix this, go back to using d_move on success. For the signal case, it's unclear what we may safely do beyond d_drop. Reported-by: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-14cifs: ensure that cifs_get_root() only traverses directoriesJeff Layton
commit ce2ac52105aa663056dfc17966ebed1bf93e6e64 upstream. Kjell Braden reported this oops: [ 833.211970] BUG: unable to handle kernel NULL pointer dereference at (null) [ 833.212816] IP: [< (null)>] (null) [ 833.213280] PGD 1b9b2067 PUD e9f7067 PMD 0 [ 833.213874] Oops: 0010 [#1] SMP [ 833.214344] CPU 0 [ 833.214458] Modules linked in: des_generic md4 nls_utf8 cifs vboxvideo drm snd_intel8x0 snd_ac97_codec ac97_bus snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq bnep rfcomm snd_timer bluetooth snd_seq_device ppdev snd vboxguest parport_pc joydev mac_hid soundcore snd_page_alloc psmouse i2c_piix4 serio_raw lp parport usbhid hid e1000 [ 833.215629] [ 833.215629] Pid: 1752, comm: mount.cifs Not tainted 3.0.0-rc7-bisectcifs-fec11dd9a0+ #18 innotek GmbH VirtualBox/VirtualBox [ 833.215629] RIP: 0010:[<0000000000000000>] [< (null)>] (null) [ 833.215629] RSP: 0018:ffff8800119c9c50 EFLAGS: 00010282 [ 833.215629] RAX: ffffffffa02186c0 RBX: ffff88000c427780 RCX: 0000000000000000 [ 833.215629] RDX: 0000000000000000 RSI: ffff88000c427780 RDI: ffff88000c4362e8 [ 833.215629] RBP: ffff8800119c9c88 R08: ffff88001fc15e30 R09: 00000000d69515c7 [ 833.215629] R10: ffffffffa0201972 R11: ffff88000e8f6a28 R12: ffff88000c4362e8 [ 833.215629] R13: 0000000000000000 R14: 0000000000000000 R15: ffff88001181aaa6 [ 833.215629] FS: 00007f2986171700(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 [ 833.215629] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 833.215629] CR2: 0000000000000000 CR3: 000000001b982000 CR4: 00000000000006f0 [ 833.215629] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 833.215629] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 833.215629] Process mount.cifs (pid: 1752, threadinfo ffff8800119c8000, task ffff88001c1c16f0) [ 833.215629] Stack: [ 833.215629] ffffffff8116a9b5 ffff8800119c9c88 ffffffff81178075 0000000000000286 [ 833.215629] 0000000000000000 ffff88000c4276c0 ffff8800119c9ce8 ffff8800119c9cc8 [ 833.215629] ffffffff8116b06e ffff88001bc6fc00 ffff88000c4276c0 ffff88000c4276c0 [ 833.215629] Call Trace: [ 833.215629] [<ffffffff8116a9b5>] ? d_alloc_and_lookup+0x45/0x90 [ 833.215629] [<ffffffff81178075>] ? d_lookup+0x35/0x60 [ 833.215629] [<ffffffff8116b06e>] __lookup_hash.part.14+0x9e/0xc0 [ 833.215629] [<ffffffff8116b1d6>] lookup_one_len+0x146/0x1e0 [ 833.215629] [<ffffffff815e4f7e>] ? _raw_spin_lock+0xe/0x20 [ 833.215629] [<ffffffffa01eef0d>] cifs_do_mount+0x26d/0x500 [cifs] [ 833.215629] [<ffffffff81163bd3>] mount_fs+0x43/0x1b0 [ 833.215629] [<ffffffff8117d41a>] vfs_kern_mount+0x6a/0xd0 [ 833.215629] [<ffffffff8117e584>] do_kern_mount+0x54/0x110 [ 833.215629] [<ffffffff8117fdc2>] do_mount+0x262/0x840 [ 833.215629] [<ffffffff81108a0e>] ? __get_free_pages+0xe/0x50 [ 833.215629] [<ffffffff8117f9ca>] ? copy_mount_options+0x3a/0x180 [ 833.215629] [<ffffffff8118075d>] sys_mount+0x8d/0xe0 [ 833.215629] [<ffffffff815ece82>] system_call_fastpath+0x16/0x1b [ 833.215629] Code: Bad RIP value. [ 833.215629] RIP [< (null)>] (null) [ 833.215629] RSP <ffff8800119c9c50> [ 833.215629] CR2: 0000000000000000 [ 833.238525] ---[ end trace ec00758b8d44f529 ]--- When walking down the path on the server, it's possible to hit a symlink. The path walking code assumes that the caller will handle that situation properly, but cifs_get_root() isn't set up for it. This patch prevents the oops by simply returning an error. A better solution would be to try and chase the symlinks here, but that's fairly complicated to handle. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=53221 Reported-and-tested-by: Kjell Braden <afflux@pentabarf.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-14btrfs: Init io_lock after cloning btrfs device structThomas Gleixner
commit 1cba0cdf5e4dbcd9e5fa5b54d7a028e55e2ca057 upstream. __btrfs_close_devices() clones btrfs device structs with memcpy(). Some of the fields in the clone are reinitialized, but it's missing to init io_lock. In mainline this goes unnoticed, but on RT it leaves the plist pointing to the original about to be freed lock struct. Initialize io_lock after cloning, so no references to the original struct are left. Reported-and-tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04pstore: Avoid deadlock in panic and emergency-restart pathSeiji Aguchi
commit 9f244e9cfd70c7c0f82d3c92ce772ab2a92d9f64 upstream. [Issue] When pstore is in panic and emergency-restart paths, it may be blocked in those paths because it simply takes spin_lock. This is an example scenario which pstore may hang up in a panic path: - cpuA grabs psinfo->buf_lock - cpuB panics and calls smp_send_stop - smp_send_stop sends IRQ to cpuA - after 1 second, cpuB gives up on cpuA and sends an NMI instead - cpuA is now in an NMI handler while still holding buf_lock - cpuB is deadlocked This case may happen if a firmware has a bug and cpuA is stuck talking with it more than one second. Also, this is a similar scenario in an emergency-restart path: - cpuA grabs psinfo->buf_lock and stucks in a firmware - cpuB kicks emergency-restart via either sysrq-b or hangcheck timer. And then, cpuB is deadlocked by taking psinfo->buf_lock again. [Solution] This patch avoids the deadlocking issues in both panic and emergency_restart paths by introducing a function, is_non_blocking_path(), to check if a cpu can be blocked in current path. With this patch, pstore is not blocked even if another cpu has taken a spin_lock, in those paths by changing from spin_lock_irqsave to spin_trylock_irqsave. In addition, according to a comment of emergency_restart() in kernel/sys.c, spin_lock shouldn't be taken in an emergency_restart path to avoid deadlock. This patch fits the comment below. <snip> /** * emergency_restart - reboot the system * * Without shutting down any hardware or taking any locks * reboot the system. This is called when we know we are in * trouble so this is our best effort to reboot. This is * safe to call in interrupt context. */ void emergency_restart(void) <snip> Signed-off-by: Seiji Aguchi <seiji.aguchi@hds.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: CAI Qian <caiqian@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04fuse: don't WARN when nlink is zeroMiklos Szeredi
commit dfca7cebc2679f3d129f8e680a8f199a7ad16e38 upstream. drop_nlink() warns if nlink is already zero. This is triggerable by a buggy userspace filesystem. The cure, I think, is worse than the disease so disable the warning. Reported-by: Tero Roponen <tero.roponen@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04nfsd: Fix memleakmajianpeng
commit 2d32b29a1c2830f7c42caa8258c714acd983961f upstream. When free nfs-client, it must free the ->cl_stateids. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04ext4: fix free clusters calculation in bigalloc filesystemLukas Czerner
commit 304e220f0879198b1f5309ad6f0be862b4009491 upstream. ext4_has_free_clusters() should tell us whether there is enough free clusters to allocate, however number of free clusters in the file system is converted to blocks using EXT4_C2B() which is not only wrong use of the macro (we should have used EXT4_NUM_B2C) but it's also completely wrong concept since everything else is in cluster units. Moreover when calculating number of root clusters we should be using macro EXT4_NUM_B2C() instead of EXT4_B2C() otherwise the result might be off by one. However r_blocks_count should always be a multiple of the cluster ratio so doing a plain bit shift should be enough here. We avoid using EXT4_B2C() because it's confusing. As a result of the first problem number of free clusters is much bigger than it should have been and ext4_has_free_clusters() would return 1 even if there is really not enough free clusters available. Fix this by removing the EXT4_C2B() conversion of free clusters and using bit shift when calculating number of root clusters. This bug affects number of xfstests tests covering file system ENOSPC situation handling. With this patch most of the ENOSPC problems with bigalloc file system disappear, especially the errors caused by delayed allocation not having enough space when the actual allocation is finally requested. Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04ext4: fix xattr block allocation/release with bigallocLukas Czerner
commit 1231b3a1eb5740192aeebf5344dd6d6da000febf upstream. Currently when new xattr block is created or released we we would call dquot_free_block() or dquot_alloc_block() respectively, among the else decrementing or incrementing the number of blocks assigned to the inode by one block. This however does not work for bigalloc file system because we always allocate/free the whole cluster so we have to count with that in dquot_free_block() and dquot_alloc_block() as well. Use the clusters-to-blocks conversion EXT4_C2B() when passing number of blocks to the dquot_alloc/free functions to fix the problem. The problem has been revealed by xfstests #117 (and possibly others). Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04ext4: fix race in ext4_mb_add_n_trim()Niu Yawei
commit f1167009711032b0d747ec89a632a626c901a1ad upstream. In ext4_mb_add_n_trim(), lg_prealloc_lock should be taken when changing the lg_prealloc_list. Signed-off-by: Niu Yawei <yawei.niu@intel.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04ext4: check bh in ext4_read_block_bitmap()Eryu Guan
commit 15b49132fc972c63894592f218ea5a9a61b1a18f upstream. Validate the bh pointer before using it, since ext4_read_block_bitmap_nowait() might return NULL. I've seen this in fsfuzz testing. EXT4-fs error (device loop0): ext4_read_block_bitmap_nowait:385: comm touch: Cannot get buffer for block bitmap - block_group = 0, block_bitmap = 3925999616 BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff8121de25>] ext4_wait_block_bitmap+0x25/0xe0 ... Call Trace: [<ffffffff8121e1e5>] ext4_read_block_bitmap+0x35/0x60 [<ffffffff8125e9c6>] ext4_free_blocks+0x236/0xb80 [<ffffffff811d0d36>] ? __getblk+0x36/0x70 [<ffffffff811d0a5f>] ? __find_get_block+0x8f/0x210 [<ffffffff81191ef3>] ? kmem_cache_free+0x33/0x140 [<ffffffff812678e5>] ext4_xattr_release_block+0x1b5/0x1d0 [<ffffffff812679be>] ext4_xattr_delete_inode+0xbe/0x100 [<ffffffff81222a7c>] ext4_free_inode+0x7c/0x4d0 [<ffffffff812277b8>] ? ext4_mark_inode_dirty+0x88/0x230 [<ffffffff8122993c>] ext4_evict_inode+0x32c/0x490 [<ffffffff811b8cd7>] evict+0xa7/0x1c0 [<ffffffff811b8ed3>] iput_final+0xe3/0x170 [<ffffffff811b8f9e>] iput+0x3e/0x50 [<ffffffff812316fd>] ext4_add_nondir+0x4d/0x90 [<ffffffff81231d0b>] ext4_create+0xeb/0x170 [<ffffffff811aae9c>] vfs_create+0xac/0xd0 [<ffffffff811ac845>] lookup_open+0x185/0x1c0 [<ffffffff8129e3b9>] ? selinux_inode_permission+0xa9/0x170 [<ffffffff811acb54>] do_last+0x2d4/0x7a0 [<ffffffff811af743>] path_openat+0xb3/0x480 [<ffffffff8116a8a1>] ? handle_mm_fault+0x251/0x3b0 [<ffffffff811afc49>] do_filp_open+0x49/0xa0 [<ffffffff811bbaad>] ? __alloc_fd+0xdd/0x150 [<ffffffff8119da28>] do_sys_open+0x108/0x1f0 [<ffffffff8119db51>] sys_open+0x21/0x30 [<ffffffff81618959>] system_call_fastpath+0x16/0x1b Also fix comment for ext4_read_block_bitmap_nowait() Signed-off-by: Eryu Guan <guaneryu@gmail.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04fs: Fix possible use-after-free with AIOJan Kara
commit 54c807e71d5ac59dee56c685f2b66e27cd54c475 upstream. Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. Acked-by: Jeff Moyer <jmoyer@redhat.com> CC: Christoph Hellwig <hch@infradead.org> CC: Jens Axboe <axboe@kernel.dk> CC: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04ocfs2: ac->ac_allow_chain_relink=0 won't disable group relinkXiaowei.Hu
commit 309a85b6861fedbb48a22d45e0e079d1be993b3a upstream. ocfs2_block_group_alloc_discontig() disables chain relink by setting ac->ac_allow_chain_relink = 0 because it grabs clusters from multiple cluster groups. It doesn't keep the credits for all chain relink,but ocfs2_claim_suballoc_bits overrides this in this call trace: ocfs2_block_group_claim_bits()->ocfs2_claim_clusters()-> __ocfs2_claim_clusters()->ocfs2_claim_suballoc_bits() ocfs2_claim_suballoc_bits set ac->ac_allow_chain_relink = 1; then call ocfs2_search_chain() one time and disable it again, and then we run out of credits. Fix is to allow relink by default and disable it in ocfs2_block_group_alloc_discontig. Without this patch, End-users will run into a crash due to run out of credits, backtrace like this: RIP: 0010:[<ffffffffa0808b14>] [<ffffffffa0808b14>] jbd2_journal_dirty_metadata+0x164/0x170 [jbd2] RSP: 0018:ffff8801b919b5b8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88022139ddc0 RCX: ffff880159f652d0 RDX: ffff880178aa3000 RSI: ffff880159f652d0 RDI: ffff880087f09bf8 RBP: ffff8801b919b5e8 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000001e00 R11: 00000000000150b0 R12: ffff880159f652d0 R13: ffff8801a0cae908 R14: ffff880087f09bf8 R15: ffff88018d177800 FS: 00007fc9b0b6b6e0(0000) GS:ffff88022fd40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000040819c CR3: 0000000184017000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process dd (pid: 9945, threadinfo ffff8801b919a000, task ffff880149a264c0) Call Trace: ocfs2_journal_dirty+0x2f/0x70 [ocfs2] ocfs2_relink_block_group+0x111/0x480 [ocfs2] ocfs2_search_chain+0x455/0x9a0 [ocfs2] ... Signed-off-by: Xiaowei.Hu <xiaowei.hu@oracle.com> Reviewed-by: Srinivas Eeda <srinivas.eeda@oracle.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04ocfs2: fix ocfs2_init_security_and_acl() to initialize acl correctlyJeff Liu
commit 32918dd9f19e5960af4cdfa41190bb843fb2247b upstream. We need to re-initialize the security for a new reflinked inode with its parent dirs if it isn't specified to be preserved for ocfs2_reflink(). However, the code logic is broken at ocfs2_init_security_and_acl() although ocfs2_init_security_get() succeed. As a result, ocfs2_acl_init() does not involked and therefore the default ACL of parent dir was missing on the new inode. Note this was introduced by 9d8f13ba3 ("security: new security_inode_init_security API adds function callback") To reproduce: set default ACL for the parent dir(ocfs2 in this case): $ setfacl -m default:user:jeff:rwx ../ocfs2/ $ getfacl ../ocfs2/ # file: ../ocfs2/ # owner: jeff # group: jeff user::rwx group::r-x other::r-x default:user::rwx default:user:jeff:rwx default:group::r-x default:mask::rwx default:other::r-x $ touch a $ getfacl a # file: a # owner: jeff # group: jeff user::rw- group::rw- other::r-- Before patching, create reflink file b from a, the user default ACL entry(user:jeff:rwx)was missing: $ ./ocfs2_reflink a b $ getfacl b # file: b # owner: jeff # group: jeff user::rw- group::rw- other::r-- In this case, the end user can also observed an error message at syslog: (ocfs2_reflink,3229,2):ocfs2_init_security_and_acl:7193 ERROR: status = 0 After applying this patch, create reflink file c from a: $ ./ocfs2_reflink a c $ getfacl c # file: c # owner: jeff # group: jeff user::rw- user:jeff:rwx #effective:rw- group::r-x #effective:r-- mask::rw- other::r-- Test program: /* Usage: reflink <source> <dest> */ #include <stdio.h> #include <stdint.h> #include <stdbool.h> #include <string.h> #include <errno.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <sys/ioctl.h> static int reflink_file(char const *src_name, char const *dst_name, bool preserve_attrs) { int fd; #ifndef REFLINK_ATTR_NONE # define REFLINK_ATTR_NONE 0 #endif #ifndef REFLINK_ATTR_PRESERVE # define REFLINK_ATTR_PRESERVE 1 #endif #ifndef OCFS2_IOC_REFLINK struct reflink_arguments { uint64_t old_path; uint64_t new_path; uint64_t preserve; }; # define OCFS2_IOC_REFLINK _IOW ('o', 4, struct reflink_arguments) #endif struct reflink_arguments args = { .old_path = (unsigned long) src_name, .new_path = (unsigned long) dst_name, .preserve = preserve_attrs ? REFLINK_ATTR_PRESERVE : REFLINK_ATTR_NONE, }; fd = open(src_name, O_RDONLY); if (fd < 0) { fprintf(stderr, "Failed to open %s: %s\n", src_name, strerror(errno)); return -1; } if (ioctl(fd, OCFS2_IOC_REFLINK, &args) < 0) { fprintf(stderr, "Failed to reflink %s to %s: %s\n", src_name, dst_name, strerror(errno)); return -1; } } int main(int argc, char *argv[]) { if (argc != 3) { fprintf(stdout, "Usage: %s source dest\n", argv[0]); return 1; } return reflink_file(argv[1], argv[2], 0); } Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Tao Ma <boyu.mt@taobao.com> Cc: Mimi Zohar <zohar@linux.vnet.ibm.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-03-04ocfs2: fix possible use-after-free with AIOJan Kara
commit 9b171e0c74ca0549d0610990a862dd895870f04a upstream. Running AIO is pinning inode in memory using file reference. Once AIO is completed using aio_complete(), file reference is put and inode can be freed from memory. So we have to be sure that calling aio_complete() is the last thing we do with the inode. Signed-off-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28NLM: Ensure that we resend all pending blocking locks after a reclaimTrond Myklebust
commit 666b3d803a511fbc9bc5e5ea8ce66010cf03ea13 upstream. Currently, nlmclnt_lock will break out of the for(;;) loop when the reclaimer wakes up the blocking lock thread by setting nlm_lck_denied_grace_period. This causes the lock request to fail with an ENOLCK error. The intention was always to ensure that we resend the lock request after the grace period has expired. Reported-by: Wangyuan Zhang <Wangyuan.Zhang@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28umount oops when remove blocklayoutdriver firstfanchaoting
commit 5a12cca697aca5dfba42a7d4c3356acc0445a2b0 upstream. now pnfs client uses block layout, maybe we can remove blocklayoutdriver first. if we umount later, it can cause oops in unset_pnfs_layoutdriver. because nfss->pnfs_curr_ld->clear_layoutdriver is invalid. reproduce it: modprobe blocklayoutdriver mount -t nfs4 -o minorversion=1 pnfsip:/ /mnt/ rmmod blocklayoutdriver umount /mnt then you can see following CPU 0 Pid: 17023, comm: umount.nfs4 Tainted: GF O 3.7.0-rc6-pnfs #1 VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform RIP: 0010:[<ffffffffa04cfe6d>] [<ffffffffa04cfe6d>] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4] RSP: 0018:ffff8800022d9e48 EFLAGS: 00010286 RAX: ffffffffa04a1b00 RBX: ffff88000b013800 RCX: 0000000000000001 RDX: ffffffff81ae8ee0 RSI: ffff880001ee94b8 RDI: ffff88000b013800 RBP: ffff8800022d9e58 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880001ee9400 R13: ffff8800105978c0 R14: 00007fff25846c08 R15: 0000000001bba550 FS: 00007f45ae7f0700(0000) GS:ffff880012c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffffffa04a1b38 CR3: 0000000002c0c000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process umount.nfs4 (pid: 17023, threadinfo ffff8800022d8000, task ffff880006e48aa0) Stack: ffff8800105978c0 ffff88000b013800 ffff8800022d9e78 ffffffffa04cd0ce ffff8800022d9e78 ffff88000b013800 ffff8800022d9ea8 ffffffffa04755a7 ffff8800022d9ea8 ffff880002f96400 ffff88000b013800 ffff880002f96400 Call Trace: [<ffffffffa04cd0ce>] nfs4_destroy_server+0x1e/0x30 [nfsv4] [<ffffffffa04755a7>] nfs_free_server+0xb7/0x150 [nfs] [<ffffffffa047d4d5>] nfs_kill_super+0x35/0x40 [nfs] [<ffffffff81178d35>] deactivate_locked_super+0x45/0x70 [<ffffffff8117986a>] deactivate_super+0x4a/0x70 [<ffffffff81193ee2>] mntput_no_expire+0xd2/0x130 [<ffffffff81194d62>] sys_umount+0x72/0xe0 [<ffffffff8154af59>] system_call_fastpath+0x16/0x1b Code: 06 e1 b8 ea ff ff ff eb 9e 0f 1f 44 00 00 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 87 80 03 00 00 48 89 fb 48 85 c0 74 29 <48> 8b 40 38 48 85 c0 74 02 ff d0 48 8b 03 3e ff 48 04 0f 94 c2 RIP [<ffffffffa04cfe6d>] unset_pnfs_layoutdriver+0x1d/0x70 [nfsv4] RSP <ffff8800022d9e48> CR2: ffffffffa04a1b38 ---[ end trace 29f75aaedda058bf ]--- Signed-off-by: fanchaoting<fanchaoting@cn.fujitsu.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28ocfs2: unlock super lock if lockres refresh failedJunxiao Bi
commit 3278bb748d2437eb1464765f36429e5d6aa91c38 upstream. If lockres refresh failed, the super lock will never be released which will cause some processes on other cluster nodes hung forever. Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28fs/block_dev.c: page cache wrongly left invalidated after revalidate_disk()MITSUNARI Shigeo
commit 7630b661da330b35dd57b6f5d6d62b386f2dd751 upstream. We found that bdev->bd_invalidated was left set once revalidate_disk() is called, which results in page cache flush every time that device is open. Specifically, we found this problem in MD block device. Once we resize a MD device, mdadm --monitor periodically flush all page cache for that device every 60 or 1000 seconds when it opens the device. This bug lies since at least 3.2.0 till the latest kernel(3.6.2). Patch is attached. The following steps will reproduce the problem. 1. prepair a block device (eg /dev/sdb). 2. create two partitions: sudo parted /dev/sdb mklabel gpt mkpart primary 0% 50% mkpart primary 50% 100% 3. create a md device. sudo mdadm -C /dev/md/hoge -l 1 -n 2 -e 1.2 --assume-clean --auto=md --symlink=no /dev/sdb1 /dev/sdb2 4. create file system and mount it sudo mkfs.ext3 /dev/md/hoge sudo mkdir /mnt/test sudo mount /dev/md/hoge /mnt/test 5. try to resize the device sudo mdadm -G /dev/md/hoge --size=max 6. create a file to fill file cache. sudo dd if=/dev/urandom of=/mnt/test/data bs=1M count=10 and verify the current status of file by free command. 7. mdadm monitor will open the md device every 1000 seconds and you will find all file cache on the device are cleared. The timing can be reduced by the following steps. a) kill mdadm and restart it with --delay option /sbin/mdadm --monitor --delay=30 --pid-file /var/run/mdadm/monitor.pid --daemonise --scan --syslog or open the md device directly. sudo dd if=/dev/md/hoge of=/dev/null bs=4096 count=1 Signed-off-by: MITSUNARI Shigeo <herumi@nifty.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-28inotify: remove broken mask checks causing unmount to be EINVALJim Somerville
commit 676a0675cf9200ac047fb50825f80867b3bb733b upstream. Running the command: inotifywait -e unmount /mnt/disk immediately aborts with a -EINVAL return code. This is however a valid parameter. This abort occurs only if unmount is the sole event parameter. If other event parameters are supplied, then the unmount event wait will work. The problem was introduced by commit 44b350fc23e ("inotify: Fix mask checks"). In that commit, it states: The mask checks in inotify_update_existing_watch() and inotify_new_watch() are useless because inotify_arg_to_mask() sets FS_IN_IGNORED and FS_EVENT_ON_CHILD bits anyway. But instead of removing the useless checks, it did this: mask = inotify_arg_to_mask(arg); - if (unlikely(!mask)) + if (unlikely(!(mask & IN_ALL_EVENTS))) return -EINVAL; The problem is that IN_ALL_EVENTS doesn't include IN_UNMOUNT, and other parts of the code keep IN_UNMOUNT separate from IN_ALL_EVENTS. So the check should be: if (unlikely(!(mask & (IN_ALL_EVENTS | IN_UNMOUNT)))) But inotify_arg_to_mask(arg) always sets the IN_UNMOUNT bit in the mask anyway, so the check is always going to pass and thus should simply be removed. Also note that inotify_arg_to_mask completely controls what mask bits get set from arg, there's no way for invalid bits to get enabled there. Lets fix it by simply removing the useless broken checks. Signed-off-by: Jim Somerville <Jim.Somerville@windriver.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Eric Paris <eparis@parisplace.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-11nilfs2: fix fix very long mount time issueVyacheslav Dubeyko
commit a9bae189542e71f91e61a4428adf6e5a7dfe8063 upstream. There exists a situation when GC can work in background alone without any other filesystem activity during significant time. The nilfs_clean_segments() method calls nilfs_segctor_construct() that updates superblocks in the case of NILFS_SC_SUPER_ROOT and THE_NILFS_DISCONTINUED flags are set. But when GC is working alone the nilfs_clean_segments() is called with unset THE_NILFS_DISCONTINUED flag. As a result, the update of superblocks doesn't occurred all this time and in the case of SPOR superblocks keep very old values of last super root placement. SYMPTOMS: Trying to mount a NILFS2 volume after SPOR in such environment ends with very long mounting time (it can achieve about several hours in some cases). REPRODUCING PATH: 1. It needs to use external USB HDD, disable automount and doesn't make any additional filesystem activity on the NILFS2 volume. 2. Generate temporary file with size about 100 - 500 GB (for example, dd if=/dev/zero of=<file_name> bs=1073741824 count=200). The size of file defines duration of GC working. 3. Then it needs to delete file. 4. Start GC manually by means of command "nilfs-clean -p 0". When you start GC by means of such way then, at the end, superblocks is updated by once. So, for simulation of SPOR, it needs to wait sometime (15 - 40 minutes) and simply switch off USB HDD manually. 5. Switch on USB HDD again and try to mount NILFS2 volume. As a result, NILFS2 volume will mount during very long time. REPRODUCIBILITY: 100% FIX: This patch adds checking that superblocks need to update and set THE_NILFS_DISCONTINUED flag before nilfs_clean_segments() call. Reported-by: Sergey Alexandrov <splavgm@gmail.com> Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Tested-by: Vyacheslav Dubeyko <slava@dubeyko.com> Acked-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>