summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2011-03-23ext3: Always set dx_node's fake_dirent explicitly.Eric Sandeen
commit d7433142b63d727b5a217c37b1a1468b116a9771 upstream. (crossport of 1f7bebb9e911d870fa8f997ddff838e82b5715ea by Andreas Schlick <schlick@lavabit.com>) When ext3_dx_add_entry() has to split an index node, it has to ensure that name_len of dx_node's fake_dirent is also zero, because otherwise e2fsck won't recognise it as an intermediate htree node and consider the htree to be corrupted. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-23NFS: Fix a decoding problem in nfs3_decode_direntTrond Myklebust
[This needs to be applied to 2.6.37 only. The bug in question was inadvertently fixed by a series of cleanups in 2.6.38, but the patches in question are too large to be backported. This patch is a minimal fix that serves the same purpose.] When we decode a filename followed by an 8-byte cookie, we need to consider the fact that the filename and cookie are 32-bit word aligned. Presently, we may end up copying insufficient amounts of data when xdr_inline_decode() needs to invoke xdr_copy_to_scratch to deal with a page boundary. The following patch fixes the issue by first decoding the filename, and then decoding the cookie. Reported-by: Neil Brown <neilb@suse.de> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Reviewed-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-23Increase OSF partition limit from 8 to 18Linus Torvalds
commit 34d211a2d5df4984a35b18d8ccacbe1d10abb067 upstream. It turns out that while a maximum of 8 partitions may be what people "should" have had, you can actually fit up to 18 entries(*) in a sector. And some people clearly were taking advantage of that, like Michael Cree, who had ten partitions on one of his OSF disks. (*) The OSF partition data starts at byte offset 64 in the first sector, and the array of 16-byte partition entries start at offset 148 in the on-disk partition structure. Reported-by: Michael Cree <mcree@orcon.net.nz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-23nfs: fix compilation warningJovi Zhang
commit 43b7c3f051dea504afccc39bcb56d8e26c2e0b77 upstream. this commit fix compilation warning as following: linux-2.6/fs/nfs/nfs4proc.c:3265: warning: comparison of distinct pointer types lacks a cast Signed-off-by: Jovi Zhang <bookjovi@gmail.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-23Btrfs: deal with short returns from copy_from_userChris Mason
commit 31339acd07b4ba687906702085127895a56eb920 upstream. When copy_from_user is only able to copy some of the bytes we requested, we may end up creating a partially up to date page. To avoid garbage in the page, we need to treat a partial copy as a zero length copy. This makes the rest of the file_write code drop the page and retry the whole copy instead of marking the partially up to date page as dirty. Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-23Fix corrupted OSF partition table parsingTimo Warns
commit 1eafbfeb7bdf59cfe173304c76188f3fd5f1fd05 upstream. The kernel automatically evaluates partition tables of storage devices. The code for evaluating OSF partitions contains a bug that leaks data from kernel heap memory to userspace for certain corrupted OSF partitions. In more detail: for (i = 0 ; i < le16_to_cpu(label->d_npartitions); i++, partition++) { iterates from 0 to d_npartitions - 1, where d_npartitions is read from the partition table without validation and partition is a pointer to an array of at most 8 d_partitions. Add the proper and obvious validation. Signed-off-by: Timo Warns <warns@pre-sense.de> [ Changed the patch trivially to not repeat the whole le16_to_cpu() thing, and to use an explicit constant for the magic value '8' ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-23NFS: NFSROOT should default to "proto=udp"Chuck Lever
commit 53d4737580535e073963b91ce87d4216e434fab5 upstream. There have been a number of recent reports that NFSROOT is no longer working with default mount options, but fails only with certain NICs. Brian Downing <bdowning@lavos.net> bisected to commit 56463e50 "NFS: Use super.c for NFSROOT mount option parsing". Among other things, this commit changes the default mount options for NFSROOT to use TCP instead of UDP as the underlying transport. TCP seems less able to deal with NICs that are slow to initialize. The system logs that have accompanied reports of problems all show that NFSROOT attempts to establish a TCP connection before the NIC is fully initialized, and thus the TCP connection attempt fails. When a TCP connection attempt fails during a mount operation, the NFS stack needs to fail the operation. Usually user space knows how and when to retry it. The network layer does not report a distinct error code for this particular failure mode. Thus, there isn't a clean way for the RPC client to see that it needs to retry in this case, but not in others. Because NFSROOT is used in some environments where it is not possible to update the kernel command line to specify "udp", the proper thing to do is change NFSROOT to use UDP by default, as it did before commit 56463e50. To make it easier to see how to change default mount options for NFSROOT and to distinguish default settings from mandatory settings, I've adjusted a couple of areas to document the specifics. root_nfs_cat() is also modified to deal with commas properly when concatenating strings containing mount option lists. This keeps root_nfs_cat() call sites simpler, now that we may be concatenating multiple mount option strings. Tested-by: Brian Downing <bdowning@lavos.net> Tested-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-14NFS: NFSv4 readdir loses entriesChuck Lever
commit d1205f87bbb8040c1408bbd9e0a720310b2b0b9b upstream. On recent 2.6.38-rc kernels, connectathon basic test 6 fails on NFSv4 mounts of OpenSolaris with something like: > ./test6: readdir > ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.12' dir entry, pass 0 > ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.82' dir entry, pass 0 > ./test6: (/mnt/klimt/matisse.test) didn't read expected 'file.164' dir entry, pass 0 > ./test6: (/mnt/klimt/matisse.test) Test failed with 3 errors > basic tests failed > Tests failed, leaving /mnt/klimt mounted > [cel@matisse cthon04]$ I narrowed the problem down to nfs4_decode_dirent() reporting that the decode buffer had overflowed while decoding the entries for those missing files. verify_attr_len() assumes both it's pointer arguments reside on the same page. When these arguments point to locations on two different pages, verify_attr_len() can report false errors. This can happen now that a large NFSv4 readdir result can span pages. We have reasonably good checking in nfs4_decode_dirent() anyway, so it should be safe to simply remove the extra checking. At a guess, this was introduced by commit 6650239a, "NFS: Don't use vm_map_ram() in readdir". Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-14nfsd: wrong index used in inner looproel
commit 3ec07aa9522e3d5e9d5ede7bef946756e623a0a0 upstream. Index i was already used in the outer loop Signed-off-by: Roel Kluin <roel.kluin@gmail.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-14nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3)Neil Horman
commit e9e3d724e2145f5039b423c290ce2b2c3d8f94bc upstream. The "bad_page()" page allocator sanity check was reported recently (call chain as follows): bad_page+0x69/0x91 free_hot_cold_page+0x81/0x144 skb_release_data+0x5f/0x98 __kfree_skb+0x11/0x1a tcp_ack+0x6a3/0x1868 tcp_rcv_established+0x7a6/0x8b9 tcp_v4_do_rcv+0x2a/0x2fa tcp_v4_rcv+0x9a2/0x9f6 do_timer+0x2df/0x52c ip_local_deliver+0x19d/0x263 ip_rcv+0x539/0x57c netif_receive_skb+0x470/0x49f :virtio_net:virtnet_poll+0x46b/0x5c5 net_rx_action+0xac/0x1b3 __do_softirq+0x89/0x133 call_softirq+0x1c/0x28 do_softirq+0x2c/0x7d do_IRQ+0xec/0xf5 default_idle+0x0/0x50 ret_from_intr+0x0/0xa default_idle+0x29/0x50 cpu_idle+0x95/0xb8 start_kernel+0x220/0x225 _sinittext+0x22f/0x236 It occurs because an skb with a fraglist was freed from the tcp retransmit queue when it was acked, but a page on that fraglist had PG_Slab set (indicating it was allocated from the Slab allocator (which means the free path above can't safely free it via put_page. We tracked this back to an nfsv4 setacl operation, in which the nfs code attempted to fill convert the passed in buffer to an array of pages in __nfs4_proc_set_acl, which gets used by the skb->frags list in xs_sendpages. __nfs4_proc_set_acl just converts each page in the buffer to a page struct via virt_to_page, but the vfs allocates the buffer via kmalloc, meaning the PG_slab bit is set. We can't create a buffer with kmalloc and free it later in the tcp ack path with put_page, so we need to either: 1) ensure that when we create the list of pages, no page struct has PG_Slab set or 2) not use a page list to send this data Given that these buffers can be multiple pages and arbitrarily sized, I think (1) is the right way to go. I've written the below patch to allocate a page from the buddy allocator directly and copy the data over to it. This ensures that we have a put_page free-able page for every entry that winds up on an skb frag list, so it can be safely freed when the frame is acked. We do a put page on each entry after the rpc_call_sync call so as to drop our own reference count to the page, leaving only the ref count taken by tcp_sendpages. This way the data will be properly freed when the ack comes in Successfully tested by myself to solve the above oops. Note, as this is the result of a setacl operation that exceeded a page of data, I think this amounts to a local DOS triggerable by an uprivlidged user, so I'm CCing security on this as well. Signed-off-by: Neil Horman <nhorman@tuxdriver.com> CC: Trond Myklebust <Trond.Myklebust@netapp.com> CC: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07nilfs2: fix regression that i-flag is not set on changeless checkpointsRyusuke Konishi
commit 72746ac643928f6c3113b5aa783d8ea1b13949d2 upstream. According to the report from Jiro SEKIBA titled "regression in 2.6.37?" (Message-Id: <8739n8vs1f.wl%jir@sekiba.com>), on 2.6.37 and later kernels, lscp command no longer displays "i" flag on checkpoints that snapshot operations or garbage collection created. This is a regression of nilfs2 checkpointing function, and it's critical since it broke behavior of a part of nilfs2 applications. For instance, snapshot manager of TimeBrowse gets to create meaningless snapshots continuously; snapshot creation triggers another checkpoint, but applications cannot distinguish whether the new checkpoint contains meaningful changes or not without the i-flag. This patch fixes the regression and brings that application behavior back to normal. Reported-by: Jiro SEKIBA <jir@unicus.jp> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Jiro SEKIBA <jir@unicus.jp> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07ext2: Fix link count corruption under heavy link+rename loadJosh Hunt
commit e8a80c6f769dd4622d8b211b398452158ee60c0b upstream. vfs_rename_other() does not lock renamed inode with i_mutex. Thus changing i_nlink in a non-atomic manner (which happens in ext2_rename()) can corrupt it as reported and analyzed by Josh. In fact, there is no good reason to mess with i_nlink of the moved file. We did it presumably to simulate linking into the new directory and unlinking from an old one. But the practical effect of this is disputable because fsck can possibly treat file as being properly linked into both directories without writing any error which is confusing. So we just stop increment-decrement games with i_nlink which also fixes the corruption. CC: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07fuse: fix hang of single threaded fuseblk filesystemMiklos Szeredi
commit 5a18ec176c934ca1bc9dc61580a5e0e90a9b5733 upstream. Single threaded NTFS-3G could get stuck if a delayed RELEASE reply triggered a DESTROY request via path_put(). Fix this by a) making RELEASE requests synchronous, whenever possible, on fuseblk filesystems b) if not possible (triggered by an asynchronous read/write) then do the path_put() in a separate thread with schedule_work(). Reported-by: Oliver Neukum <oneukum@suse.de> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07Ocfs2/refcounttree: Fix a bug for refcounttree to writeback clusters in a ↵Tristan Ye
right number. commit acf3bb007e5636ef4c17505affb0974175108553 upstream. Current refcounttree codes actually didn't writeback the new pages out in write-back mode, due to a bug of always passing a ZERO number of clusters to 'ocfs2_cow_sync_writeback', the patch tries to pass a proper one in. Signed-off-by: Tristan Ye <tristan.ye@oracle.com> Signed-off-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07ocfs2: Check heartbeat mode for kernel stacks onlyMark Fasheh
commit 52c303c56c3638944b5f733e3961dc58eb8c7270 upstream. Commit 2c442719e90a44a6982c033d69df4aae4b167cfa added some checks for proper heartbeat mode when the o2cb stack is running. Unfortunately, it didn't take into account that a userpsace stack could be running. Fix this by only doing the check if o2cb is in use. This patch allows userspace stacks to mount the fs again. Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07Fix over-zealous flush_disk when changing device size.NeilBrown
commit 93b270f76e7ef3b81001576860c2701931cdc78b upstream. There are two cases when we call flush_disk. In one, the device has disappeared (check_disk_change) so any data will hold becomes irrelevant. In the oter, the device has changed size (check_disk_size_change) so data we hold may be irrelevant. In both cases it makes sense to discard any 'clean' buffers, so they will be read back from the device if needed. In the former case it makes sense to discard 'dirty' buffers as there will never be anywhere safe to write the data. In the second case it *does*not* make sense to discard dirty buffers as that will lead to file system corruption when you simply enlarge the containing devices. flush_disk calls __invalidate_devices. __invalidate_device calls both invalidate_inodes and invalidate_bdev. invalidate_inodes *does* discard I_DIRTY inodes and this does lead to fs corruption. invalidate_bev *does*not* discard dirty pages, but I don't really care about that at present. So this patch adds a flag to __invalidate_device (calling it __invalidate_device2) to indicate whether dirty buffers should be killed, and this is passed to invalidate_inodes which can choose to skip dirty inodes. flusk_disk then passes true from check_disk_change and false from check_disk_size_change. dm avoids tripping over this problem by calling i_size_write directly rathher than using check_disk_size_change. md does use check_disk_size_change and so is affected. This regression was introduced by commit 608aeef17a which causes check_disk_size_change to call flush_disk, so it is suitable for any kernel since 2.6.27. Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: Andrew Patterson <andrew.patterson@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07ldm: corrupted partition table can cause kernel oopsTimo Warns
commit 294f6cf48666825d23c9372ef37631232746e40d upstream. The kernel automatically evaluates partition tables of storage devices. The code for evaluating LDM partitions (in fs/partitions/ldm.c) contains a bug that causes a kernel oops on certain corrupted LDM partitions. A kernel subsystem seems to crash, because, after the oops, the kernel no longer recognizes newly connected storage devices. The patch changes ldm_parse_vmdb() to Validate the value of vblk_size. Signed-off-by: Timo Warns <warns@pre-sense.de> Cc: Eugene Teo <eugeneteo@kernel.sg> Acked-by: Richard Russon <ldm@flatcap.org> Cc: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07epoll: prevent creating circular epoll structuresDavide Libenzi
commit 22bacca48a1755f79b7e0f192ddb9fbb7fc6e64e upstream. In several places, an epoll fd can call another file's ->f_op->poll() method with ep->mtx held. This is in general unsafe, because that other file could itself be an epoll fd that contains the original epoll fd. The code defends against this possibility in its own ->poll() method using ep_call_nested, but there are several other unsafe calls to ->poll elsewhere that can be made to deadlock. For example, the following simple program causes the call in ep_insert recursively call the original fd's ->poll, leading to deadlock: #include <unistd.h> #include <sys/epoll.h> int main(void) { int e1, e2, p[2]; struct epoll_event evt = { .events = EPOLLIN }; e1 = epoll_create(1); e2 = epoll_create(2); pipe(p); epoll_ctl(e2, EPOLL_CTL_ADD, e1, &evt); epoll_ctl(e1, EPOLL_CTL_ADD, p[0], &evt); write(p[1], p, sizeof p); epoll_ctl(e1, EPOLL_CTL_ADD, e2, &evt); return 0; } On insertion, check whether the inserted file is itself a struct epoll, and if so, do a recursive walk to detect whether inserting this file would create a loop of epoll structures, which could lead to deadlock. [nelhage@ksplice.com: Use epmutex to serialize concurrent inserts] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Nelson Elhage <nelhage@ksplice.com> Reported-by: Nelson Elhage <nelhage@ksplice.com> Tested-by: Nelson Elhage <nelhage@ksplice.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-03-07mm: prevent concurrent unmap_mapping_range() on the same inodeMiklos Szeredi
commit 2aa15890f3c191326678f1bd68af61ec6b8753ec upstream. Michael Leun reported that running parallel opens on a fuse filesystem can trigger a "kernel BUG at mm/truncate.c:475" Gurudas Pai reported the same bug on NFS. The reason is, unmap_mapping_range() is not prepared for more than one concurrent invocation per inode. For example: thread1: going through a big range, stops in the middle of a vma and stores the restart address in vm_truncate_count. thread2: comes in with a small (e.g. single page) unmap request on the same vma, somewhere before restart_address, finds that the vma was already unmapped up to the restart address and happily returns without doing anything. Another scenario would be two big unmap requests, both having to restart the unmapping and each one setting vm_truncate_count to its own value. This could go on forever without any of them being able to finish. Truncate and hole punching already serialize with i_mutex. Other callers of unmap_mapping_range() do not, and it's difficult to get i_mutex protection for all callers. In particular ->d_revalidate(), which calls invalidate_inode_pages2_range() in fuse, may be called with or without i_mutex. This patch adds a new mutex to 'struct address_space' to prevent running multiple concurrent unmap_mapping_range() on the same mapping. [ We'll hopefully get rid of all this with the upcoming mm preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex lockbreak" patch in particular. But that is for 2.6.39 ] Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reported-by: Michael Leun <lkml20101129@newton.leun.net> Reported-by: Gurudas Pai <gurudas.pai@oracle.com> Tested-by: Gurudas Pai <gurudas.pai@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24eCryptfs: Copy up lower inode attrs in getattrTyler Hicks
commit 55f9cf6bbaa682958a7dd2755f883b768270c3ce upstream. The lower filesystem may do some type of inode revalidation during a getattr call. eCryptfs should take advantage of that by copying the lower inode attributes to the eCryptfs inode after a call to vfs_getattr() on the lower inode. I originally wrote this fix while working on eCryptfs on nfsv3 support, but discovered it also fixed an eCryptfs on ext4 nanosecond timestamp bug that was reported. https://bugs.launchpad.net/bugs/613873 Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24cifs: Fix regression in LANMAN (LM) auth codeShirish Pargaonkar
commit 5e640927a597a7c3e72b61e8bce74c22e906de65 upstream. LANMAN response length was changed to 16 bytes instead of 24 bytes. Revert it back to 24 bytes. Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24cifs: fix handling of scopeid in cifs_convert_addressJeff Layton
commit 9616125611ee47693186533d76e403856a36b3c8 upstream. The code finds, the '%' sign in an ipv6 address and copies that to a buffer allocated on the stack. It then ignores that buffer, and passes 'pct' to simple_strtoul(), which doesn't work right because we're comparing 'endp' against a completely different string. Fix it by passing the correct pointer. While we're at it, this is a good candidate for conversion to strict_strtoul as well. Cc: David Howells <dhowells@redhat.com> Reported-by: Björn JACKE <bj@sernet.de> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24fs/partitions: Validate map_count in Mac partition tablesTimo Warns
commit fa7ea87a057958a8b7926c1a60a3ca6d696328ed upstream. Validate number of blocks in map and remove redundant variable. Signed-off-by: Timo Warns <warns@pre-sense.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24s390: remove task_show_regsMartin Schwidefsky
commit 261cd298a8c363d7985e3482946edb4bfedacf98 upstream. task_show_regs used to be a debugging aid in the early bringup days of Linux on s390. /proc/<pid>/status is a world readable file, it is not a good idea to show the registers of a process. The only correct fix is to remove task_show_regs. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24nfsd: correctly handle return value from nfsd_map_name_to_*NeilBrown
commit 47c85291d3dd1a51501555000b90f8e281a0458e upstream. These functions return an nfs status, not a host_err. So don't try to convert before returning. This is a regression introduced by 3c726023402a2f3b28f49b9d90ebf9e71151157d; I fixed up two of the callers, but missed these two. Reported-by: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24btrfs: prevent heap corruption in btrfs_ioctl_space_info()Dan Rosenberg
commit 51788b1bdd0d68345bab0af4301e7fa429277228 upstream. Commit bf5fc093c5b625e4259203f1cee7ca73488a5620 refactored btrfs_ioctl_space_info() and introduced several security issues. space_args.space_slots is an unsigned 64-bit type controlled by a possibly unprivileged caller. The comparison as a signed int type allows providing values that are treated as negative and cause the subsequent allocation size calculation to wrap, or be truncated to 0. By providing a size that's truncated to 0, kmalloc() will return ZERO_SIZE_PTR. It's also possible to provide a value smaller than the slot count. The subsequent loop ignores the allocation size when copying data in, resulting in a heap overflow or write to ZERO_SIZE_PTR. The fix changes the slot count type and comparison typecast to u64, which prevents truncation or signedness errors, and also ensures that we don't copy more data than we've allocated in the subsequent loop. Note that zero-size allocations are no longer possible since there is already an explicit check for space_args.space_slots being 0 and truncation of this value is no longer an issue. Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com> Signed-off-by: Josef Bacik <josef@redhat.com> Reviewed-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24CRED: Fix kernel panic upon security_file_alloc() failure.Tetsuo Handa
commit 78d2978874e4e10e97dfd4fd79db45bdc0748550 upstream. In get_empty_filp() since 2.6.29, file_free(f) is called with f->f_cred == NULL when security_file_alloc() returned an error. As a result, kernel will panic() due to put_cred(NULL) call within RCU callback. Fix this bug by assigning f->f_cred before calling security_file_alloc(). Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24xfs: fix dquot shaker deadlockDave Chinner
commit 0fbca4d1c3932c27c4794bf5c2b5fc961cf5a54f upstream. Commit 368e136 ("xfs: remove duplicate code from dquot reclaim") fails to unlock the dquot freelist when the number of loop restarts is exceeded in xfs_qm_dqreclaim_one(). This causes hangs in memory reclaim. Rework the loop control logic into an unwind stack that all the different cases jump into. This means there is only one set of code that processes the loop exit criteria, and simplifies the unlocking of all the items from different points in the loop. It also fixes a double increment of the restart counter from the qi_dqlist_lock case. Reported-by: Malcolm Scott <lkml@malc.org.uk> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com> Cc: Arkadiusz Miskiewicz <a.miskiewicz@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-24NFSD: memory corruption due to writing beyond the stat arrayKonstantin Khorenko
commit 3aa6e0aa8ab3e64bbfba092c64d42fd1d006b124 upstream. If nfsd fails to find an exported via NFS file in the readahead cache, it should increment corresponding nfsdstats counter (ra_depth[10]), but due to a bug it may instead write to ra_depth[11], corrupting the following field. In a kernel with NFSDv4 compiled in the corruption takes the form of an increment of a counter of the number of NFSv4 operation 0's received; since there is no operation 0, this is harmless. In a kernel with NFSDv4 disabled it corrupts whatever happens to be in the memory beyond nfsdstats. Signed-off-by: Konstantin Khorenko <khorenko@openvz.org> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17nilfs2: fix crash after one superblock became unavailableRyusuke Konishi
commit 0ca7a5b9ac5d301845dd6382ff25a699b6263a81 upstream. Fixes the following kernel oops in nilfs_setup_super() which could arise if one of two super-blocks is unavailable. > BUG: unable to handle kernel NULL pointer dereference at (null) > Pid: 3529, comm: mount.nilfs2 Not tainted 2.6.37 #1 / > EIP: 0060:[<c03196bc>] EFLAGS: 00010202 CPU: 3 > EIP is at memcpy+0xc/0x1b > Call Trace: > [<f953720e>] ? nilfs_setup_super+0x6c/0xa5 [nilfs2] > [<f95369e9>] ? nilfs_get_root_dentry+0x81/0xcb [nilfs2] > [<f9537a08>] ? nilfs_mount+0x4f9/0x62c [nilfs2] > [<c02745cf>] ? kstrdup+0x36/0x3f > [<f953750f>] ? nilfs_mount+0x0/0x62c [nilfs2] > [<c0293940>] ? vfs_kern_mount+0x4d/0x12c > [<c02a5100>] ? get_fs_type+0x76/0x8f > [<c0293a68>] ? do_kern_mount+0x33/0xbf > [<c02a784a>] ? do_mount+0x2ed/0x714 > [<c02a6171>] ? copy_mount_options+0x28/0xfc > [<c02a7ce3>] ? sys_mount+0x72/0xaf > [<c0473085>] ? syscall_call+0x7/0xb Reported-by: Wakko Warner <wakko@animx.eu.org> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Tested-by: Wakko Warner <wakko@animx.eu.org> LKML-Reference: <20110121024918.GA29598@animx.eu.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17fs/direct-io.c: don't try to allocate more than BIO_MAX_PAGES in a bioDavid Dillow
commit 20d9600cb407b0b55fef6ee814b60345c6f58264 upstream. When using devices that support max_segments > BIO_MAX_PAGES (256), direct IO tries to allocate a bio with more pages than allowed, which leads to an oops in dio_bio_alloc(). Clamp the request to the supported maximum, and change dio_bio_alloc() to reflect that bio_alloc() will always return a bio when called with __GFP_WAIT and a valid number of vectors. [akpm@linux-foundation.org: remove redundant BUG_ON()] Signed-off-by: David Dillow <dillowda@ornl.gov> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17cifs: Fix regression during share-level security mounts (Repost)Shirish Pargaonkar
commit 540b2e377797d8715469d408b887baa9310c5f3e upstream. NTLM response length was changed to 16 bytes instead of 24 bytes that are sent in Tree Connection Request during share-level security share mounts. Revert it back to 24 bytes. Reported-and-Tested-by: Grzegorz Ozanski <grzegorz.ozanski@intel.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Acked-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17CIFS: Fix oplock break handling (try #2)Pavel Shilovsky
commit 12fed00de963433128b5366a21a55808fab2f756 upstream. When we get oplock break notification we should set the appropriate value of OplockLevel field in oplock break acknowledge according to the oplock level held by the client in this time. As we only can have level II oplock or no oplock in the case of oplock break, we should be aware only about clientCanCacheRead field in cifsInodeInfo structure. Also fix bug connected with wrong interpretation of OplockLevel field during oplock break notification processing. Signed-off-by: Pavel Shilovsky <piastryyy@gmail.com> Signed-off-by: Steve French <sfrench@us.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17epoll: epoll_wait() should not use timespec_add_ns()Eric Dumazet
commit 0781b909b5586f4db720b5d1838b78f9d8e42f14 upstream. commit 95aac7b1cd224f ("epoll: make epoll_wait() use the hrtimer range feature") added a performance regression because it uses timespec_add_ns() with potential very large 'ns' values. [akpm@linux-foundation.org: s/epoll_set_mstimeout/ep_set_mstimeout/, per Davide] Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Shawn Bohrer <shawn.bohrer@gmail.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17Revert "exofs: Set i_mapping->backing_dev_info anyway"Boaz Harrosh
commit 0b0abeaf3d30cec03ac6497fe978b8f7edecc5ae upstream. This reverts commit 115e19c53501edc11f730191f7f047736815ae3d. Apparently setting inode->bdi to one's own sb->s_bdi stops VFS from sending *read-aheads*. This problem was bisected to this commit. A revert fixes it. I'll investigate farther why is this happening for the next Kernel, but for now a revert. I'm sending to stable@kernel.org as well, since it exists also in 2.6.37. 2.6.36 is good and does not have this patch. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17ext4: make grpinfo slab cache names staticEric Sandeen
commit 2892c15ddda6a76dc10b7499e56c0f3b892e5a69 upstream. In 2.6.37 I was running into oopses with repeated module loads & unloads. I tracked this down to: fb1813f4 ext4: use dedicated slab caches for group_info structures (this was in addition to the features advert unload problem) The kstrdup & subsequent kfree of the cache name was causing a double free. In slub, at least, if I read it right it allocates & frees the name itself, slab seems to do something different... so in slub I think we were leaking -our- cachep->name, and double freeing the one allocated by slub. After getting lost in slab/slub/slob a bit, I just looked at other sized-caches that get allocated. jbd2, biovec, sgpool all do it more or less the way jbd2 does. Below patch follows the jbd2 method of dynamically allocating a cache at mount time from a list of static names. (This might also possibly fix a race creating the caches with parallel mounts running). [Folded in a fix from Dan Carpenter which fixed an off-by-one error in the original patch] Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17ext4: Fix data corruption with multi-block writepages supportCurt Wohlgemuth
commit d50bdd5aa55127635fd8a5c74bd2abb256bd34e3 upstream. This fixes a corruption problem with the multi-block writepages submittal change for ext4, from commit bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc ("ext4: use bio layer instead of buffer layer in mpage_da_submit_io"). (Note that this corruption is not present in 2.6.37 on ext4, because the corruption was detected after the feature was merged in 2.6.37-rc1, and so it was turned off by adding a non-default mount option, mblk_io_submit. With this commit, which hopefully fixes the last of the bugs with this feature, we'll be able to turn on this performance feature by default in 2.6.38, and remove the mblk_io_submit option.) The ext4 code path to bundle multiple pages for writeback in ext4_bio_write_page() had a bug: we should be clearing buffer head dirty flags *before* we submit the bio, not in the completion routine. The patch below was tested on 2.6.37 under KVM with the postgresql script which was submitted by Jon Nelson as documented in commit 1449032be1. Without the patch, I'd hit the corruption problem about 50-70% of the time. With the patch, I executed the script > 100 times with no corruption seen. I also fixed a bug to make sure ext4_end_bio() doesn't dereference the bio after the bio_put() call. Reported-by: Jon Nelson <jnelson@jamponi.net> Reported-by: Matthias Bayer <jackdachef@gmail.com> Signed-off-by: Curt Wohlgemuth <curtw@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17ext4: unregister features interface on module unloadLukas Czerner
commit 8f021222c1e2756ea4c9dde93b23e1d2a0a4ec37 upstream. Ext4 features interface was not properly unregistered which led to problems while unloading/reloading ext4 module. This commit fixes that by adding proper kobject unregistration code into ext4_exit_fs() as well as fail-path of ext4_init_fs() Reported-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17ext4: fix panic on module unload when stopping lazyinit threadEric Sandeen
commit 8f1f745331c1b560f53c0d60e55a4f4f43f7cce5 upstream. https://bugzilla.kernel.org/show_bug.cgi?id=27652 If the lazyinit thread is running, the teardown function ext4_destroy_lazyinit_thread() has problems: ext4_clear_request_list(); while (ext4_li_info->li_task) { wake_up(&ext4_li_info->li_wait_daemon); wait_event(ext4_li_info->li_wait_task, ext4_li_info->li_task == NULL); } Clearing the request list will cause the thread to exit and free ext4_li_info, so then we're waiting on something which is getting freed. Fix this up by making the thread respond to kthread_stop, and exit, without the need to wait for that exit in some other homegrown way. Reported-and-Tested-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17ext4: fix memory leak in ext4_free_branchesTheodore Ts'o
commit 1c5b9e9065567876c2d4a7a16d78f0fed154a5bf upstream. Commit 40389687 moved a call to ext4_forget() out of ext4_free_branches and let ext4_free_blocks() handle calling bforget(). But that change unfortunately did not replace the call to ext4_forget() with brelse(), which was needed to drop the in-use count of the indirect block's buffer head, which lead to a memory leak when deleting files that used indirect blocks. Fix this. Thanks to Hugh Dickins for pointing this out. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17ext4: fix trimming of a single groupJan Kara
commit ca6e909f9bebe709bc65a3ee605ce32969db0452 upstream. When ext4_trim_fs() is called to trim a part of a single group, the logic will wrongly set last block of the interval to 'len' instead of 'first_block + len'. Thus a shorter interval is possibly trimmed. Fix it. CC: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17ext4: fix uninitialized variable in ext4_register_li_requestAndrew Morton
commit 6c5a6cb998854f3c579ecb2bc1423d302bcb1b76 upstream. fs/ext4/super.c: In function 'ext4_register_li_request': fs/ext4/super.c:2936: warning: 'ret' may be used uninitialized in this function It looks buggy to me, too. Cc: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17writeback: avoid livelocking WB_SYNC_ALL writebackJan Kara
commit b9543dac5bbc4aef0a598965b6b34f6259ab9a9b upstream. When wb_writeback() is called in WB_SYNC_ALL mode, work->nr_to_write is usually set to LONG_MAX. The logic in wb_writeback() then calls __writeback_inodes_sb() with nr_to_write == MAX_WRITEBACK_PAGES and we easily end up with non-positive nr_to_write after the function returns, if the inode has more than MAX_WRITEBACK_PAGES dirty pages at the moment. When nr_to_write is <= 0 wb_writeback() decides we need another round of writeback but this is wrong in some cases! For example when a single large file is continuously dirtied, we would never finish syncing it because each pass would be able to write MAX_WRITEBACK_PAGES and inode dirty timestamp never gets updated (as inode is never completely clean). Thus __writeback_inodes_sb() would write the redirtied inode again and again. Fix the issue by setting nr_to_write to LONG_MAX in WB_SYNC_ALL mode. We do not need nr_to_write in WB_SYNC_ALL mode anyway since write_cache_pages() does livelock avoidance using page tagging in WB_SYNC_ALL mode. This makes wb_writeback() call __writeback_inodes_sb() only once on WB_SYNC_ALL. The latter function won't livelock because it works on - a finite set of files by doing queue_io() once at the beginning - a finite set of pages by PAGECACHE_TAG_TOWRITE page tagging After this patch, program from http://lkml.org/lkml/2010/10/24/154 is no longer able to stall sync forever. [fengguang.wu@intel.com: fix locking comment] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17writeback: stop background/kupdate works from livelocking other worksJan Kara
commit aa373cf550994623efb5d49a4d8775bafd10bbc1 upstream. Background writeback is easily livelockable in a loop in wb_writeback() by a process continuously re-dirtying pages (or continuously appending to a file). This is in fact intended as the target of background writeback is to write dirty pages it can find as long as we are over dirty_background_threshold. But the above behavior gets inconvenient at times because no other work queued in the flusher thread's queue gets processed. In particular, since e.g. sync(1) relies on flusher thread to do all the IO for it, sync(1) can hang forever waiting for flusher thread to do the work. Generally, when a flusher thread has some work queued, someone submitted the work to achieve a goal more specific than what background writeback does. Moreover by working on the specific work, we also reduce amount of dirty pages which is exactly the target of background writeout. So it makes sense to give specific work a priority over a generic page cleaning. Thus we interrupt background writeback if there is some other work to do. We return to the background writeback after completing all the queued work. This may delay the writeback of expired inodes for a while, however the expired inodes will eventually be flushed to disk as long as the other works won't livelock. [fengguang.wu@intel.com: update comment] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17writeback: integrated background writeback workJan Kara
commit 6585027a5e8cb490e3a761b2f3f3c3acf722aff2 upstream. Check whether background writeback is needed after finishing each work. When bdi flusher thread finishes doing some work check whether any kind of background writeback needs to be done (either because dirty_background_ratio is exceeded or because we need to start flushing old inodes). If so, just do background write back. This way, bdi_start_background_writeback() just needs to wake up the flusher thread. It will do background writeback as soon as there is no other work. This is a preparatory patch for the next patch which stops background writeback as soon as there is other work to do. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Engelhardt <jengelh@medozas.de> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17block: fix accounting bug on cross partition mergesJerome Marchand
commit 09e099d4bafea3b15be003d548bdf94b4b6e0e17 upstream. /proc/diskstats would display a strange output as follows. $ cat /proc/diskstats |grep sda 8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089 8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691 ~~~~~~~~~~ 8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390 8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92 8 4 sda4 4 0 8 0 0 0 0 0 0 0 0 8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137 Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE. The detailed root cause is as follows. Assuming that there are two partition, sda1 and sda2. 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight is 0 and sda2's one is 1. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 2. A bio belongs to sda1 is issued and is merged into the request mentioned on step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed from sda2 region to sda1 region. However the two partition's hd_struct->in_flight are not changed. | hd_struct->in_flight --------------------------- sda1 | 0 sda2 | 1 --------------------------- 3. The request is finished and blk_account_io_done() is called. In this case, sda2's hd_struct->in_flight, not a sda1's one, is decremented. | hd_struct->in_flight --------------------------- sda1 | -1 sda2 | 1 --------------------------- The patch fixes the problem by caching the partition lookup inside the request structure, hence making sure that the increment and decrement will always happen on the same partition struct. This also speeds up IO with accounting enabled, since it cuts down on the number of lookups we have to do. Also add a refcount to struct hd_struct to keep the partition in memory as long as users exist. We use kref_test_and_get() to ensure we don't add a reference to a partition which is going away. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17/proc/kcore: fix seekingDave Anderson
commit ceff1a770933e2ca2bf995b453dade4ec47a9878 upstream. Commit 34aacb2920 ("procfs: Use generic_file_llseek in /proc/kcore") broke seeking on /proc/kcore. This changes it back to use default_llseek in order to restore the original behavior. The problem with generic_file_llseek is that it only allows seeks up to inode->i_sb->s_maxbytes, which is 2GB-1 on procfs, where the memory file offset values in the /proc/kcore PT_LOAD segments may exceed or start beyond that offset value. A similar revert was made for /proc/vmcore. Signed-off-by: Dave Anderson <anderson@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17nfsd4: name->id mapping should fail with BADOWNER not BADNAMEJ. Bruce Fields
commit f6af99ec1b261e21219d5eba99e3af48fc6c32d4 upstream. According to rfc 3530 BADNAME is for strings that represent paths; BADOWNER is for user/group names that don't map. And the too-long name should probably be BADOWNER as well; it's effectively the same as if we couldn't map it. Reported-by: Trond Myklebust <Trond.Myklebust@netapp.com> Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17NFS: Fix "kernel BUG at fs/aio.c:554!"Chuck Lever
commit 839f7ad6932d95f4d5ae7267b95c574714ff3d5b upstream. Nick Piggin reports: > I'm getting use after frees in aio code in NFS > > [ 2703.396766] Call Trace: > [ 2703.396858] [<ffffffff8100b057>] ? native_sched_clock+0x27/0x80 > [ 2703.396959] [<ffffffff8108509e>] ? put_lock_stats+0xe/0x40 > [ 2703.397058] [<ffffffff81088348>] ? lock_release_holdtime+0xa8/0x140 > [ 2703.397159] [<ffffffff8108a2a5>] lock_acquire+0x95/0x1b0 > [ 2703.397260] [<ffffffff811627db>] ? aio_put_req+0x2b/0x60 > [ 2703.397361] [<ffffffff81039701>] ? get_parent_ip+0x11/0x50 > [ 2703.397464] [<ffffffff81612a31>] _raw_spin_lock_irq+0x41/0x80 > [ 2703.397564] [<ffffffff811627db>] ? aio_put_req+0x2b/0x60 > [ 2703.397662] [<ffffffff811627db>] aio_put_req+0x2b/0x60 > [ 2703.397761] [<ffffffff811647fe>] do_io_submit+0x2be/0x7c0 > [ 2703.397895] [<ffffffff81164d0b>] sys_io_submit+0xb/0x10 > [ 2703.397995] [<ffffffff8100307b>] system_call_fastpath+0x16/0x1b > > Adding some tracing, it is due to nfs completing the request then > returning something other than -EIOCBQUEUED, so aio.c > also completes the request. To address this, prevent the NFS direct I/O engine from completing async iocbs when the forward path returns an error without starting any I/O. This fix appears to survive ^C during both "xfstest no. 208" and "fsx -Z." It's likely this bug has existed for a very long while, as we are seeing very similar symptoms in OEL 5. Copying stable. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-02-17NFS: Fix an NFS client lockdep issueTrond Myklebust
commit e00b8a24041f37e56b4b8415ce4eba1cbc238065 upstream. There is no reason to be freeing the delegation cred in the rcu callback, and doing so is resulting in a lockdep complaint that rpc_credcache_lock is being called from both softirq and non-softirq contexts. Reported-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>