summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2008-02-06vm audit: add VM_DONTEXPAND to mmap for drivers that need it (CVE-2008-0007)Nick Piggin
Drivers that register a ->fault handler, but do not range-check the offset argument, must set VM_DONTEXPAND in the vm_flags in order to prevent an expanding mremap from overflowing the resource. I've audited the tree and attempted to fix these problems (usually by adding VM_DONTEXPAND where it is not obvious). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-02-06vfs: coredumping fix (CVE-2007-6206)Ingo Molnar
vfs: coredumping fix patch c46f739dd39db3b07ab5deb4e3ec81e1c04a91af in mainline fix: http://bugzilla.kernel.org/show_bug.cgi?id=3043 only allow coredumping to the same uid that the coredumping task runs under. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Alan Cox <alan@redhat.com> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Al Viro <viro@ftp.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: maximilian attems <max@stro.at> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-01-14Use access mode instead of open flags to determine needed permissions ↵Linus Torvalds
(CVE-2008-0001) patch 974a9f0b47da74e28f68b9c8645c3786aa5ace1a in mainline Way back when (in commit 834f2a4a1554dc5b2598038b3fe8703defcbe467, aka "VFS: Allow the filesystem to return a full file pointer on open intent" to be exact), Trond changed the open logic to keep track of the original flags to a file open, in order to pass down the the intent of a dentry lookup to the low-level filesystem. However, when doing that reorganization, it changed the meaning of namei_flags, and thus inadvertently changed the test of access mode for directories (and RO filesystem) to use the wrong flag. So fix those test back to use access mode ("acc_mode") rather than the open flag ("flag"). Issue noticed by Bill Roman at Datalight. Reported-and-tested-by: Bill Roman <bill.roman@datalight.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-12-14knfsd: Validate filehandle type in fsid_sourceNeil Brown
patch b8da0d1c27f144bce999c653467106f3f0d5a308 in mainline. fsid_source decided where to get the 'fsid' number to return for a GETATTR based on the type of filehandle. It can be from the device, from the fsid, or from the UUID. It is possible for the filehandle to be inconsistent with the export information, so make sure the export information actually has the info implied by the value returned by fsid_source. Signed-off-by: Neil Brown <neilb@suse.de> Cc: "Luiz Fernando N. Capitulino" <lcapitulino@gmail.com> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oliver Pintr <oliver.pntr@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-11-21ocfs2: fix write() performance regressionMark Fasheh
ocfs2: fix write() performance regression patch 4e9563fd55ff4479f2b118d0757d121dd0cfc39c in mainline. On file systems which don't support sparse files, Ocfs2_map_page_blocks() was reading blocks on appending writes. This caused write performance to suffer dramatically. Fix this by detecting an appending write on a nonsparse fs and skipping the read. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-11-05minixfs: limit minixfs printks on corrupted dir i_size (CVE-2006-6058)Eric Sandeen
patch 44ec6f3f89889a469773b1fd894f8fcc07c29cf in mainline This attempts to address CVE-2006-6058 http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2006-6058 first reported at http://projects.info-pull.com/mokb/MOKB-17-11-2006.html Essentially a corrupted minix dir inode reporting a very large i_size will loop for a very long time in minix_readdir, minix_find_entry, etc, because on EIO they just move on to try the next page. This is under the BKL, printk-storming as well. This can lock up the machine for a very long time. Simply ratelimiting the printks gets things back under control. Make the message a bit more informative while we're here. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Cc: Bodo Eggert <7eggert@gmx.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-10-10NLM: Fix a memory leak in nlmsvc_testlockTrond Myklebust
changeset a6d85430424d44e946e0946bfaad607115510989 in upstream The recent fix for a circular lock dependency unfortunately introduced a potential memory leak in the event where the call to nlmsvc_lookup_host fails for some reason. Thanks to Roel Kluin for spotting this. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-10NLM: Fix a circular lock dependency in lockdTrond Myklebust
commit 255129d1e9ca0ed3d69d5517fae3e03d7ab4b806 in upstream. The problem is that the garbage collector for the 'host' structures nlm_gc_hosts(), holds nlm_host_mutex while calling down to nlmsvc_mark_resources, which, eventually takes the file->f_mutex. We cannot therefore call nlmsvc_lookup_host() from within nlmsvc_create_block, since the caller will already hold file->f_mutex, so the attempt to grab nlm_host_mutex may deadlock. Fix the problem by calling nlmsvc_lookup_host() outside the file->f_mutex. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-09-26dir_index: error out instead of BUG on corrupt dx dirsEric Sandeen
commit 3d82abae9523c33d4a16fdfdfd2bdde316d7b56a in mainline. Convert asserts (BUGs) in dx_probe from bad on-disk data to recoverable errors with helpful warnings. With help catching other asserts from Duane Griffin <duaneg@dghda.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Duane Griffin <duaneg@dghda.com> Acked-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-09-26nfs: fix oops re sysctls and V4 supportAlexey Dobriyan
commit 49af7ee181f4f516ac99eba85d3f70ed42cabe76 in mainline. NFS unregisters sysctls only if V4 support is compiled in. However, sysctl table is not V4 specific, so unregister it always. Steps to reproduce: [build nfs.ko with CONFIG_NFS_V4=n] modrobe nfs rmmod nfs ls /proc/sys Unable to handle kernel paging request at ffffffff880661c0 RIP: [<ffffffff802af8e3>] proc_sys_readdir+0xd3/0x350 PGD 203067 PUD 207063 PMD 7e216067 PTE 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: lockd nfs_acl sunrpc Pid: 3335, comm: ls Not tainted 2.6.23-rc3-bloat #2 RIP: 0010:[<ffffffff802af8e3>] [<ffffffff802af8e3>] proc_sys_readdir+0xd3/0x350 RSP: 0018:ffff81007fd93e78 EFLAGS: 00010286 RAX: ffffffff880661c0 RBX: ffffffff80466370 RCX: ffffffff880661c0 RDX: 00000000000014c0 RSI: ffff81007f3ad020 RDI: ffff81007efd8b40 RBP: 0000000000000018 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: ffffffff802a8570 R12: ffffffff880661c0 R13: ffff81007e219640 R14: ffff81007efd8b40 R15: ffff81007ded7280 FS: 00002ba25ef03060(0000) GS:ffff81007ff81258(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: ffffffff880661c0 CR3: 000000007dfaf000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process ls (pid: 3335, threadinfo ffff81007fd92000, task ffff81007d8a0000) Stack: ffff81007f3ad150 ffffffff80283f30 ffff81007fd93f48 ffff81007efd8b40 ffff81007ee00440 0000000422222222 0000000200035593 ffffffff88037e9a 2222222222222222 ffffffff80466500 ffff81007e416400 ffff81007e219640 Call Trace: [<ffffffff80283f30>] filldir+0x0/0xf0 [<ffffffff80283f30>] filldir+0x0/0xf0 [<ffffffff802840c7>] vfs_readdir+0xa7/0xc0 [<ffffffff80284376>] sys_getdents+0x96/0xe0 [<ffffffff8020bb3e>] system_call+0x7e/0x83 Code: 41 8b 14 24 85 d2 74 dc 49 8b 44 24 08 48 85 c0 74 e7 49 3b RIP [<ffffffff802af8e3>] proc_sys_readdir+0xd3/0x350 RSP <ffff81007fd93e78> CR2: ffffffff880661c0 Kernel panic - not syncing: Fatal exception Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-09-26ext34: ensure do_split leaves enough free space in both blocksEric Sandeen
commit ef2b02d3e617cb0400eedf2668f86215e1b0e6af in mainline. The do_split() function for htree dir blocks is intended to split a leaf block to make room for a new entry. It sorts the entries in the original block by hash value, then moves the last half of the entries to the new block - without accounting for how much space this actually moves. (IOW, it moves half of the entry *count* not half of the entry *space*). If by chance we have both large & small entries, and we move only the smallest entries, and we have a large new entry to insert, we may not have created enough space for it. The patch below stores each record size when calculating the dx_map, and then walks the hash-sorted dx_map, calculating how many entries must be moved to more evenly split the existing entries between the old block and the new block, guaranteeing enough space for the new entry. The dx_map "offs" member is reduced to u16 so that the overall map size does not change - it is temporarily stored at the end of the new block, and if it grows too large it may be overwritten. By making offs and size both u16, we won't grow the map size. Also add a few comments to the functions involved. This fixes the testcase reported by hooanon05@yahoo.co.jp on the linux-ext4 list, "ext3 dir_index causes an error" Thanks to Andreas Dilger for discussing the problem & solution with me. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Andreas Dilger <adilger@clusterfs.com> Tested-by: Junjiro Okajima <hooanon05@yahoo.co.jp> Cc: Theodore Ts'o <tytso@mit.edu> Cc: ext4 <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-09-26Leases can be hidden by flocksPavel Emelyanov
commit 0e2f6db88a6900bc9db576d6b478b12ee60d61f7 in mainline. The inode->i_flock list contains the leases, flocks and posix locks in the specified order. However, the flocks are added in the head of this list thus hiding the leases from F_GETLEASE command, from time_out_leases() and other code that expects the leases to come first. The following example will demonstrate this: #define _GNU_SOURCE #include <unistd.h> #include <fcntl.h> #include <stdio.h> #include <sys/file.h> static void show_lease(int fd) { int res; res = fcntl(fd, F_GETLEASE); switch (res) { case F_RDLCK: printf("Read lease\n"); break; case F_WRLCK: printf("Write lease\n"); break; case F_UNLCK: printf("No leases\n"); break; default: printf("Some shit\n"); break; } } int main(int argc, char **argv) { int fd, res; fd = open(argv[1], O_RDONLY); if (fd == -1) { perror("Can't open file"); return 1; } res = fcntl(fd, F_SETLEASE, F_WRLCK); if (res == -1) { perror("Can't set lease"); return 1; } show_lease(fd); if (flock(fd, LOCK_SH) == -1) { perror("Can't flock shared"); return 1; } show_lease(fd); return 0; } The first call to show_lease() will show the write lease set, but the second will show no leases. Fix the flock adding so that the leases always stay in the head of this list. Found during making the flocks pid-namespaces aware. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Acked-by: "J. Bruce Fields" <bfields@fieldses.org> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-09-26afs: mntput called before dputAndreas Gruenbacher
commit 1a1a1a758bf0107d1f78ff1d622f45987803d894 in mainline. dput must be called before mntput here. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Acked-By: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-09-26splice: fix direct splice error handlingJens Axboe
This is a splice patch for 2.6.22 and 2.6.21 (and earlier, I did not check. Let me know if you still maintain older stable trees!). It fixes an infinite loop in do_splice_direct(), when there's either nothing to read or nothing to write and blocking doesn't help. It could be things like running out of disk space. We need to exit both for failure and zero return, or we could be going around forever. This got fixed in 2.6.23-git with commit 51a92c0f6ce8fa85fa0e18ecda1d847e606e8066 Herbert Poetzl <herbert@13thfloor.at> noticed this bug in 2.6.22, and has verified that this minimal fix works. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-09-26JFFS2: fix write deadlock regressionJason Lunz
Changeset fc0e01974ccccc7530b7634a63ee3fcc57b845ea from mainline. I've bisected the deadlock when many small appends are done on jffs2 down to this commit: commit 6fe6900e1e5b6fa9e5c59aa5061f244fe3f467e2 Author: Nick Piggin <npiggin@suse.de> Date: Sun May 6 14:49:04 2007 -0700 mm: make read_cache_page synchronous Ensure pages are uptodate after returning from read_cache_page, which allows us to cut out most of the filesystem-internal PageUptodate calls. I didn't have a great look down the call chains, but this appears to fixes 7 possible use-before uptodate in hfs, 2 in hfsplus, 1 in jfs, a few in ecryptfs, 1 in jffs2, and a possible cleared data overwritten with readpage in block2mtd. All depending on whether the filler is async and/or can return with a !uptodate page. It introduced a wait to read_cache_page, as well as a read_cache_page_async function equivalent to the old read_cache_page without any callers. Switching jffs2_gc_fetch_page to read_cache_page_async for the old behavior makes the deadlocks go away, but maybe reintroduces the use-before-uptodate problem? I don't understand the mm/fs interaction well enough to say. [It's fine. dwmw2.] Signed-off-by: Jason Lunz <lunz@falooley.org> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-30signalfd: make it group-wide, fix posix-timers schedulingOleg Nesterov
With this patch any thread can dequeue its own private signals via signalfd, even if it was created by another sub-thread. To do so, we pass "current" to dequeue_signal() if the caller is from the same thread group. This also fixes the scheduling of posix timers broken by the previous patch. If the caller doesn't belong to this thread group, we can't handle __SI_TIMER case properly anyway. Perhaps we should forbid the cross-process signalfd usage and convert ctx->tsk to ctx->sighand. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Davide Libenzi <davidel@xmailserver.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Roland McGrath <roland@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-30ocfs2: Fix bad source start calculation during kernel writesMark Fasheh
[PATCH] ocfs2: Fix bad source start calculation during kernel writes For in-kernel writes ocfs2_get_write_source() should be starting the buffer at a page boundary as the math in ocfs2_map_and_write_user_data() will pad it back out to the correct write offset. Instead, we were passing the raw offset, which caused ocfs2_map_and_write_user_data() start too far into the buffer, resulting in corruptions from nfs client writes. Signed-off-by: Mark Fasheh <mark.fasheh@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-22JFFS2 locking regression fix.David Woodhouse
Commit a491486a2087ac3dfc00efb4f838c8d684afaf54 introduced a locking problem in JFFS2 -- we up() the alloc_sem when we weren't previously holding it. This leads to all kinds of fun behaviour later. There was a _reason_ for the if (1 /* alternative path needs testing */ || which the above-mentioned commit removed :) Discovered and debugged by Giulio Fedel <giulio.fedel@andorsystems.com> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-20Reset current->pdeath_signal on SUID binary execution (CVE-2007-3848)Marcel Holtmann
This fixes a vulnerability in the "parent process death signal" implementation discoverd by Wojciech Purczynski of COSEINC PTE Ltd. and iSEC Security Research. http://marc.info/?l=bugtraq&m=118711306802632&w=2 Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-15direct-io: fix error-path crashesBadari Pulavarty
Need to initialize map_bh.b_state to zero. Otherwise, in case of a faulty user-buffer its possible to go into dio_zero_block() and submit a page by mistake - since it checks for buffer_new(). http://marc.info/?l=linux-kernel&m=118551339032528&w=2 akpm: Linus had a (better) patch to just do a kzalloc() in there, but it got lost. Probably this version is better for -stable anwyay. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Joe Jin <joe.jin@oracle.com> Acked-by: Zach Brown <zach.brown@oracle.com> Cc: gurudas pai <gurudas.pai@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-09sysfs: release mutex when kmalloc() failed in sysfs_open_file().YOSHIFUJI Hideaki
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-09jbd2 commit: fix transaction droppingJan Kara
We have to check that also the second checkpoint list is non-empty before dropping the transaction. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Chuck Ebbert <cebbert@redhat.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-09jbd commit: fix transaction droppingJan Kara
We have to check that also the second checkpoint list is non-empty before dropping the transaction. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Chuck Ebbert <cebbert@redhat.com> Cc: Kirill Korotaev <dev@openvz.org> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-09fs: 9p/conv.c error path fixMariusz Kozlowski
When buf_check_overflow() returns != 0 we will hit kfree(ERR_PTR(err)) and it will not be happy about it. Signed-off-by: Mariusz Kozlowski <m.kozlowski@tuxland.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-09nfsd: fix possible read-ahead cache and export table corruptionJ. Bruce Fields
The value of nperbucket calculated here is too small--we should be rounding up instead of down--with the result that the index j in the following loop can overflow the raparm_hash array. At least in my case, the next thing in memory turns out to be export_table, so the symptoms I see are crashes caused by the appearance of four zeroed-out export entries in the first bucket of the hash table of exports (which were actually entries in the readahead cache, a pointer to which had been written to the export table in this initialization code). It looks like the bug was probably introduced with commit fce1456a19f5c08b688c29f00ef90fdfa074c79b ("knfsd: make the readahead params cache SMP-friendly"). Cc: Greg Banks <gnb@melbourne.sgi.com> Signed-off-by: "J. Bruce Fields" <bfields@citi.umich.edu> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-09eCryptfs: ecryptfs_setattr() bugfixMichael Halcrow
There is another bug recently introduced into the ecryptfs_setattr() function in 2.6.22. eCryptfs will attempt to treat special files like regular eCryptfs files on chmod, chown, and so forth. This leads to a NULL pointer dereference. This patch validates that the file is a regular file before proceeding with operations related to the inode's crypt_stat. Thanks to Ryusuke Konishi for finding this bug and suggesting the fix. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-09"ext4_ext_put_in_cache" uses __u32 to receive physical block numberMingming Cao
Yan Zheng wrote: > I think I found a bug in ext4/extents.c, "ext4_ext_put_in_cache" uses > "__u32" to receive physical block number. "ext4_ext_put_in_cache" is > used in "ext4_ext_get_blocks", it sets ext4 inode's extent cache > according most recently tree lookup (higher 16 bits of saved physical > block number are always zero). when serving a mapping request, > "ext4_ext_get_blocks" first check whether the logical block is in > inode's extent cache. if the logical block is in the cache and the > cached region isn't a gap, "ext4_ext_get_blocks" gets physical block > number by using cached region's physical block number and offset in > the cached region. as described above, "ext4_ext_get_blocks" may > return wrong result when there are physical block numbers bigger than > 0xffffffff. > You are right. Thanks for reporting this! Signed-off-by: Mingming Cao <cmm@us.ibm.com> Cc: Yan Zheng <yanzheng@21cn.com> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-09splice: fix double page unlockJens Axboe
If add_to_page_cache_lru() fails, the page will not be locked. But splice jumps to an error path that does a page release and unlock, causing a BUG() in unlock_page(). Fix this by adding one more label that just releases the page. This bug was actually triggered on EL5 by gurudas pai <gurudas.pai@oracle.com> using fio. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-09make timerfd return a u64 and fix the __put_userDavide Libenzi
Davi fixed a missing cast in the __put_user(), that was making timerfd return a single byte instead of the full value. Talking with Michael about the timerfd man page, we think it'd be better to use a u64 for the returned value, to align it with the eventfd implementation. This is an ABI change. The timerfd code is new in 2.6.22 and if we merge this into 2.6.23 then we should also merge it into 2.6.22.x. That will leave a few early 2.6.22 kernels out in the wild which might misbehave when a future timerfd-enabled glibc is run on them. mtk says: The difference would be that read() will only return 4 bytes, while the application will expect 8. If the application is checking the size of returned value, as it should, then it will be able to detect the problem (it could even be sophisticated enough to know that if this is a 4-byte return, then it is running on an old 2.6.22 kernel). If the application is not checking the return from read(), then its 8-byte buffer will not be filled -- the contents of the last 4 bytes will be undefined, so the u64 value as a whole will be junk. When I wrote up that description above, I forgot a crucial detail. The above description described the difference between the new behavior implemented by the patch, and the current (i.e., 2.6.22) *intended* behavior. However, as I originally remarked to Davide, the 2.6.22 read() behavior is broken: it should return 4 bytes on a read(), but as originally implemented, only the least significant byte contained valid information. (In other words, the top 3 bytes of overrun information were simply being discarded.) So the patch both fixes a bug in the originally intended behavior, and changes the intended behavior (to return 8 bytes from a read() instead of 4). Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: Davi Arnaut <davi@haxent.com.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-07-08Fix permission checking for the new utimensat() system callLinus Torvalds
Commit 1c710c896eb461895d3c399e15bb5f20b39c9073 added the utimensat() system call, but didn't handle the case of checking for the writability of the target right, when the target was a file descriptor, not a filename. We cannot use vfs_permission(MAY_WRITE) for that case, and need to simply check whether the file descriptor is writable. The oops from using the wrong function was noticed and narrowed down by Markus Trippelsdorf. Cc: Ulrich Drepper <drepper@redhat.com> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Al Viro <viro@ftp.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-07DLM must depend on SYSFSAdrian Bunk
The dependency of DLM on SYSFS got lost in commit 6ed7257b46709e87d79ac2b6b819b7e0c9184998 resulting in the following compile error with CONFIG_DLM=y, CONFIG_SYSFS=n: <-- snip --> ... LD .tmp_vmlinux1 fs/built-in.o: In function `dlm_lockspace_init': /home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/fs/dlm/lockspace.c:231: undefined reference to `kernel_subsys' fs/built-in.o: In function `configfs_init': /home/bunk/linux/kernel-2.6/linux-2.6.22-rc6-mm1/fs/configfs/mount.c:143: undefined reference to `kernel_subsys' make[1]: *** [.tmp_vmlinux1] Error 1 <-- snip --> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-06Fix elf_core_dump() when writing arch specific notes (spu coredumps)Michael Ellerman
elf_core_dump() supports dumping arch specific ELF notes, via the #define ELF_CORE_WRITE_EXTRA_NOTES. Currently the only user of this is the powerpc spu coredump code. There is a bug in the handling of foffset WRT the arch notes, which causes us to erroneously increment foffset by the size of the arch notes, leaving a block of zeroes in the file, and causing all subsequent data in the file to be at <supposed position> + <arch note size>. eg: LOAD 0x050000 0x00100000 0x00000000 0x20000 0x20000 R E 0x10000 Tells us we should have a chunk of data at 0x50000. The truth is the data is at 0x90dbc = 0x50000 + 0x40dbc (the size of the arch notes). This bug prevents gdb from reading the core file correctly. The simplest fix is to simply remember the size of the arch notes, and add it to foffset after we've written the arch notes. The only drawback is that if the arch code doesn't write as many bytes as it said it would, we end up with a broken core dump again. For now I think that's a reasonable requirement. Tested on a Cell blade, gdb no longer complains about the core file being bogus. While I'm here I should point out that the spu coredump code does not work if we're dumping to a pipe - we'll have to wait for 23 to fix that. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-04[JFFS2] Fix readinode failure when read_dnode() detects CRC failure.David Woodhouse
We should have stopped returning 1 from read_dnode() to indicate failure. We can just mark the damn thing obsolete immediately. But I missed a case where we don't. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2007-07-03dio: remove bogus refcounting BUG_ONZach Brown
Badari Pulavarty reported a case of this BUG_ON is triggering during testing. It's completely bogus and should be removed. It's trying to notice if we left references to the dio hanging around in the sync case. They should have been dropped as IO completed while this path was in dio_await_completion(). This condition will also be checked, via some twisty logic, by the BUG_ON(ret != -EIOCBQUEUED) a few lines lower. So to start this BUG_ON() is redundant. More fatally, it's dereferencing dio-> after having dropped its reference. It's only safe to dereference the dio after releasing the lock if the final reference was just dropped. Another CPU might free the dio in bio completion and reuse the memory after this path drops the dio lock but before the BUG_ON() is evaluated. This patch passed aio+dio regression unit tests and aio-stress on ext3. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28Introduce fixed sys_sync_file_range2() syscall, implement on PowerPC and ARMDavid Woodhouse
Not all the world is an i386. Many architectures need 64-bit arguments to be aligned in suitable pairs of registers, and the original sys_sync_file_range(int, loff_t, loff_t, int) was therefore wasting an argument register for padding after the first integer. Since we don't normally have more than 6 arguments for system calls, that left no room for the final argument on some architectures. Fix this by introducing sys_sync_file_range2(int, int, loff_t, loff_t) which all fits nicely. In fact, ARM already had that, but called it sys_arm_sync_file_range. Move it to fs/sync.c and rename it, then implement the needed compatibility routine. And stop the missing syscall check from bitching about the absence of sys_sync_file_range() if we've implemented sys_sync_file_range2() instead. Tested on PPC32 and with 32-bit and 64-bit userspace on PPC64. Signed-off-by: David Woodhouse <dwmw2@infradead.org> Acked-by: Russell King <rmk+kernel@arm.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28ext2: fix return of uninitialised variableAndrew Morton
gcc correctly says fs/ext2/super.c: In function 'ext2_remount': fs/ext2/super.c:1055: warning: 'err' may be used uninitialized in this function Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28avoid spurious POLLIN returns in signalfdDavide Libenzi
The new code in kernel/signal.c does not allow fetching private signals from another task. This patch avoid spurious POLLIN returns from a signalfd poll(2) operation. Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28zero out last page for llseek/writeMichael Halcrow
When one llseek's past the end of the file and then writes, every page past the previous end of the file should be cleared. Trevor found that the code, as is, does not assure that the very last page is always cleared. This patch takes care of that. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28eCryptfs: initialize crypt_stat in setattrMichael Halcrow
Recent changes in eCryptfs have made it possible to get to ecryptfs_setattr() with an uninitialized crypt_stat struct. This results in a wide and colorful variety of unpleasantries. This patch properly initializes the crypt_stat structure in ecryptfs_setattr() when it is necessary to do so. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-28eCryptfs: fix write zeros behaviorMichael Halcrow
This patch fixes the processes involved in wiping regions of the data during truncate and write events, fixing a kernel hang in 2.6.22-rc4 while assuring that zero values are written out to the appropriate locations during events in which the i_size will change. The range passed to ecryptfs_truncate() from ecryptfs_prepare_write() includes the page that is the object of ecryptfs_prepare_write(). This leads to a kernel hang as read_cache_page() is executed on the same page in the ecryptfs_truncate() execution path. This patch remedies this by limiting the range passed to ecryptfs_truncate() so as to exclude the page that is the object of ecryptfs_prepare_write(); it also adds code to ecryptfs_prepare_write() to zero out the region of its own page when writing past the i_size position. This patch also modifies ecryptfs_truncate() so that when a file is truncated to a smaller size, eCryptfs will zero out the contents of the new last page from the new size through to the end of the last page. Signed-off-by: Michael Halcrow <mhalcrow@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24ext4: lost brelse in ext4_read_inode()Kirill Korotaev
One of error path in ext4_read_inode() leaks bh since brelse is forgoten. Signed-off-by: Kirill Korotaev <dev@openvz.org> Acked-by: Vasily Averin <vvs@sw.ru> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24ext3: lost brelse in ext3_read_inode()Kirill Korotaev
One of error path in ext3_read_inode() leaks bh since brelse is forgoten. Signed-off-by: Kirill Korotaev <dev@openvz.org> Acked-by: Vasily Averin <vvs@sw.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-24ext2: disallow setting xip on remountCarsten Otte
Yan Zheng pointed out that ext2_remount lacks checking if -o xip should be enabled or not. This patch checks for presence of direct_access on the backing block device and if the blocksize meets the requirements. Signed-off-by: Carsten Otte <cotte@de.ibm.com> Cc: Yan Zheng <yanzheng@21cn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-19[XFS] s/memclear_highpage_flush/zero_user_page/Christoph Hellwig
SGI-PV: 957103 SGI-Modid: xfs-linux-melb:xfs-kern:28678a Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Tim Shimmin <tes@sgi.com>
2007-06-16shm: fix the filename of hugetlb sysv shared memoryEric W. Biederman
Some user space tools need to identify SYSV shared memory when examining /proc/<pid>/maps. To do so they look for a block device with major zero, a dentry named SYSV<sysv key>, and having the minor of the internal sysv shared memory kernel mount. To help these tools and to make it easier for people just browsing /proc/<pid>/maps this patch modifies hugetlb sysv shared memory to use the SYSV<key> dentry naming convention. User space tools will still have to be aware that hugetlb sysv shared memory lives on a different internal kernel mount and so has a different block device minor number from the rest of sysv shared memory. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Albert Cahalan <acahalan@gmail.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16udf: fix possible leakage of blocksJan Kara
We have to take care that when we call udf_discard_prealloc() from udf_clear_inode() we have to write inode ourselves afterwards (otherwise, some changes might be lost leading to leakage of blocks, use of free blocks or improperly aligned extents). Also udf_discard_prealloc() does two different things - it removes preallocated blocks and truncates the last extent to exactly match i_size. We move the latter functionality to udf_truncate_tail_extent(), call udf_discard_prealloc() when last reference to a file is dropped and call udf_truncate_tail_extent() when inode is being removed from inode cache (udf_clear_inode() call). We cannot call udf_truncate_tail_extent() earlier as subsequent open+write would find the last block of the file mapped and happily write to the end of it, although the last extent says it's shorter. [akpm@linux-foundation.org: Make checkpatch.pl happier] Signed-off-by: Jan Kara <jack@suse.cz> Cc: Eric Sandeen <sandeen@sandeen.net> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-16fuse: ->fs_flags fixletAlexey Dobriyan
fs/fuse/inode.c:658:3: error: Initializer entry defined twice fs/fuse/inode.c:661:3: also defined here Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-06-15splice: only check do_wakeup in splice_to_pipe() for a real pipeJens Axboe
We only ever set do_wakeup to non-zero if the pipe has an inode backing, so it's pointless to check outside the pipe->inode check. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-06-15splice: fix leak of pages on short splice to pipeJens Axboe
If the destination pipe is full and we already transferred data, we break out instead of waiting for more pipe room. The exit logic looks at spd->nr_pages to see if we moved everything inside the spd container, but we decrement that variable in the loop to decide when spd has emptied. Instead we want to compare to the original page count in the spd, so cache that in a local variable. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2007-06-15splice: adjust balance_dirty_pages_ratelimited() callJens Axboe
As we have potentially dirtied more than 1 page, we should indicate as such to the dirty page balancing. So call balance_dirty_pages_ratelimited_nr() and pass in the approximate number of pages we dirtied. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>