summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2009-01-18mm lockless pagecache barrier fixNick Piggin
commit e8c82c2e23e3527e0c9dc195e432c16784d270fa upstream. An XFS workload showed up a bug in the lockless pagecache patch. Basically it would go into an "infinite" loop, although it would sometimes be able to break out of the loop! The reason is a missing compiler barrier in the "increment reference count unless it was zero" case of the lockless pagecache protocol in the gang lookup functions. This would cause the compiler to use a cached value of struct page pointer to retry the operation with, rather than reload it. So the page might have been removed from pagecache and freed (refcount==0) but the lookup would not correctly notice the page is no longer in pagecache, and keep attempting to increment the refcount and failing, until the page gets reallocated for something else. This isn't a data corruption because the condition will be detected if the page has been reallocated. However it can result in a lockup. Linus points out that ACCESS_ONCE is also required in that pointer load, even if it's absence is not causing a bug on our particular build. The most general way to solve this is just to put an rcu_dereference in radix_tree_deref_slot. Assembly of find_get_pages, before: .L220: movq (%rbx), %rax #* ivtmp.1162, tmp82 movq (%rax), %rdi #, prephitmp.1149 .L218: testb $1, %dil #, prephitmp.1149 jne .L217 #, testq %rdi, %rdi # prephitmp.1149 je .L203 #, cmpq $-1, %rdi #, prephitmp.1149 je .L217 #, movl 8(%rdi), %esi # <variable>._count.counter, c testl %esi, %esi # c je .L218 #, after: .L212: movq (%rbx), %rax #* ivtmp.1109, tmp81 movq (%rax), %rdi #, ret testb $1, %dil #, ret jne .L211 #, testq %rdi, %rdi # ret je .L197 #, cmpq $-1, %rdi #, ret je .L211 #, movl 8(%rdi), %esi # <variable>._count.counter, c testl %esi, %esi # c je .L212 #, (notice the obvious infinite loop in the first example, if page->count remains 0) Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18mm: fix assertionNick Piggin
commit 18e6959c385f3edf3991fa6662a53dac4eb10d5b upstream. This assertion is incorrect for lockless pagecache. By definition if we have an unpinned page that we are trying to take a speculative reference to, it may become the tail of a compound page at any time (if it is freed, then reallocated as a compound page). It was still a valid assertion for the vmscan.c LRU isolation case, but it doesn't seem incredibly helpful... if somebody wants it, they can put it back directly where it applies in the vmscan code. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18fs: symlink write_begin allocation context fixNick Piggin
commit 54566b2c1594c2326a645a3551f9d989f7ba3c5e upstream. With the write_begin/write_end aops, page_symlink was broken because it could no longer pass a GFP_NOFS type mask into the point where the allocations happened. They are done in write_begin, which would always assume that the filesystem can be entered from reclaim. This bug could cause filesystem deadlocks. The funny thing with having a gfp_t mask there is that it doesn't really allow the caller to arbitrarily tinker with the context in which it can be called. It couldn't ever be GFP_ATOMIC, for example, because it needs to take the page lock. The only thing any callers care about is __GFP_FS anyway, so turn that into a single flag. Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on this flag in their write_begin function. Change __grab_cache_page to accept a nofs argument as well, to honour that flag (while we're there, change the name to grab_cache_page_write_begin which is more instructive and does away with random leading underscores). This is really a more flexible way to go in the end anyway -- if a filesystem happens to want any extra allocations aside from the pagecache ones in ints write_begin function, it may now use GFP_KERNEL (rather than GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a random example). [kosaki.motohiro@jp.fujitsu.com: fix ubifs] [kosaki.motohiro@jp.fujitsu.com: fix fuse] Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Cleaned up the calling convention: just pass in the AOP flags untouched to the grab_cache_page_write_begin() function. That just simplifies everybody, and may even allow future expansion of the logic. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 33Heiko Carstens
commit 2b66421995d2e93c9d1a0111acf2581f8529c6e5 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 32Heiko Carstens
commit d4e82042c4cfa87a7d51710b71f568fe80132551 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18powerpc: Enable syscall wrappers for 64-bitBenjamin Herrenschmidt
commit ee6a093222549ac0c72cfd296c69fa5e7d6daa34 upstream. This enables the use of syscall wrappers to do proper sign extension for 64-bit programs. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrapper infrastructureHeiko Carstens
commit 1a94bc34768e463a93cb3751819709ab0ea80a01 upstream. From: Martin Schwidefsky <schwidefsky@de.ibm.com> By selecting HAVE_SYSCALL_WRAPPERS architectures can activate system call wrappers in order to sign extend system call arguments. All architectures where the ABI defines that the caller of a function has to perform sign extension probably need this. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Rename old_readdir to sys_old_readdirHeiko Carstens
commit e55380edf68796d75bf41391a781c68ee678587d upstream. This way it matches the generic system call name convention. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Convert all system calls to return a longHeiko Carstens
commit 2ed7c03ec17779afb4fcfa3b8c61df61bd4879ba upstream. Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Move compat system call declarations to compat header fileHeiko Carstens
commit 4c696ba7982501d43dea11dbbaabd2aa8a19cc42 upstream. Move declarations to correct header file. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18inotify: fix type errors in interfacesMichael Kerrisk
commit 4ae8978cf92a96257cd8998a49e781be83571d64 upstream. The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes. For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is currently 'u32', it should be '__s32' . That is Robert's suggestion, and is consistent with the other declarations of watch descriptors in the kernel source, in particular, the inotify_event structure in include/linux/inotify.h: struct inotify_event { __s32 wd; /* watch descriptor */ __u32 mask; /* watch mask */ __u32 cookie; /* cookie to synchronize two events */ __u32 len; /* length (including nulls) of name */ char name[0]; /* stub for possible name */ }; The patch makes the changes needed for inotify_rm_watch(). Signed-off-by: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Robert Love <rlove@google.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18sched_clock: prevent scd->clock from moving backwards, take #2Thomas Gleixner
commit 1c5745aa380efb6417b5681104b007c8612fb496 upstream. Redo: 5b7dba4: sched_clock: prevent scd->clock from moving backwards which had to be reverted due to s2ram hangs: ca7e716: Revert "sched_clock: prevent scd->clock from moving backwards" ... this time with resume restoring GTOD later in the sequence taken into account as well. The "timekeeping_suspended" flag is not very nice but we cannot call into GTOD before it has been properly resumed and the scheduler will run very early in the resume sequence. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-14parisc: disable UP-optimized flush_tlb_mmKyle McMartin
commit 5289f46b9de04bde181d833d48df9671b69c4b08 upstream. flush_tlb_mm's "optimized" uniprocessor case of allocating a new context for userspace is exposing a race where we can suddely return to a syscall with the protection id and space id out of sync, trapping on the next userspace access. Debugged-by: James Bottomley <James.Bottomley@HansenPartnership.com> Tested-by: Helge Deller <deller@gmx.de> Signed-off-by: Kyle McMartin <kyle@mcmartin.ca> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-18can: Fix CAN_(EFF|RTR)_FLAG handling in can_filterOliver Hartkopp
commit d253eee20195b25e298bf162a6e72f14bf4803e5 upstream. Due to a wrong safety check in af_can.c it was not possible to filter for SFF frames with a specific CAN identifier without getting the same selected CAN identifier from a received EFF frame also. This fix has a minimum (but user visible) impact on the CAN filter API and therefore the CAN version is set to a new date. Indeed the 'old' API is still working as-is. But when now setting CAN_(EFF|RTR)_FLAG in can_filter.can_mask you might get less traffic than before - but still the stuff that you expected to get for your defined filter ... Thanks to Kurt Van Dijck for pointing at this issue and for the review. Signed-off-by: Oliver Hartkopp <oliver@hartkopp.net> Acked-by: Kurt Van Dijck <kurt.van.dijck@eia.be> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-18x86 Fix VMI crash on boot in 2.6.28-rc8Zachary Amsden
commit ae8d04e2ecbb233926860e9ce145eac19c7835dc upstream. VMI initialiation can relocate the fixmap, causing early_ioremap to malfunction if it is initialized before the relocation. To fix this, VMI activation is split into two phases; the detection, which must happen before setting up ioremap, and the activation, which must happen after parsing early boot parameters. This fixes a crash on boot when VMI is enabled under VMware. Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13pnp: make the resource type an unsigned longRene Herman
commit b563cf59c4d67da7d671788a9848416bfa4180ab upstream. PnP encodes the resource type directly as its struct resource->flags value which is an unsigned long. Make it so... Signed-off-by: Rene Herman <rene.herman@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13Allow recursion in binfmt_script and binfmt_miscKirill A. Shutemov
commit bf2a9a39639b8b51377905397a5005f444e9a892 upstream. binfmt_script and binfmt_misc disallow recursion to avoid stack overflow using sh_bang and misc_bang. It causes problem in some cases: $ echo '#!/bin/ls' > /tmp/t0 $ echo '#!/tmp/t0' > /tmp/t1 $ echo '#!/tmp/t1' > /tmp/t2 $ chmod +x /tmp/t* $ /tmp/t2 zsh: exec format error: /tmp/t2 Similar problem with binfmt_misc. This patch introduces field 'recursion_depth' into struct linux_binprm to track recursion level in binfmt_misc and binfmt_script. If recursion level more then BINPRM_MAX_RECURSION it generates -ENOEXEC. [akpm@linux-foundation.org: make linux_binprm.recursion_depth a uint] Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13jbd: fix error handling for checkpoint ioHidehiro Kawai
commit 4afe978530702c934dfdb11f54073136818b2119 upstream. When a checkpointing IO fails, current JBD code doesn't check the error and continue journaling. This means latest metadata can be lost from both the journal and filesystem. This patch leaves the failed metadata blocks in the journal space and aborts journaling in the case of log_do_checkpoint(). To achieve this, we need to do: 1. don't remove the failed buffer from the checkpoint list where in the case of __try_to_free_cp_buf() because it may be released or overwritten by a later transaction 2. log_do_checkpoint() is the last chance, remove the failed buffer from the checkpoint list and abort the journal 3. when checkpointing fails, don't update the journal super block to prevent the journaled contents from being cleaned. For safety, don't update j_tail and j_tail_sequence either 4. when checkpointing fails, notify this error to the ext3 layer so that ext3 don't clear the needs_recovery flag, otherwise the journaled contents are ignored and cleaned in the recovery phase 5. if the recovery fails, keep the needs_recovery flag 6. prevent cleanup_journal_tail() from being called between __journal_drop_transaction() and journal_abort() (a race issue between journal_flush() and __log_wait_for_space() Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Jan Kara <jack@suse.cz> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-13Enforce a minimum SG_IO timeoutLinus Torvalds
commit f2f1fa78a155524b849edf359e42a3001ea652c0 upstream. There's no point in having too short SG_IO timeouts, since if the command does end up timing out, we'll end up through the reset sequence that is several seconds long in order to abort the command that timed out. As a result, shorter timeouts than a few seconds simply do not make sense, as the recovery would be longer than the timeout itself. Add a BLK_MIN_SG_TIMEOUT to match the existign BLK_DEFAULT_SG_TIMEOUT. Suggested-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Garzik <jeff@garzik.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05jbd2: fix /proc setup for devices that contain '/' in their namesTheodore Ts'o
trimed down version of commit 05496769e5da83ce22ed97345afd9c7b71d6bd24 upstream. Some devices such as "cciss/c0d0p9" will cause jbd2 setup and teardown failures when /proc filenames are created with embedded slashes. This is a slimmed down version of commit 05496769, with the stack reduction aspects of the patch omitted to meet the -stable criteria. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05net: Fix soft lockups/OOM issues w/ unix garbage collector (CVE-2008-5300)dann frazier
commit 5f23b734963ec7eaa3ebcd9050da0c9b7d143dd3 upstream. This is an implementation of David Miller's suggested fix in: https://bugzilla.redhat.com/show_bug.cgi?id=470201 It has been updated to use wait_event() instead of wait_event_interruptible(). Paraphrasing the description from the above report, it makes sendmsg() block while UNIX garbage collection is in progress. This avoids a situation where child processes continue to queue new FDs over a AF_UNIX socket to a parent which is in the exit path and running garbage collection on these FDs. This contention can result in soft lockups and oom-killing of unrelated processes. Signed-off-by: dann frazier <dannf@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05libata: blacklist Seagate drives which time out FLUSH_CACHE when used with NCQTejun Heo
commit ac70a964b0e22a95af3628c344815857a01461b7 upstream. Some recent Seagate harddrives have firmware bug which causes FLUSH CACHE to timeout under certain circumstances if NCQ is being used. This can be worked around by disabling NCQ and fixed by updating the firmware. Implement ATA_HORKAGE_FIRMWARE_UPDATE and blacklist these devices. The wiki page has been updated to contain information on this issue. http://ata.wiki.kernel.org/index.php/Known_issues Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jeff Garzik <jgarzik@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05x86: Hibernate: Fix breakage on x86_32 with CONFIG_NUMA setRafael J. Wysocki
backport of commit 97a70e548bd97d5a46ae9d44f24aafcc013fd701 to the 2.6.27 kernel. The NUMA code on x86_32 creates special memory mapping that allows each node's pgdat to be located in this node's memory. For this purpose it allocates a memory area at the end of each node's memory and maps this area so that it is accessible with virtual addresses belonging to low memory. As a result, if there is high memory, these NUMA-allocated areas are physically located in high memory, although they are mapped to low memory addresses. Our hibernation code does not take that into account and for this reason hibernation fails on all x86_32 systems with CONFIG_NUMA=y and with high memory present. Fix this by adding a special mapping for the NUMA-allocated memory areas to the temporary page tables created during the last phase of resume. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05PCI: Hotplug core: remove 'name'Alex Chiang
commit 58319b802a614f10f1b5238fbde7a4b2e9a60069 upstream. Now that the PCI core manages the 'name' for each individual hotplug driver, and all drivers (except rpaphp) have been converted to use hotplug_slot_name(), there is no need for the PCI hotplug core to drag around its own copy of name either. Cc: kristen.c.accardi@intel.com Cc: matthew@wil.cx Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05PCI, PCI Hotplug: introduce slot_name helpersAlex Chiang
commit 0ad772ec464d3fcf9d210836b97e654f393606c4 upstream In preparation for cleaning up the various hotplug drivers such that they don't have to manage their own 'name' parameters anymore, we provide the following convenience functions: pci_slot_name() hotplug_slot_name() These helpers will be used by individual hotplug drivers. Cc: kristen.c.accardi@intel.com Cc: matthew@wil.cx Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05PCI: update pci_create_slot() to take a 'hotplug' paramAlex Chiang
commit 828f37683e6d3ab5912989df0d04201db7ad798e upstream. Slot detection drivers can co-exist with hotplug drivers. The names of the detected/claimed slots may be different depending on module load order. For legacy reasons, we need to allow hotplug drivers to override the slot name if a detection driver is loaded first (and they find the same slots). Creating and overriding slot names should be an atomic operation, otherwise you get a locking nightmare as various drivers race to call pci_create_slot(). pci_create_slot() is already serialized by grabbing the pci_bus_sem. We update the API and add a 'hotplug' param, which is: set if the caller is a hotplug driver NULL if the caller is a detection driver pci_create_slot() does not actually use the 'hotplug' parameter in this patch. A later patch will add the logic that uses it. Cc: kristen.c.accardi@intel.com Cc: matthew@wil.cx Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05PCI Hotplug core: add 'name' param pci_hp_register interfaceAlex Chiang
commit 1359f2701b96abd9bb69c1273fb995a093b6409a upstream. Update pci_hp_register() to take a const char *name parameter. The motivation for this is to clean up the individual hotplug drivers so that each one does not have to manage its own name. The PCI core should be the place where we manage the name. We update the interface and all callsites first, in a "no functional change" manner, and clean up the drivers later. Cc: kristen.c.accardi@intel.com Acked-by: Kenji Kaneshige <kaneshige.kenji@jp.fujitsu.com> Reviewed-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Alex Chiang <achiang@hp.com> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05x86: always define DECLARE_PCI_UNMAP* macrosJoerg Roedel
commit b627c8b17ccacba38c975bc0f69a49fc4e5261c9 upstream. Impact: fix boot crash on AMD IOMMU if CONFIG_GART_IOMMU is off Currently these macros evaluate to a no-op except the kernel is compiled with GART or Calgary support. But we also need these macros when we have SWIOTLB, VT-d or AMD IOMMU in the kernel. Since we always compile at least with SWIOTLB we can define these macros always. This patch is also for stable backport for the same reason the SWIOTLB default selection patch is. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05lib/idr.c: fix rcu related race with idr_findManfred Spraul
commit 6ff2d39b91aec3dcae951afa982059e3dd9b49dc upstream. 2nd part of the fixes needed for http://bugzilla.kernel.org/show_bug.cgi?id=11796. When the idr tree is either grown or shrunk, then the update to the number of layers and the top pointer were not atomic. This race caused crashes. The attached patch fixes that by replicating the layers counter in each layer, thus idr_find doesn't need idp->layers anymore. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Clement Calmels <cboulte@gmail.com> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: Pierre Peiffer <peifferp@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05Fix inotify watch removal/umount racesAl Viro
commit 8f7b0ba1c853919b85b54774775f567f30006107 upstream. Inotify watch removals suck violently. To kick the watch out we need (in this order) inode->inotify_mutex and ih->mutex. That's fine if we have a hold on inode; however, for all other cases we need to make damn sure we don't race with umount. We can *NOT* just grab a reference to a watch - inotify_unmount_inodes() will happily sail past it and we'll end with reference to inode potentially outliving its superblock. Ideally we just want to grab an active reference to superblock if we can; that will make sure we won't go into inotify_umount_inodes() until we are done. Cleanup is just deactivate_super(). However, that leaves a messy case - what if we *are* racing with umount() and active references to superblock can't be acquired anymore? We can bump ->s_count, grab ->s_umount, which will almost certainly wait until the superblock is shut down and the watch in question is pining for fjords. That's fine, but there is a problem - we might have hit the window between ->s_active getting to 0 / ->s_count - below S_BIAS (i.e. the moment when superblock is past the point of no return and is heading for shutdown) and the moment when deactivate_super() acquires ->s_umount. We could just do drop_super() yield() and retry, but that's rather antisocial and this stuff is luser-triggerable. OTOH, having grabbed ->s_umount and having found that we'd got there first (i.e. that ->s_root is non-NULL) we know that we won't race with inotify_umount_inodes(). So we could grab a reference to watch and do the rest as above, just with drop_super() instead of deactivate_super(), right? Wrong. We had to drop ih->mutex before we could grab ->s_umount. So the watch could've been gone already. That still can be dealt with - we need to save watch->wd, do idr_find() and compare its result with our pointer. If they match, we either have the damn thing still alive or we'd lost not one but two races at once, the watch had been killed and a new one got created with the same ->wd at the same address. That couldn't have happened in inotify_destroy(), but inotify_rm_wd() could run into that. Still, "new one got created" is not a problem - we have every right to kill it or leave it alone, whatever's more convenient. So we can use idr_find(...) == watch && watch->inode->i_sb == sb as "grab it and kill it" check. If it's been our original watch, we are fine, if it's a newcomer - nevermind, just pretend that we'd won the race and kill the fscker anyway; we are safe since we know that its superblock won't be going away. And yes, this is far beyond mere "not very pretty"; so's the entire concept of inotify to start with. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Greg KH <greg@kroah.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-05epoll: introduce resource usage limitsDavide Libenzi
commit 7ef9964e6d1b911b78709f144000aacadd0ebc21 upstream. It has been thought that the per-user file descriptors limit would also limit the resources that a normal user can request via the epoll interface. Vegard Nossum reported a very simple program (a modified version attached) that can make a normal user to request a pretty large amount of kernel memory, well within the its maximum number of fds. To solve such problem, default limits are now imposed, and /proc based configuration has been introduced. A new directory has been created, named /proc/sys/fs/epoll/ and inside there, there are two configuration points: max_user_instances = Maximum number of devices - per user max_user_watches = Maximum number of "watched" fds - per user The current default for "max_user_watches" limits the memory used by epoll to store "watches", to 1/32 of the amount of the low RAM. As example, a 256MB 32bit machine, will have "max_user_watches" set to roughly 90000. That should be enough to not break existing heavy epoll users. The default value for "max_user_instances" is set to 128, that should be enough too. This also changes the userspace, because a new error code can now come out from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already listed, so that should be ok. [akpm@linux-foundation.org: use get_current_user()] Signed-off-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Reported-by: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-20USB: don't register endpoints for interfaces that are going awayAlan Stern
commit 352d026338378b1f13f044e33c1047da6e470056 upstream. This patch (as1155) fixes a bug in usbcore. When interfaces are deleted, either because the device was disconnected or because of a configuration change, the extra attribute files and child endpoint devices may get left behind. This is because the core removes them before calling device_del(). But during device_del(), after the driver is unbound the core will reinstall altsetting 0 and recreate those extra attributes and children. The patch prevents this by adding a flag to record when the interface is in the midst of being unregistered. When the flag is set, the attribute files and child devices will not be created. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-20block: fix nr_phys_segments miscalculation bugFUJITA Tomonori
commit 8677142710516d986d932d6f1fba7be8382c1fec upstream backported by Nikanth Karthikesan <knikanth@suse.de> to the 2.6.27.y tree. block: fix nr_phys_segments miscalculation bug This fixes the bug reported by Nikanth Karthikesan <knikanth@suse.de>: http://lkml.org/lkml/2008/10/2/203 The root cause of the bug is that blk_phys_contig_segment miscalculates q->max_segment_size. blk_phys_contig_segment checks: req->biotail->bi_size + next_req->bio->bi_size > q->max_segment_size But blk_recalc_rq_segments might expect that req->biotail and the previous bio in the req are supposed be merged into one segment. blk_recalc_rq_segments might also expect that next_req->bio and the next bio in the next_req are supposed be merged into one segment. In such case, we merge two requests that can't be merged here. Later, blk_rq_map_sg gives more segments than it should. We need to keep track of segment size in blk_recalc_rq_segments and use it to see if two requests can be merged. This patch implements it in the similar way that we used to do for hw merging (virtual merging). Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Cc: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-13Fix __pfn_to_page(pfn) for CONFIG_DISCONTIGMEM=yRafael J. Wysocki
commit c5d712433ff57a66d8fb79a57a4fc7a7c3467b97 upstream Fix the __pfn_to_page(pfn) macro so that it doesn't evaluate its argument twice in the CONFIG_DISCONTIGMEM=y case, because 'pfn' may be a result of a funtion call having side effects. For example, the hibernation code applies pfn_to_page(pfn) to the result of a function returning the pfn corresponding to the next set bit in a bitmap and the current bit position is modified on each call. This leads to "interesting" failures for CONFIG_DISCONTIGMEM=y due to the current behavior of __pfn_to_page(pfn). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-13net: unix: fix inflight counting bug in garbage collectorMiklos Szeredi
commit 6209344f5a3795d34b7f2c0061f49802283b6bdd upstream Previously I assumed that the receive queues of candidates don't change during the GC. This is only half true, nothing can be received from the queues (see comment in unix_gc()), but buffers could be added through the other half of the socket pair, which may still have file descriptors referring to it. This can result in inc_inflight_move_tail() erronously increasing the "inflight" counter for a unix socket for which dec_inflight() wasn't previously called. This in turn can trigger the "BUG_ON(total_refs < inflight_refs)" in a later garbage collection run. Fix this by only manipulating the "inflight" counter for sockets which are candidates themselves. Duplicating the file references in unix_attach_fds() is also needed to prevent a socket becoming a candidate for GC while the skb that contains it is not yet queued. Reported-by: Andrea Bittau <a.bittau@cs.ucl.ac.uk> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-13MTD: [NOR] Fix cfi_send_gen_cmd handling of x16 devices in x8 mode (v4)Eric W. Biederman
commit 467622ef2acb01986eab37ef96c3632b3ea35999 upstream For "unlock" cycles to 16bit devices in 8bit compatibility mode we need to use the byte addresses 0xaaa and 0x555. These effectively match the word address 0x555 and 0x2aa, except the latter has its low bit set. Most chips don't care about the value of the 'A-1' pin in x8 mode, but some -- like the ST M29W320D -- do. So we need to be careful to set it where appropriate. cfi_send_gen_cmd is only ever passed addresses where the low byte is 0x00, 0x55 or 0xaa. Of those, only addresses ending 0xaa are affected by this patch, by masking in the extra low bit when the device is known to be in compatibility mode. [dwmw2: Do it only when (cmd_ofs & 0xff) == 0xaa] v4: Fix stupid typo in cfi_build_cmd_addr that failed to compile I'm writing this patch way to late at night. v3: Bring all of the work back into cfi_build_cmd_addr including calling of map_bankwidth(map) and cfi_interleave(cfi) So every caller doesn't need to. v2: Only modified the address if we our device_type is larger than our bus width. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-07net: Fix recursive descent in __scm_destroy().David Miller
commit f8d570a4745835f2238a33b537218a1bb03fc671 and 3b53fbf4314594fa04544b02b2fc6e607912da18 upstream (because once wasn't good enough...) __scm_destroy() walks the list of file descriptors in the scm_fp_list pointed to by the scm_cookie argument. Those, in turn, can close sockets and invoke __scm_destroy() again. There is nothing which limits how deeply this can occur. The idea for how to fix this is from Linus. Basically, we do all of the fput()s at the top level by collecting all of the scm_fp_list objects hit by an fput(). Inside of the initial __scm_destroy() we keep running the list until it is empty. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-06math-emu: Fix signalling of underflow and inexact while packing result.Kumar Gala
[ Upstream commit 930cc144a043ff95e56b6888fa51c618b33f89e7 ] I'm trying to move the powerpc math-emu code to use the include/math-emu bits. In doing so I've been using TestFloat to see how good or bad we are doing. For the most part the current math-emu code that PPC uses has a number of issues that the code in include/math-emu seems to solve (plus bugs we've had for ever that no one every realized). Anyways, I've come across a case that we are flagging underflow and inexact because we think we have a denormalized result from a double precision divide: 000.FFFFFFFFFFFFF / 3FE.FFFFFFFFFFFFE soft: 001.0000000000000 ..... syst: 001.0000000000000 ...ux What it looks like is the results out of FP_DIV_D are: D: sign: 0 mantissa: 01000000 00000000 exp: -1023 (0) The problem seems like we aren't normalizing the result and bumping the exp. Now that I'm digging into this a bit I'm thinking my issue has to do with the fix DaveM put in place from back in Aug 2007 (commit 405849610fd96b4f34cd1875c4c033228fea6c0f): [MATH-EMU]: Fix underflow exception reporting. 2) we ended up rounding back up to normal (this is the case where we set the exponent to 1 and set the fraction to zero), this should set inexact too ... Another example, "0x0.0000000000001p-1022 / 16.0", should signal both inexact and underflow. The cpu implementations and ieee1754 literature is very clear about this. This is case #2 above. Here is the distilled glibc test case from Jakub Jelinek which prompted that commit: -------------------- #include <float.h> #include <fenv.h> #include <stdio.h> volatile double d = DBL_MIN; volatile double e = 0x0.0000000000001p-1022; volatile double f = 16.0; int main (void) { printf ("%x\n", fetestexcept (FE_UNDERFLOW)); d /= f; printf ("%x\n", fetestexcept (FE_UNDERFLOW)); e /= f; printf ("%x\n", fetestexcept (FE_UNDERFLOW)); return 0; } -------------------- It looks like the case I have we are exact before rounding, but think it looks like the rounding case since it appears as if "overflow is set". 000.FFFFFFFFFFFFF / 3FE.FFFFFFFFFFFFE = 001.0000000000000 I think the following adds the check for my case and still works for the issue your commit was trying to resolve. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/bart/ide-2.6: ide: workaround for bogus gcc warning in ide_sysfs_register_port() ide-cd: Optiarc DVD RW AD-7200A does play audio IDE: Fix platform device registration in Swarm IDE driver (v2) ide-dma: fix ide_build_dmatable() for TRM290 ide-cd: temporary tray close fix
2008-10-06[MIPS] IP27: Fix build errors if CONFIG_MAPPED_KERNEL=yRalf Baechle
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2008-10-05ide-cd: temporary tray close fixBorislav Petkov
This one fixes http://bugzilla.kernel.org/show_bug.cgi?id=11602. A more generic fix for drives which cannot autoclose tray will follow. Signed-off-by: Borislav Petkov <petkovbb@gmail.com> Cc: Jens Axboe <jens.axboe@oracle.com> [bart: add an extra parentheses for consistency with the rest of kernel code] Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
2008-10-03include/linux/stacktrace.h: declare struct task_structAndrew Morton
include/linux/stacktrace.h:13: warning: 'struct task_struct' declared inside parameter list (This might be a hard error on sparc64, which uses this header and has -Werror) Reported-by: "Randy.Dunlap" <rdunlap@xenotime.net> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-03[MIPS] SMTC: Fix SMTC dyntick support.Kevin D. Kissell
Rework of SMTC support to make it work with the new clock event system, allowing "tickless" operation, and to make it compatible with the use of the "wait_irqoff" idle loop. The new clocking scheme means that the previously optional IPI instant replay mechanism is now required, and has been made more robust. Signed-off-by: Kevin D. Kissell <kevink@paralogos.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2008-10-03[MIPS] SMTC: Close tiny holes in the SMTC IPI replay system.Kevin D. Kissell
Signed-off-by: Kevin D. Kissell <kevink@paralogos.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2008-10-03[MIPS] Build fix: Fix irq flags typeRalf Baechle
Though from a hardware perspective it would be sensible to use only a 32-bit unsigned int type Linux defines interrupt flags to be stored in an unsigned long and nothing else. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2008-10-02mm: tiny-shmem nommu fixNick Piggin
The previous patch db203d53d474aa068984e409d807628f5841da1b ("mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU architectures because it was using the expanding truncate to signal ramfs to allocate a physically contiguous RAM backing the inode (otherwise it is unusable for "memory mapping" it to userspace). However do_truncate is what caused the lock ordering error, due to it taking i_mutex. In this case, we can actually just call ramfs directly to allocate memory for the mapping, rather than go via truncate. Acked-by: David Howells <dhowells@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02inotify: fix lock ordering wrt do_page_fault's mmap_semNick Piggin
Fix inotify lock order reversal with mmap_sem due to holding locks over copy_to_user. Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Tested-by: "Daniel J Blueman" <daniel.blueman@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-01Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: af_key: Free dumping state on socket close XFRM,IPv6: initialize ip6_dst_blackhole_ops.kmem_cachep ipv6: NULL pointer dereferrence in tcp_v6_send_ack tcp: Fix NULL dereference in tcp_4_send_ack() sctp: Fix kernel panic while process protocol violation parameter iucv: Fix mismerge again. ipsec: Fix pskb_expand_head corruption in xfrm_state_check_space
2008-09-30Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: hrtimer: prevent migration of per CPU hrtimers hrtimer: mark migration state hrtimer: fix migration of CB_IRQSAFE_NO_SOFTIRQ hrtimers hrtimer: migrate pending list on cpu offline Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2008-09-30sctp: Fix kernel panic while process protocol violation parameterWei Yongjun
Since call to function sctp_sf_abort_violation() need paramter 'arg' with 'struct sctp_chunk' type, it will read the chunk type and chunk length from the chunk_hdr member of chunk. But call to sctp_sf_violation_paramlen() always with 'struct sctp_paramhdr' type's parameter, it will be passed to sctp_sf_abort_violation(). This may cause kernel panic. sctp_sf_violation_paramlen() |-- sctp_sf_abort_violation() |-- sctp_make_abort_violation() This patch fixed this problem. This patch also fix two place which called sctp_sf_violation_paramlen() with wrong paramter type. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>