summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2011-08-29Add a personality to report 2.6.x version numbersAndi Kleen
commit be27425dcc516fd08245b047ea57f83b8f6f0903 upstream. I ran into a couple of programs which broke with the new Linux 3.0 version. Some of those were binary only. I tried to use LD_PRELOAD to work around it, but it was quite difficult and in one case impossible because of a mix of 32bit and 64bit executables. For example, all kind of management software from HP doesnt work, unless we pretend to run a 2.6 kernel. $ uname -a Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux $ hpacucli ctrl all show Error: No controllers detected. $ rpm -qf /usr/sbin/hpacucli hpacucli-8.75-12.0 Another notable case is that Python now reports "linux3" from sys.platform(); which in turn can break things that were checking sys.platform() == "linux2": https://bugzilla.mozilla.org/show_bug.cgi?id=664564 It seems pretty clear to me though it's a bug in the apps that are using '==' instead of .startswith(), but this allows us to unbreak broken programs. This patch adds a UNAME26 personality that makes the kernel report a 2.6.40+x version number instead. The x is the x in 3.x. I know this is somewhat ugly, but I didn't find a better workaround, and compatibility to existing programs is important. Some programs also read /proc/sys/kernel/osrelease. This can be worked around in user space with mount --bind (and a mount namespace) To use: wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c gcc -o uname26 uname26.c ./uname26 program Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29x86, mtrr: lock stop machine during MTRR rendezvous sequenceSuresh Siddha
commit 6d3321e8e2b3bf6a5892e2ef673c7bf536e3f904 upstream. MTRR rendezvous sequence using stop_one_cpu_nowait() can potentially happen in parallel with another system wide rendezvous using stop_machine(). This can lead to deadlock (The order in which works are queued can be different on different cpu's. Some cpu's will be running the first rendezvous handler and others will be running the second rendezvous handler. Each set waiting for the other set to join for the system wide rendezvous, leading to a deadlock). MTRR rendezvous sequence is not implemented using stop_machine() as this gets called both from the process context aswell as the cpu online paths (where the cpu has not come online and the interrupts are disabled etc). stop_machine() works with only online cpus. For now, take the stop_machine mutex in the MTRR rendezvous sequence that gets called from an online cpu (here we are in the process context and can potentially sleep while taking the mutex). And the MTRR rendezvous that gets triggered during cpu online doesn't need to take this stop_machine lock (as the stop_machine() already ensures that there is no cpu hotplug going on in parallel by doing get_online_cpus()) TBD: Pursue a cleaner solution of extending the stop_machine() infrastructure to handle the case where the calling cpu is still not online and use this for MTRR rendezvous sequence. fixes: https://bugzilla.novell.com/show_bug.cgi?id=672008 Reported-by: Vadim Kotelnikov <vadimuzzz@inbox.ru> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/20110623182056.807230326@sbsiddha-MOBL3.sc.intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29genirq: Fix wrong bit operationjhbird.choi@samsung.com
commit 1dd75f91ae713049eb6baaa640078f3a6549e522 upstream. (!msk & 0x01) should be !(msk & 0x01) Signed-off-by: Jonghwan Choi <jhbird.choi@samsung.com> Link: http://lkml.kernel.org/r/1311229754-6003-1-git-send-email-jhbird.choi@samsung.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-15futex: Fix regression with read only mappingsShawn Bohrer
commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae upstream. commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw parameter from get_futex_key()) in 2.6.33 fixed two problems: First, It prevented a loop when encountering a ZERO_PAGE. Second, it fixed RW MAP_PRIVATE futex operations by forcing the COW to occur by unconditionally performing a write access get_user_pages_fast() to get the page. The commit also introduced a user-mode regression in that it broke futex operations on read-only memory maps. For example, this breaks workloads that have one or more reader processes doing a FUTEX_WAIT on a futex within a read only shared file mapping, and a writer processes that has a writable mapping issuing the FUTEX_WAKE. This fixes the regression for valid futex operations on RO mappings by trying a RO get_user_pages_fast() when the RW get_user_pages_fast() fails. This change makes it necessary to also check for invalid use cases, such as anonymous RO mappings (which can never change) and the ZERO_PAGE which the commit referenced above was written to address. This patch does restore the original behavior with RO MAP_PRIVATE mappings, which have inherent user-mode usage problems and don't really make sense. With this patch performing a FUTEX_WAIT within a RO MAP_PRIVATE mapping will be successfully woken provided another process updates the region of the underlying mapped file. However, the mmap() man page states that for a MAP_PRIVATE mapping: It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region. So user-mode users attempting to use futex operations on RO MAP_PRIVATE mappings are depending on unspecified behavior. Additionally a RO MAP_PRIVATE mapping could fail to wake up in the following case. Thread-A: call futex(FUTEX_WAIT, memory-region-A). get_futex_key() return inode based key. sleep on the key Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A) Thread-B: write memory-region-A. COW happen. This process's memory-region-A become related to new COWed private (ie PageAnon=1) page. Thread-B: call futex(FUETX_WAKE, memory-region-A). get_futex_key() return mm based key. IOW, we fail to wake up Thread-A. Once again doing something like this is just silly and users who do something like this get what they deserve. While RO MAP_PRIVATE mappings are nonsensical, checking for a private mapping requires walking the vmas and was deemed too costly to avoid a userspace hang. This Patch is based on Peter Zijlstra's initial patch with modifications to only allow RO mappings for futex operations that need VERIFY_READ access. Reported-by: David Oliver <david@rgmadvisors.com> Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: peterz@infradead.org Cc: eric.dumazet@gmail.com Cc: zvonler@rgmadvisors.com Cc: hughd@google.com Link: http://lkml.kernel.org/r/1309450892-30676-1-git-send-email-sbohrer@rgmadvisors.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04mm/futex: fix futex writes on archs with SW tracking of dirty & youngBenjamin Herrenschmidt
commit 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 upstream. I haven't reproduced it myself but the fail scenario is that on such machines (notably ARM and some embedded powerpc), if you manage to hit that futex path on a writable page whose dirty bit has gone from the PTE, you'll livelock inside the kernel from what I can tell. It will go in a loop of trying the atomic access, failing, trying gup to "fix it up", getting succcess from gup, go back to the atomic access, failing again because dirty wasn't fixed etc... So I think you essentially hang in the kernel. The scenario is probably rare'ish because affected architecture are embedded and tend to not swap much (if at all) so we probably rarely hit the case where dirty is missing or young is missing, but I think Shan has a piece of SW that can reliably reproduce it using a shared writable mapping & fork or something like that. On archs who use SW tracking of dirty & young, a page without dirty is effectively mapped read-only and a page without young unaccessible in the PTE. Additionally, some architectures might lazily flush the TLB when relaxing write protection (by doing only a local flush), and expect a fault to invalidate the stale entry if it's still present on another processor. The futex code assumes that if the "in_atomic()" access -EFAULT's, it can "fix it up" by causing get_user_pages() which would then be equivalent to taking the fault. However that isn't the case. get_user_pages() will not call handle_mm_fault() in the case where the PTE seems to have the right permissions, regardless of the dirty and young state. It will eventually update those bits ... in the struct page, but not in the PTE. Additionally, it will not handle the lazy TLB flushing that can be required by some architectures in the fault case. Basically, gup is the wrong interface for the job. The patch provides a more appropriate one which boils down to just calling handle_mm_fault() since what we are trying to do is simulate a real page fault. The futex code currently attempts to write to user memory within a pagefault disabled section, and if that fails, tries to fix it up using get_user_pages(). This doesn't work on archs where the dirty and young bits are maintained by software, since they will gate access permission in the TLB, and will not be updated by gup(). In addition, there's an expectation on some archs that a spurious write fault triggers a local TLB flush, and that is missing from the picture as well. I decided that adding those "features" to gup() would be too much for this already too complex function, and instead added a new simpler fixup_user_fault() which is essentially a wrapper around handle_mm_fault() which the futex code can call. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout] Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Reported-by: Shan Hai <haishan.bai@gmail.com> Tested-by: Shan Hai <haishan.bai@gmail.com> Cc: David Laight <David.Laight@ACULAB.COM> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <darren.hart@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04tracing: Have "enable" file use refcounts like the "filter" fileSteven Rostedt
commit 40ee4dffff061399eb9358e0c8fcfbaf8de4c8fe upstream. The "enable" file for the event system can be removed when a module is unloaded and the event system only has events from that module. As the event system nr_events count goes to zero, it may be freed if its ref_count is also set to zero. Like the "filter" file, the "enable" file may be opened by a task and referenced later, after a module has been unloaded and the events for that event system have been removed. Although the "filter" file referenced the event system structure, the "enable" file only references a pointer to the event system name. Since the name is freed when the event system is removed, it is possible that an access to the "enable" file may reference a freed pointer. Update the "enable" file to use the subsystem_open() routine that the "filter" file uses, to keep a reference to the event system structure while the "enable" file is opened. Reported-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04tracing: Fix bug when reading system filters on module removalSteven Rostedt
commit e9dbfae53eeb9fc3d4bb7da3df87fa9875f5da02 upstream. The event system is freed when its nr_events is set to zero. This happens when a module created an event system and then later the module is removed. Modules may share systems, so the system is allocated when it is created and freed when the modules are unloaded and all the events under the system are removed (nr_events set to zero). The problem arises when a task opened the "filter" file for the system. If the module is unloaded and it removed the last event for that system, the system structure is freed. If the task that opened the filter file accesses the "filter" file after the system has been freed, the system will access an invalid pointer. By adding a ref_count, and using it to keep track of what is using the event system, we can free it after all users are finished with the event system. Reported-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-04perf: Fix software event overflowPeter Zijlstra
The below patch is for -stable only, upstream has a much larger patch that contains the below hunk in commit a8b0ca17b80e92faab46ee7179ba9e99ccb61233 Vince found that under certain circumstances software event overflows go wrong and deadlock. Avoid trying to delete a timer from the timer callback. Reported-by: Vince Weaver <vweaver1@eecs.utk.edu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-20Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: signal: align __lock_task_sighand() irq disabling and RCU softirq,rcu: Inform RCU of irq_exit() activity sched: Add irq_{enter,exit}() to scheduler_ipi() rcu: protect __rcu_read_unlock() against scheduler-using irq handlers rcu: Streamline code produced by __rcu_read_unlock() rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_special rcu: decrease rcu_report_exp_rnp coupling with scheduler
2011-07-20Merge branch 'rcu/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent
2011-07-20signal: align __lock_task_sighand() irq disabling and RCUPaul E. McKenney
The __lock_task_sighand() function calls rcu_read_lock() with interrupts and preemption enabled, but later calls rcu_read_unlock() with interrupts disabled. It is therefore possible that this RCU read-side critical section will be preempted and later RCU priority boosted, which means that rcu_read_unlock() will call rt_mutex_unlock() in order to deboost itself, but with interrupts disabled. This results in lockdep splats, so this commit nests the RCU read-side critical section within the interrupt-disabled region of code. This prevents the RCU read-side critical section from being preempted, and thus prevents the attempt to deboost with interrupts disabled. It is quite possible that a better long-term fix is to make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's ->wait_lock. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20softirq,rcu: Inform RCU of irq_exit() activityPeter Zijlstra
The rcu_read_unlock_special() function relies on in_irq() to exclude scheduler activity from interrupt level. This fails because exit_irq() can invoke the scheduler after clearing the preempt_count() bits that in_irq() uses to determine that it is at interrupt level. This situation can result in failures as follows: $task IRQ SoftIRQ rcu_read_lock() /* do stuff */ <preempt> |= UNLOCK_BLOCKED rcu_read_unlock() --t->rcu_read_lock_nesting irq_enter(); /* do stuff, don't use RCU */ irq_exit(); sub_preempt_count(IRQ_EXIT_OFFSET); invoke_softirq() ttwu(); spin_lock_irq(&pi->lock) rcu_read_lock(); /* do stuff */ rcu_read_unlock(); rcu_read_unlock_special() rcu_report_exp_rnp() ttwu() spin_lock_irq(&pi->lock) /* deadlock */ rcu_read_unlock_special(t); Ed can simply trigger this 'easy' because invoke_softirq() immediately does a ttwu() of ksoftirqd/# instead of doing the in-place softirq stuff first, but even without that the above happens. Cure this by also excluding softirqs from the rcu_read_unlock_special() handler and ensuring the force_irqthreads ksoftirqd/# wakeup is done from full softirq context. [ Alternatively, delaying the ->rcu_read_lock_nesting decrement until after the special handling would make the thing more robust in the face of interrupts as well. And there is a separate patch for that. ] Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-tested-by: Ed Tomlinson <edt@aei.ca> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20sched: Add irq_{enter,exit}() to scheduler_ipi()Peter Zijlstra
Ensure scheduler_ipi() calls irq_{enter,exit} when it does some actual work. Traditionally we never did any actual work from the resched IPI and all magic happened in the return from interrupt path. Now that we do do some work, we need to ensure irq_{enter,exit} are called so that we don't confuse things. This affects things like timekeeping, NO_HZ and RCU, basically everything with a hook in irq_enter/exit. Explicit examples of things going wrong are: sched_clock_cpu() -- has a callback when leaving NO_HZ state to take a new reading from GTOD and TSC. Without this callback, time is stuck in the past. RCU -- needs in_irq() to work in order to avoid some nasty deadlocks Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20rcu: protect __rcu_read_unlock() against scheduler-using irq handlersPaul E. McKenney
The addition of RCU read-side critical sections within runqueue and priority-inheritance lock critical sections introduced some deadlock cycles, for example, involving interrupts from __rcu_read_unlock() where the interrupt handlers call wake_up(). This situation can cause the instance of __rcu_read_unlock() invoked from interrupt to do some of the processing that would otherwise have been carried out by the task-level instance of __rcu_read_unlock(). When the interrupt-level instance of __rcu_read_unlock() is called with a scheduler lock held from interrupt-entry/exit situations where in_irq() returns false, deadlock can result. This commit resolves these deadlocks by using negative values of the per-task ->rcu_read_lock_nesting counter to indicate that an instance of __rcu_read_unlock() is in flight, which in turn prevents instances from interrupt handlers from doing any special processing. This patch is inspired by Steven Rostedt's earlier patch that similarly made __rcu_read_unlock() guard against interrupt-mediated recursion (see https://lkml.org/lkml/2011/7/15/326), but this commit refines Steven's approach to avoid the need for preemption disabling on the __rcu_read_unlock() fastpath and to also avoid the need for manipulating a separate per-CPU variable. This patch avoids need for preempt_disable() by instead using negative values of the per-task ->rcu_read_lock_nesting counter. Note that nested rcu_read_lock()/rcu_read_unlock() pairs are still permitted, but they will never see ->rcu_read_lock_nesting go to zero, and will therefore never invoke rcu_read_unlock_special(), thus preventing them from seeing the RCU_READ_UNLOCK_BLOCKED bit should it be set in ->rcu_read_unlock_special. This patch also adds a check for ->rcu_read_unlock_special being negative in rcu_check_callbacks(), thus preventing the RCU_READ_UNLOCK_NEED_QS bit from being set should a scheduling-clock interrupt occur while __rcu_read_unlock() is exiting from an outermost RCU read-side critical section. Of course, __rcu_read_unlock() can be preempted during the time that ->rcu_read_lock_nesting is negative. This could result in the setting of the RCU_READ_UNLOCK_BLOCKED bit after __rcu_read_unlock() checks it, and would also result it this task being queued on the corresponding rcu_node structure's blkd_tasks list. Therefore, some later RCU read-side critical section would enter rcu_read_unlock_special() to clean up -- which could result in deadlock if that critical section happened to be in the scheduler where the runqueue or priority-inheritance locks were held. This situation is dealt with by making rcu_preempt_note_context_switch() check for negative ->rcu_read_lock_nesting, thus refraining from queuing the task (and from setting RCU_READ_UNLOCK_BLOCKED) if we are already exiting from the outermost RCU read-side critical section (in other words, we really are no longer actually in that RCU read-side critical section). In addition, rcu_preempt_note_context_switch() invokes rcu_read_unlock_special() to carry out the cleanup in this case, which clears out the ->rcu_read_unlock_special bits and dequeues the task (if necessary), in turn avoiding needless delay of the current RCU grace period and needless RCU priority boosting. It is still illegal to call rcu_read_unlock() while holding a scheduler lock if the prior RCU read-side critical section has ever had either preemption or irqs enabled. However, the common use case is legal, namely where then entire RCU read-side critical section executes with irqs disabled, for example, when the scheduler lock is held across the entire lifetime of the RCU read-side critical section. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-20sched: Avoid creating superfluous NUMA domains on non-NUMA systemsPeter Zijlstra
When creating sched_domains, stop when we've covered the entire target span instead of continuing to create domains, only to later find they're redundant and throw them away again. This avoids single node systems from touching funny NUMA sched_domain creation code and reduces the risks of the new SD_OVERLAP code. Requested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Anton Blanchard <anton@samba.org> Cc: mahesh@linux.vnet.ibm.com Cc: benh@kernel.crashing.org Cc: linuxppc-dev@lists.ozlabs.org Link: http://lkml.kernel.org/r/1311180177.29152.57.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-20sched: Allow for overlapping sched_domain spansPeter Zijlstra
Allow for sched_domain spans that overlap by giving such domains their own sched_group list instead of sharing the sched_groups amongst each-other. This is needed for machines with more than 16 nodes, because sched_domain_node_span() will generate a node mask from the 16 nearest nodes without regard if these masks have any overlap. Currently sched_domains have a sched_group that maps to their child sched_domain span, and since there is no overlap we share the sched_group between the sched_domains of the various CPUs. If however there is overlap, we would need to link the sched_group list in different ways for each cpu, and hence sharing isn't possible. In order to solve this, allocate private sched_groups for each CPU's sched_domain but have the sched_groups share a sched_group_power structure such that we can uniquely track the power. Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-08bxqw9wis3qti9u5inifh3y@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-20sched: Break out cpu_power from the sched_group structurePeter Zijlstra
In order to prepare for non-unique sched_groups per domain, we need to carry the cpu_power elsewhere, so put a level of indirection in. Reported-and-tested-by: Anton Blanchard <anton@samba.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-19rcu: Streamline code produced by __rcu_read_unlock()Paul E. McKenney
Given some common flag combinations, particularly -Os, gcc will inline rcu_read_unlock_special() despite its being in an unlikely() clause. Use noinline to prohibit this misoptimization. In addition, move the second barrier() in __rcu_read_unlock() so that it is not on the common-case code path. This will allow the compiler to generate better code for the common-case path through __rcu_read_unlock(). Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
2011-07-19rcu: Fix RCU_BOOST race handling current->rcu_read_unlock_specialPaul E. McKenney
The RCU_BOOST commits for TREE_PREEMPT_RCU introduced an other-task write to a new RCU_READ_UNLOCK_BOOSTED bit in the task_struct structure's ->rcu_read_unlock_special field, but, as noted by Steven Rostedt, without correctly synchronizing all accesses to ->rcu_read_unlock_special. This could result in bits in ->rcu_read_unlock_special being spuriously set and cleared due to conflicting accesses, which in turn could result in deadlocks between the rcu_node structure's ->lock and the scheduler's rq and pi locks. These deadlocks would result from RCU incorrectly believing that the just-ended RCU read-side critical section had been preempted and/or boosted. If that RCU read-side critical section was executed with either rq or pi locks held, RCU's ensuing (incorrect) calls to the scheduler would cause the scheduler to attempt to once again acquire the rq and pi locks, resulting in deadlock. More complex deadlock cycles are also possible, involving multiple rq and pi locks as well as locks from multiple rcu_node structures. This commit fixes synchronization by creating ->rcu_boosted field in task_struct that is accessed and modified only when holding the ->lock in the rcu_node structure on which the task is queued (on that rcu_node structure's ->blkd_tasks list). This results in tasks accessing only their own current->rcu_read_unlock_special fields, making unsynchronized access once again legal, and keeping the rcu_read_unlock() fastpath free of atomic instructions and memory barriers. The reason that the rcu_read_unlock() fastpath does not need to access the new current->rcu_boosted field is that this new field cannot be non-zero unless the RCU_READ_UNLOCK_BLOCKED bit is set in the current->rcu_read_unlock_special field. Therefore, rcu_read_unlock() need only test current->rcu_read_unlock_special: if that is zero, then current->rcu_boosted must also be zero. This bug does not affect TINY_PREEMPT_RCU because this implementation of RCU accesses current->rcu_read_unlock_special with irqs disabled, thus preventing races on the !SMP systems that TINY_PREEMPT_RCU runs on. Maybe-reported-by: Dave Jones <davej@redhat.com> Maybe-reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
2011-07-19rcu: decrease rcu_report_exp_rnp coupling with schedulerPaul E. McKenney
PREEMPT_RCU read-side critical sections blocking an expedited grace period invoke rcu_report_exp_rnp(). When the last such critical section has completed, rcu_report_exp_rnp() invokes the scheduler to wake up the task that invoked synchronize_rcu_expedited() -- needlessly holding the root rcu_node structure's lock while doing so, thus needlessly providing a way for RCU and the scheduler to deadlock. This commit therefore releases the root rcu_node structure's lock before calling wake_up(). Reported-by: Ed Tomlinson <edt@aei.ca> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-07-15Merge branch 'rcu/urgent' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu * 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu: rcu: Prevent RCU callbacks from executing before scheduler initialized
2011-07-15sched: Fix 32bit racePeter Zijlstra
Commit 3fe1698b7fe0 ("sched: Deal with non-atomic min_vruntime reads on 32bit") forgot to initialize min_vruntime_copy which could lead to an infinite while loop in task_waking_fair() under some circumstances (early boot, lucky timing). [ This bug was also reported by others that blamed it on the RCU initialization problems ] Reported-and-tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-13rcu: Prevent RCU callbacks from executing before scheduler initializedPaul E. McKenney
Under some rare but real combinations of configuration parameters, RCU callbacks are posted during early boot that use kernel facilities that are not yet initialized. Therefore, when these callbacks are invoked, hard hangs and crashes ensue. This commit therefore prevents RCU callbacks from being invoked until after the scheduler is fully up and running, as in after multiple tasks have been spawned. It might well turn out that a better approach is to identify the specific RCU callbacks that are causing this problem, but that discussion will wait until such time as someone really needs an RCU callback to be invoked (as opposed to merely registered) during early boot. Reported-by: julie Sullivan <kernelmail.jms@gmail.com> Reported-by: RKK <kulkarni.ravi4@gmail.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Tested-by: julie Sullivan <kernelmail.jms@gmail.com> Tested-by: RKK <kulkarni.ravi4@gmail.com>
2011-07-12Merge branch 'fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/linux-arm-soc: pcmcia: pxa2xx/vpac270: free gpios on exist rather than requesting ARM: pxa/raumfeld: fix device name for codec ak4104 ARM: pxa/raumfeld: display initialisation fixes ARM: pxa/raumfeld: adapt to upcoming hardware change ARM: pxa: fix gpio_to_chip() clash with gpiolib namespace genirq: replace irq_gc_ack() with {set,clr}_bit variants (fwd) arm: mach-vt8500: add forgotten irq_data conversion ARM: pxa168: correct nand pmu setting ARM: pxa910: correct nand pmu setting ARM: pxa: fix PGSR register address calculation
2011-07-07Merge branch 'pm-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 * 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6: PM / Hibernate: Fix free_unnecessary_pages()
2011-07-07Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and ↵Linus Torvalds
'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: debugobjects: Fix boot crash when kmemleak and debugobjects enabled * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: jump_label: Fix jump_label update for modules oprofile, x86: Fix race in nmi handler while starting counters * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Disable (revert) SCHED_LOAD_SCALE increase sched, cgroups: Fix MIN_SHARES on 64-bit boxen
2011-07-07genirq: replace irq_gc_ack() with {set,clr}_bit variants (fwd)Simon Guinot
This fixes a regression introduced by e59347a "arm: orion: Use generic irq chip". Depending on the device, interrupts acknowledgement is done by setting or by clearing a dedicated register. Replace irq_gc_ack() with some {set,clr}_bit variants allows to handle both cases. Note that this patch affects the following SoCs: Davinci, Samsung and Orion. Except for this last, the change is minor: irq_gc_ack() is just renamed into irq_gc_ack_set_bit(). For the Orion SoCs, the edge GPIO interrupts support is currently broken. irq_gc_ack() try to acknowledge a such interrupt by setting the corresponding cause register bit. The Orion GPIO device expect the opposite. To fix this issue, the irq_gc_ack_clr_bit() variant is used. Tested on Network Space v2. Reported-by: Joey Oravec <joravec@drewtech.com> Signed-off-by: Simon Guinot <sguinot@lacie.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2011-07-06PM / Hibernate: Fix free_unnecessary_pages()Rafael J. Wysocki
There is a bug in free_unnecessary_pages() that causes it to attempt to free too many pages in some cases, which triggers the BUG_ON() in memory_bm_clear_bit() for copy_bm. Namely, if count_data_pages() is initially greater than alloc_normal, we get to_free_normal equal to 0 and "save" greater from 0. In that case, if the sum of "save" and count_highmem_pages() is greater than alloc_highmem, we subtract a positive number from to_free_normal. Hence, since to_free_normal was 0 before the subtraction and is an unsigned int, the result is converted to a huge positive number that is used as the number of pages to free. Fix this bug by checking if to_free_normal is actually greater than or equal to the number we're going to subtract from it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-and-tested-by: Matthew Garrett <mjg@redhat.com> Cc: stable@kernel.org
2011-07-06resource: ability to resize an allocated resourceRam Pai
Provides the ability to resize a resource that is already allocated. This functionality is put in place to support reallocation needs of pci resources. Signed-off-by: Ram Pai <linuxram@us.ibm.com> Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-01sched, cgroups: Fix MIN_SHARES on 64-bit boxenMike Galbraith
Commit c8b28116 ("sched: Increase SCHED_LOAD_SCALE resolution") intended to have no user-visible effect, but allows setting cpu.shares to < MIN_SHARES, which the user then sees. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Nikhil Rao <ncrao@google.com> Link: http://lkml.kernel.org/r/1307192600.8618.3.camel@marge.simson.net Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-06-29jump_label: Fix jump_label update for modulesXiao Guangrong
The jump labels entries for modules do not stop at __stop__jump_table, but after mod->jump_entries + mod_num_jump_entries. By checking the wrong end point, module trace events never get enabled. Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Tested-by: Avi Kivity <avi@redhat.com> Tested-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Link: http://lkml.kernel.org/r/4E00038B.2060404@cn.fujitsu.com Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-06-27taskstats: don't allow duplicate entries in listener modeVasiliy Kulikov
Currently a single process may register exit handlers unlimited times. It may lead to a bloated listeners chain and very slow process terminations. Eg after 10KK sent TASKSTATS_CMD_ATTR_REGISTER_CPUMASKs ~300 Mb of kernel memory is stolen for the handlers chain and "time id" shows 2-7 seconds instead of normal 0.003. It makes it possible to exhaust all kernel memory and to eat much of CPU time by triggerring numerous exits on a single CPU. The patch limits the number of times a single process may register itself on a single CPU to one. One little issue is kept unfixed - as taskstats_exit() is called before exit_files() in do_exit(), the orphaned listener entry (if it was not explicitly deregistered) is kept until the next someone's exit() and implicit deregistration in send_cpu_listeners(). So, if a process registered itself as a listener exits and the next spawned process gets the same pid, it would inherit taskstats attributes. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-25Merge branch 'timer-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timer-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rtc: vt8500: Fix build error & cleanup rtc_class_ops->update_irq_enable() alarmtimers: Return -ENOTSUPP if no RTC device is present alarmtimers: Handle late rtc module loading
2011-06-21alarmtimers: Return -ENOTSUPP if no RTC device is presentJohn Stultz
Toralf Förster and Richard Weinberger noted that if there is no RTC device, the alarm timers core prints out an annoying "ALARM timers will not wake from suspend" message. This warning has been removed in a previous patch, however the issue still remains: The original idea was to support alarm timers even if there was no rtc device, as long as the system didn't go into suspend. However, after further consideration, communicating to the application that alarmtimers are not fully functional seems like the better solution. So this patch makes it so we return -ENOTSUPP to any posix _ALARM clockid calls if there is no backing RTC device on the system. Further this changes the behavior where when there is no rtc device we will check for one on clock_getres, clock_gettime, timer_create, and timer_nsleep instead of on suspend. CC: Toralf Förster <toralf.foerster@gmx.de> CC: Richard Weinberger <richard@nod.at CC: Peter Zijlstra <peterz@infradead.org> CC: Thomas Gleixner <tglx@linutronix.de> Reported-by: Toralf Förster <toralf.foerster@gmx.de> Reported by: Richard Weinberger <richard@nod.at> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-06-21alarmtimers: Handle late rtc module loadingJohn Stultz
The alarmtimers code currently picks a rtc device to use at late init time. However, if your rtc driver is loaded as a module, it may be registered after the alarmtimers late init code, leaving the alarmtimers nonfunctional. This patch moves the the rtcdevice selection to when we actually try to use it, allowing us to make use of rtc modules that may have been loaded at any point since bootup. CC: Thomas Gleixner <tglx@linutronix.de> CC: Meelis Roos <mroos@ut.ee> Reported-by: Meelis Roos <mroos@ut.ee> Signed-off-by: John Stultz <john.stultz@linaro.org>
2011-06-21PM: Free memory bitmaps if opening /dev/snapshot failsMichal Kubecek
When opening /dev/snapshot device, snapshot_open() creates memory bitmaps which are freed in snapshot_release(). But if any of the callbacks called by pm_notifier_call_chain() returns NOTIFY_BAD, open() fails, snapshot_release() is never called and bitmaps are not freed. Next attempt to open /dev/snapshot then triggers BUG_ON() check in create_basic_memory_bitmaps(). This happens e.g. when vmwatchdog module is active on s390x. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: stable@kernel.org
2011-06-19Merge branches 'perf-urgent-for-linus', 'sched-urgent-for-linus', ↵Linus Torvalds
'timers-urgent-for-linus' and 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tools/perf: Fix static build of perf tool tracing: Fix regression in printk_formats file * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: generic-ipi: Fix kexec boot crash by initializing call_single_queue before enabling interrupts * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clocksource: Make watchdog robust vs. interruption timerfd: Fix wakeup of processes when timer is cancelled on clock change * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, MAINTAINERS: Add x86 MCE people x86, efi: Do not reserve boot services regions within reserved areas
2011-06-19Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: Move RCU_BOOST #ifdefs to header file rcu: use softirq instead of kthreads except when RCU_BOOST=y rcu: Use softirq to address performance regression rcu: Simplify curing of load woes
2011-06-17KEYS/DNS: Fix ____call_usermodehelper() to not lose the session keyringDavid Howells
____call_usermodehelper() now erases any credentials set by the subprocess_inf::init() function. The problem is that commit 17f60a7da150 ("capabilites: allow the application of capability limits to usermode helpers") creates and commits new credentials with prepare_kernel_cred() after the call to the init() function. This wipes all keyrings after umh_keys_init() is called. The best way to deal with this is to put the init() call just prior to the commit_creds() call, and pass the cred pointer to init(). That means that umh_keys_init() and suchlike can modify the credentials _before_ they are published and potentially in use by the rest of the system. This prevents request_key() from working as it is prevented from passing the session keyring it set up with the authorisation token to /sbin/request-key, and so the latter can't assume the authority to instantiate the key. This causes the in-kernel DNS resolver to fail with ENOKEY unconditionally. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Eric Paris <eparis@redhat.com> Tested-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-17generic-ipi: Fix kexec boot crash by initializing call_single_queue before ↵Takao Indoh
enabling interrupts There is a problem that kdump(2nd kernel) sometimes hangs up due to a pending IPI from 1st kernel. Kernel panic occurs because IPI comes before call_single_queue is initialized. To fix the crash, rename init_call_single_data() to call_function_init() and call it in start_kernel() so that call_single_queue can be initialized before enabling interrupts. The details of the crash are: (1) 2nd kernel boots up (2) A pending IPI from 1st kernel comes when irqs are first enabled in start_kernel(). (3) Kernel tries to handle the interrupt, but call_single_queue is not initialized yet at this point. As a result, in the generic_smp_call_function_single_interrupt(), NULL pointer dereference occurs when list_replace_init() tries to access &q->list.next. Therefore this patch changes the name of init_call_single_data() to call_function_init() and calls it before local_irq_enable() in start_kernel(). Signed-off-by: Takao Indoh <indou.takao@jp.fujitsu.com> Reviewed-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Milton Miller <miltonm@bga.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: kexec@lists.infradead.org Link: http://lkml.kernel.org/r/D6CBEE2F420741indou.takao@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-06-16rcu: Move RCU_BOOST #ifdefs to header filePaul E. McKenney
The commit "use softirq instead of kthreads except when RCU_BOOST=y" just applied #ifdef in place. This commit is a cleanup that moves the newly #ifdef'ed code to the header file kernel/rcutree_plugin.h. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-06-16clocksource: Make watchdog robust vs. interruptionThomas Gleixner
The clocksource watchdog code is interruptible and it has been observed that this can trigger false positives which disable the TSC. The reason is that an interrupt storm or a long running interrupt handler between the read of the watchdog source and the read of the TSC brings the two far enough apart that the delta is larger than the unstable treshold. Move both reads into a short interrupt disabled region to avoid that. Reported-and-tested-by: Vernon Mauery <vernux@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@kernel.org
2011-06-15rcu: use softirq instead of kthreads except when RCU_BOOST=yPaul E. McKenney
This patch #ifdefs RCU kthreads out of the kernel unless RCU_BOOST=y, thus eliminating context-switch overhead if RCU priority boosting has not been configured. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-06-15Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: sched: Check if lowest_mask is initialized in find_lowest_rq() sched: Fix need_resched() when checking peempt
2011-06-15gcov: disable CONFIG_CONSTRUCTORS when not needed by CONFIG_GCOV_KERNELJosh Triplett
CONFIG_CONSTRUCTORS controls support for running constructor functions at kernel init time. According to commit b99b87f70c7785ab ("kernel: constructor support"), gcov (CONFIG_GCOV_KERNEL) needs this. However, CONFIG_CONSTRUCTORS currently defaults to y, with no option to disable it, and CONFIG_GCOV_KERNEL depends on it. Instead, default it to n and have CONFIG_GCOV_KERNEL select it, so that the normal case of CONFIG_GCOV_KERNEL=n will result in CONFIG_CONSTRUCTORS=n. Observed in the short list of =y values in a minimal kernel configuration. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Acked-by: Peter Oberparleiter <peter.oberparleiter@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15memcg: clear mm->owner when last possible owner leavesKAMEZAWA Hiroyuki
The following crash was reported: > Call Trace: > [<ffffffff81139792>] mem_cgroup_from_task+0x15/0x17 > [<ffffffff8113a75a>] __mem_cgroup_try_charge+0x148/0x4b4 > [<ffffffff810493f3>] ? need_resched+0x23/0x2d > [<ffffffff814cbf43>] ? preempt_schedule+0x46/0x4f > [<ffffffff8113afe8>] mem_cgroup_charge_common+0x9a/0xce > [<ffffffff8113b6d1>] mem_cgroup_newpage_charge+0x5d/0x5f > [<ffffffff81134024>] khugepaged+0x5da/0xfaf > [<ffffffff81078ea0>] ? __init_waitqueue_head+0x4b/0x4b > [<ffffffff81133a4a>] ? add_mm_counter.constprop.5+0x13/0x13 > [<ffffffff81078625>] kthread+0xa8/0xb0 > [<ffffffff814d13e8>] ? sub_preempt_count+0xa1/0xb4 > [<ffffffff814d5664>] kernel_thread_helper+0x4/0x10 > [<ffffffff814ce858>] ? retint_restore_args+0x13/0x13 > [<ffffffff8107857d>] ? __init_kthread_worker+0x5a/0x5a What happens is that khugepaged tries to charge a huge page against an mm whose last possible owner has already exited, and the memory controller crashes when the stale mm->owner is used to look up the cgroup to charge. mm->owner has never been set to NULL with the last owner going away, but nobody cared until khugepaged came along. Even then it wasn't a problem because the final mmput() on an mm was forced to acquire and release mmap_sem in write-mode, preventing an exiting owner to go away while the mmap_sem was held, and until "692e0b3 mm: thp: optimize memcg charge in khugepaged", the memory cgroup charge was protected by mmap_sem in read-mode. Instead of going back to relying on the mmap_sem to enforce lifetime of a task, this patch ensures that mm->owner is properly set to NULL when the last possible owner is exiting, which the memory controller can handle just fine. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Hugh Dickins <hughd@google.com> Reported-by: Dave Jones <davej@redhat.com> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-15sched: Check if lowest_mask is initialized in find_lowest_rq()Steven Rostedt
On system boot up, the lowest_mask is initialized with an early_initcall(). But RT tasks may wake up on other early_initcall() callers before the lowest_mask is initialized, causing a system crash. Commit "d72bce0e67 rcu: Cure load woes" was the first commit to wake up RT tasks in early init. Before this commit this bug should not happen. Reported-by: Andrew Theurer <habanero@linux.vnet.ibm.com> Tested-by: Andrew Theurer <habanero@linux.vnet.ibm.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20110614223657.824872966@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-06-15sched: Fix need_resched() when checking peemptHillf Danton
The RT preempt check tests the wrong task if NEED_RESCHED is set. It currently checks the local CPU task. It is supposed to check the task that is running on the runqueue we are about to wake another task on. Signed-off-by: Hillf Danton <dhillf@gmail.com> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20110614223657.450239027@goodmis.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-06-14signal.c: fix kernel-doc notationRandy Dunlap
Fix kernel-doc warnings in signal.c: Warning(kernel/signal.c:2374): No description found for parameter 'nset' Warning(kernel/signal.c:2374): Excess function parameter 'set' description in 'sys_rt_sigprocmask' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-06-14rcu: Use softirq to address performance regressionShaohua Li
Commit a26ac2455ffcf3(rcu: move TREE_RCU from softirq to kthread) introduced performance regression. In an AIM7 test, this commit degraded performance by about 40%. The commit runs rcu callbacks in a kthread instead of softirq. We observed high rate of context switch which is caused by this. Out test system has 64 CPUs and HZ is 1000, so we saw more than 64k context switch per second which is caused by RCU's per-CPU kthread. A trace showed that most of the time the RCU per-CPU kthread doesn't actually handle any callbacks, but instead just does a very small amount of work handling grace periods. This means that RCU's per-CPU kthreads are making the scheduler do quite a bit of work in order to allow a very small amount of RCU-related processing to be done. Alex Shi's analysis determined that this slowdown is due to lock contention within the scheduler. Unfortunately, as Peter Zijlstra points out, the scheduler's real-time semantics require global action, which means that this contention is inherent in real-time scheduling. (Yes, perhaps someone will come up with a workaround -- otherwise, -rt is not going to do well on large SMP systems -- but this patch will work around this issue in the meantime. And "the meantime" might well be forever.) This patch therefore re-introduces softirq processing to RCU, but only for core RCU work. RCU callbacks are still executed in kthread context, so that only a small amount of RCU work runs in softirq context in the common case. This should minimize ksoftirqd execution, allowing us to skip boosting of ksoftirqd for CONFIG_RCU_BOOST=y kernels. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Tested-by: "Alex,Shi" <alex.shi@intel.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>