summaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2013-08-11sched: Fix the broken sched_rr_get_interval()Zhu Yanhai
commit a59f4e079d19464eebb9b06513a1d4f55fdae5ba upstream. The caller of sched_sliced() should pass se.cfs_rq and se as the arguments, however in sched_rr_get_interval() we gave it rq.cfs_rq and se, which made the following computation obviously wrong. The change was introduced by commit: 77034937dc45 sched: fix crash in sys_sched_rr_get_interval() ... 5 years ago, while it had been the correct 'cfs_rq_of' before the commit. The change seems to be irrelevant to the commit msg, which was to return a 0 timeslice for tasks that are on an idle runqueue. So I believe that was just a plain typo. Signed-off-by: Zhu Yanhai <gaoyang.zyh@taobao.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1357621012-15039-1-git-send-email-gaoyang.zyh@taobao.com [ Since this is an ABI and an old bug, we'll test this via a slow upstream route, to hopefully discover any app breakage. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-25sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()sTejun Heo
commit 383efcd00053ec40023010ce5034bd702e7ab373 upstream. try_to_wake_up_local() should only be invoked to wake up another task in the same runqueue and BUG_ON()s are used to enforce the rule. Missing try_to_wake_up_local() can stall workqueue execution but such stalls are likely to be finite either by another work item being queued or the one blocked getting unblocked. There's no reason to trigger BUG while holding rq lock crashing the whole system. Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130318192234.GD3042@htj.dyndns.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-04-16sched_clock: Prevent 64bit inatomicity on 32bit systemsThomas Gleixner
commit a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 upstream. The sched_clock_remote() implementation has the following inatomicity problem on 32bit systems when accessing the remote scd->clock, which is a 64bit value. CPU0 CPU1 sched_clock_local() sched_clock_remote(CPU0) ... remote_clock = scd[CPU0]->clock read_low32bit(scd[CPU0]->clock) cmpxchg64(scd->clock,...) read_high32bit(scd[CPU0]->clock) While the update of scd->clock is using an atomic64 mechanism, the readout on the remote cpu is not, which can cause completely bogus readouts. It is a quite rare problem, because it requires the update to hit the narrow race window between the low/high readout and the update must go across the 32bit boundary. The resulting misbehaviour is, that CPU1 will see the sched_clock on CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value to this bogus timestamp. This stays that way due to the clamping implementation for about 4 seconds until the synchronization with CLOCK_MONOTONIC undoes the problem. The issue is hard to observe, because it might only result in a less accurate SCHED_OTHER timeslicing behaviour. To create observable damage on realtime scheduling classes, it is necessary that the bogus update of CPU1 sched_clock happens in the context of an realtime thread, which then gets charged 4 seconds of RT runtime, which results in the RT throttler mechanism to trigger and prevent scheduling of RT tasks for a little less than 4 seconds. So this is quite unlikely as well. The issue was quite hard to decode as the reproduction time is between 2 days and 3 weeks and intrusive tracing makes it less likely, but the following trace recorded with trace_clock=global, which uses sched_clock_local(), gave the final hint: <idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80 <idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ... irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0 <idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80 What happens is that CPU0 goes idle and invokes sched_clock_idle_sleep_event() which invokes sched_clock_local() and CPU1 runs a remote wakeup for CPU0 at the same time, which invokes sched_remote_clock(). The time jump gets propagated to CPU0 via sched_remote_clock() and stays stale on both cores for ~4 seconds. There are only two other possibilities, which could cause a stale sched clock: 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic wrong value. 2) sched_clock() which reads the TSC returns a sporadic wrong value. #1 can be excluded because sched_clock would continue to increase for one jiffy and then go stale. #2 can be excluded because it would not make the clock jump forward. It would just result in a stale sched_clock for one jiffy. After quite some brain twisting and finding the same pattern on other traces, sched_clock_remote() remained the only place which could cause such a problem and as explained above it's indeed racy on 32bit systems. So while on 64bit systems the readout is atomic, we need to verify the remote readout on 32bit machines. We need to protect the local->clock readout in sched_clock_remote() on 32bit as well because an NMI could hit between the low and the high readout, call sched_clock_local() and modify local->clock. Thanks to Siegfried Wulsch for bearing with my debug requests and going through the tedious tasks of running a bunch of reproducer systems to generate the debug information which let me decode the issue. Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-02-11sched/rt: Use root_domain of rt_rq not current processorShawn Bohrer
commit aa7f67304d1a03180f463258aa6f15a8b434e77d upstream. When the system has multiple domains do_sched_rt_period_timer() can run on any CPU and may iterate over all rt_rq in cpu_online_mask. This means when balance_runtime() is run for a given rt_rq that rt_rq may be in a different rd than the current processor. Thus if we use smp_processor_id() to get rd in do_balance_runtime() we may borrow runtime from a rt_rq that is not part of our rd. This changes do_balance_runtime to get the rd from the passed in rt_rq ensuring that we borrow runtime only from the correct rd for the given rt_rq. This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime when we try reclaim runtime lent to other rt_rq but runtime has been lent to a rt_rq in another rd. Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Mike Galbraith <bitbucket@online.de> Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1358186131-29494-1-git-send-email-sbohrer@rgmadvisors.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-27wake_up_process() should be never used to wakeup a TASK_STOPPED/TRACED taskOleg Nesterov
commit 9067ac85d533651b98c2ff903182a20cbb361fcb upstream. wake_up_process() should never wakeup a TASK_STOPPED/TRACED task. Change it to use TASK_NORMAL and add the WARN_ON(). TASK_ALL has no other users, probably can be killed. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-10Revert "sched, autogroup: Stop going ahead if autogroup is disabled"Mike Galbraith
commit fd8ef11730f1d03d5d6555aa53126e9e34f52f12 upstream. This reverts commit 800d4d30c8f20bd728e5741a3b77c4859a613f7c. Between commits 8323f26ce342 ("sched: Fix race in task_group()") and 800d4d30c8f2 ("sched, autogroup: Stop going ahead if autogroup is disabled"), autogroup is a wreck. With both applied, all you have to do to crash a box is disable autogroup during boot up, then reboot.. boom, NULL pointer dereference due to commit 800d4d30c8f2 not allowing autogroup to move things, and commit 8323f26ce342 making that the only way to switch runqueues: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90 Pid: 7047, comm: systemd-user-se Not tainted 3.6.8-smp #7 MEDIONPC MS-7502/MS-7502 RIP: effective_load.isra.43+0x50/0x90 Process systemd-user-se (pid: 7047, threadinfo ffff880221dde000, task ffff88022618b3a0) Call Trace: select_task_rq_fair+0x255/0x780 try_to_wake_up+0x156/0x2c0 wake_up_state+0xb/0x10 signal_wake_up+0x28/0x40 complete_signal+0x1d6/0x250 __send_signal+0x170/0x310 send_signal+0x40/0x80 do_send_sig_info+0x47/0x90 group_send_sig_info+0x4a/0x70 kill_pid_info+0x3a/0x60 sys_kill+0x97/0x1a0 ? vfs_read+0x120/0x160 ? sys_read+0x45/0x90 system_call_fastpath+0x16/0x1b Code: 49 0f af 41 50 31 d2 49 f7 f0 48 83 f8 01 48 0f 46 c6 48 2b 07 48 8b bf 40 01 00 00 48 85 ff 74 3a 45 31 c0 48 8b 8f 50 01 00 00 <48> 8b 11 4c 8b 89 80 00 00 00 49 89 d2 48 01 d0 45 8b 59 58 4c RIP [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90 RSP <ffff880221ddfbd8> CR2: 0000000000000000 Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13sched: Fix migration thread runtime bogosityMike Galbraith
commit 8f6189684eb4e85e6c593cd710693f09c944450a upstream. Make stop scheduler class do the same accounting as other classes, Migration threads can be caught in the act while doing exec balancing, leading to the below due to use of unmaintained ->se.exec_start. The load that triggered this particular instance was an apparently out of control heavily threaded application that does system monitoring in what equated to an exec bomb, with one of the VERY frequently migrated tasks being ps. %CPU PID USER CMD 99.3 45 root [migration/10] 97.7 53 root [migration/12] 97.0 57 root [migration/13] 90.1 49 root [migration/11] 89.6 65 root [migration/15] 88.7 17 root [migration/3] 80.4 37 root [migration/8] 78.1 41 root [migration/9] 44.2 13 root [migration/2] Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resumeSrivatsa S. Bhat
commit d35be8bab9b0ce44bed4b9453f86ebf64062721e upstream. In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed masks as and when necessary to ensure that the tasks belonging to the cpusets have some place (online CPUs) to run on. And regular CPU hotplug is destructive in the sense that the kernel doesn't remember the original cpuset configurations set by the user, across hotplug operations. However, suspend/resume (which uses CPU hotplug) is a special case in which the kernel has the responsibility to restore the system (during resume), to exactly the same state it was in before suspend. In order to achieve that, do the following: 1. Don't modify cpusets during suspend/resume. At all. In particular, don't move the tasks from one cpuset to another, and don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets during the CPU hotplug operations that are carried out in the suspend/resume path. 2. However, cpusets and sched domains are related. We just want to avoid altering cpusets alone. So, to keep the sched domains updated, build a single sched domain (containing all active cpus) during each of the CPU hotplug operations carried out in s/r path, effectively ignoring the cpusets' cpus_allowed masks. (Since userspace is frozen while doing all this, it will go unnoticed.) 3. During the last CPU online operation during resume, build the sched domains by looking up the (unaltered) cpusets' cpus_allowed masks. That will bring back the system to the same original state as it was in before suspend. Ultimately, this will not only solve the cpuset problem related to suspend resume (ie., restores the cpusets to exactly what it was before suspend, by not touching it at all) but also speeds up suspend/resume because we avoid running cpuset update code for every CPU being offlined/onlined. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02sched: Fix race in task_group()Peter Zijlstra
commit 8323f26ce3425460769605a6aece7a174edaa7d1 upstream. Stefan reported a crash on a kernel before a3e5d1091c1 ("sched: Don't call task_group() too many times in set_task_rq()"), he found the reason to be that the multiple task_group() invocations in set_task_rq() returned different values. Looking at all that I found a lack of serialization and plain wrong comments. The below tries to fix it using an extra pointer which is updated under the appropriate scheduler locks. Its not pretty, but I can't really see another way given how all the cgroup stuff works. Reported-and-tested-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-14sched: fix divide by zero at {thread_group,task}_timesStanislaw Gruszka
commit bea6832cc8c4a0a9a65dd17da6aaa657fe27bc3e upstream. On architectures where cputime_t is 64 bit type, is possible to trigger divide by zero on do_div(temp, (__force u32) total) line, if total is a non zero number but has lower 32 bit's zeroed. Removing casting is not a good solution since some do_div() implementations do cast to u32 internally. This problem can be triggered in practice on very long lived processes: PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin" #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2 #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00 #3 [ffff880472a51cd0] die at ffffffff8100f26b #4 [ffff880472a51d00] do_trap at ffffffff814f03f4 #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff #6 [ffff880472a51e00] divide_error at ffffffff8100be7b [exception RIP: thread_group_times+0x56] RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046 RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800 R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08 R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d #8 [ffff880472a51f40] sys_times at ffffffff81088524 #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2 RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202 RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000 RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0 RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940 R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940 ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-14sched,cgroup: Fix up task_groups listMike Galbraith
commit 35cf4e50b16331def6cfcbee11e49270b6db07f5 upstream. With multiple instances of task_groups, for_each_rt_rq() is a noop, no task groups having been added to the rt.c list instance. This renders __enable/disable_runtime() and print_rt_stats() noop, the user (non) visible effect being that rt task groups are missing in /proc/sched_debug. Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-07-19sched/nohz: Rewrite and fix load-avg computation -- againPeter Zijlstra
commit 5167e8d5417bf5c322a703d2927daec727ea40dd upstream. Thanks to Charles Wang for spotting the defects in the current code: - If we go idle during the sample window -- after sampling, we get a negative bias because we can negate our own sample. - If we wake up during the sample window we get a positive bias because we push the sample to a known active period. So rewrite the entire nohz load-avg muck once again, now adding copious documentation to the code. Reported-and-tested-by: Doug Smythies <dsmythies@telus.net> Reported-and-tested-by: Charles Wang <muming.wq@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins [ minor edits ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-06-17sched: Fix the relax_domain_level boot parameterDimitri Sivanich
commit a841f8cef4bb124f0f5563314d0beaf2e1249d72 upstream. It does not get processed because sched_domain_level_max is 0 at the time that setup_relax_domain_level() is run. Simply accept the value as it is, as we don't know the value of sched_domain_level_max until sched domain construction is completed. Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls the set_domain_attribute() routine prior to setting the sd->level, however, the set_domain_attribute() routine relies on the sd->level to decide whether idle load balancing will be off/on. Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-09sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list ↵Igor Mammedov
assumption If we have one cpu that failed to boot and boot cpu gave up on waiting for it and then another cpu is being booted, kernel might crash with following OOPS: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80 Call Trace: [<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50 The crash happens in init_sched_groups_power() that expects sched_groups to be circular linked list. However it is not always true, since sched_groups preallocated in __sdt_alloc are initialized in build_sched_groups and it may exit early if (cpu != cpumask_first(sched_domain_span(sd))) return 0; without initializing sd->groups->next field. Fix bug by initializing next field right after sched_group was allocated. Also-Reported-by: Jiang Liu <liuj97@gmail.com> Signed-off-by: Igor Mammedov <imammedo@redhat.com> Cc: a.p.zijlstra@chello.nl Cc: pjt@google.com Cc: seto.hidetoshi@jp.fujitsu.com Link: http://lkml.kernel.org/r/1336559908-32533-1-git-send-email-imammedo@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-26sched: Fix OOPS when build_sched_domains() percpu allocation failshe, bo
Under extreme memory used up situations, percpu allocation might fail. We hit it when system goes to suspend-to-ram, causing a kworker panic: EIP: [<c124411a>] build_sched_domains+0x23a/0xad0 Kernel panic - not syncing: Fatal exception Pid: 3026, comm: kworker/u:3 3.0.8-137473-gf42fbef #1 Call Trace: [<c18cc4f2>] panic+0x66/0x16c [...] [<c1244c37>] partition_sched_domains+0x287/0x4b0 [<c12a77be>] cpuset_update_active_cpus+0x1fe/0x210 [<c123712d>] cpuset_cpu_inactive+0x1d/0x30 [...] With this fix applied build_sched_domains() will return -ENOMEM and the suspend attempt fails. Signed-off-by: he, bo <bo.he@intel.com> Reviewed-by: Zhang, Yanmin <yanmin.zhang@intel.com> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1335355161.5892.17.camel@hebo [ So, we fail to deallocate a CPU because we cannot allocate RAM :-/ I don't like that kind of sad behavior but nevertheless it should not crash under high memory load. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-26sched: Fix more load-balancing falloutPeter Zijlstra
Commits 367456c756a6 ("sched: Ditch per cgroup task lists for load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage") left some more wreckage. By setting loop_max unconditionally to ->nr_running load-balancing could take a lot of time on very long runqueues (hackbench!). So keep the sysctl as max limit of the amount of tasks we'll iterate. Furthermore, the min load filter for migration completely fails with cgroups since inequality in per-cpu state can easily lead to such small loads :/ Furthermore the change to add new tasks to the tail of the queue instead of the head seems to have some effect.. not quite sure I understand why. Combined these fixes solve the huge hackbench regression reported by Tim when hackbench is ran in a cgroup. Reported-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins [ got rid of the CONFIG_PREEMPT tuning and made small readability edits ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-31Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix incorrect usage of for_each_cpu_mask() in select_fallback_rq() sched: Fix __schedule_bug() output when called from an interrupt sched/arch: Introduce the finish_arch_post_lock_switch() scheduler callback
2012-03-31sched: Fix incorrect usage of for_each_cpu_mask() in select_fallback_rq()Srivatsa S. Bhat
The function for_each_cpu_mask() expects a *pointer* to struct cpumask as its second argument, whereas select_fallback_rq() passes the value itself. And moreover, for_each_cpu_mask() has been marked as obselete in include/linux/cpumask.h. So move to the more appropriate for_each_cpu() variant. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Dave Jones <davej@redhat.com> Cc: Liu Chuansheng <chuansheng.liu@intel.com> Cc: vapier@gentoo.org Cc: rusty@rustcorp.com.au Link: http://lkml.kernel.org/r/4F75BED4.9050005@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-29Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpusets: Remove an unused variable sched/rt: Improve pick_next_highest_task_rt() sched: Fix select_fallback_rq() vs cpu_active/cpu_online sched/x86/smp: Do not enable IRQs over calibrate_delay() sched: Fix compiler warning about declared inline after use MAINTAINERS: Update email address for SCHEDULER and PERF EVENTS
2012-03-29Merge branch 'sched/arch' into sched/urgentIngo Molnar
Merge reason: It has not gone upstream via the ARM tree, merge it via the scheduler tree. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-29sched: Fix __schedule_bug() output when called from an interruptStephen Boyd
If schedule is called from an interrupt handler __schedule_bug() will call show_regs() with the registers saved during the interrupt handling done in do_IRQ(). This means we'll see the registers and the backtrace for the process that was interrupted and not the full backtrace explaining who called schedule(). This is due to 838225b ("sched: use show_regs() to improve __schedule_bug() output", 2007-10-24) which improperly assumed that get_irq_regs() would return the registers for the current stack because it is being called from within an interrupt handler. Simply remove the show_reg() code so that we dump a backtrace for the interrupt handler that called schedule(). [ I ran across this when I was presented with a scheduling while atomic log with a stacktrace pointing at spin_unlock_irqrestore(). It made no sense and I had to guess what interrupt handler could be called and poke around for someone calling schedule() in an interrupt handler. A simple test of putting an msleep() in an interrupt handler works better with this patch because you can actually see the msleep() call in the backtrace. ] Also-reported-by: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Satyam Sharma <satyam@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1332979847-27102-1-git-send-email-sboyd@codeaurora.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-28Add #includes needed to permit the removal of asm/system.hDavid Howells
asm/system.h is a cause of circular dependency problems because it contains commonly used primitive stuff like barrier definitions and uncommonly used stuff like switch_to() that might require MMU definitions. asm/system.h has been disintegrated by this point on all arches into the following common segments: (1) asm/barrier.h Moved memory barrier definitions here. (2) asm/cmpxchg.h Moved xchg() and cmpxchg() here. #included in asm/atomic.h. (3) asm/bug.h Moved die() and similar here. (4) asm/exec.h Moved arch_align_stack() here. (5) asm/elf.h Moved AT_VECTOR_SIZE_ARCH here. (6) asm/switch_to.h Moved switch_to() here. Signed-off-by: David Howells <dhowells@redhat.com>
2012-03-27sched/rt: Improve pick_next_highest_task_rt()Michael J Wang
Avoid extra work by continuing on to the next rt_rq if the highest prio task in current rt_rq is the same priority as our candidate task. More detailed explanation: if next is not NULL, then we have found a candidate task, and its priority is next->prio. Now we are looking for an even higher priority task in the other rt_rq's. idx is the highest priority in the current candidate rt_rq. In the current 3.3 code, if idx is equal to next->prio, we would start scanning the tasks in that rt_rq and replace the current candidate task with a task from that rt_rq. But the new task would only have a priority that is equal to our previous candidate task, so we have not advanced our goal of finding a higher prio task. So we should avoid the extra work by continuing on to the next rt_rq if idx is equal to next->prio. Signed-off-by: Michael J Wang <mjwang@broadcom.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Yong Zhang <yong.zhang0@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/2EF88150C0EF2C43A218742ED384C1BC0FC83D6B@IRVEXCHMB08.corp.ad.broadcom.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-27sched: Fix select_fallback_rq() vs cpu_active/cpu_onlinePeter Zijlstra
Commit 5fbd036b55 ("sched: Cleanup cpu_active madness"), which was supposed to finally sort the cpu_active mess, instead uncovered more. Since CPU_STARTING is ran before setting the cpu online, there's a (small) window where the cpu has active,!online. If during this time there's a wakeup of a task that used to reside on that cpu select_task_rq() will use select_fallback_rq() to compute an alternative cpu to run on since we find !online. select_fallback_rq() however will compute the new cpu against cpu_active, this means that it can return the same cpu it started out with, the !online one, since that cpu is in fact marked active. This results in us trying to scheduling a task on an offline cpu and triggering a WARN in the IPI code. The solution proposed by Chuansheng Liu of setting cpu_active in set_cpu_online() is buggy, firstly not all archs actually use set_cpu_online(), secondly, not all archs call set_cpu_online() with IRQs disabled, this means we would introduce either the same race or the race from fd8a7de17 ("x86: cpu-hotplug: Prevent softirq wakeup on wrong CPU") -- albeit much narrower. [ By setting online first and active later we have a window of online,!active, fresh and bound kthreads have task_cpu() of 0 and since cpu0 isn't in tsk_cpus_allowed() we end up in select_fallback_rq() which excludes !active, resulting in a reset of ->cpus_allowed and the thread running all over the place. ] The solution is to re-work select_fallback_rq() to require active _and_ online. This makes the active,!online case work as expected, OTOH archs running CPU_STARTING after setting online are now vulnerable to the issue from fd8a7de17 -- these are alpha and blackfin. Reported-by: Chuansheng Liu <chuansheng.liu@intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Frysinger <vapier@gentoo.org> Cc: linux-alpha@vger.kernel.org Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-23sched: Fix compiler warning about declared inline after usePeter Zijlstra
kernel/sched/fair.c:420: warning: 'account_cfs_rq_runtime' declared inline after being called kernel/sched/fair.c:420: warning: previous declaration of 'account_cfs_rq_runtime' was here kernel/sched/fair.c:1165: warning: 'return_cfs_rq_runtime' declared inlineafter being called kernel/sched/fair.c:1165: warning: previous declaration of 'return_cfs_rq_runtime' was here Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20120321200717.49BB4A024E@akpm.mtv.corp.google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-21Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates for 3.4 from James Morris: "The main addition here is the new Yama security module from Kees Cook, which was discussed at the Linux Security Summit last year. Its purpose is to collect miscellaneous DAC security enhancements in one place. This also marks a departure in policy for LSM modules, which were previously limited to being standalone access control systems. Chromium OS is using Yama, and I believe there are plans for Ubuntu, at least. This patchset also includes maintenance updates for AppArmor, TOMOYO and others." Fix trivial conflict in <net/sock.h> due to the jumo_label->static_key rename. * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits) AppArmor: Fix location of const qualifier on generated string tables TOMOYO: Return error if fails to delete a domain AppArmor: add const qualifiers to string arrays AppArmor: Add ability to load extended policy TOMOYO: Return appropriate value to poll(). AppArmor: Move path failure information into aa_get_name and rename AppArmor: Update dfa matching routines. AppArmor: Minor cleanup of d_namespace_path to consolidate error handling AppArmor: Retrieve the dentry_path for error reporting when path lookup fails AppArmor: Add const qualifiers to generated string tables AppArmor: Fix oops in policy unpack auditing AppArmor: Fix error returned when a path lookup is disconnected KEYS: testing wrong bit for KEY_FLAG_REVOKED TOMOYO: Fix mount flags checking order. security: fix ima kconfig warning AppArmor: Fix the error case for chroot relative path name lookup AppArmor: fix mapping of META_READ to audit and quiet flags AppArmor: Fix underflow in xindex calculation AppArmor: Fix dropping of allowed operations that are force audited AppArmor: Add mising end of structure test to caps unpacking ...
2012-03-20Merge branch 'for-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Out of the 8 commits, one fixes a long-standing locking issue around tasklist walking and others are cleanups." * 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list cgroup: Remove wrong comment on cgroup_enable_task_cg_list() cgroup: remove cgroup_subsys argument from callbacks cgroup: remove extra calls to find_existing_css_set cgroup: replace tasklist_lock with rcu_read_lock cgroup: simplify double-check locking in cgroup_attach_proc cgroup: move struct cgroup_pidlist out from the header file cgroup: remove cgroup_attach_task_current_cg()
2012-03-20Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes for v3.4 from Ingo Molnar * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) printk: Make it compile with !CONFIG_PRINTK sched/x86: Fix overflow in cyc2ns_offset sched: Fix nohz load accounting -- again! sched: Update yield() docs printk/sched: Introduce special printk_sched() for those awkward moments sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer sched: Cleanup cpu_active madness sched: Fix load-balance wreckage sched: Clean up parameter passing of proc_sched_autogroup_set_nice() sched: Ditch per cgroup task lists for load-balancing sched: Rename load-balancing fields sched: Move load-balancing arguments into helper struct sched/rt: Do not submit new work when PI-blocked sched/rt: Prevent idle task boosting sched/wait: Add __wake_up_all_locked() API sched/rt: Document scheduler related skip-resched-check sites sched/rt: Use schedule_preempt_disabled() sched/rt: Add schedule_preempt_disabled() sched/rt: Do not throttle when PI boosting sched/rt: Keep period timer ticking when rt throttling is active ...
2012-03-13sched/arch: Introduce the finish_arch_post_lock_switch() scheduler callbackCatalin Marinas
This callback is called by the scheduler after rq->lock has been released and interrupts enabled. It will be used in subsequent patches on the ARM architecture. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Will Deacon <will.deacon@arm.com> Reviewed-by: Frank Rowand <frank.rowand@am.sony.com> Tested-by: Will Deacon <will.deacon@arm.com> Tested-by: Marc Zyngier <Marc.Zyngier@arm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/20120313110840.7b444deb6b1bb902c15f3cdf@canb.auug.org.au Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12sched: Fix nohz load accounting -- again!Peter Zijlstra
Various people reported nohz load tracking still being wrecked, but Doug spotted the actual problem. We fold the nohz remainder in too soon, causing us to loose samples and under-account. So instead of playing catch-up up-front, always do a single load-fold with whatever state we encounter and only then fold the nohz remainder and play catch-up. Reported-by: Doug Smythies <dsmythies@telus.net> Reported-by: LesÅ=82aw Kope=C4=87 <leslaw.kopec@nasza-klasa.pl> Reported-by: Aman Gupta <aman@tmm1.net> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-4v31etnhgg9kwd6ocgx3rxl8@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12sched: Update yield() docsPeter Zijlstra
Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1331056466.11248.327.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12printk/sched: Introduce special printk_sched() for those awkward momentsPeter Zijlstra
There's a few awkward printk()s inside of scheduler guts that people prefer to keep but really are rather deadlock prone. Fudge around it by storing the text in a per-cpu buffer and poll it using the existing printk_tick() handler. This will drop output when its more frequent than once a tick, however only the affinity thing could possible go that fast and for that just one should suffice to notify the admin he's done something silly.. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-wua3lmkt3dg8nfts66o6brne@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancerDiwakar Tundlam
The 'next_balance' field of 'nohz' idle balancer must be initialized to jiffies. Since jiffies is initialized to negative 300 seconds the 'nohz' idle balancer does not run for the first 300s (5mins) after bootup. If no new processes are spawed or no idle cycles happen, the load on the cpus will remain unbalanced for that duration. Signed-off-by: Diwakar Tundlam <dtundlam@nvidia.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1DD7BFEDD3147247B1355BEFEFE4665237994F30EF@HQMAIL04.nvidia.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12sched: Cleanup cpu_active madnessPeter Zijlstra
Stepan found: CPU0 CPUn _cpu_up() __cpu_up() boostrap() notify_cpu_starting() set_cpu_online() while (!cpu_active()) cpu_relax() <PREEMPT-out> smp_call_function(.wait=1) /* we find cpu_online() is true */ arch_send_call_function_ipi_mask() /* wait-forever-more */ <PREEMPT-in> local_irq_enable() cpu_notify(CPU_ONLINE) sched_cpu_active() set_cpu_active() Now the purpose of cpu_active is mostly with bringing down a cpu, where we mark it !active to avoid the load-balancer from moving tasks to it while we tear down the cpu. This is required because we only update the sched_domain tree after we brought the cpu-down. And this is needed so that some tasks can still run while we bring it down, we just don't want new tasks to appear. On cpu-up however the sched_domain tree doesn't yet include the new cpu, so its invisible to the load-balancer, regardless of the active state. So instead of setting the active state after we boot the new cpu (and consequently having to wait for it before enabling interrupts) set the cpu active before we set it online and avoid the whole mess. Reported-by: Stepan Moskovchenko <stepanm@codeaurora.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-12sched: Fix load-balance wreckagePeter Zijlstra
Commit 367456c ("sched: Ditch per cgroup task lists for load-balancing") completely wrecked load-balancing due to a few silly mistakes. Correct those and remove more pointless code. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-zk04ihygwxn7qqrlpaf73b0r@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-07Revert "CPU hotplug, cpusets, suspend: Don't touch cpusets during ↵Linus Torvalds
suspend/resume" This reverts commit 8f2f748b0656257153bcf0941df8d6060acc5ca6. It causes some odd regression that we have not figured out, and it's too late in the -rc series to try to figure it out now. As reported by Konstantin Khlebnikov, it causes consistent hangs on his laptop (Thinkpad x220: 2x cores + HT). They can be avoided by adding calls to "rebuild_sched_domains();" in cpuset_cpu_[in]active() for the CPU_{ONLINE/DOWN_FAILED/DOWN_PREPARE}_FROZEN cases, but it's not at all clear why, and it makes no sense. Konstantin's config doesn't even have CONFIG_CPUSETS enabled, just to make things even more interesting. So it's not the cpusets, it's just the scheduling domains. So until this is understood, revert. Bisected-reported-and-tested-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-05Merge branch 'perf/urgent' into perf/coreIngo Molnar
Conflicts: tools/perf/builtin-record.c tools/perf/builtin-top.c tools/perf/perf.h tools/perf/util/top.h Merge reason: resolve these cherry-picking conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-02sched: Clean up parameter passing of proc_sched_autogroup_set_nice()Hiroshi Shimamoto
Pass nice as a value to proc_sched_autogroup_set_nice(). No side effect is expected, and the variable err will be overwritten with the return value. Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4F45FBB7.5090607@ct.jp.nec.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched: Ditch per cgroup task lists for load-balancingPeter Zijlstra
Per cgroup load-balance has numerous problems, chief amongst them that there is no real sane order in them. So stop pretending it makes sense and enqueue all tasks on a single list. This also allows us to more easily fix the fwd progress issue uncovered by the lock-break stuff. Rotate the list on failure to migreate and limit the total iterations to nr_running (which with releasing the lock isn't strictly accurate but close enough). Also add a filter that skips very light tasks on the first attempt around the list, this attempts to avoid shooting whole cgroups around without affecting over balance. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Link: http://lkml.kernel.org/n/tip-tx8yqydc7eimgq7i4rkc3a4g@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched: Rename load-balancing fieldsPeter Zijlstra
s/env->this_/env->dst_/g s/env->busiest_/env->src_/g s/pull_task/move_task/g Makes everything clearer. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Link: http://lkml.kernel.org/n/tip-0yvgms8t8x962drpvl0fu0kk@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched: Move load-balancing arguments into helper structPeter Zijlstra
Passing large sets of similar arguments all around the load-balancer gets tiresom when you want to modify something. Stick them all in a helper structure and pass the structure around. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Link: http://lkml.kernel.org/n/tip-5slqz0vhsdzewrfk9eza1aon@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched/rt: Do not submit new work when PI-blockedThomas Gleixner
When we are PI-blocked then we want to get things done ASAP. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vw8et3445km5b8mpihf4trae@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched/rt: Prevent idle task boostingThomas Gleixner
Idle task boosting is a nono in general. There is one exception, when PREEMPT_RT and NOHZ is active: The idle task calls get_next_timer_interrupt() and holds the timer wheel base->lock on the CPU and another CPU wants to access the timer (probably to cancel it). We can safely ignore the boosting request, as the idle CPU runs this code with interrupts disabled and will complete the lock protected section without being interrupted. So there is no real need to boost. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-755rvsosz7sdzot12a3gbha6@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched/wait: Add __wake_up_all_locked() APIThomas Gleixner
For code which protects the waitqueue itself with another lock it makes no sense to acquire the waitqueue lock for wakeup all. Provide __wake_up_all_locked(). This is an optimization on the vanilla kernel (to be used by the PCI code) and an important semantic distinction on -rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-ux6m4b8jonb9inx8xafh77ds@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched/rt: Document scheduler related skip-resched-check sitesThomas Gleixner
Create a distinction between scheduler related preempt_enable_no_resched() calls and the nearly one hundred other places in the kernel that do not want to reschedule, for one reason or another. This distinction matters for -rt, where the scheduler and the non-scheduler preempt models (and checks) are different. For upstream it's purely documentational. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-gs88fvx2mdv5psnzxnv575ke@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched/rt: Add schedule_preempt_disabled()Thomas Gleixner
Add helper to get rid of the ever repeating: preempt_enable_no_resched(); schedule(); preempt_disable(); patterns. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-wxx7btox7coby6ifv5vzhzgp@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched/rt: Do not throttle when PI boostingPeter Zijlstra
When a runqueue has rt_runtime_us = 0 then the only way it can accumulate rt_time is via PI boosting. That causes the runqueue to be throttled and replenishing does not change anything due to rt_runtime_us = 0. So avoid that situation by clearing rt_time and skip the throttling alltogether. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [ Changelog ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-7x70cypsotjb4jvcor3edctk@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01sched/rt: Keep period timer ticking when rt throttling is activePeter Zijlstra
When a runqueue is throttled we cannot disable the period timer because that timer is the only way to undo the throttling. We got stale throttling entries when a rq was throttled and then the global sysctl was disabled, which stopped the timer. Signed-off-by: Peter Zijlstra <peterz@infradead.org> [ Added changelog ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-nuj34q52p6ro7szapuz84i0v@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01Merge branch 'linus' into sched/coreIngo Molnar
Merge reason: we'll queue up dependent patches. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27CPU hotplug, cpusets, suspend: Don't touch cpusets during suspend/resumeSrivatsa S. Bhat
Currently, during CPU hotplug, the cpuset callbacks modify the cpusets to reflect the state of the system, and this handling is asymmetric. That is, upon CPU offline, that CPU is removed from all cpusets. However when it comes back online, it is put back only to the root cpuset. This gives rise to a significant problem during suspend/resume. During suspend, we offline all non-boot cpus and during resume we online them back. Which means, after a resume, all cpusets (except the root cpuset) will be restricted to just one single CPU (the boot cpu). But the whole point of suspend/resume is to restore the system to a state which is as close as possible to how it was before suspend. So to fix this, don't touch cpusets during suspend/resume. That is, modify the cpuset-related CPU hotplug callback to just ignore CPU hotplug when it is initiated as part of the suspend/resume sequence. Reported-by: Prashanth Nageshappa <prashanth@linux.vnet.ibm.com> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/4F460D7B.1020703@linux.vnet.ibm.com Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>