summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2009-10-12NOHZ: update idle state also when NOHZ is inactiveEero Nurkkala
commit fdc6f192e7e1ae80565af23cc33dc88e3dcdf184 upstream. Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side effects with cpufreq governors on some systems. If the system did not switch into NOHZ mode ts->inidle is not set when tick_nohz_stop_sched_tick() is called from the idle routine. Therefor all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick() fail to call tick_nohz_start_idle(). This results in bogus idle accounting information which is passed to cpufreq governors. Set the inidle flag unconditionally of the NOHZ active state to keep the idle time accounting correct in any case. [ tglx: Added comment and tweaked the changelog ] Reported-by: Steven Noonan <steven@uplinklabs.net> Signed-off-by: Eero Nurkkala <ext-eero.nurkkala@nokia.com> Cc: Rik van Riel <riel@redhat.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Steven Noonan <steven@uplinklabs.net> LKML-Reference: <1254907901.30157.93.camel@eenurkka-desktop> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-12futex: Fix locking imbalanceThomas Gleixner
commit eaaea8036d0261d87d7072c5bc88c7ea730c18ac upstream. Rich reported a lock imbalance in the futex code: http://bugzilla.kernel.org/show_bug.cgi?id=14288 It's caused by the displacement of the retry_private label in futex_wake_op(). The code unlocks the hash bucket locks in the error handling path and retries without locking them again which makes the next unlock fail. Move retry_private so we lock the hash bucket locks when we retry. Reported-by: Rich Ercolany <rercola@acm.jhu.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-12futex: Nullify robust lists after cleanupPeter Zijlstra
commit fc6b177dee33365ccb29fe6d2092223cf8d679f9 upstream. The robust list pointers of user space held futexes are kept intact over an exec() call. When the exec'ed task exits exit_robust_list() is called with the stale pointer. The risk of corruption is minimal, but still it is incorrect to keep the pointers valid. Actually glibc should uninstall the robust list before calling exec() but we have to deal with it anyway. Nullify the pointers after [compat_]exit_robust_list() has been called. Reported-by: Anirban Sinha <ani@anirban.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-12futex: Move exit_pi_state() call to release_mm()Thomas Gleixner
commit 322a2c100a8998158445599ea437fb556aa95b11 upstream. exit_pi_state() is called from do_exit() but not from do_execve(). Move it to release_mm() so it gets called from do_execve() as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> Cc: Anirban Sinha <ani@anirban.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-12futex: fix requeue_pi key imbalanceDarren Hart
commit da085681014fb43d67d9bf6d14bc068e9254bd49 upstream. If futex_wait_requeue_pi() wakes prior to requeue, we drop the reference to the source futex_key twice, once in handle_early_requeue_pi_wakeup() and once on our way out. Remove the drop from the handle_early_requeue_pi_wakeup() and keep the get/drops together in futex_wait_requeue_pi(). Reported-by: Helge Bahmann <hcb@chaoticmind.net> Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Helge Bahmann <hcb@chaoticmind.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@us.ibm.com> LKML-Reference: <4ACCE21E.5030805@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-12ftrace: check for failure for all conversionsSteven Rostedt
commit 3279ba37db5d65c4ab0dcdee3b211ccb85bb563f upstream. Due to legacy code from back when the dynamic tracer used a daemon, only core kernel code was checking for failures. This is no longer the case. We must check for failures any time we perform text modifications. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-12tracing: correct module boundaries for ftrace_releasejolsa@redhat.com
commit e7247a15ff3bbdab0a8b402dffa1171e5c05a8e0 upstream. When the module is about the unload we release its call records. The ftrace_release function was given wrong values representing the module core boundaries, thus not releasing its call records. Plus making ftrace_release function module specific. Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <1254934835-363-3-git-send-email-jolsa@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-05perf_counter: Fix perf_copy_attr() pointer arithmeticIan Schram
commit cdf8073d6b2c6c5a3cd6ce0e6c1297157f7f99ba upstream. There is still some weird code in per_copy_attr(). Which supposedly checks that all bytes trailing a struct are zero. It doesn't seem to get pointer arithmetic right. Since it increments an iterating pointer by sizeof(unsigned long) rather than 1. Signed-off-by: Ian Schram <ischram@telenet.be> [ v2: clean up the messy PTR_ALIGN logic as well. ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <4AB3DEE2.3030600@telenet.be> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-24perf_counter: Start counting time enabled when group leader gets enabledPaul Mackerras
commit fa289beca9de9119c7760bd984f3640da21bc94c upstream. Currently, if a group is created where the group leader is initially disabled but a non-leader member is initially enabled, and then the leader is subsequently enabled some time later, the time_enabled for the non-leader member will reflect the whole time since it was created, not just the time since the leader was enabled. This is incorrect, because all of the members are effectively disabled while the leader is disabled, since none of the members can go on the PMU if the leader can't. Thus we have to update the ->tstamp_enabled for all the enabled group members when a group leader is enabled, so that the time_enabled computation only counts the time since the leader was enabled. Similarly, when disabling a group leader we have to update the time_enabled and time_running for all of the group members. Also, in update_counter_times, we have to treat a counter whose group leader is disabled as being disabled. Reported-by: Stephane Eranian <eranian@googlemail.com> Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <19091.29664.342227.445006@drongo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-24perf_counter: Fix buffer overflow in perf_copy_attr()Xiao Guangrong
commit b3e62e35058fc744ac794611f4e79bcd1c5a4b83 upstream. If we pass a big size data over perf_counter_open() syscall, the kernel will copy this data to a small buffer, it will cause kernel crash. This bug makes the kernel unsafe and non-root local user can trigger it. Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Paul Mackerras <paulus@samba.org> LKML-Reference: <4AAF37D4.5010706@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-05Merge branch 'perfcounters-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf_counter/powerpc: Fix cache event codes for POWER7 perf_counter: Fix /0 bug in swcounters perf_counters: Increase paranoia level
2009-08-29perf_counter: Fix /0 bug in swcountersPeter Zijlstra
We have a race in the swcounter stuff where we can start counting a counter that has never been enabled, this leads to a /0 situation. The below avoids the /0 but doesn't close the race, this would need a new counter state. The race is due to perf_swcounter_is_counting() which cannot discern between disabled due to scheduled out, and disabled for any other reason. Such a crash has been seen by Ingo: [ 967.092372] divide error: 0000 [#1] SMP [ 967.096499] last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map [ 967.104846] CPU 5 [ 967.106965] Modules linked in: [ 967.110169] Pid: 3351, comm: hackbench Not tainted 2.6.31-rc8-tip-01158-gd940a54-dirty #1568 X8DTN [ 967.119456] RIP: 0010:[<ffffffff810c0aba>] [<ffffffff810c0aba>] perf_swcounter_ctx_event+0x127/0x1af [ 967.129137] RSP: 0018:ffff8801a95abd70 EFLAGS: 00010046 [ 967.134699] RAX: 0000000000000002 RBX: ffff8801bd645c00 RCX: 0000000000000002 [ 967.142162] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8801bd645d40 [ 967.149584] RBP: ffff8801a95abdb0 R08: 0000000000000001 R09: ffff8801a95abe00 [ 967.157042] R10: 0000000000000037 R11: ffff8801aa1245f8 R12: ffff8801a95abe00 [ 967.164481] R13: ffff8801a95abe00 R14: ffff8801aa1c0e78 R15: 0000000000000001 [ 967.171953] FS: 0000000000000000(0000) GS:ffffc90000a00000(0063) knlGS:00000000f7f486c0 [ 967.180406] CS: 0010 DS: 002b ES: 002b CR0: 000000008005003b [ 967.186374] CR2: 000000004822c0ac CR3: 00000001b19a2000 CR4: 00000000000006e0 [ 967.193770] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 967.201224] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 967.208692] Process hackbench (pid: 3351, threadinfo ffff8801a95aa000, task ffff8801a96b0000) [ 967.217607] Stack: [ 967.219711] 0000000000000000 0000000000000037 0000000200000001 ffffc90000a1107c [ 967.227296] <0> ffff8801a95abe00 0000000000000001 0000000000000001 0000000000000037 [ 967.235333] <0> ffff8801a95abdf0 ffffffff810c0c20 0000000200a14f30 ffff8801a95abe40 [ 967.243532] Call Trace: [ 967.246103] [<ffffffff810c0c20>] do_perf_swcounter_event+0xde/0xec [ 967.252635] [<ffffffff810c0ca7>] perf_tpcounter_event+0x79/0x7b [ 967.258957] [<ffffffff81037f73>] ftrace_profile_sched_switch+0xc0/0xcb [ 967.265791] [<ffffffff8155f22d>] schedule+0x429/0x4c4 [ 967.271156] [<ffffffff8100c01e>] int_careful+0xd/0x14 Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1251472247.17617.74.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-28modules: Fix build error in the !CONFIG_KALLSYMS caseIngo Molnar
> James Bottomley (1): > module: workaround duplicate section names -tip testing found that this patch breaks the build on x86 if CONFIG_KALLSYMS is disabled: kernel/module.c: In function ‘load_module’: kernel/module.c:2367: error: ‘struct module’ has no member named ‘sect_attrs’ distcc[8269] ERROR: compile kernel/module.c on ph/32 failed make[1]: *** [kernel/module.o] Error 1 make: *** [kernel] Error 2 make: *** Waiting for unfinished jobs.... Commit 1b364bf misses the fact that section attributes are only built and dealt with if kallsyms is enabled. The patch below fixes this. ( note, technically speaking this should depend on CONFIG_SYSFS as well but this patch is correct too and keeps the #ifdef less intrusive - in the KALLSYMS && !SYSFS case the code is a NOP. ) Signed-off-by: Ingo Molnar <mingo@elte.hu> [ Replaced patch with a slightly cleaner variation by James Bottomley ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-28perf_counters: Increase paranoia levelIngo Molnar
Per-cpu counters are an ASLR information leak as they show the execution other tasks do. Increase the paranoia level to 1, which disallows per-cpu counters. (they still allow counting/profiling of own tasks - and admin can profile everything.) Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-27module: workaround duplicate section namesJames Bottomley
The root cause is a duplicate section name (.text); is this legal? [ Amerigo Wang: "AFAIK, yes." ] However, there's a problem with commit 6d76013381ed28979cd122eb4b249a88b5e384fa in that if you fail to allocate a mod->sect_attrs (in this case it's null because of the duplication), it still gets used without checking in add_notes_attrs() This should fix it [ This patch leaves other problems, particularly the sections directory, but recent parisc toolchains seem to produce these modules and this prevents a crash and is a minimal change -- RR ] Signed-off-by: James Bottomley <James.Bottomley@suse.de> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Tested-by: Helge Deller <deller@gmx.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-27module: fix BUG_ON() for powerpc (and other function descriptor archs)Rusty Russell
The rarely-used symbol_put_addr() needs to use dereference_function_descriptor on powerpc. Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-26clone(): fix race between copy_process() and de_thread()Oleg Nesterov
Spotted by Hiroshi Shimamoto who also provided the test-case below. copy_process() uses signal->count as a reference counter, but it is not. This test case #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <stdio.h> #include <errno.h> #include <pthread.h> void *null_thread(void *p) { for (;;) sleep(1); return NULL; } void *exec_thread(void *p) { execl("/bin/true", "/bin/true", NULL); return null_thread(p); } int main(int argc, char **argv) { for (;;) { pid_t pid; int ret, status; pid = fork(); if (pid < 0) break; if (!pid) { pthread_t tid; pthread_create(&tid, NULL, exec_thread, NULL); for (;;) pthread_create(&tid, NULL, null_thread, NULL); } do { ret = waitpid(pid, &status, 0); } while (ret == -1 && errno == EINTR); } return 0; } quickly creates an unkillable task. If copy_process(CLONE_THREAD) races with de_thread() copy_signal()->atomic(signal->count) breaks the signal->notify_count logic, and the execing thread can hang forever in kernel space. Change copy_process() to increment count/live only when we know for sure we can't fail. In this case the forked thread will take care of its reference to signal correctly. If copy_process() fails, check CLONE_THREAD flag. If it it set - do nothing, the counters were not changed and current belongs to the same thread group. If it is not set, ->signal must be released in any case (and ->count must be == 1), the forked child is the only thread in the thread group. We need more cleanups here, in particular signal->count should not be used by de_thread/__exit_signal at all. This patch only fixes the bug. Reported-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Tested-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-25Merge branch 'perfcounters-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf_counter: Fix typo in read() output generation perf tools: Check perf.data owner
2009-08-25Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: clockevent: Prevent dead lock on clockevents_lock timers: Drop write permission on /proc/timer_list
2009-08-25Merge branch 'tracing-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Fix too large stack usage in do_one_initcall() tracing: handle broken names in ftrace filter ftrace: Unify effect of writing to trace_options and option/*
2009-08-21perf_counter: Fix typo in read() output generationPeter Zijlstra
When you iterate a list, using the iterator is useful. Before: ID: 5 ID: 5 ID: 5 ID: 5 EVNT: 0x40088b scale: nan ID: 5 CNT: 1006252 ID: 6 CNT: 1011090 ID: 7 CNT: 1011196 ID: 8 CNT: 1011095 EVNT: 0x40088c scale: 1.000000 ID: 5 CNT: 2003065 ID: 6 CNT: 2011671 ID: 7 CNT: 2012620 ID: 8 CNT: 2013479 EVNT: 0x40088c scale: 1.000000 ID: 5 CNT: 3002390 ID: 6 CNT: 3015996 ID: 7 CNT: 3018019 ID: 8 CNT: 3020006 EVNT: 0x40088b scale: 1.000000 ID: 5 CNT: 4002406 ID: 6 CNT: 4021120 ID: 7 CNT: 4024241 ID: 8 CNT: 4027059 After: ID: 1 ID: 2 ID: 3 ID: 4 EVNT: 0x400889 scale: nan ID: 1 CNT: 1005270 ID: 2 CNT: 1009833 ID: 3 CNT: 1010065 ID: 4 CNT: 1010088 EVNT: 0x400898 scale: nan ID: 1 CNT: 2001531 ID: 2 CNT: 2022309 ID: 3 CNT: 2022470 ID: 4 CNT: 2022627 EVNT: 0x400888 scale: 0.489467 ID: 1 CNT: 3001261 ID: 2 CNT: 3027088 ID: 3 CNT: 3027941 ID: 4 CNT: 3028762 Reported-by: stephane eranian <eranian@googlemail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Corey J Ashford <cjashfor@us.ibm.com> Cc: perfmon2-devel <perfmon2-devel@lists.sourceforge.net> LKML-Reference: <1250867976.7538.73.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-19Merge branch 'perfcounters-fixes-for-linus-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-fixes-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf tools: Make 'make html' work perf annotate: Fix segmentation fault perf_counter: Fix the PARISC build perf_counter: Check task on counter read IPI perf: Rename perf-examples.txt to examples.txt perf record: Fix typo in pid_synthesize_comm_event
2009-08-19clockevent: Prevent dead lock on clockevents_lockSuresh Siddha
Currently clockevents_notify() is called with interrupts enabled at some places and interrupts disabled at some other places. This results in a deadlock in this scenario. cpu A holds clockevents_lock in clockevents_notify() with irqs enabled cpu B waits for clockevents_lock in clockevents_notify() with irqs disabled cpu C doing set_mtrr() which will try to rendezvous of all the cpus. This will result in C and A come to the rendezvous point and waiting for B. B is stuck forever waiting for the spinlock and thus not reaching the rendezvous point. Fix the clockevents code so that clockevents_lock is taken with interrupts disabled and thus avoid the above deadlock. Also call lapic_timer_propagate_broadcast() on the destination cpu so that we avoid calling smp_call_function() in the clockevents notifier chain. This issue left us wondering if we need to change the MTRR rendezvous logic to use stop machine logic (instead of smp_call_function) or add a check in spinlock debug code to see if there are other spinlocks which gets taken under both interrupts enabled/disabled conditions. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: "Pallipadi Venkatesh" <venkatesh.pallipadi@intel.com> Cc: "Brown Len" <len.brown@intel.com> LKML-Reference: <1250544899.2709.210.camel@sbs-t61.sc.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-18tracing: handle broken names in ftrace filterJiri Olsa
If one filter item (for set_ftrace_filter and set_ftrace_notrace) is being setup by more than 1 consecutive writes (FTRACE_ITER_CONT flag), it won't be handled corretly. I used following program to test/verify: [snip] #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <string.h> int main(int argc, char **argv) { int fd, i; char *file = argv[1]; if (-1 == (fd = open(file, O_WRONLY))) { perror("open failed"); return -1; } for(i = 0; i < (argc - 2); i++) { int len = strlen(argv[2+i]); int cnt, off = 0; while(len) { cnt = write(fd, argv[2+i] + off, len); len -= cnt; off += cnt; } } close(fd); return 0; } [snip] before change: sh-4.0# echo > ./set_ftrace_filter sh-4.0# /test ./set_ftrace_filter "sys" "_open " sh-4.0# cat ./set_ftrace_filter #### all functions enabled #### sh-4.0# after change: sh-4.0# echo > ./set_ftrace_notrace sh-4.0# test ./set_ftrace_notrace "sys" "_open " sh-4.0# cat ./set_ftrace_notrace sys_open sh-4.0# Signed-off-by: Jiri Olsa <jolsa@redhat.com> LKML-Reference: <20090811152904.GA26065@jolsa.lab.eng.brq.redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2009-08-18mm: revert "oom: move oom_adj value"KOSAKI Motohiro
The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to the mm_struct. It was a very good first step for sanitize OOM. However Paul Menage reported the commit makes regression to his job scheduler. Current OOM logic can kill OOM_DISABLED process. Why? His program has the code of similar to the following. ... set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */ ... if (vfork() == 0) { set_oom_adj(0); /* Invoked child can be killed */ execve("foo-bar-cmd"); } .... vfork() parent and child are shared the same mm_struct. then above set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also change oom_adj for vfork() parent. Then, vfork() parent (job scheduler) lost OOM immune and it was killed. Actually, fork-setting-exec idiom is very frequently used in userland program. We must not break this assumption. Then, this patch revert commit 2ff05b2b and related commit. Reverted commit list --------------------- - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct) - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE) - commit 8123681022 (oom: only oom kill exiting tasks with attached memory) - commit 933b787b57 (mm: copy over oom_adj value at fork time) Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-08-18Merge branch 'irq-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: genirq: Wake up irq thread after action has been installed
2009-08-18genirq: Wake up irq thread after action has been installedThomas Gleixner
The wake_up_process() of the new irq thread in __setup_irq() is too early as the irqaction is not yet fully initialized especially action->irq is not yet set. The interrupt thread might dereference the wrong irq descriptor. Move the wakeup after the action is installed and action->irq has been set. Reported-by: Michael Buesch <mb@bu3sch.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Michael Buesch <mb@bu3sch.de>
2009-08-18perf_counter: Fix the PARISC buildIngo Molnar
PARISC does not build: /home/mingo/tip/kernel/perf_counter.c: In function 'perf_counter_index': /home/mingo/tip/kernel/perf_counter.c:2016: error: 'PERF_COUNTER_INDEX_OFFSET' undeclared (first use in this function) /home/mingo/tip/kernel/perf_counter.c:2016: error: (Each undeclared identifier is reported only once /home/mingo/tip/kernel/perf_counter.c:2016: error: for each function it appears in.) As PERF_COUNTER_INDEX_OFFSET is not defined. Now, we could define it in the architecture - but lets also provide a core default of 0 (which happens to be what all but one architecture uses at the moment). Architectures that need a different index offset should set this value in their asm/perf_counter.h files. Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Helge Deller <deller@gmx.de> Cc: linux-parisc@vger.kernel.org Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-18ftrace: Unify effect of writing to trace_options and option/*Zhaolei
"echo noglobal-clock > trace_options" can be used to change trace clock but "echo 0 > options/global-clock" can't. The flag toggling will be silently accepted without actually changing the clock callback. We can fix it by using set_tracer_flags() in trace_options_core_write(). Changelog: v1->v2: Simplified switch() after Li Zefan <lizf@cn.fujitsu.com>'s suggestion Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Cc: Steven Rostedt <srostedt@redhat.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2009-08-17timers: Drop write permission on /proc/timer_listAmerigo Wang
/proc/timer_list and /proc/slabinfo are not supposed to be written, so there should be no write permissions on it. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: linux-mm@kvack.org Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Amerigo Wang <amwang@redhat.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Arjan van de Ven <arjan@linux.intel.com> LKML-Reference: <20090817094525.6355.88682.sendpatchset@localhost.localdomain> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-17perf_counter: Check task on counter read IPIPaul Mackerras
In general, code in perf_counter.c that is called through an IPI checks, for per-task counters, that the counter's task is still the current task. This is to handle the race condition where the cpu switches from the task we want to another task in the interval between sending the IPI and the IPI arriving and being handled on the target CPU. For some reason, __perf_counter_read is missing this check, yet there is no reason why the race condition can't occur. This adds a check that the current task is the one we want. If it isn't, we just return. In that case the counter->count value should be up to date, since it will have been updated when the counter was scheduled out, which must have happened since the IPI was sent. I don't have an example of an actual failure due to this race, but it seems obvious that it could occur and we need to guard against it. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <19076.63614.277861.368125@drongo.ozlabs.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-17Security/SELinux: seperate lsm specific mmap_min_addrEric Paris
Currently SELinux enforcement of controls on the ability to map low memory is determined by the mmap_min_addr tunable. This patch causes SELinux to ignore the tunable and instead use a seperate Kconfig option specific to how much space the LSM should protect. The tunable will now only control the need for CAP_SYS_RAWIO and SELinux permissions will always protect the amount of low memory designated by CONFIG_LSM_MMAP_MIN_ADDR. This allows users who need to disable the mmap_min_addr controls (usual reason being they run WINE as a non-root user) to do so and still have SELinux controls preventing confined domains (like a web server) from being able to map some area of low memory. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
2009-08-13genirq: prevent wakeup of freed irq threadLinus Torvalds
free_irq() can remove an irqaction while the corresponding interrupt is in progress, but free_irq() sets action->thread to NULL unconditionally, which might lead to a NULL pointer dereference in handle_IRQ_event() when the hard interrupt context tries to wake up the handler thread. Prevent this by moving the thread stop after synchronize_irq(). No need to set action->thread to NULL either as action is going to be freed anyway. This fixes a boot crash reported against preempt-rt which uses the mainline irq threads code to implement full irq threading. [ tglx: removed local irqthread variable ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2009-08-13Merge branch 'perfcounters-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: perf_counter: Report the cloning task as parent on perf_counter_fork() perf_counter: Fix an ipi-deadlock perf: Rework/fix the whole read vs group stuff perf_counter: Fix swcounter context invariance perf report: Don't show unresolved DSOs and symbols when -S/-d is used perf tools: Add a general option to enable raw sample records perf tools: Add a per tracepoint counter attribute to get raw sample perf_counter: Provide hw_perf_counter_setup_online() APIs perf list: Fix large list output by using the pager perf_counter, x86: Fix/improve apic fallback perf record: Add missing -C option support for specifying profile cpu perf tools: Fix dso__new handle() to handle deleted DSOs perf tools: Fix fallback to cplus_demangle() when bfd_demangle() is not available perf report: Show the tid too in -D perf record: Fix .tid and .pid fill-in when synthesizing events perf_counter, x86: Fix generic cache events on P6-mobile CPUs perf_counter, x86: Fix lapic printk message
2009-08-13Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: futex: Fix handling of bad requeue syscall pairing futex: Fix compat_futex to be same as futex for REQUEUE_PI locking, sched: Give waitqueue spinlocks their own lockdep classes futex: Update futex_q lock_ptr on requeue proxy lock
2009-08-13perf_counter: Report the cloning task as parent on perf_counter_fork()Peter Zijlstra
A bug in (9f498cc: perf_counter: Full task tracing) makes profiling multi-threaded apps it go belly up. [ output as: (PID:TID):(PPID:PTID) ] # ./perf report -D | grep FORK 0x4b0 [0x18]: PERF_EVENT_FORK: (3237:3237):(3236:3236) 0xa10 [0x18]: PERF_EVENT_FORK: (3237:3238):(3236:3236) 0xa70 [0x18]: PERF_EVENT_FORK: (3237:3239):(3236:3236) 0xad0 [0x18]: PERF_EVENT_FORK: (3237:3240):(3236:3236) 0xb18 [0x18]: PERF_EVENT_FORK: (3237:3241):(3236:3236) Shows us that the test (27d028d perf report: Update for the new FORK/EXIT events) in builtin-report.c: /* * A thread clone will have the same PID for both * parent and child. */ if (thread == parent) return 0; Will clearly fail. The problem is that perf_counter_fork() reports the actual parent, instead of the cloning thread. Fixing that (with the below patch), yields: # ./perf report -D | grep FORK 0x4c8 [0x18]: PERF_EVENT_FORK: (1590:1590):(1589:1589) 0xbd8 [0x18]: PERF_EVENT_FORK: (1590:1591):(1590:1590) 0xc80 [0x18]: PERF_EVENT_FORK: (1590:1592):(1590:1590) 0x3338 [0x18]: PERF_EVENT_FORK: (1590:1593):(1590:1590) 0x66b0 [0x18]: PERF_EVENT_FORK: (1590:1594):(1590:1590) Which both makes more sense and doesn't confuse perf report anymore. Reported-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: paulus@samba.org Cc: Anton Blanchard <anton@samba.org> Cc: Arjan van de Ven <arjan@infradead.org> LKML-Reference: <1250172882.5241.62.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-13perf_counter: Fix an ipi-deadlockPeter Zijlstra
perf_pending_counter() is called from IRQ context and will call perf_counter_disable(), however perf_counter_disable() uses smp_call_function_single() which doesn't fancy being used with IRQs disabled due to IPI deadlocks. Fix this by making it use the local __perf_counter_disable() call and teaching the counter_sched_out() code about pending disables as well. This should cover the case where a counter migrates before the pending queue gets processed. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey J Ashford <cjashfor@us.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> LKML-Reference: <20090813103655.244097721@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-13perf: Rework/fix the whole read vs group stuffPeter Zijlstra
Replace PERF_SAMPLE_GROUP with PERF_SAMPLE_READ and introduce PERF_FORMAT_GROUP to deal with group reads in a more generic way. This allows you to get group reads out of read() as well. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Corey J Ashford <cjashfor@us.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> LKML-Reference: <20090813103655.117411814@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-13perf_counter: Fix swcounter context invariancePeter Zijlstra
perf_swcounter_is_counting() uses a lock, which means we cannot use swcounters from NMI or when holding that particular lock, this is unintended. The below removes the lock, this opens up race window, but not worse than the swcounters already experience due to RCU traversal of the context in perf_swcounter_ctx_event(). This also fixes the hard lockups while opening a lockdep tracepoint counter. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: stephane eranian <eranian@googlemail.com> Cc: Corey J Ashford <cjashfor@us.ibm.com> LKML-Reference: <1250149915.10001.66.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-13perf_counter: Provide hw_perf_counter_setup_online() APIsIngo Molnar
Provide weak aliases for hw_perf_counter_setup_online(). This is used by the BTS patches (for v2.6.32), but it interacts with fixes so propagate this upstream. (it has no effect as of yet) Also export perf_counter_output() to architecture code. Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-12Remove double removal of blktrace directoryAlan D. Brunelle
commit fd51d251e4cdb21f68e9dbc4336514d64a105a79 Author: Stefan Raspl <raspl@linux.vnet.ibm.com> Date: Tue May 19 09:59:08 2009 +0200 blktrace: remove debugfs entries on bad path added in an explicit invocation of debugfs_remove for bt->dir, in blk_remove_buf_file_callback we are also getting the directory removed. On occasion I am seeing memory corruption that I have bisected down to this commit. [The testing involves a (long) series of I/O benchmarks with blktrace invoked around the actual runs.] I believe that this committed patch is correct, but the problem actually lies in the code in blk_remove_buf_file_callback. With this patch I am able to consistently get complete runs whereas previously I could not get a single run to complete. The first part of the patch simply moves the debugfs_remove below the relay_close: the relay_close call will remove files under bt->dir, and so we should not remove the directory until all the files we created have been removed. (Note: This is not sufficient to fix the problem - the file system code has ref counts on the directoy, so our invocation does not cause the directory to actually be removed. Nonetheless, we should not rely upon that feature.) Signed-off-by: Alan D. Brunelle <alan.brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-08-10Merge branch 'perfcounters-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perfcounters-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (27 commits) perf_counter: Zero dead bytes from ftrace raw samples size alignment perf_counter: Subtract the buffer size field from the event record size perf_counter: Require CAP_SYS_ADMIN for raw tracepoint data perf_counter: Correct PERF_SAMPLE_RAW output perf tools: callchain: Fix bad rounding of minimum rate perf_counter tools: Fix libbfd detection for systems with libz dependency perf: "Longum est iter per praecepta, breve et efficax per exempla" perf_counter: Fix a race on perf_counter_ctx perf_counter: Fix tracepoint sampling to be part of generic sampling perf_counter: Work around gcc warning by initializing tracepoint record unconditionally perf tools: callchain: Fix sum of percentages to be 100% by displaying amount of ignored chains in fractal mode perf tools: callchain: Fix 'perf report' display to be callchain by default perf tools: callchain: Fix spurious 'perf report' warnings: ignore empty callchains perf record: Fix the -A UI for empty or non-existent perf.data perf util: Fix do_read() to fail on EOF instead of busy-looping perf list: Fix the output to not include tracepoints without an id perf_counter/powerpc: Fix oops on cpus without perf_counter hardware support perf stat: Fix tool option consistency: rename -S/--scale to -c/--scale perf report: Add debug help for the finding of symbol bugs - show the symtab origin (DSO, build-id, kernel, etc) perf report: Fix per task mult-counter stat reporting ...
2009-08-10futex: Fix handling of bad requeue syscall pairingDarren Hart
If futex_requeue(requeue_pi=1) finds a futex_q that was created by a call other the futex_wait_requeue_pi(), the q.rt_waiter may be null. If so, this will result in an oops from the following call graph: futex_requeue() rt_mutex_start_proxy_lock() task_blocks_on_rt_mutex() waiter->task dereference OOPS We currently WARN_ON() if this is detected, clearly this is inadequate. If we detect a mispairing in futex_requeue(), bail out, seding -EINVAL to user-space. V2: Fix parenthesis warnings. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: John Kacur <jkacur@redhat.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> LKML-Reference: <4A7CA8C0.7010809@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-10Merge branch 'irq-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86/irq: Fix move_irq_desc() for nodes without ram
2009-08-10futex: Fix compat_futex to be same as futex for REQUEUE_PIDinakar Guniguntala
Need to add the REQUEUE_PI checks to the compat_sys_futex API as well to ensure 32 bit requeue's work fine on a 64 bit system. Patch is against latest tip Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Cc: Darren Hart <dvhltc@us.ibm.com> LKML-Reference: <20090810130142.GA23619@in.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-10locking, sched: Give waitqueue spinlocks their own lockdep classesPeter Zijlstra
Give waitqueue spinlocks their own lockdep classes when they are initialised from init_waitqueue_head(). This means that struct wait_queue::func functions can operate other waitqueues. This is used by CacheFiles to catch the page from a backing fs being unlocked and to wake up another thread to take a copy of it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Takashi Iwai <tiwai@suse.de> Cc: linux-cachefs@redhat.com Cc: torvalds@osdl.org Cc: akpm@linux-foundation.org LKML-Reference: <20090810113305.17284.81508.stgit@warthog.procyon.org.uk> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-10perf_counter: Require CAP_SYS_ADMIN for raw tracepoint dataPeter Zijlstra
Raw tracepoint data contains various kernel internals and data from other users, so restrict this to CAP_SYS_ADMIN. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1249896452.17467.75.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-10perf_counter: Correct PERF_SAMPLE_RAW outputPeter Zijlstra
PERF_SAMPLE_* output switches should unconditionally output the correct format, as they are the only way to unambiguously parse the PERF_EVENT_SAMPLE data. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> LKML-Reference: <1249896447.17467.74.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-10futex: Update futex_q lock_ptr on requeue proxy lockDarren Hart
futex_requeue() can acquire the lock on behalf of a waiter early on or during the requeue loop if it is uncontended or in the event of a lock steal or owner died. On wakeup, the waiter (in futex_wait_requeue_pi()) cleans up the pi_state owner using the lock_ptr to protect against concurrent access to the pi_state. The pi_state is hung off futex_q's on the requeue target futex hash bucket so the lock_ptr needs to be updated accordingly. The problem manifested by triggering the WARN_ON in lookup_pi_state() about the pid != pi_state->owner->pid. With this patch, the pi_state is properly guarded against concurrent access via the requeue target hb lock. The astute reviewer may notice that there is a window of time between when futex_requeue() unlocks the hb locks and when futex_wait_requeue_pi() will acquire hb2->lock. During this time the pi_state and uval are not in sync with the underlying rtmutex owner (but the uval does indicate there are waiters, so no atomic changes will occur in userspace). However, this is not a problem. Should a contending thread enter lookup_pi_state() and acquire hb2->lock before the ownership is fixed up, it will find the pi_state hung off a waiter's (possibly the pending owner's) futex_q and block on the rtmutex. Once futex_wait_requeue_pi() fixes up the owner, it will also move the pi_state from the old owner's task->pi_state_list to its own. v3: Fix plist lock name for application to mainline (rather than -rt) Compile tested against tip/v2.6.31-rc5. Signed-off-by: Darren Hart <dvhltc@us.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Dinakar Guniguntala <dino@in.ibm.com> Cc: John Stultz <johnstul@linux.vnet.ibm.com> LKML-Reference: <4A7F4EFF.6090903@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-08-09Merge branch 'timers-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: posix_cpu_timers_exit_group(): Do not use thread_group_cputimer()