summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-12-06uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUMEAndy Lutomirski
commit 82975bc6a6df743b9a01810fb32cb65d0ec5d60b upstream. x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but not on non-paranoid returns. I suspect that this is a mistake and that the code only works because int3 is paranoid. Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround for the x86 bug. With that bug fixed, we can remove _TIF_NOTIFY_RESUME from the uprobes code. Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21rcu: Use rcu_gp_kthread_wake() to wake up grace period kthreadsPranith Kumar
commit 2aa792e6faf1a00f5accf1f69e87e11a390ba2cd upstream. The rcu_gp_kthread_wake() function checks for three conditions before waking up grace period kthreads: * Is the thread we are trying to wake up the current thread? * Are the gp_flags zero? (all threads wait on non-zero gp_flags condition) * Is there no thread created for this flavour, hence nothing to wake up? If any one of these condition is true, we do not call wake_up(). It was found that there are quite a few avoidable wake ups both during idle time and under stress induced by rcutorture. Idle: Total:66000, unnecessary:66000, case1:61827, case2:66000, case3:0 Total:68000, unnecessary:68000, case1:63696, case2:68000, case3:0 rcutorture: Total:254000, unnecessary:254000, case1:199913, case2:254000, case3:0 Total:256000, unnecessary:256000, case1:201784, case2:256000, case3:0 Here case{1-3} are the cases listed above. We can avoid these wake ups by using rcu_gp_kthread_wake() to conditionally wake up the grace period kthreads. There is a comment about an implied barrier supplied by the wake_up() logic. This barrier is necessary for the awakened thread to see the updated ->gp_flags. This flag is always being updated with the root node lock held. Also, the awakened thread tries to acquire the root node lock before reading ->gp_flags because of which there is proper ordering. Hence this commit tries to avoid calling wake_up() whenever we can by using rcu_gp_kthread_wake() function. Signed-off-by: Pranith Kumar <bobby.prani@gmail.com> CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Kamal Mostafa <kamal@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21tracing: Do not busy wait in buffer spliceRabin Vincent
commit e30f53aad2202b5526c40c36d8eeac8bf290bde5 upstream. On a !PREEMPT kernel, attempting to use trace-cmd results in a soft lockup: # trace-cmd record -e raw_syscalls:* -F false NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61] ... Call Trace: [<ffffffff8105b580>] ? __wake_up_common+0x90/0x90 [<ffffffff81092e25>] wait_on_pipe+0x35/0x40 [<ffffffff810936e3>] tracing_buffers_splice_read+0x2e3/0x3c0 [<ffffffff81093300>] ? tracing_stats_read+0x2a0/0x2a0 [<ffffffff812d10ab>] ? _raw_spin_unlock+0x2b/0x40 [<ffffffff810dc87b>] ? do_read_fault+0x21b/0x290 [<ffffffff810de56a>] ? handle_mm_fault+0x2ba/0xbd0 [<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80 [<ffffffff810951e2>] ? trace_buffer_lock_reserve+0x22/0x60 [<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80 [<ffffffff8112415d>] do_splice_to+0x6d/0x90 [<ffffffff81126971>] SyS_splice+0x7c1/0x800 [<ffffffff812d1edd>] tracesys_phase2+0xd3/0xd8 The problem is this: tracing_buffers_splice_read() calls ring_buffer_wait() to wait for data in the ring buffers. The buffers are not empty so ring_buffer_wait() returns immediately. But tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1, meaning it only wants to read a full page. When the full page is not available, tracing_buffers_splice_read() tries to wait again with ring_buffer_wait(), which again returns immediately, and so on. Fix this by adding a "full" argument to ring_buffer_wait() which will make ring_buffer_wait() wait until the writer has left the reader's page, i.e. until full-page reads will succeed. Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in Fixes: b1169cc69ba9 ("tracing: Remove mock up poll wait function") Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21audit: keep inode pinnedMiklos Szeredi
commit 799b601451b21ebe7af0e6e8f6e2ccd4683c5064 upstream. Audit rules disappear when an inode they watch is evicted from the cache. This is likely not what we want. The guilty commit is "fsnotify: allow marks to not pin inodes in core", which didn't take into account that audit_tree adds watches with a zero mask. Adding any mask should fix this. Fixes: 90b1e7a57880 ("fsnotify: allow marks to not pin inodes in core") Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21audit: AUDIT_FEATURE_CHANGE message format missing delimiting spaceRichard Guy Briggs
commit 897f1acbb6702ddaa953e8d8436eee3b12016c7e upstream. Add a space between subj= and feature= fields to make them parsable. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21audit: correct AUDIT_GET_FEATURE return message typeRichard Guy Briggs
commit 9ef91514774a140e468f99d73d7593521e6d25dc upstream. When an AUDIT_GET_FEATURE message is sent from userspace to the kernel, it should reply with a message tagged as an AUDIT_GET_FEATURE type with a struct audit_feature. The current reply is a message tagged as an AUDIT_GET type with a struct audit_feature. This appears to have been a cut-and-paste-eo in commit b0fed40. Reported-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14sched: Use rq->rd in sched_setaffinity() under RCU read lockKirill Tkhai
commit f1e3a0932f3a9554371792a7daaf1e0eb19f66d5 upstream. Probability of use-after-free isn't zero in this place. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183636.11015.83611.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14posix-timers: Fix stack info leak in timer_create()Mathias Krause
commit 6891c4509c792209c44ced55a60f13954cb50ef4 upstream. If userland creates a timer without specifying a sigevent info, we'll create one ourself, using a stack local variable. Particularly will we use the timer ID as sival_int. But as sigev_value is a union containing a pointer and an int, that assignment will only partially initialize sigev_value on systems where the size of a pointer is bigger than the size of an int. On such systems we'll copy the uninitialized stack bytes from the timer_create() call to userland when the timer actually fires and we're going to deliver the signal. Initialize sigev_value with 0 to plug the stack info leak. Found in the PaX patch, written by the PaX Team. Fixes: 5a9fa7307285 ("posix-timers: kill ->it_sigev_signo and...") Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Brad Spengler <spender@grsecurity.net> Cc: PaX Team <pageexec@freemail.hu> Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14PM / Sleep: fix recovery during resuming from hibernationImre Deak
commit 94fb823fcb4892614f57e59601bb9d4920f24711 upstream. If a device's dev_pm_ops::freeze callback fails during the QUIESCE phase, we don't rollback things correctly calling the thaw and complete callbacks. This could leave some devices in a suspended state in case of an error during resuming from hibernation. Signed-off-by: Imre Deak <imre.deak@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14OOM, PM: OOM killed task shouldn't escape PM suspendMichal Hocko
commit 5695be142e203167e3cb515ef86a88424f3524eb upstream. PM freezer relies on having all tasks frozen by the time devices are getting frozen so that no task will touch them while they are getting frozen. But OOM killer is allowed to kill an already frozen task in order to handle OOM situtation. In order to protect from late wake ups OOM killer is disabled after all tasks are frozen. This, however, still keeps a window open when a killed task didn't manage to die by the time freeze_processes finishes. Reduce the race window by checking all tasks after OOM killer has been disabled. This is still not race free completely unfortunately because oom_killer_disable cannot stop an already ongoing OOM killer so a task might still wake up from the fridge and get killed without freeze_processes noticing. Full synchronization of OOM and freezer is, however, too heavy weight for this highly unlikely case. Introduce and check oom_kills counter which gets incremented early when the allocator enters __alloc_pages_may_oom path and only check all the tasks if the counter changes during the freezing attempt. The counter is updated so early to reduce the race window since allocator checked oom_killer_disabled which is set by PM-freezing code. A false positive will push the PM-freezer into a slow path but that is not a big deal. Changes since v1 - push the re-check loop out of freeze_processes into check_frozen_processes and invert the condition to make the code more readable as per Rafael Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14freezer: Do not freeze tasks killed by OOM killerCong Wang
commit 51fae6da640edf9d266c94f36bc806c63c301991 upstream. Since f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring) OOM killer relies on being able to thaw a frozen task to handle OOM situation but a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE) has reorganized the code and stopped clearing freeze flag in __thaw_task. This means that the target task only wakes up and goes into the fridge again because the freezing condition hasn't changed for it. This reintroduces the bug fixed by f660daac474c6f. Fix the issue by checking for TIF_MEMDIE thread flag in freezing_slow_path and exclude the task from freezing completely. If a task was already frozen it would get woken by __thaw_task from OOM killer and get out of freezer after rechecking freezing(). Changes since v1 - put TIF_MEMDIE check into freezing_slowpath rather than in __refrigerator as per Oleg - return __thaw_task into oom_scan_process_thread because oom_kill_process will not wake task in the fridge because it is sleeping uninterruptible [mhocko@suse.cz: rewrote the changelog] Fixes: a3201227f803 (freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE) Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14futex: Fix a race condition between REQUEUE_PI and task deathBrian Silverman
commit 30a6b8031fe14031ab27c1fa3483cb9780e7f63c upstream. free_pi_state and exit_pi_state_list both clean up futex_pi_state's. exit_pi_state_list takes the hb lock first, and most callers of free_pi_state do too. requeue_pi doesn't, which means free_pi_state can free the pi_state out from under exit_pi_state_list. For example: task A | task B exit_pi_state_list | pi_state = | curr->pi_state_list->next | | futex_requeue(requeue_pi=1) | // pi_state is the same as | // the one in task A | free_pi_state(pi_state) | list_del_init(&pi_state->list) | kfree(pi_state) list_del_init(&pi_state->list) | Move the free_pi_state calls in requeue_pi to before it drops the hb locks which it's already holding. [ tglx: Removed a pointless free_pi_state() call and the hb->lock held debugging. The latter comes via a seperate patch ] Signed-off-by: Brian Silverman <bsilver16384@gmail.com> Cc: austin.linux@gmail.com Cc: darren@dvhart.com Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1414282837-23092-1-git-send-email-bsilver16384@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14modules, lock around setting of MODULE_STATE_UNFORMEDPrarit Bhargava
commit d3051b489aa81ca9ba62af366149ef42b8dae97c upstream. A panic was seen in the following sitation. There are two threads running on the system. The first thread is a system monitoring thread that is reading /proc/modules. The second thread is loading and unloading a module (in this example I'm using my simple dummy-module.ko). Note, in the "real world" this occurred with the qlogic driver module. When doing this, the following panic occurred: ------------[ cut here ]------------ kernel BUG at kernel/module.c:3739! invalid opcode: 0000 [#1] SMP Modules linked in: binfmt_misc sg nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel lrw igb gf128mul glue_helper iTCO_wdt iTCO_vendor_support ablk_helper ptp sb_edac cryptd pps_core edac_core shpchp i2c_i801 pcspkr wmi lpc_ich ioatdma mfd_core dca ipmi_si nfsd ipmi_msghandler auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit drm_kms_helper ttm isci drm libsas ahci libahci scsi_transport_sas libata i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dummy_module] CPU: 37 PID: 186343 Comm: cat Tainted: GF O-------------- 3.10.0+ #7 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.00.29.D696.1311111329 11/11/2013 task: ffff8807fd2d8000 ti: ffff88080fa7c000 task.ti: ffff88080fa7c000 RIP: 0010:[<ffffffff810d64c5>] [<ffffffff810d64c5>] module_flags+0xb5/0xc0 RSP: 0018:ffff88080fa7fe18 EFLAGS: 00010246 RAX: 0000000000000003 RBX: ffffffffa03b5200 RCX: 0000000000000000 RDX: 0000000000001000 RSI: ffff88080fa7fe38 RDI: ffffffffa03b5000 RBP: ffff88080fa7fe28 R08: 0000000000000010 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000000000f R12: ffffffffa03b5000 R13: ffffffffa03b5008 R14: ffffffffa03b5200 R15: ffffffffa03b5000 FS: 00007f6ae57ef740(0000) GS:ffff88101e7a0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000404f70 CR3: 0000000ffed48000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Stack: ffffffffa03b5200 ffff8810101e4800 ffff88080fa7fe70 ffffffff810d666c ffff88081e807300 000000002e0f2fbf 0000000000000000 ffff88100f257b00 ffffffffa03b5008 ffff88080fa7ff48 ffff8810101e4800 ffff88080fa7fee0 Call Trace: [<ffffffff810d666c>] m_show+0x19c/0x1e0 [<ffffffff811e4d7e>] seq_read+0x16e/0x3b0 [<ffffffff812281ed>] proc_reg_read+0x3d/0x80 [<ffffffff811c0f2c>] vfs_read+0x9c/0x170 [<ffffffff811c1a58>] SyS_read+0x58/0xb0 [<ffffffff81605829>] system_call_fastpath+0x16/0x1b Code: 48 63 c2 83 c2 01 c6 04 03 29 48 63 d2 eb d9 0f 1f 80 00 00 00 00 48 63 d2 c6 04 13 2d 41 8b 0c 24 8d 50 02 83 f9 01 75 b2 eb cb <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 RIP [<ffffffff810d64c5>] module_flags+0xb5/0xc0 RSP <ffff88080fa7fe18> Consider the two processes running on the system. CPU 0 (/proc/modules reader) CPU 1 (loading/unloading module) CPU 0 opens /proc/modules, and starts displaying data for each module by traversing the modules list via fs/seq_file.c:seq_open() and fs/seq_file.c:seq_read(). For each module in the modules list, seq_read does op->start() <-- this is a pointer to m_start() op->show() <- this is a pointer to m_show() op->stop() <-- this is a pointer to m_stop() The m_start(), m_show(), and m_stop() module functions are defined in kernel/module.c. The m_start() and m_stop() functions acquire and release the module_mutex respectively. ie) When reading /proc/modules, the module_mutex is acquired and released for each module. m_show() is called with the module_mutex held. It accesses the module struct data and attempts to write out module data. It is in this code path that the above BUG_ON() warning is encountered, specifically m_show() calls static char *module_flags(struct module *mod, char *buf) { int bx = 0; BUG_ON(mod->state == MODULE_STATE_UNFORMED); ... The other thread, CPU 1, in unloading the module calls the syscall delete_module() defined in kernel/module.c. The module_mutex is acquired for a short time, and then released. free_module() is called without the module_mutex. free_module() then sets mod->state = MODULE_STATE_UNFORMED, also without the module_mutex. Some additional code is called and then the module_mutex is reacquired to remove the module from the modules list: /* Now we can delete it from the lists */ mutex_lock(&module_mutex); stop_machine(__unlink_module, mod, NULL); mutex_unlock(&module_mutex); This is the sequence of events that leads to the panic. CPU 1 is removing dummy_module via delete_module(). It acquires the module_mutex, and then releases it. CPU 1 has NOT set dummy_module->state to MODULE_STATE_UNFORMED yet. CPU 0, which is reading the /proc/modules, acquires the module_mutex and acquires a pointer to the dummy_module which is still in the modules list. CPU 0 calls m_show for dummy_module. The check in m_show() for MODULE_STATE_UNFORMED passed for dummy_module even though it is being torn down. Meanwhile CPU 1, which has been continuing to remove dummy_module without holding the module_mutex, now calls free_module() and sets dummy_module->state to MODULE_STATE_UNFORMED. CPU 0 now calls module_flags() with dummy_module and ... static char *module_flags(struct module *mod, char *buf) { int bx = 0; BUG_ON(mod->state == MODULE_STATE_UNFORMED); and BOOM. Acquire and release the module_mutex lock around the setting of MODULE_STATE_UNFORMED in the teardown path, which should resolve the problem. Testing: In the unpatched kernel I can panic the system within 1 minute by doing while (true) do insmod dummy_module.ko; rmmod dummy_module.ko; done and while (true) do cat /proc/modules; done in separate terminals. In the patched kernel I was able to run just over one hour without seeing any issues. I also verified the output of panic via sysrq-c and the output of /proc/modules looks correct for all three states for the dummy_module. dummy_module 12661 0 - Unloading 0xffffffffa03a5000 (OE-) dummy_module 12661 0 - Live 0xffffffffa03bb000 (OE) dummy_module 14015 1 - Loading 0xffffffffa03a5000 (OE+) Signed-off-by: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14sched: Use dl_bw_of() under RCU read lockKirill Tkhai
commit 66339c31bc3978d5fff9c4b4cb590a861def4db2 upstream. dl_bw_of() dereferences rq->rd which has to have RCU read lock held. Probability of use-after-free isn't zero here. Also add lockdep assert into dl_bw_cpus(). Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140922183624.11015.71558.stgit@localhost Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14perf: Fix unclone_ctx() vs. lockingPeter Zijlstra
commit 211de6eba8960521e2be450a7d07db85fba4604c upstream. The idiot who did 4a1c0f262f88 ("perf: Fix lockdep warning on process exit") forgot to pay attention and fix all similar cases. Do so now. In particular, unclone_ctx() must be called while holding ctx->lock, therefore all such sites are broken for the same reason. Pull the put_ctx() call out from under ctx->lock. Reported-by: Sasha Levin <sasha.levin@oracle.com> Probably-also-reported-by: Vince Weaver <vincent.weaver@maine.edu> Fixes: 4a1c0f262f88 ("perf: Fix lockdep warning on process exit") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Cong Wang <cwang@twopensource.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20140930172308.GI4241@worktop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-14tracing/syscalls: Ignore numbers outside NR_syscalls' rangeRabin Vincent
commit 086ba77a6db00ed858ff07451bedee197df868c9 upstream. ARM has some private syscalls (for example, set_tls(2)) which lie outside the range of NR_syscalls. If any of these are called while syscall tracing is being performed, out-of-bounds array access will occur in the ftrace and perf sys_{enter,exit} handlers. # trace-cmd record -e raw_syscalls:* true && trace-cmd report ... true-653 [000] 384.675777: sys_enter: NR 192 (0, 1000, 3, 4000022, ffffffff, 0) true-653 [000] 384.675812: sys_exit: NR 192 = 1995915264 true-653 [000] 384.675971: sys_enter: NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1) true-653 [000] 384.675988: sys_exit: NR 983045 = 0 ... # trace-cmd record -e syscalls:* true [ 17.289329] Unable to handle kernel paging request at virtual address aaaaaace [ 17.289590] pgd = 9e71c000 [ 17.289696] [aaaaaace] *pgd=00000000 [ 17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM [ 17.290169] Modules linked in: [ 17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21 [ 17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000 [ 17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8 [ 17.290866] LR is at syscall_trace_enter+0x124/0x184 Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers. Commit cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" added the check for less than zero, but it should have also checked for greater than NR_syscalls. Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in Fixes: cd0980fc8add "tracing: Check invalid syscall nr while tracing syscalls" Signed-off-by: Rabin Vincent <rabin@rab.in> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-30futex: Ensure get_futex_key_refs() always implies a barrierCatalin Marinas
commit 76835b0ebf8a7fe85beb03c75121419a7dec52f0 upstream. Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up) changes the futex code to avoid taking a lock when there are no waiters. This code has been subsequently fixed in commit 11d4616bd07f (futex: revert back to the explicit waiter counting code). Both the original commit and the fix-up rely on get_futex_key_refs() to always imply a barrier. However, for private futexes, none of the cases in the switch statement of get_futex_key_refs() would be hit and the function completes without a memory barrier as required before checking the "waiters" in futex_wake() -> hb_waiters_pending(). The consequence is a race with a thread waiting on a futex on another CPU, allowing the waker thread to read "waiters == 0" while the waiter thread to have read "futex_val == locked" (in kernel). Without this fix, the problem (user space deadlocks) can be seen with Android bionic's mutex implementation on an arm64 multi-cluster system. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Reported-by: Matteo Franchin <Matteo.Franchin@arm.com> Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up) Acked-by: Davidlohr Bueso <dave@stgolabs.net> Tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-15irq_work: Force raised irq work to run on irq work interruptFrederic Weisbecker
commit 76a33061b9323b7fdb220ae5fa116c10833ec22e upstream. The nohz full kick, which restarts the tick when any resource depend on it, can't be executed anywhere given the operation it does on timers. If it is called from the scheduler or timers code, chances are that we run into a deadlock. This is why we run the nohz full kick from an irq work. That way we make sure that the kick runs on a virgin context. However if that's the case when irq work runs in its own dedicated self-ipi, things are different for the big bunch of archs that don't support the self triggered way. In order to support them, irq works are also handled by the timer interrupt as fallback. Now when irq works run on the timer interrupt, the context isn't blank. More precisely, they can run in the context of the hrtimer that runs the tick. But the nohz kick cancels and restarts this hrtimer and cancelling an hrtimer from itself isn't allowed. This is why we run in an endless loop: Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2 CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34 Workqueue: btrfs-endio-write normal_work_helper [btrfs] ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37 ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010 ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000 Call Trace: <NMI[<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a [<ffffffff8a7ef928>] panic+0xd4/0x207 [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120 [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350 [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0 [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150 [<ffffffff8a187934>] perf_event_overflow+0x14/0x20 [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410 [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50 [<ffffffff8a007b72>] nmi_handle+0xd2/0x390 [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0 [<ffffffff8a008268>] do_nmi+0xb8/0x100 [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0 <<EOE><IRQ[<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50 [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90 [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50 [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50 [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0 [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30 [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90 [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100 [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10 [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70 [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50 [<ffffffff8a0f700b>] update_process_times+0x5b/0x70 [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60 [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60 [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470 [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0 [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270 [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60 [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50 [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80 To fix this we force non-lazy irq works to run on irq work self-IPIs when available. That ability of the arch to trigger irq work self IPIs is available with arch_irq_work_has_interrupt(). Reported-by: Catalin Iacob <iacobcatalin@gmail.com> Reported-by: Dave Jones <davej@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-10-03Merge tag 'trace-fixes-v3.17-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull trace ring buffer iterator fix from Steven Rostedt: "While testing some new changes for 3.18, I kept hitting a bug every so often in the ring buffer. At first I thought it had to do with some of the changes I was working on, but then testing something else I realized that the bug was in 3.17 itself. I ran several bisects as the bug was not very reproducible, and finally came up with the commit that I could reproduce easily within a few minutes, and without the change I could run the tests over an hour without issue. The change fit the bug and I figured out a fix. That bad commit was: Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page" This commit fixed a bug, but in the process created another one. It used the wrong value as the cached value that is used to see if things changed while an iterator was in use. This made it look like a change always happened, and could cause the iterator to go into an infinite loop" * tag 'trace-fixes-v3.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: ring-buffer: Fix infinite spin in reading buffer
2014-10-02perf: fix perf bug in fork()Peter Zijlstra
Oleg noticed that a cleanup by Sylvain actually uncovered a bug; by calling perf_event_free_task() when failing sched_fork() we will not yet have done the memset() on ->perf_event_ctxp[] and will therefore try and 'free' the inherited contexts, which are still in use by the parent process. This is bad.. Suggested-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Sylvain 'ythier' Hitier <sylvain.hitier@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-02ring-buffer: Fix infinite spin in reading bufferSteven Rostedt (Red Hat)
Commit 651e22f2701b "ring-buffer: Always reset iterator to reader page" fixed one bug but in the process caused another one. The reset is to update the header page, but that fix also changed the way the cached reads were updated. The cache reads are used to test if an iterator needs to be updated or not. A ring buffer iterator, when created, disables writes to the ring buffer but does not stop other readers or consuming reads from happening. Although all readers are synchronized via a lock, they are only synchronized when in the ring buffer functions. Those functions may be called by any number of readers. The iterator continues down when its not interrupted by a consuming reader. If a consuming read occurs, the iterator starts from the beginning of the buffer. The way the iterator sees that a consuming read has happened since its last read is by checking the reader "cache". The cache holds the last counts of the read and the reader page itself. Commit 651e22f2701b changed what was saved by the cache_read when the rb_iter_reset() occurred, making the iterator never match the cache. Then if the iterator calls rb_iter_reset(), it will go into an infinite loop by checking if the cache doesn't match, doing the reset and retrying, just to see that the cache still doesn't match! Which should never happen as the reset is suppose to set the cache to the current value and there's locks that keep a consuming reader from having access to the data. Fixes: 651e22f2701b "ring-buffer: Always reset iterator to reader page" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-09-27Merge branch 'for-3.17-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This is quite late but these need to be backported anyway. This is the fix for a long-standing cpuset bug which existed from 2009. cpuset makes use of PF_SPREAD_{PAGE|SLAB} flags to modify the task's memory allocation behavior according to the settings of the cpuset it belongs to; unfortunately, when those flags have to be changed, cpuset did so directly even whlie the target task is running, which is obviously racy as task->flags may be modified by the task itself at any time. This obscure bug manifested as corrupt PF_USED_MATH flag leading to a weird crash. The bug is fixed by moving the flag to task->atomic_flags. The first two are prepatory ones to help defining atomic_flags accessors and the third one is the actual fix" * 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags sched: add macros to define bitops for task atomic flags sched: fix confusing PFA_NO_NEW_PRIVS constant
2014-09-25Merge tag 'pm+acpi-3.17-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: "These are regression fixes (ACPI hotplug, cpufreq, hibernation, ACPI LPSS driver), fixes for stuff that never worked correctly (ACPI GPIO support in some cases and a wrong sign of an error code in the ACPI core in one place), and one blacklist item for ACPI backlight handling. Specifics: - Revert of a recent hibernation core commit that introduced a NULL pointer dereference during resume for at least one user (Rafael J Wysocki). - Fix for the ACPI LPSS (Low-Power Subsystem) driver to disable asynchronous PM callback execution for LPSS devices during system suspend/resume (introduced in 3.16) which turns out to break ordering expectations on some systems. From Fu Zhonghui. - cpufreq core fix related to the handling of sysfs nodes during system suspend/resume that has been broken for intel_pstate since 3.15 from Lan Tianyu. - Restore the generation of "online" uevents for ACPI container devices that was removed in 3.14, but some user space utilities turn out to need them (Rafael J Wysocki). - The cpufreq core fails to release a lock in an error code path after changes made in 3.14. Fix from Prarit Bhargava. - ACPICA and ACPI/GPIO fixes to make the handling of ACPI GPIO operation regions (which means AML using GPIOs) work correctly in all cases from Bob Moore and Srinivas Pandruvada. - Fix for a wrong sign of the ACPI core's create_modalias() return value in case of an error from Mika Westerberg. - ACPI backlight blacklist entry for ThinkPad X201s from Aaron Lu" * tag 'pm+acpi-3.17-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: Revert "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()" gpio / ACPI: Use pin index and bit length ACPICA: Update to GPIO region handler interface. ACPI / platform / LPSS: disable async suspend/resume of LPSS devices cpufreq: release policy->rwsem on error cpufreq: fix cpufreq suspend/resume for intel_pstate ACPI / scan: Correct error return value of create_modalias() ACPI / video: disable native backlight for ThinkPad X201s ACPI / hotplug: Generate online uevents for ACPI containers
2014-09-24cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flagsZefan Li
When we change cpuset.memory_spread_{page,slab}, cpuset will flip PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset. This should be done using atomic bitops, but currently we don't, which is broken. Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened when one thread tried to clear PF_USED_MATH while at the same time another thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on the same task. Here's the full report: https://lkml.org/lkml/2014/9/19/230 To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags. v4: - updated mm/slab.c. (Fengguang Wu) - updated Documentation. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Kees Cook <keescook@chromium.org> Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time") Cc: <stable@vger.kernel.org> # 2.6.31+ Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-25Revert "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()"Rafael J. Wysocki
Revert commit 6efde38f0769 (PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()) that introduced a NULL pointer dereference during system resume from hibernation: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff810a8cc1>] swsusp_free+0x21/0x190 PGD b39c2067 PUD b39c1067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: <irrelevant list of modules> CPU: 1 PID: 4898 Comm: s2disk Tainted: G C 3.17-rc5-amd64 #1 Debian 3.17~rc5-1~exp1 Hardware name: LENOVO 2776LEG/2776LEG, BIOS 6EET55WW (3.15 ) 12/19/2011 task: ffff88023155ea40 ti: ffff8800b3b14000 task.ti: ffff8800b3b14000 RIP: 0010:[<ffffffff810a8cc1>] [<ffffffff810a8cc1>] swsusp_free+0x21/0x190 RSP: 0018:ffff8800b3b17ea8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8800b39bab00 RCX: 0000000000000001 RDX: ffff8800b39bab10 RSI: ffff8800b39bab00 RDI: 0000000000000000 RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000000 R10: ffff8800b39bab10 R11: 0000000000000246 R12: ffffea0000000000 R13: ffff880232f485a0 R14: ffff88023ac27cd8 R15: ffff880232927590 FS: 00007f406d83b700(0000) GS:ffff88023bc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000b3a62000 CR4: 00000000000007e0 Stack: ffff8800b39bab00 0000000000000010 ffff880232927590 ffffffff810acb4a ffff8800b39bab00 ffffffff811a955a ffff8800b39bab10 0000000000000000 ffff88023155f098 ffffffff81a6b8c0 ffff88023155ea40 0000000000000007 Call Trace: [<ffffffff810acb4a>] ? snapshot_release+0x2a/0xb0 [<ffffffff811a955a>] ? __fput+0xca/0x1d0 [<ffffffff81080627>] ? task_work_run+0x97/0xd0 [<ffffffff81012d89>] ? do_notify_resume+0x69/0xa0 [<ffffffff8151452a>] ? int_signal+0x12/0x17 Code: 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 54 48 8b 05 ba 62 9c 00 49 bc 00 00 00 00 00 ea ff ff 48 8b 3d a1 62 9c 00 55 53 <48> 8b 10 48 89 50 18 48 8b 52 20 48 c7 40 28 00 00 00 00 c7 40 RIP [<ffffffff810a8cc1>] swsusp_free+0x21/0x190 RSP <ffff8800b3b17ea8> CR2: 0000000000000000 ---[ end trace f02be86a1ec0cccb ]--- due to forbidden_pages_map being NULL in swsusp_free(). Fixes: 6efde38f0769 "PM / Hibernate: Iterate over set bits instead of PFNs in swsusp_free()" Reported-by: Bjørn Mork <bjorn@mork.no> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-23Merge branch 'for-3.17-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: "One late fix for cgroup. I was waiting for another set of fixes for a long-standing obscure cpuset bug but am not sure whether they'll be ready before v3.17 release. This one is a simple fix for a mutex unlock balance bug in an allocation failure path in pidlist_array_load(). The bug was introduced in v3.14 and the fix is tagged for -stable" * 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: fix unbalanced locking
2014-09-19Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Two kernel side fixes: a kprobes fix and a perf_remove_from_context() fix (which does not yet fix the migration bug which is WIP)" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Fix a race condition in perf_remove_from_context() kprobes/x86: Free 'optinsn' cache when range check fails
2014-09-18cgroup: fix unbalanced lockingZefan Li
cgroup_pidlist_start() holds cgrp->pidlist_mutex and then calls pidlist_array_load(), and cgroup_pidlist_stop() releases the mutex. It is wrong that we release the mutex in the failure path in pidlist_array_load(), because cgroup_pidlist_stop() will be called no matter if cgroup_pidlist_start() returns errno or not. Fixes: 4bac00d16a8760eae7205e41d2c246477d42a210 Cc: <stable@vger.kernel.org> # 3.14+ Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
2014-09-13Merge branches 'locking-urgent-for-linus' and 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull futex and timer fixes from Thomas Gleixner: "A oneliner bugfix for the jinxed futex code: - Drop hash bucket lock in the error exit path. I really could slap myself for intruducing that bug while fixing all the other horror in that code three month ago ... and the timer department is not too proud about the following fixes: - Deal with a long standing rounding bug in the timeval to jiffies conversion. It's a real issue and this fix fell through the cracks for quite some time. - Another round of alarmtimer fixes. Finally this code gets used more widely and the subtle issues hidden for quite some time are noticed and fixed. Nothing really exciting, just the itty bitty details which bite the serious users here and there" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Unlock hb->lock in futex_wait_requeue_pi() error path * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: alarmtimer: Lock k_itimer during timer callback alarmtimer: Do not signal SIGEV_NONE timers alarmtimer: Return relative times in timer_gettime jiffies: Fix timeval conversion to jiffies
2014-09-12alarmtimer: Lock k_itimer during timer callbackRichard Larocque
Locks the k_itimer's it_lock member when handling the alarm timer's expiry callback. The regular posix timers defined in posix-timers.c have this lock held during timout processing because their callbacks are routed through posix_timer_fn(). The alarm timers follow a different path, so they ought to grab the lock somewhere else. Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by: Richard Larocque <rlarocque@google.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12alarmtimer: Do not signal SIGEV_NONE timersRichard Larocque
Avoids sending a signal to alarm timers created with sigev_notify set to SIGEV_NONE by checking for that special case in the timeout callback. The regular posix timers avoid sending signals to SIGEV_NONE timers by not scheduling any callbacks for them in the first place. Although it would be possible to do something similar for alarm timers, it's simpler to handle this as a special case in the timeout. Prior to this patch, the alarm timer would ignore the sigev_notify value and try to deliver signals to the process anyway. Even worse, the sanity check for the value of sigev_signo is skipped when SIGEV_NONE was specified, so the signal number could be bogus. If sigev_signo was an unitialized value (as it often would be if SIGEV_NONE is used), then it's hard to predict which signal will be sent. Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by: Richard Larocque <rlarocque@google.com> Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12alarmtimer: Return relative times in timer_gettimeRichard Larocque
Returns the time remaining for an alarm timer, rather than the time at which it is scheduled to expire. If the timer has already expired or it is not currently scheduled, the it_value's members are set to zero. This new behavior matches that of the other posix-timers and the POSIX specifications. This is a change in user-visible behavior, and may break existing applications. Hopefully, few users rely on the old incorrect behavior. Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sharvil Nanavati <sharvil@google.com> Signed-off-by: Richard Larocque <rlarocque@google.com> [jstultz: minor style tweak] Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12jiffies: Fix timeval conversion to jiffiesAndrew Hunter
timeval_to_jiffies tried to round a timeval up to an integral number of jiffies, but the logic for doing so was incorrect: intervals corresponding to exactly N jiffies would become N+1. This manifested itself particularly repeatedly stopping/starting an itimer: setitimer(ITIMER_PROF, &val, NULL); setitimer(ITIMER_PROF, NULL, &val); would add a full tick to val, _even if it was exactly representable in terms of jiffies_ (say, the result of a previous rounding.) Doing this repeatedly would cause unbounded growth in val. So fix the math. Here's what was wrong with the conversion: we essentially computed (eliding seconds) jiffies = usec * (NSEC_PER_USEC/TICK_NSEC) by using scaling arithmetic, which took the best approximation of NSEC_PER_USEC/TICK_NSEC with denominator of 2^USEC_JIFFIE_SC = x/(2^USEC_JIFFIE_SC), and computed: jiffies = (usec * x) >> USEC_JIFFIE_SC and rounded this calculation up in the intermediate form (since we can't necessarily exactly represent TICK_NSEC in usec.) But the scaling arithmetic is a (very slight) *over*approximation of the true value; that is, instead of dividing by (1 usec/ 1 jiffie), we effectively divided by (1 usec/1 jiffie)-epsilon (rounding down). This would normally be fine, but we want to round timeouts up, and we did so by adding 2^USEC_JIFFIE_SC - 1 before the shift; this would be fine if our division was exact, but dividing this by the slightly smaller factor was equivalent to adding just _over_ 1 to the final result (instead of just _under_ 1, as desired.) In particular, with HZ=1000, we consistently computed that 10000 usec was 11 jiffies; the same was true for any exact multiple of TICK_NSEC. We could possibly still round in the intermediate form, adding something less than 2^USEC_JIFFIE_SC - 1, but easier still is to convert usec->nsec, round in nanoseconds, and then convert using time*spec*_to_jiffies. This adds one constant multiplication, and is not observably slower in microbenchmarks on recent x86 hardware. Tested: the following program: int main() { struct itimerval zero = {{0, 0}, {0, 0}}; /* Initially set to 10 ms. */ struct itimerval initial = zero; initial.it_interval.tv_usec = 10000; setitimer(ITIMER_PROF, &initial, NULL); /* Save and restore several times. */ for (size_t i = 0; i < 10; ++i) { struct itimerval prev; setitimer(ITIMER_PROF, &zero, &prev); /* on old kernels, this goes up by TICK_USEC every iteration */ printf("previous value: %ld %ld %ld %ld\n", prev.it_interval.tv_sec, prev.it_interval.tv_usec, prev.it_value.tv_sec, prev.it_value.tv_usec); setitimer(ITIMER_PROF, &prev, NULL); } return 0; } Cc: stable@vger.kernel.org Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Paul Turner <pjt@google.com> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Reviewed-by: Paul Turner <pjt@google.com> Reported-by: Aaron Jacobs <jacobsa@google.com> Signed-off-by: Andrew Hunter <ahh@google.com> [jstultz: Tweaked to apply to 3.17-rc] Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-09-12futex: Unlock hb->lock in futex_wait_requeue_pi() error pathThomas Gleixner
futex_wait_requeue_pi() calls futex_wait_setup(). If futex_wait_setup() succeeds it returns with hb->lock held and preemption disabled. Now the sanity check after this does: if (match_futex(&q.key, &key2)) { ret = -EINVAL; goto out_put_keys; } which releases the keys but does not release hb->lock. So we happily return to user space with hb->lock held and therefor preemption disabled. Unlock hb->lock before taking the exit route. Reported-by: Dave "Trinity" Jones <davej@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-10kcmp: fix standard comparison bugRasmus Villemoes
The C operator <= defines a perfectly fine total ordering on the set of values representable in a long. However, unlike its namesake in the integers, it is not translation invariant, meaning that we do not have "b <= c" iff "a+b <= a+c" for all a,b,c. This means that it is always wrong to try to boil down the relationship between two longs to a question about the sign of their difference, because the resulting relation [a LEQ b iff a-b <= 0] is neither anti-symmetric or transitive. The former is due to -LONG_MIN==LONG_MIN (take any two a,b with a-b = LONG_MIN; then a LEQ b and b LEQ a, but a != b). The latter can either be seen observing that x LEQ x+1 for all x, implying x LEQ x+1 LEQ x+2 ... LEQ x-1 LEQ x; or more directly with the simple example a=LONG_MIN, b=0, c=1, for which a-b < 0, b-c < 0, but a-c > 0. Note that it makes absolutely no difference that a transmogrying bijection has been applied before the comparison is done. In fact, had the obfuscation not been done, one could probably not observe the bug (assuming all values being compared always lie in one half of the address space, the mathematical value of a-b is always representable in a long). As it stands, one can easily obtain three file descriptors exhibiting the non-transitivity of kcmp(). Side note 1: I can't see that ensuring the MSB of the multiplier is set serves any purpose other than obfuscating the obfuscating code. Side note 2: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <fcntl.h> #include <unistd.h> #include <assert.h> #include <sys/syscall.h> enum kcmp_type { KCMP_FILE, KCMP_VM, KCMP_FILES, KCMP_FS, KCMP_SIGHAND, KCMP_IO, KCMP_SYSVSEM, KCMP_TYPES, }; pid_t pid; int kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2) { return syscall(SYS_kcmp, pid1, pid2, type, idx1, idx2); } int cmp_fd(int fd1, int fd2) { int c = kcmp(pid, pid, KCMP_FILE, fd1, fd2); if (c < 0) { perror("kcmp"); exit(1); } assert(0 <= c && c < 3); return c; } int cmp_fdp(const void *a, const void *b) { static const int normalize[] = {0, -1, 1}; return normalize[cmp_fd(*(int*)a, *(int*)b)]; } #define MAX 100 /* This is plenty; I've seen it trigger for MAX==3 */ int main(int argc, char *argv[]) { int r, s, count = 0; int REL[3] = {0,0,0}; int fd[MAX]; pid = getpid(); while (count < MAX) { r = open("/dev/null", O_RDONLY); if (r < 0) break; fd[count++] = r; } printf("opened %d file descriptors\n", count); for (r = 0; r < count; ++r) { for (s = r+1; s < count; ++s) { REL[cmp_fd(fd[r], fd[s])]++; } } printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]); qsort(fd, count, sizeof(fd[0]), cmp_fdp); memset(REL, 0, sizeof(REL)); for (r = 0; r < count; ++r) { for (s = r+1; s < count; ++s) { REL[cmp_fd(fd[r], fd[s])]++; } } printf("== %d\t< %d\t> %d\n", REL[0], REL[1], REL[2]); return (REL[0] + REL[2] != 0); } Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-10kernel/printk/printk.c: fix faulty logic in the case of recursive printkPatrick Palka
We shouldn't set text_len in the code path that detects printk recursion because text_len corresponds to the length of the string inside textbuf. A few lines down from the line text_len = strlen(recursion_msg); is the line text_len += vscnprintf(text + text_len, ...); So if printk detects recursion, it sets text_len to 29 (the length of recursion_msg) and logs an error. Then the message supplied by the caller of printk is stored inside textbuf but offset by 29 bytes. This means that the output of the recursive call to printk will contain 29 bytes of garbage in front of it. This defect is caused by commit 458df9fd4815 ("printk: remove separate printk_sched buffers and use printk buf instead") which turned the line text_len = vscnprintf(text, ...); into text_len += vscnprintf(text + text_len, ...); To fix this, this patch avoids setting text_len when logging the printk recursion error. This patch also marks unlikely() the branch leading up to this code. Fixes: 458df9fd4815b478 ("printk: remove separate printk_sched buffers and use printk buf instead") Signed-off-by: Patrick Palka <patrick@parcs.ath.cx> Reviewed-by: Petr Mladek <pmladek@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-09-09perf: Fix a race condition in perf_remove_from_context()Cong Wang
We saw a kernel soft lockup in perf_remove_from_context(), it looks like the `perf` process, when exiting, could not go out of the retry loop. Meanwhile, the target process was forking a child. So either the target process should execute the smp function call to deactive the event (if it was running) or it should do a context switch which deactives the event. It seems we optimize out a context switch in perf_event_context_sched_out(), and what's more important, we still test an obsolete task pointer when retrying, so no one actually would deactive that event in this situation. Fix it directly by reloading the task pointer in perf_remove_from_context(). This should cure the above soft lockup. Signed-off-by: Cong Wang <cwang@twopensource.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1409696840-843-1-git-send-email-xiyou.wangcong@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-07Merge branch 'for-3.17-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This pull request includes Alban's patch to disallow '\n' in cgroup names. Two other patches from Li to fix a possible oops when cgroup destruction races against other file operations and one from Vivek to fix a unified hierarchy devel behavior" * 'for-3.17-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: check cgroup liveliness before unbreaking kernfs cgroup: delay the clearing of cgrp->kn->priv cgroup: Display legacy cgroup files on default hierarchy cgroup: reject cgroup names with '\n'
2014-09-07Merge tag 'pm+acpi-3.17-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: "These are regression fixes (ACPI sysfs, ACPI video, suspend test), ACPI cpuidle deadlock fix, missing runtime validation of ACPI _DSD output, a fix and a new CPU ID for the RAPL driver, new blacklist entry for the ACPI EC driver and a couple of trivial cleanups (intel_pstate and generic PM domains). Specifics: - Fix for recently broken test_suspend= command line argument (Rafael Wysocki). - Fixes for regressions related to the ACPI video driver caused by switching the default to native backlight handling in 3.16 from Hans de Goede. - Fix for a sysfs attribute of ACPI device objects that returns stale values sometimes due to the fact that they are cached instead of executing the appropriate method (_SUN) every time (broken in 3.14). From Yasuaki Ishimatsu. - Fix for a deadlock between cpuidle_lock and cpu_hotplug.lock in the ACPI processor driver from Jiri Kosina. - Runtime output validation for the ACPI _DSD device configuration object missing from the support for it that has been introduced recently. From Mika Westerberg. - Fix for an unuseful and misleading RAPL (Running Average Power Limit) domain detection message in the RAPL driver from Jacob Pan. - New Intel Haswell CPU ID for the RAPL driver from Jason Baron. - New Clevo W350etq blacklist entry for the ACPI EC driver from Lan Tianyu. - Cleanup for the intel_pstate driver and the core generic PM domains code from Gabriele Mazzotta and Geert Uytterhoeven" * tag 'pm+acpi-3.17-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / cpuidle: fix deadlock between cpuidle_lock and cpu_hotplug.lock ACPI / scan: not cache _SUN value in struct acpi_device_pnp cpufreq: intel_pstate: Remove unneeded variable powercap / RAPL: change domain detection message powercap / RAPL: add support for CPU model 0x3f PM / domains: Make generic_pm_domain.name const PM / sleep: Fix test_suspend= command line option ACPI / EC: Add msi quirk for Clevo W350etq ACPI / video: Disable native_backlight on HP ENVY 15 Notebook PC ACPI / video: Add a disable_native_backlight quirk ACPI / video: Fix use_native_backlight selection logic ACPICA: ACPI 5.1: Add support for runtime validation of _DSD package.
2014-09-07Merge branch 'core-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU fix from Ingo Molnar: "A boot hang fix for the offloaded callback RCU model (RCU_NOCB_CPU=y && (TREE_CPU=y || TREE_PREEMPT_RC)) in certain bootup scenarios" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rcu: Make nocb leader kthreads process pending callbacks after spawning
2014-09-07Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Thomas Gleixner: "Three fixlets from the timer departement: - Update the timekeeper before updating vsyscall and pvclock. This fixes the kvm-clock regression reported by Chris and Paolo. - Use the proper irq work interface from NMI. This fixes the regression reported by Catalin and Dave. - Clarify the compat_nanosleep error handling mechanism to avoid future confusion" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Update timekeeper before updating vsyscall and pvclock compat: nanosleep: Clarify error handling nohz: Restore NMI safe local irq work for local nohz kick
2014-09-06timekeeping: Update timekeeper before updating vsyscall and pvclockThomas Gleixner
The update_walltime() code works on the shadow timekeeper to make the seqcount protected region as short as possible. But that update to the shadow timekeeper does not update all timekeeper fields because it's sufficient to do that once before it becomes life. One of these fields is tkr.base_mono. That stays stale in the shadow timekeeper unless an operation happens which copies the real timekeeper to the shadow. The update function is called after the update calls to vsyscall and pvclock. While not correct, it did not cause any problems because none of the invoked update functions used base_mono. commit cbcf2dd3b3d4 (x86: kvm: Make kvm_get_time_and_clockread() nanoseconds based) changed that in the kvm pvclock update function, so the stale mono_base value got used and caused kvm-clock to malfunction. Put the update where it belongs and fix the issue. Reported-by: Chris J Arges <chris.j.arges@canonical.com> Reported-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409050000570.3333@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-06compat: nanosleep: Clarify error handlingThomas Gleixner
The error handling in compat_sys_nanosleep() is correct, but completely non obvious. Document it and restrict it to the -ERESTART_RESTARTBLOCK return value for clarity. Reported-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-09-04nohz: Restore NMI safe local irq work for local nohz kickFrederic Weisbecker
The local nohz kick is currently used by perf which needs it to be NMI-safe. Recent commit though (7d1311b93e58ed55f3a31cc8f94c4b8fe988a2b9) changed its implementation to fire the local kick using the remote kick API. It was convenient to make the code more generic but the remote kick isn't NMI-safe. As a result: WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140() CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34 0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37 0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180 0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000 Call Trace: <NMI> [<ffffffff9a7f1e37>] dump_stack+0x4e/0x7a [<ffffffff9a0791dd>] warn_slowpath_common+0x7d/0xa0 [<ffffffff9a07930a>] warn_slowpath_null+0x1a/0x20 [<ffffffff9a17ca1e>] irq_work_queue_on+0x11e/0x140 [<ffffffff9a10a2c7>] tick_nohz_full_kick_cpu+0x57/0x90 [<ffffffff9a186cd5>] __perf_event_overflow+0x275/0x350 [<ffffffff9a184f80>] ? perf_event_task_disable+0xa0/0xa0 [<ffffffff9a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150 [<ffffffff9a187934>] perf_event_overflow+0x14/0x20 [<ffffffff9a020386>] intel_pmu_handle_irq+0x206/0x410 [<ffffffff9a0b54d3>] ? arch_vtime_task_switch+0x63/0x130 [<ffffffff9a01937b>] perf_event_nmi_handler+0x2b/0x50 [<ffffffff9a007b72>] nmi_handle+0xd2/0x390 [<ffffffff9a007aa5>] ? nmi_handle+0x5/0x390 [<ffffffff9a0d131b>] ? lock_release+0xab/0x330 [<ffffffff9a008062>] default_do_nmi+0x72/0x1c0 [<ffffffff9a0c925f>] ? cpuacct_account_field+0xcf/0x200 [<ffffffff9a008268>] do_nmi+0xb8/0x100 Lets fix this by restoring the use of local irq work for the nohz local kick. Reported-by: Catalin Iacob <iacobcatalin@gmail.com> Reported-and-tested-by: Dave Jones <davej@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2014-09-05cgroup: check cgroup liveliness before unbreaking kernfsLi Zefan
When cgroup_kn_lock_live() is called through some kernfs operation and another thread is calling cgroup_rmdir(), we'll trigger the warning in cgroup_get(). ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1228 at kernel/cgroup.c:1034 cgroup_get+0x89/0xa0() ... Call Trace: [<c16ee73d>] dump_stack+0x41/0x52 [<c10468ef>] warn_slowpath_common+0x7f/0xa0 [<c104692d>] warn_slowpath_null+0x1d/0x20 [<c10bb999>] cgroup_get+0x89/0xa0 [<c10bbe58>] cgroup_kn_lock_live+0x28/0x70 [<c10be3c1>] __cgroup_procs_write.isra.26+0x51/0x230 [<c10be5b2>] cgroup_tasks_write+0x12/0x20 [<c10bb7b0>] cgroup_file_write+0x40/0x130 [<c11aee71>] kernfs_fop_write+0xd1/0x160 [<c1148e58>] vfs_write+0x98/0x1e0 [<c114934d>] SyS_write+0x4d/0xa0 [<c16f656b>] sysenter_do_call+0x12/0x12 ---[ end trace 6f2e0c38c2108a74 ]--- Fix this by calling css_tryget() instead of cgroup_get(). v2: - move cgroup_tryget() right below cgroup_get() definition. (Tejun) Cc: <stable@vger.kernel.org> # 3.15+ Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-05cgroup: delay the clearing of cgrp->kn->privLi Zefan
Run these two scripts concurrently: for ((; ;)) { mkdir /cgroup/sub rmdir /cgroup/sub } for ((; ;)) { echo $$ > /cgroup/sub/cgroup.procs echo $$ > /cgroup/cgroup.procs } A kernel bug will be triggered: BUG: unable to handle kernel NULL pointer dereference at 00000038 IP: [<c10bbd69>] cgroup_put+0x9/0x80 ... Call Trace: [<c10bbe19>] cgroup_kn_unlock+0x39/0x50 [<c10bbe91>] cgroup_kn_lock_live+0x61/0x70 [<c10be3c1>] __cgroup_procs_write.isra.26+0x51/0x230 [<c10be5b2>] cgroup_tasks_write+0x12/0x20 [<c10bb7b0>] cgroup_file_write+0x40/0x130 [<c11aee71>] kernfs_fop_write+0xd1/0x160 [<c1148e58>] vfs_write+0x98/0x1e0 [<c114934d>] SyS_write+0x4d/0xa0 [<c16f656b>] sysenter_do_call+0x12/0x12 We clear cgrp->kn->priv in the end of cgroup_rmdir(), but another concurrent thread can access kn->priv after the clearing. We should move the clearing to css_release_work_fn(). At that time no one is holding reference to the cgroup and no one can gain a new reference to access it. v2: - move RCU_INIT_POINTER() into the else block. (Tejun) - remove the cgroup_parent() check. (Tejun) - update the comment in css_tryget_online_from_dir(). Cc: <stable@vger.kernel.org> # 3.15+ Reported-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: Zefan Li <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-09-03Merge branch 'rcu/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent Pull an RCU fix from Paul E. McKenney: "This series contains a single commit fixing an initialization bug reported by Amit Shah and fixed by Pranith Kumar (and tested by Amit). This bug results in a boot-time hang in callback-offloaded configurations where callbacks were posted before the offloading ('rcuo') kthreads were created." Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-09-03PM / sleep: Fix test_suspend= command line optionRafael J. Wysocki
After commit d431cbc53cb7 (PM / sleep: Simplify sleep states sysfs interface code) the pm_states[] array is not populated initially, which causes setup_test_suspend() to always fail and the suspend testing during boot doesn't work any more. Fix the problem by using pm_labels[] instead of pm_states[] in setup_test_suspend() and storing a pointer to the label of the sleep state to test rather than the number representing it, because the connection between the state numbers and labels is only established by suspend_set_ops(). Fixes: d431cbc53cb7 (PM / sleep: Simplify sleep states sysfs interface code) Reported-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-09-01Merge branch 'irq-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq handling fixlet from Thomas Gleixner: "Just an export for an interrupt flow handler which is now used in gpio modules" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irq: Export handle_fasteoi_irq
2014-08-29kexec: create a new config option CONFIG_KEXEC_FILE for new syscallVivek Goyal
Currently new system call kexec_file_load() and all the associated code compiles if CONFIG_KEXEC=y. But new syscall also compiles purgatory code which currently uses gcc option -mcmodel=large. This option seems to be available only gcc 4.4 onwards. Hiding new functionality behind a new config option will not break existing users of old gcc. Those who wish to enable new functionality will require new gcc. Having said that, I am trying to figure out how can I move away from using -mcmodel=large but that can take a while. I think there are other advantages of introducing this new config option. As this option will be enabled only on x86_64, other arches don't have to compile generic kexec code which will never be used. This new code selects CRYPTO=y and CRYPTO_SHA256=y. And all other arches had to do this for CONFIG_KEXEC. Now with introduction of new config option, we can remove crypto dependency from other arches. Now CONFIG_KEXEC_FILE is available only on x86_64. So whereever I had CONFIG_X86_64 defined, I got rid of that. For CONFIG_KEXEC_FILE, instead of doing select CRYPTO=y, I changed it to "depends on CRYPTO=y". This should be safer as "select" is not recursive. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: H. Peter Anvin <hpa@zytor.com> Tested-by: Shaun Ruffell <sruffell@digium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>