summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2016-05-11tracing: Don't display trigger file for events that can't be enabledChunyu Hu
commit 854145e0a8e9a05f7366d240e2f99d9c1ca6d6dd upstream. Currently register functions for events will be called through the 'reg' field of event class directly without any check when seting up triggers. Triggers for events that don't support register through debug fs (events under events/ftrace are for trace-cmd to read event format, and most of them don't have a register function except events/ftrace/functionx) can't be enabled at all, and an oops will be hit when setting up trigger for those events, so just not creating them is an easy way to avoid the oops. Link: http://lkml.kernel.org/r/1462275274-3911-1-git-send-email-chuhu@redhat.com Fixes: 85f2b08268c01 ("tracing: Add basic event trigger framework") Signed-off-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04locking/mcs: Fix mcs_spin_lock() orderingPeter Zijlstra
commit 920c720aa5aa3900a7f1689228fdfc2580a91e7e upstream. Similar to commit b4b29f94856a ("locking/osq: Fix ordering of node initialisation in osq_lock") the use of xchg_acquire() is fundamentally broken with MCS like constructs. Furthermore, it turns out we rely on the global transitivity of this operation because the unlock path observes the pointer with a READ_ONCE(), not an smp_load_acquire(). This is non-critical because the MCS code isn't actually used and mostly serves as documentation, a stepping stone to the more complex things we've build on top of the idea. Reported-by: Andrea Parri <parri.andrea@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Fixes: 3552a07a9c4a ("locking/mcs: Use acquire/release semantics") Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04cgroup: make sure a parent css isn't freed before its childrenTejun Heo
commit 8bb5ef79bc0f4016ecf79e8dce6096a3c63603e4 upstream. There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. The previous patch updated ordering for css_offline() which fixes the cpu controller issue. While there currently isn't a known bug caused by misordering of css_free() invocations, let's fix it too for consistency. css_free() ordering can be trivially fixed by moving putting of the parent css below css_free() invocation. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04cgroup, cpuset: replace cpuset_post_attach_flush() with ↵Tejun Heo
cgroup_subsys->post_attach callback commit 5cf1cacb49aee39c3e02ae87068fc3c6430659b0 upstream. Since e93ad19d0564 ("cpuset: make mm migration asynchronous"), cpuset kicks off asynchronous NUMA node migration if necessary during task migration and flushes it from cpuset_post_attach_flush() which is called at the end of __cgroup_procs_write(). This is to avoid performing migration with cgroup_threadgroup_rwsem write-locked which can lead to deadlock through dependency on kworker creation. memcg has a similar issue with charge moving, so let's convert it to an official callback rather than the current one-off cpuset specific function. This patch adds cgroup_subsys->post_attach callback and makes cpuset register cpuset_post_attach_flush() as its ->post_attach. The conversion is mostly one-to-one except that the new callback is called under cgroup_mutex. This is to guarantee that no other migration operations are started before ->post_attach callbacks are finished. cgroup_mutex is one of the outermost mutex in the system and has never been and shouldn't be a problem. We can add specialized synchronization around __cgroup_procs_write() but I don't think there's any noticeable benefit. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04workqueue: fix ghost PENDING flag while doing MQ IORoman Pen
commit 346c09f80459a3ad97df1816d6d606169a51001a upstream. The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list with the following backtrace: [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6 [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000 [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000 [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0 [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38 [ 601.350965] Call Trace: [ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 [ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80 [ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230 [ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220 [ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0 [ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60 [ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110 [ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10 [ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70 [ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90 [ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0 [ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40 [ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140 [ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30 [ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70 [ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50 [ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0 [ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100 [ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0 [ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0 [ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50 The underlying device is a null_blk, with default parameters: queue_mode = MQ submit_queues = 1 Verification that nullb0 has something inflight: root@pserver8:~# cat /sys/block/nullb0/inflight 0 1 root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \; ... /sys/block/nullb0/mq/0/cpu2/rq_list CTX pending: ffff8838038e2400 ... During debug it became clear that stalled request is always inserted in the rq_list from the following path: save_stack_trace_tsk + 34 blk_mq_insert_requests + 231 blk_mq_flush_plug_list + 281 blk_flush_plug_list + 199 wait_on_page_bit + 192 __filemap_fdatawait_range + 228 filemap_fdatawait_range + 20 filemap_write_and_wait_range + 63 blkdev_fsync + 27 vfs_fsync_range + 73 blkdev_write_iter + 202 __vfs_write + 170 vfs_write + 169 kernel_write + 56 So blk_flush_plug_list() was called with from_schedule == true. If from_schedule is true, that means that finally blk_mq_insert_requests() offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue, i.e. it calls kblockd_schedule_delayed_work_on(). That means, that we race with another CPU, which is about to execute __blk_mq_run_hw_queue() work. Further debugging shows the following traces from different CPUs: CPU#0 CPU#1 ---------------------------------- ------------------------------- reqeust A inserted STORE hctx->ctx_map[0] bit marked kblockd_schedule...() returns 1 <schedule to kblockd workqueue> request B inserted STORE hctx->ctx_map[1] bit marked kblockd_schedule...() returns 0 *** WORK PENDING bit is cleared *** flush_busy_ctxs() is executed, but bit 1, set by CPU#1, is not observed As a result request B pended forever. This behaviour can be explained by speculative LOAD of hctx->ctx_map on CPU#0, which is reordered with clear of PENDING bit and executed _before_ actual STORE of bit 1 on CPU#1. The proper fix is an explicit full barrier <mfence>, which guarantees that clear of PENDING bit is to be executed before all possible speculative LOADS or STORES inside actual work function. Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Michael Wang <yun.wang@profitbricks.com> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04futex: Acknowledge a new waiter in counter before plistDavidlohr Bueso
commit fe1bce9e2107ba3a8faffe572483b6974201a0e6 upstream. Otherwise an incoming waker on the dest hash bucket can miss the waiter adding itself to the plist during the lockless check optimization (small window but still the correct way of doing this); similarly to the decrement counterpart. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: bigeasy@linutronix.de Cc: dvhart@infradead.org Link: http://lkml.kernel.org/r/1461208164-29150-1-git-send-email-dave@stgolabs.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04futex: Handle unlock_pi race gracefullySebastian Andrzej Siewior
commit 89e9e66ba1b3bde9d8ea90566c2aee20697ad681 upstream. If userspace calls UNLOCK_PI unconditionally without trying the TID -> 0 transition in user space first then the user space value might not have the waiters bit set. This opens the following race: CPU0 CPU1 uval = get_user(futex) lock(hb) lock(hb) futex |= FUTEX_WAITERS .... unlock(hb) cmpxchg(futex, uval, newval) So the cmpxchg fails and returns -EINVAL to user space, which is wrong because the futex value is valid. To handle this (yes, yet another) corner case gracefully, check for a flag change and retry. [ tglx: Massaged changelog and slightly reworked implementation ] Fixes: ccf9e6a80d9e ("futex: Make unlock_pi more robust") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1460723739-5195-1-git-send-email-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-04sched/cgroup: Fix/cleanup cgroup teardown/initPeter Zijlstra
commit 2f5177f0fd7e531b26d54633be62d1d4cb94621c upstream. The CPU controller hasn't kept up with the various changes in the whole cgroup initialization / destruction sequence, and commit: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups") caused it to explode. The reason for this is that zombies do not inhibit css_offline() from being called, but do stall css_released(). Now we tear down the cfs_rq structures on css_offline() but zombies can run after that, leading to use-after-free issues. The solution is to move the tear-down to css_released(), which guarantees nobody (including no zombies) is still using our cgroup. Furthermore, a few simple cleanups are possible too. There doesn't appear to be any point to us using css_online() (anymore?) so fold that in css_alloc(). And since cgroup code guarantees an RCU grace period between css_released() and css_free() we can forgo using call_rcu() and free the stuff immediately. Suggested-by: Tejun Heo <tj@kernel.org> Reported-by: Kazuki Yamaguchi <k@rhe.jp> Reported-by: Niklas Cassel <niklas.cassel@axis.com> Tested-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Tejun Heo <tj@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups") Link: http://lkml.kernel.org/r/20160316152245.GY6344@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20perf: Cure event->pending_disable racePeter Zijlstra
commit 28a967c3a2f99fa3b5f762f25cb2a319d933571b upstream. Because event_sched_out() checks event->pending_disable _before_ actually disabling the event, it can happen that the event fires after it checks but before it gets disabled. This would leave event->pending_disable set and the queued irq_work will try and process it. However, if the event trigger was during schedule(), the event might have been de-scheduled by the time the irq_work runs, and perf_event_disable_local() will fail. Fix this by checking event->pending_disable _after_ we call event->pmu->del(). This depends on the latter being a compiler barrier, such that the compiler does not lift the load and re-creates the problem. Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174948.040469884@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20perf: Do not double freePeter Zijlstra
commit 130056275ade730e7a79c110212c8815202773ee upstream. In case of: err_file: fput(event_file), we'll end up calling perf_release() which in turn will free the event. Do not then free the event _again_. Tested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dvyukov@google.com Cc: eranian@google.com Cc: oleg@redhat.com Cc: panand@redhat.com Cc: sasha.levin@oracle.com Cc: vince@deater.net Link: http://lkml.kernel.org/r/20160224174947.697350349@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20bpf: avoid copying junk bytes in bpf_get_current_comm()Alexei Starovoitov
[ Upstream commit cdc4e47da8f4c32eeb6b2061a8a834f4362a12b7 ] Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but the result is typically passed to print("%s", buf) and extra bytes after zero don't cause any harm. In bpf the result of bpf_get_current_comm() is used as the part of map key and was causing spurious hash map mismatches. Use strlcpy() to guarantee zero-terminated string. bpf verifier checks that output buffer is zero-initialized, so even for short task names the output buffer don't have junk bytes. Note it's not a security concern, since kprobe+bpf is root only. Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors") Reported-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12sched/cputime: Fix steal time accounting vs. CPU hotplugThomas Gleixner
commit e9532e69b8d1d1284e8ecf8d2586de34aec61244 upstream. On CPU hotplug the steal time accounting can keep a stale rq->prev_steal_time value over CPU down and up. So after the CPU comes up again the delta calculation in steal_account_process_tick() wreckages itself due to the unsigned math: u64 steal = paravirt_steal_clock(smp_processor_id()); steal -= this_rq()->prev_steal_time; So if steal is smaller than rq->prev_steal_time we end up with an insane large value which then gets added to rq->prev_steal_time, resulting in a permanent wreckage of the accounting. As a consequence the per CPU stats in /proc/stat become stale. Nice trick to tell the world how idle the system is (100%) while the CPU is 100% busy running tasks. Though we prefer realistic numbers. None of the accounting values which use a previous value to account for fractions is reset at CPU hotplug time. update_rq_clock_task() has a sanity check for prev_irq_time and prev_steal_time_rq, but that sanity check solely deals with clock warps and limits the /proc/stat visible wreckage. The prev_time values are still wrong. Solution is simple: Reset rq->prev_*_time when the CPU is plugged in again. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: commit 095c0aa83e52 "sched: adjust scheduler cpu power for stolen time" Fixes: commit aa483808516c "sched: Remove irq time from available CPU power" Fixes: commit e6e6685accfa "KVM guest: Steal time accounting" Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1603041539490.3686@nanos Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12PM / sleep: Clear pm_suspend_global_flags upon hibernateLukas Wunner
commit 276142730c39c9839465a36a90e5674a8c34e839 upstream. When suspending to RAM, waking up and later suspending to disk, we gratuitously runtime resume devices after the thaw phase. This does not occur if we always suspend to RAM or always to disk. pm_complete_with_resume_check(), which gets called from pci_pm_complete() among others, schedules a runtime resume if PM_SUSPEND_FLAG_FW_RESUME is set. The flag is set during a suspend-to-RAM cycle. It is cleared at the beginning of the suspend-to-RAM cycle but not afterwards and it is not cleared during a suspend-to-disk cycle at all. Fix it. Fixes: ef25ba047601 (PM / sleep: Add flags to indicate platform firmware involvement) Signed-off-by: Lukas Wunner <lukas@wunner.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12tracing: Fix trace_printk() to print when not using bprintk()Steven Rostedt (Red Hat)
commit 3debb0a9ddb16526de8b456491b7db60114f7b5e upstream. The trace_printk() code will allocate extra buffers if the compile detects that a trace_printk() is used. To do this, the format of the trace_printk() is saved to the __trace_printk_fmt section, and if that section is bigger than zero, the buffers are allocated (along with a message that this has happened). If trace_printk() uses a format that is not a constant, and thus something not guaranteed to be around when the print happens, the compiler optimizes the fmt out, as it is not used, and the __trace_printk_fmt section is not filled. This means the kernel will not allocate the special buffers needed for the trace_printk() and the trace_printk() will not write anything to the tracing buffer. Adding a "__used" to the variable in the __trace_printk_fmt section will keep it around, even though it is set to NULL. This will keep the string from being printed in the debugfs/tracing/printk_formats section as it is not needed. Reported-by: Vlastimil Babka <vbabka@suse.cz> Fixes: 07d777fe8c398 "tracing: Add percpu buffers for trace_printk()" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12tracing: Fix crash from reading trace_pipe with sendfileSteven Rostedt (Red Hat)
commit a29054d9478d0435ab01b7544da4f674ab13f533 upstream. If tracing contains data and the trace_pipe file is read with sendfile(), then it can trigger a NULL pointer dereference and various BUG_ON within the VM code. There's a patch to fix this in the splice_to_pipe() code, but it's also a good idea to not let that happen from trace_pipe either. Link: http://lkml.kernel.org/r/1457641146-9068-1-git-send-email-rabin@rab.in Reported-by: Rabin Vincent <rabin.vincent@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12tracing: Have preempt(irqs)off trace preempt disabled functionsSteven Rostedt (Red Hat)
commit cb86e05390debcc084cfdb0a71ed4c5dbbec517d upstream. Joel Fernandes reported that the function tracing of preempt disabled sections was not being reported when running either the preemptirqsoff or preemptoff tracers. This was due to the fact that the function tracer callback for those tracers checked if irqs were disabled before tracing. But this fails when we want to trace preempt off locations as well. Joel explained that he wanted to see funcitons where interrupts are enabled but preemption was disabled. The expected output he wanted: <...>-2265 1d.h1 3419us : preempt_count_sub <-irq_exit <...>-2265 1d..1 3419us : __do_softirq <-irq_exit <...>-2265 1d..1 3419us : msecs_to_jiffies <-__do_softirq <...>-2265 1d..1 3420us : irqtime_account_irq <-__do_softirq <...>-2265 1d..1 3420us : __local_bh_disable_ip <-__do_softirq <...>-2265 1..s1 3421us : run_timer_softirq <-__do_softirq <...>-2265 1..s1 3421us : hrtimer_run_pending <-run_timer_softirq <...>-2265 1..s1 3421us : _raw_spin_lock_irq <-run_timer_softirq <...>-2265 1d.s1 3422us : preempt_count_add <-_raw_spin_lock_irq <...>-2265 1d.s2 3422us : _raw_spin_unlock_irq <-run_timer_softirq <...>-2265 1..s2 3422us : preempt_count_sub <-_raw_spin_unlock_irq <...>-2265 1..s1 3423us : rcu_bh_qs <-__do_softirq <...>-2265 1d.s1 3423us : irqtime_account_irq <-__do_softirq <...>-2265 1d.s1 3423us : __local_bh_enable <-__do_softirq There's a comment saying that the irq disabled check is because there's a possible race that tracing_cpu may be set when the function is executed. But I don't remember that race. For now, I added a check for preemption being enabled too to not record the function, as there would be no race if that was the case. I need to re-investigate this, as I'm now thinking that the tracing_cpu will always be correct. But no harm in keeping the check for now, except for the slight performance hit. Link: http://lkml.kernel.org/r/1457770386-88717-1-git-send-email-agnel.joel@gmail.com Fixes: 5e6d2b9cfa3a "tracing: Use one prologue for the preempt irqs off tracer function tracers" Reported-by: Joel Fernandes <agnel.joel@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12fs/coredump: prevent fsuid=0 dumps into user-controlled directoriesJann Horn
commit 378c6520e7d29280f400ef2ceaf155c86f05a71a upstream. This commit fixes the following security hole affecting systems where all of the following conditions are fulfilled: - The fs.suid_dumpable sysctl is set to 2. - The kernel.core_pattern sysctl's value starts with "/". (Systems where kernel.core_pattern starts with "|/" are not affected.) - Unprivileged user namespace creation is permitted. (This is true on Linux >=3.8, but some distributions disallow it by default using a distro patch.) Under these conditions, if a program executes under secure exec rules, causing it to run with the SUID_DUMP_ROOT flag, then unshares its user namespace, changes its root directory and crashes, the coredump will be written using fsuid=0 and a path derived from kernel.core_pattern - but this path is interpreted relative to the root directory of the process, allowing the attacker to control where a coredump will be written with root privileges. To fix the security issue, always interpret core_pattern for dumps that are written under SUID_DUMP_ROOT relative to the root directory of init. Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12cgroup: ignore css_sets associated with dead cgroups during migrationTejun Heo
commit 2b021cbf3cb6208f0d40fd2f1869f237934340ed upstream. Before 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups"), all dead tasks were associated with init_css_set. If a zombie task is requested for migration, while migration prep operations would still be performed on init_css_set, the actual migration would ignore zombie tasks. As init_css_set is always valid, this worked fine. However, after 2e91fa7f6d45, zombie tasks stay with the css_set it was associated with at the time of death. Let's say a task T associated with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2. After T becomes a zombie, it would still remain associated with A and B. If A only contains zombie tasks, it can be removed. On removal, A gets marked offline but stays pinned until all zombies are drained. At this point, if migration is initiated on T to a cgroup C on hierarchy H-2, migration path would try to prepare T's css_set for migration and trigger the following. WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160() CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289 ... Call Trace: [<ffffffff8127e63c>] dump_stack+0x4e/0x82 [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0 [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20 [<ffffffff810c33e1>] cgroup_get+0x121/0x160 [<ffffffff810c349b>] link_css_set+0x7b/0x90 [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0 [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0 [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230 [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470 [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10 [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0 [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180 [<ffffffff81151673>] __vfs_write+0x23/0xe0 [<ffffffff81152494>] vfs_write+0xa4/0x1a0 [<ffffffff811532d4>] SyS_write+0x44/0xa0 [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f It doesn't make sense to prepare migration for css_sets pointing to dead cgroups as they are guaranteed to contain only zombies which are ignored later during migration. This patch makes cgroup destruction path mark all affected css_sets as dead and updates the migration path to ignore them during preparation. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 2e91fa7f6d45 ("cgroup: keep zombies associated with their original cgroups") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12watchdog: don't run proc_watchdog_update if new value is same as oldJoshua Hunt
commit a1ee1932aa6bea0bb074f5e3ced112664e4637ed upstream. While working on a script to restore all sysctl params before a series of tests I found that writing any value into the /proc/sys/kernel/{nmi_watchdog,soft_watchdog,watchdog,watchdog_thresh} causes them to call proc_watchdog_update(). NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. There doesn't appear to be a reason for doing this work every time a write occurs, so only do it when the values change. Signed-off-by: Josh Hunt <johunt@akamai.com> Acked-by: Don Zickus <dzickus@redhat.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12sched/cputime: Fix steal_account_process_tick() to always return jiffiesChris Friesen
commit f9c904b7613b8b4c85b10cd6b33ad41b2843fa9d upstream. The callers of steal_account_process_tick() expect it to return whether a jiffy should be considered stolen or not. Currently the return value of steal_account_process_tick() is in units of cputime, which vary between either jiffies or nsecs depending on CONFIG_VIRT_CPU_ACCOUNTING_GEN. If cputime has nsecs granularity and there is a tiny amount of stolen time (a few nsecs, say) then we will consider the entire tick stolen and will not account the tick on user/system/idle, causing /proc/stats to show invalid data. The fix is to change steal_account_process_tick() to accumulate the stolen time and only account it once it's worth a jiffy. (Thanks to Frederic Weisbecker for suggestions to fix a bug in my first version of the patch.) Signed-off-by: Chris Friesen <chris.friesen@windriver.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/56DBBDB8.40305@mail.usask.ca Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-12perf/core: Fix perf_sched_count derailmentAlexander Shishkin
commit 927a5570855836e5d5859a80ce7e91e963545e8f upstream. The error path in perf_event_open() is such that asking for a sampling event on a PMU that doesn't generate interrupts will end up in dropping the perf_sched_count even though it hasn't been incremented for this event yet. Given a sufficient amount of these calls, we'll end up disabling scheduler's jump label even though we'd still have active events in the system, thereby facilitating the arrival of the infernal regions upon us. I'm fixing this by moving account_event() inside perf_event_alloc(). Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: vince@deater.net Link: http://lkml.kernel.org/r/1456917854-29427-1-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09modules: fix longstanding /proc/kallsyms vs module insertion race.Rusty Russell
commit 8244062ef1e54502ef55f54cced659913f244c3e upstream. For CONFIG_KALLSYMS, we keep two symbol tables and two string tables. There's one full copy, marked SHF_ALLOC and laid out at the end of the module's init section. There's also a cut-down version that only contains core symbols and strings, and lives in the module's core section. After module init (and before we free the module memory), we switch the mod->symtab, mod->num_symtab and mod->strtab to point to the core versions. We do this under the module_mutex. However, kallsyms doesn't take the module_mutex: it uses preempt_disable() and rcu tricks to walk through the modules, because it's used in the oops path. It's also used in /proc/kallsyms. There's nothing atomic about the change of these variables, so we can get the old (larger!) num_symtab and the new symtab pointer; in fact this is what I saw when trying to reproduce. By grouping these variables together, we can use a carefully-dereferenced pointer to ensure we always get one or the other (the free of the module init section is already done in an RCU callback, so that's safe). We allocate the init one at the end of the module init section, and keep the core one inside the struct module itself (it could also have been allocated at the end of the module core, but that's probably overkill). [ Rebased for 4.4-stable and older, because the following changes aren't in the older trees: - e0224418516b4d8a6c2160574bac18447c354ef0: adds arg to is_core_symbol - 7523e4dc5057e157212b4741abd6256e03404cf1: module_init/module_core/init_size/core_size become init_layout.base/core_layout.base/init_layout.size/core_layout.size. ] Reported-by: Weilong Chen <chenweilong@huawei.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=111541 Cc: stable@kernel.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-09tracing: Do not have 'comm' filter override event 'comm' fieldSteven Rostedt (Red Hat)
commit e57cbaf0eb006eaa207395f3bfd7ce52c1b5539c upstream. Commit 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and process names" added a 'comm' filter that will filter events based on the current tasks struct 'comm'. But this now hides the ability to filter events that have a 'comm' field too. For example, sched_migrate_task trace event. That has a 'comm' field of the task to be migrated. echo 'comm == "bash"' > events/sched_migrate_task/filter will now filter all sched_migrate_task events for tasks named "bash" that migrates other tasks (in interrupt context), instead of seeing when "bash" itself gets migrated. This fix requires a couple of changes. 1) Change the look up order for filter predicates to look at the events fields before looking at the generic filters. 2) Instead of basing the filter function off of the "comm" name, have the generic "comm" filter have its own filter_type (FILTER_COMM). Test against the type instead of the name to assign the filter function. 3) Add a new "COMM" filter that works just like "comm" but will filter based on the current task, even if the trace event contains a "comm" field. Do the same for "cpu" field, adding a FILTER_CPU and a filter "CPU". Fixes: 9f61668073a8d "tracing: Allow triggers to filter for CPU ids and process names" Reported-by: Matt Fleming <matt@codeblueprint.co.uk> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03kernel/resource.c: fix muxed resource handling in __request_region()Simon Guinot
commit 59ceeaaf355fa0fb16558ef7c24413c804932ada upstream. In __request_region, if a conflict with a BUSY and MUXED resource is detected, then the caller goes to sleep and waits for the resource to be released. A pointer on the conflicting resource is kept. At wake-up this pointer is used as a parent to retry to request the region. A first problem is that this pointer might well be invalid (if for example the conflicting resource have already been freed). Another problem is that the next call to __request_region() fails to detect a remaining conflict. The previously conflicting resource is passed as a parameter and __request_region() will look for a conflict among the children of this resource and not at the resource itself. It is likely to succeed anyway, even if there is still a conflict. Instead, the parent of the conflicting resource should be passed to __request_region(). As a fix, this patch doesn't update the parent resource pointer in the case we have to wait for a muxed region right after. Reported-and-tested-by: Vincent Pelletier <plr.vincent@gmail.com> Signed-off-by: Simon Guinot <simon.guinot@sequanux.org> Tested-by: Vincent Donnefort <vdonnefort@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03tracing: Fix showing function event in available_eventsSteven Rostedt (Red Hat)
commit d045437a169f899dfb0f6f7ede24cc042543ced9 upstream. The ftrace:function event is only displayed for parsing the function tracer data. It is not used to enable function tracing, and does not include an "enable" file in its event directory. Originally, this event was kept separate from other events because it did not have a ->reg parameter. But perf added a "reg" parameter for its use which caused issues, because it made the event available to functions where it was not compatible for. Commit 9b63776fa3ca9 "tracing: Do not enable function event with enable" added a TRACE_EVENT_FL_IGNORE_ENABLE flag that prevented the function event from being enabled by normal trace events. But this commit missed keeping the function event from being displayed by the "available_events" directory, which is used to show what events can be enabled by set_event. One documented way to enable all events is to: cat available_events > set_event But because the function event is displayed in the available_events, this now causes an INVALID error: cat: write error: Invalid argument Reported-by: Chunyu Hu <chuhu@redhat.com> Fixes: 9b63776fa3ca9 "tracing: Do not enable function event with enable" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03cgroup: make sure a parent css isn't offlined before its childrenTejun Heo
commit aa226ff4a1ce79f229c6b7a4c0a14e17fececd01 upstream. There are three subsystem callbacks in css shutdown path - css_offline(), css_released() and css_free(). Except for css_released(), cgroup core didn't guarantee the order of invocation. css_offline() or css_free() could be called on a parent css before its children. This behavior is unexpected and led to bugs in cpu and memory controller. This patch updates offline path so that a parent css is never offlined before its children. Each css keeps online_cnt which reaches zero iff itself and all its children are offline and offline_css() is invoked only after online_cnt reaches zero. This fixes the memory controller bug and allows the fix for cpu controller. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reported-by: Brian Christiansen <brian.o.christiansen@gmail.com> Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03cpuset: make mm migration asynchronousTejun Heo
commit e93ad19d05648397ef3bcb838d26aec06c245dc0 upstream. If "cpuset.memory_migrate" is set, when a process is moved from one cpuset to another with a different memory node mask, pages in used by the process are migrated to the new set of nodes. This was performed synchronously in the ->attach() callback, which is synchronized against process management. Recently, the synchronization was changed from per-process rwsem to global percpu rwsem for simplicity and optimization. Combined with the synchronous mm migration, this led to deadlocks because mm migration could schedule a work item which may in turn try to create a new worker blocking on the process management lock held from cgroup process migration path. This heavy an operation shouldn't be performed synchronously from that deep inside cgroup migration in the first place. This patch punts the actual migration to an ordered workqueue and updates cgroup process migration and cpuset config update paths to flush the workqueue after all locks are released. This way, the operations still seem synchronous to userland without entangling mm migration with process management synchronization. CPU hotplug can also invoke mm migration but there's no reason for it to wait for mm migrations and thus doesn't synchronize against their completions. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-and-tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03Revert "workqueue: make sure delayed work run in local cpu"Tejun Heo
commit 041bd12e272c53a35c54c13875839bcb98c999ce upstream. This reverts commit 874bbfe600a660cba9c776b3957b1ce393151b76. Workqueue used to implicity guarantee that work items queued without explicit CPU specified are put on the local CPU. Recent changes in timer broke the guarantee and led to vmstat breakage which was fixed by 176bed1de5bf ("vmstat: explicitly schedule per-cpu work on the CPU we need it to run on"). vmstat is the most likely to expose the issue and it's quite possible that there are other similar problems which are a lot more difficult to trigger. As a preventive measure, 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") was applied to restore the local CPU guarnatee. Unfortunately, the change exposed a bug in timer code which got fixed by 22b886dd1018 ("timers: Use proper base migration in add_timer_on()"). Due to code restructuring, the commit couldn't be backported beyond certain point and stable kernels which only had 874bbfe600a6 started crashing. The local CPU guarantee was accidental more than anything else and we want to get rid of it anyway. As, with the vmstat case fixed, 874bbfe600a6 is causing more problems than it's fixing, it has been decided to take the chance and officially break the guarantee by reverting the commit. A debug feature will be added to force foreign CPU assignment to expose cases relying on the guarantee and fixes for the individual cases will be backported to stable as necessary. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu") Link: http://lkml.kernel.org/g/20160120211926.GJ10810@quack.suse.cz Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Cc: Daniel Bilik <daniel.bilik@neosystem.cz> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shli@fb.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Bilik <daniel.bilik@neosystem.cz> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03workqueue: handle NUMA_NO_NODE for unbound pool_workqueue lookupTejun Heo
commit d6e022f1d207a161cd88e08ef0371554680ffc46 upstream. When looking up the pool_workqueue to use for an unbound workqueue, workqueue assumes that the target CPU is always bound to a valid NUMA node. However, currently, when a CPU goes offline, the mapping is destroyed and cpu_to_node() returns NUMA_NO_NODE. This has always been broken but hasn't triggered often enough before 874bbfe600a6 ("workqueue: make sure delayed work run in local cpu"). After the commit, workqueue forcifully assigns the local CPU for delayed work items without explicit target CPU to fix a different issue. This widens the window where CPU can go offline while a delayed work item is pending causing delayed work items dispatched with target CPU set to an already offlined CPU. The resulting NUMA_NO_NODE mapping makes workqueue try to queue the work item on a NULL pool_workqueue and thus crash. While 874bbfe600a6 has been reverted for a different reason making the bug less visible again, it can still happen. Fix it by mapping NUMA_NO_NODE to the default pool_workqueue from unbound_pwq_by_node(). This is a temporary workaround. The long term solution is keeping CPU -> NODE mapping stable across CPU off/online cycles which is being worked on. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Len Brown <len.brown@intel.com> Link: http://lkml.kernel.org/g/1454424264.11183.46.camel@gmail.com Link: http://lkml.kernel.org/g/1453702100-2597-1-git-send-email-tangchen@cn.fujitsu.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03tick/nohz: Set the correct expiry when switching to nohz/lowres modeWanpeng Li
commit 1ca8ec532fc2d986f1f4a319857bb18e0c9739b4 upstream. commit 0ff53d096422 sets the next tick interrupt to the last jiffies update, i.e. in the past, because the forward operation is invoked before the set operation. There is no resulting damage (yet), but we get an extra pointless tick interrupt. Revert the order so we get the next tick interrupt in the future. Fixes: commit 0ff53d096422 "tick: sched: Force tick interrupt and get rid of softirq magic" Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/1453893967-3458-1-git-send-email-wanpeng.li@hotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03seccomp: always propagate NO_NEW_PRIVS on tsyncJann Horn
commit 103502a35cfce0710909da874f092cb44823ca03 upstream. Before this patch, a process with some permissive seccomp filter that was applied by root without NO_NEW_PRIVS was able to add more filters to itself without setting NO_NEW_PRIVS by setting the new filter from a throwaway thread with NO_NEW_PRIVS. Signed-off-by: Jann Horn <jann@thejh.net> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03time: Avoid signed overflow in timekeeping_get_ns()David Gibson
commit 35a4933a895927990772ae96fdcfd2f806929ee2 upstream. 1e75fa8 "time: Condense timekeeper.xtime into xtime_sec" replaced a call to clocksource_cyc2ns() from timekeeping_get_ns() with an open-coded version of the same logic to avoid keeping a semi-redundant struct timespec in struct timekeeper. However, the commit also introduced a subtle semantic change - where clocksource_cyc2ns() uses purely unsigned math, the new version introduces a signed temporary, meaning that if (delta * tk->mult) has a 63-bit overflow the following shift will still give a negative result. The choice of 'maxsec' in __clocksource_updatefreq_scale() means this will generally happen if there's a ~10 minute pause in examining the clocksource. This can be triggered on a powerpc KVM guest by stopping it from qemu for a bit over 10 minutes. After resuming time has jumped backwards several minutes causing numerous problems (jiffies does not advance, msleep()s can be extended by minutes..). It doesn't happen on x86 KVM guests, because the guest TSC is effectively frozen while the guest is stopped, which is not the case for the powerpc timebase. Obviously an unsigned (64 bit) overflow will only take twice as long as a signed, 63-bit overflow. I don't know the time code well enough to know if that will still cause incorrect calculations, or if a 64-bit overflow is avoided elsewhere. Still, an incorrect forwards clock adjustment will cause less trouble than time going backwards. So, this patch removes the potential for intermediate signed overflow. Suggested-by: Laurent Vivier <lvivier@redhat.com> Tested-by: Laurent Vivier <lvivier@redhat.com> Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03posix-clock: Fix return code on the poll method's error pathRichard Cochran
commit 1b9f23727abb92c5e58f139e7d180befcaa06fe0 upstream. The posix_clock_poll function is supposed to return a bit mask of POLLxxx values. However, in case the hardware has disappeared (due to hot plugging for example) this code returns -ENODEV in a futile attempt to throw an error at the file descriptor level. The kernel's file_operations interface does not accept such error codes from the poll method. Instead, this function aught to return POLLERR. The value -ENODEV does, in fact, contain the POLLERR bit (and almost all the other POLLxxx bits as well), but only by chance. This patch fixes code to return a proper bit mask. Credit goes to Markus Elfring for pointing out the suspicious signed/unsigned mismatch. Reported-by: Markus Elfring <elfring@users.sourceforge.net> igned-off-by: Richard Cochran <richardcochran@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Julia Lawall <julia.lawall@lip6.fr> Link: http://lkml.kernel.org/r/1450819198-17420-1-git-send-email-richardcochran@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03genirq: Validate action before dereferencing it in handle_irq_event_percpu()Thomas Gleixner
commit 570540d50710ed192e98e2f7f74578c9486b6b05 upstream. commit 71f64340fc0e changed the handling of irq_desc->action from CPU 0 CPU 1 free_irq() lock(desc) lock(desc) handle_edge_irq() if (desc->action) { handle_irq_event() action = desc->action unlock(desc) desc->action = NULL handle_irq_event_percpu(desc, action) action->xxx to CPU 0 CPU 1 free_irq() lock(desc) lock(desc) handle_edge_irq() if (desc->action) { handle_irq_event() unlock(desc) desc->action = NULL handle_irq_event_percpu(desc, action) action = desc->action action->xxx So if free_irq manages to set the action to NULL between the unlock and before the readout, we happily dereference a null pointer. We could simply revert 71f64340fc0e, but we want to preserve the better code generation. A simple solution is to change the action loop from a do {} while to a while {} loop. This is safe because we either see a valid desc->action or NULL. If the action is about to be removed it is still valid as free_irq() is blocked on synchronize_irq(). CPU 0 CPU 1 free_irq() lock(desc) lock(desc) handle_edge_irq() handle_irq_event(desc) set(INPROGRESS) unlock(desc) handle_irq_event_percpu(desc) action = desc->action desc->action = NULL while (action) { action->xxx ... action = action->next; sychronize_irq() while(INPROGRESS); lock(desc) clr(INPROGRESS) free(action) That's basically the same mechanism as we have for shared interrupts. action->next can become NULL while handle_irq_event_percpu() runs. Either it sees the action or NULL. It does not matter, because action itself cannot go away before the interrupt in progress flag has been cleared. Fixes: commit 71f64340fc0e "genirq: Remove the second parameter from handle_irq_event_percpu()" Reported-by: zyjzyj2000@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Huang Shijie <shijie.huang@arm.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1601131224190.3575@nanos Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03devm_memremap: Fix error value when memremap failedToshi Kani
commit 93f834df9c2d4e362dfdc4b05daa0a4e18814836 upstream. devm_memremap() returns an ERR_PTR() value in case of error. However, it returns NULL when memremap() failed. This causes the caller, such as the pmem driver, to proceed and oops later. Change devm_memremap() to return ERR_PTR(-ENXIO) when memremap() failed. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03bpf: fix branch offset adjustment on backjumps after patching ctx expansionDaniel Borkmann
[ Upstream commit a1b14d27ed0965838350f1377ff97c93ee383492 ] When ctx access is used, the kernel often needs to expand/rewrite instructions, so after that patching, branch offsets have to be adjusted for both forward and backward jumps in the new eBPF program, but for backward jumps it fails to account the delta. Meaning, for example, if the expansion happens exactly on the insn that sits at the jump target, it doesn't fix up the back jump offset. Analysis on what the check in adjust_branches() is currently doing: /* adjust offset of jmps if necessary */ if (i < pos && i + insn->off + 1 > pos) insn->off += delta; else if (i > pos && i + insn->off + 1 < pos) insn->off -= delta; First condition (forward jumps): Before: After: insns[0] insns[0] insns[1] <--- i/insn insns[1] <--- i/insn insns[2] <--- pos insns[P] <--- pos insns[3] insns[P] `------| delta insns[4] <--- target_X insns[P] `-----| insns[5] insns[3] insns[4] <--- target_X insns[5] First case is if we cross pos-boundary and the jump instruction was before pos. This is handeled correctly. I.e. if i == pos, then this would mean our jump that we currently check was the patchlet itself that we just injected. Since such patchlets are self-contained and have no awareness of any insns before or after the patched one, the delta is correctly not adjusted. Also, for the second condition in case of i + insn->off + 1 == pos, means we jump to that newly patched instruction, so no offset adjustment are needed. That part is correct. Second condition (backward jumps): Before: After: insns[0] insns[0] insns[1] <--- target_X insns[1] <--- target_X insns[2] <--- pos <-- target_Y insns[P] <--- pos <-- target_Y insns[3] insns[P] `------| delta insns[4] <--- i/insn insns[P] `-----| insns[5] insns[3] insns[4] <--- i/insn insns[5] Second interesting case is where we cross pos-boundary and the jump instruction was after pos. Backward jump with i == pos would be impossible and pose a bug somewhere in the patchlet, so the first condition checking i > pos is okay only by itself. However, i + insn->off + 1 < pos does not always work as intended to trigger the adjustment. It works when jump targets would be far off where the delta wouldn't matter. But, for example, where the fixed insn->off before pointed to pos (target_Y), it now points to pos + delta, so that additional room needs to be taken into account for the check. This means that i) both tests here need to be adjusted into pos + delta, and ii) for the second condition, the test needs to be <= as pos itself can be a target in the backjump, too. Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25modules: fix modparam async_probe requestLuis R. Rodriguez
commit 4355efbd80482a961cae849281a8ef866e53d55c upstream. Commit f2411da746985 ("driver-core: add driver module asynchronous probe support") added async probe support, in two forms: * in-kernel driver specification annotation * generic async_probe module parameter (modprobe foo async_probe) To support the generic kernel parameter parse_args() was extended via commit ecc8617053e0 ("module: add extra argument for parse_params() callback") however commit failed to f2411da746985 failed to add the required argument. This causes a crash then whenever async_probe generic module parameter is used. This was overlooked when the form in which in-kernel async probe support was reworked a bit... Fix this as originally intended. Cc: Hannes Reinecke <hare@suse.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> [minimized] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25module: wrapper for symbol name.Rusty Russell
commit 2e7bac536106236104e9e339531ff0fcdb7b8147 upstream. This trivial wrapper adds clarity and makes the following patch smaller. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25itimers: Handle relative timers with CONFIG_TIME_LOW_RES properThomas Gleixner
commit 51cbb5242a41700a3f250ecfb48dcfb7e4375ea4 upstream. As Helge reported for timerfd we have the same issue in itimers. We return remaining time larger than the programmed relative time to user space in case of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra time added in hrtimer_start_range_ns(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Helge Deller <deller@gmx.de> Cc: John Stultz <john.stultz@linaro.org> Cc: linux-m68k@lists.linux-m68k.org Cc: dhowells@redhat.com Link: http://lkml.kernel.org/r/20160114164159.528222587@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES properThomas Gleixner
commit 572c39172684c3711e4a03c9a7380067e2b0661c upstream. As Helge reported for timerfd we have the same issue in posix timers. We return remaining time larger than the programmed relative time to user space in case of CONFIG_TIME_LOW_RES=y. Use the proper function to adjust the extra time added in hrtimer_start_range_ns(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Helge Deller <deller@gmx.de> Cc: John Stultz <john.stultz@linaro.org> Cc: linux-m68k@lists.linux-m68k.org Cc: dhowells@redhat.com Link: http://lkml.kernel.org/r/20160114164159.450510905@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25prctl: take mmap sem for writing to protect against othersMateusz Guzik
commit ddf1d398e517e660207e2c807f76a90df543a217 upstream. An unprivileged user can trigger an oops on a kernel with CONFIG_CHECKPOINT_RESTORE. proc_pid_cmdline_read takes mmap_sem for reading and obtains args + env start/end values. These get sanity checked as follows: BUG_ON(arg_start > arg_end); BUG_ON(env_start > env_end); These can be changed by prctl_set_mm. Turns out also takes the semaphore for reading, effectively rendering it useless. This results in: kernel BUG at fs/proc/base.c:240! invalid opcode: 0000 [#1] SMP Modules linked in: virtio_net CPU: 0 PID: 925 Comm: a.out Not tainted 4.4.0-rc8-next-20160105dupa+ #71 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff880077a68000 ti: ffff8800784d0000 task.ti: ffff8800784d0000 RIP: proc_pid_cmdline_read+0x520/0x530 RSP: 0018:ffff8800784d3db8 EFLAGS: 00010206 RAX: ffff880077c5b6b0 RBX: ffff8800784d3f18 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 00007f78e8857000 RDI: 0000000000000246 RBP: ffff8800784d3e40 R08: 0000000000000008 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000050 R13: 00007f78e8857800 R14: ffff88006fcef000 R15: ffff880077c5b600 FS: 00007f78e884a740(0000) GS:ffff88007b200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f78e8361770 CR3: 00000000790a5000 CR4: 00000000000006f0 Call Trace: __vfs_read+0x37/0x100 vfs_read+0x82/0x130 SyS_read+0x58/0xd0 entry_SYSCALL_64_fastpath+0x12/0x76 Code: 4c 8b 7d a8 eb e9 48 8b 9d 78 ff ff ff 4c 8b 7d 90 48 8b 03 48 39 45 a8 0f 87 f0 fe ff ff e9 d1 fe ff ff 4c 8b 7d 90 eb c6 0f 0b <0f> 0b 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RIP proc_pid_cmdline_read+0x520/0x530 ---[ end trace 97882617ae9c6818 ]--- Turns out there are instances where the code just reads aformentioned values without locking whatsoever - namely environ_read and get_cmdline. Interestingly these functions look quite resilient against bogus values, but I don't believe this should be relied upon. The first patch gets rid of the oops bug by grabbing mmap_sem for writing. The second patch is optional and puts locking around aformentioned consumers for safety. Consumers of other fields don't seem to benefit from similar treatment and are left untouched. This patch (of 2): The code was taking the semaphore for reading, which does not protect against readers nor concurrent modifications. The problem could cause a sanity checks to fail in procfs's cmdline reader, resulting in an OOPS. Note that some functions perform an unlocked read of various mm fields, but they seem to be fine despite possible modificaton. Signed-off-by: Mateusz Guzik <mguzik@redhat.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Jarod Wilson <jarod@redhat.com> Cc: Jan Stancek <jstancek@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anshuman Khandual <anshuman.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25futex: Drop refcount if requeue_pi() acquired the rtmutexThomas Gleixner
commit fb75a4282d0d9a3c7c44d940582c2d226cf3acfb upstream. If the proxy lock in the requeue loop acquires the rtmutex for a waiter then it acquired also refcount on the pi_state related to the futex, but the waiter side does not drop the reference count. Add the missing free_pi_state() call. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Darren Hart <darren@dvhart.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Bhuvanesh_Surachari@mentor.com Cc: Andy Lowe <Andy_Lowe@mentor.com> Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25devm_memremap_release(): fix memremap'd addr handlingToshi Kani
commit 9273a8bbf58a15051e53a777389a502420ddc60e upstream. The pmem driver calls devm_memremap() to map a persistent memory range. When the pmem driver is unloaded, this memremap'd range is not released so the kernel will leak a vma. Fix devm_memremap_release() to handle a given memremap'd address properly. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-25ptrace: use fsuid, fsgid, effective creds for fs access checksJann Horn
commit caaee6234d05a58c5b4d05e7bf766131b810a657 upstream. By checking the effective credentials instead of the real UID / permitted capabilities, ensure that the calling process actually intended to use its credentials. To ensure that all ptrace checks use the correct caller credentials (e.g. in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS flag), use two new flags and require one of them to be set. The problem was that when a privileged task had temporarily dropped its privileges, e.g. by calling setreuid(0, user_uid), with the intent to perform following syscalls with the credentials of a user, it still passed ptrace access checks that the user would not be able to pass. While an attacker should not be able to convince the privileged task to perform a ptrace() syscall, this is a problem because the ptrace access check is reused for things in procfs. In particular, the following somewhat interesting procfs entries only rely on ptrace access checks: /proc/$pid/stat - uses the check for determining whether pointers should be visible, useful for bypassing ASLR /proc/$pid/maps - also useful for bypassing ASLR /proc/$pid/cwd - useful for gaining access to restricted directories that contain files with lax permissions, e.g. in this scenario: lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar drwx------ root root /root drwxr-xr-x root root /root/foobar -rw-r--r-- root root /root/foobar/secret Therefore, on a system where a root-owned mode 6755 binary changes its effective credentials as described and then dumps a user-specified file, this could be used by an attacker to reveal the memory layout of root's processes or reveal the contents of files he is not allowed to access (through /proc/$pid/cwd). [akpm@linux-foundation.org: fix warning] Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Kees Cook <keescook@chromium.org> Cc: Casey Schaufler <casey@schaufler-ca.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge.hallyn@ubuntu.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17sched: Fix crash in sched_init_numa()Raghavendra K T
commit 9c03ee147193645be4c186d3688232fa438c57c7 upstream. The following PowerPC commit: c118baf80256 ("arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes") avoids allocating bootmem memory for non existent nodes. But when DEBUG_PER_CPU_MAPS=y is enabled, my powerNV system failed to boot because in sched_init_numa(), cpumask_or() operation was done on unallocated nodes. Fix that by making cpumask_or() operation only on existing nodes. [ Tested with and w/o DEBUG_PER_CPU_MAPS=y on x86 and PowerPC. ] Reported-by: Jan Stancek <jstancek@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: <gkurz@linux.vnet.ibm.com> Cc: <grant.likely@linaro.org> Cc: <nikunj@linux.vnet.ibm.com> Cc: <vdavydov@parallels.com> Cc: <linuxppc-dev@lists.ozlabs.org> Cc: <linux-mm@kvack.org> Cc: <peterz@infradead.org> Cc: <benh@kernel.crashing.org> Cc: <paulus@samba.org> Cc: <mpe@ellerman.id.au> Cc: <anton@samba.org> Link: http://lkml.kernel.org/r/1452884483-11676-1-git-send-email-raghavendra.kt@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17hrtimer: Handle remaining time proper for TIME_LOW_RESThomas Gleixner
commit 203cbf77de59fc8f13502dcfd11350c6d4a5c95f upstream. If CONFIG_TIME_LOW_RES is enabled we add a jiffie to the relative timeout to prevent short sleeps, but we do not account for that in interfaces which retrieve the remaining time. Helge observed that timerfd can return a remaining time larger than the relative timeout. That's not expected and breaks userland test programs. Store the information that the timer was armed relative and provide functions to adjust the remaining time. To avoid bloating the hrtimer struct make state a u8, which as a bonus results in better code on x86 at least. Reported-and-tested-by: Helge Deller <deller@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: John Stultz <john.stultz@linaro.org> Cc: linux-m68k@lists.linux-m68k.org Cc: dhowells@redhat.com Link: http://lkml.kernel.org/r/20160114164159.273328486@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17printk: do cond_resched() between lines while outputting to consolesTejun Heo
commit 8d91f8b15361dfb438ab6eb3b319e2ded43458ff upstream. @console_may_schedule tracks whether console_sem was acquired through lock or trylock. If the former, we're inside a sleepable context and console_conditional_schedule() performs cond_resched(). This allows console drivers which use console_lock for synchronization to yield while performing time-consuming operations such as scrolling. However, the actual console outputting is performed while holding irq-safe logbuf_lock, so console_unlock() clears @console_may_schedule before starting outputting lines. Also, only a few drivers call console_conditional_schedule() to begin with. This means that when a lot of lines need to be output by console_unlock(), for example on a console registration, the task doing console_unlock() may not yield for a long time on a non-preemptible kernel. If this happens with a slow console devices, for example a serial console, the outputting task may occupy the cpu for a very long time. Long enough to trigger softlockup and/or RCU stall warnings, which in turn pile more messages, sometimes enough to trigger the next cycle of warnings incapacitating the system. Fix it by making console_unlock() insert cond_resched() between lines if @console_may_schedule. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Calvin Owens <calvinowens@fb.com> Acked-by: Jan Kara <jack@suse.com> Cc: Dave Jones <davej@codemonkey.org.uk> Cc: Kyle McMartin <kyle@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17tracing/stacktrace: Show entire trace if passed in function not foundSteven Rostedt
commit 6ccd83714a009ee301b50c15f6c3a5dc1f30164c upstream. When a max stack trace is discovered, the stack dump is saved. In order to not record the overhead of the stack tracer, the ip of the traced function is looked for within the dump. The trace is started from the location of that function. But if for some reason the ip is not found, the entire stack trace is then truncated. That's not very useful. Instead, print everything if the ip of the traced function is not found within the trace. This issue showed up on s390. Link: http://lkml.kernel.org/r/20160129102241.1b3c9c04@gandalf.local.home Fixes: 72ac426a5bb0 ("tracing: Clean up stack tracing and fix fentry updates") Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-02-17tracing: Fix stacktrace skip depth in trace_buffer_unlock_commit_regs()Steven Rostedt (Red Hat)
commit 7717c6be699975f6733d278b13b7c4295d73caf6 upstream. While cleaning the stacktrace code I unintentially changed the skip depth of trace_buffer_unlock_commit_regs() from 0 to 6. kprobes uses this function, and with skipping 6 call backs, it can easily produce no stack. Here's how I tested it: # echo 'p:ext4_sync_fs ext4_sync_fs ' > /sys/kernel/debug/tracing/kprobe_events # echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable # cat /sys/kernel/debug/trace sync-2394 [005] 502.457060: ext4_sync_fs: (ffffffff81317650) sync-2394 [005] 502.457063: kernel_stack: <stack trace> sync-2394 [005] 502.457086: ext4_sync_fs: (ffffffff81317650) sync-2394 [005] 502.457087: kernel_stack: <stack trace> sync-2394 [005] 502.457091: ext4_sync_fs: (ffffffff81317650) After putting back the skip stack to zero, we have: sync-2270 [000] 748.052693: ext4_sync_fs: (ffffffff81317650) sync-2270 [000] 748.052695: kernel_stack: <stack trace> => iterate_supers (ffffffff8126412e) => sys_sync (ffffffff8129c4b6) => entry_SYSCALL_64_fastpath (ffffffff8181f0b2) sync-2270 [000] 748.053017: ext4_sync_fs: (ffffffff81317650) sync-2270 [000] 748.053019: kernel_stack: <stack trace> => iterate_supers (ffffffff8126412e) => sys_sync (ffffffff8129c4b6) => entry_SYSCALL_64_fastpath (ffffffff8181f0b2) sync-2270 [000] 748.053381: ext4_sync_fs: (ffffffff81317650) sync-2270 [000] 748.053383: kernel_stack: <stack trace> => iterate_supers (ffffffff8126412e) => sys_sync (ffffffff8129c4b6) => entry_SYSCALL_64_fastpath (ffffffff8181f0b2) Fixes: 73dddbb57bb0 "tracing: Only create stacktrace option when STACKTRACE is configured" Reported-by: Brendan Gregg <brendan.d.gregg@gmail.com> Tested-by: Brendan Gregg <brendan.d.gregg@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31net: bpf: reject invalid shiftsRabin Vincent
[ Upstream commit 229394e8e62a4191d592842cf67e80c62a492937 ] On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a constant shift that can't be encoded in the immediate field of the UBFM/SBFM instructions is passed to the JIT. Since these shifts amounts, which are negative or >= regsize, are invalid, reject them in the eBPF verifier and the classic BPF filter checker, for all architectures. Signed-off-by: Rabin Vincent <rabin@rab.in> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>