summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-11-10perf/core: Fix a memory leak in perf_event_parse_addr_filter()kiyin(尹亮)
commit 7bdb157cdebbf95a1cd94ed2e01b338714075d00 upstream. As shown through runtime testing, the "filename" allocation is not always freed in perf_event_parse_addr_filter(). There are three possible ways that this could happen: - It could be allocated twice on subsequent iterations through the loop, - or leaked on the success path, - or on the failure path. Clean up the code flow to make it obvious that 'filename' is always freed in the reallocation path and in the two return paths as well. We rely on the fact that kfree(NULL) is NOP and filename is initialized with NULL. This fixes the leak. No other side effects expected. [ Dan Carpenter: cleaned up the code flow & added a changelog. ] [ Ingo Molnar: updated the changelog some more. ] Fixes: 375637bc5249 ("perf/core: Introduce address range filtering") Signed-off-by: "kiyin(尹亮)" <kiyin@tencent.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: "Srivatsa S. Bhat" <srivatsa@csail.mit.edu> Cc: Anthony Liguori <aliguori@amazon.com> -- kernel/events/core.c | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10fork: fix copy_process(CLONE_PARENT) race with the exiting ->real_parentEddy Wu
commit b4e00444cab4c3f3fec876dc0cccc8cbb0d1a948 upstream. current->group_leader->exit_signal may change during copy_process() if current->real_parent exits. Move the assignment inside tasklist_lock to avoid the race. Signed-off-by: Eddy Wu <eddy_wu@trendmicro.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10futex: Handle transient "ownerless" rtmutex state correctlyMike Galbraith
commit 9f5d1c336a10c0d24e83e40b4c1b9539f7dba627 upstream. Gratian managed to trigger the BUG_ON(!newowner) in fixup_pi_state_owner(). This is one possible chain of events leading to this: Task Prio Operation T1 120 lock(F) T2 120 lock(F) -> blocks (top waiter) T3 50 (RT) lock(F) -> boosts T1 and blocks (new top waiter) XX timeout/ -> wakes T2 signal T1 50 unlock(F) -> wakes T3 (rtmutex->owner == NULL, waiter bit is set) T2 120 cleanup -> try_to_take_mutex() fails because T3 is the top waiter and the lower priority T2 cannot steal the lock. -> fixup_pi_state_owner() sees newowner == NULL -> BUG_ON() The comment states that this is invalid and rt_mutex_real_owner() must return a non NULL owner when the trylock failed, but in case of a queued and woken up waiter rt_mutex_real_owner() == NULL is a valid transient state. The higher priority waiter has simply not yet managed to take over the rtmutex. The BUG_ON() is therefore wrong and this is just another retry condition in fixup_pi_state_owner(). Drop the locks, so that T3 can make progress, and then try the fixup again. Gratian provided a great analysis, traces and a reproducer. The analysis is to the point, but it confused the hell out of that tglx dude who had to page in all the futex horrors again. Condensed version is above. [ tglx: Wrote comment and changelog ] Fixes: c1e2f0eaf015 ("futex: Avoid violating the 10th rule of futex") Reported-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87a6w6x7bb.fsf@ni.com Link: https://lore.kernel.org/r/87sg9pkvf7.fsf@nanos.tec.linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10tracing: Fix out of bounds write in get_trace_bufQiujun Huang
commit c1acb4ac1a892cf08d27efcb964ad281728b0545 upstream. The nesting count of trace_printk allows for 4 levels of nesting. The nesting counter starts at zero and is incremented before being used to retrieve the current context's buffer. But the index to the buffer uses the nesting counter after it was incremented, and not its original number, which in needs to do. Link: https://lkml.kernel.org/r/20201029161905.4269-1-hqjagain@gmail.com Cc: stable@vger.kernel.org Fixes: 3d9622c12c887 ("tracing: Add barrier to trace_printk() buffer nesting modification") Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10ftrace: Handle tracing when switching between contextSteven Rostedt (VMware)
commit 726b3d3f141fba6f841d715fc4d8a4a84f02c02a upstream. When an interrupt or NMI comes in and switches the context, there's a delay from when the preempt_count() shows the update. As the preempt_count() is used to detect recursion having each context have its own bit get set when tracing starts, and if that bit is already set, it is considered a recursion and the function exits. But if this happens in that section where context has changed but preempt_count() has not been updated, this will be incorrectly flagged as a recursion. To handle this case, create another bit call TRANSITION and test it if the current context bit is already set. Flag the call as a recursion if the TRANSITION bit is already set, and if not, set it and continue. The TRANSITION bit will be cleared normally on the return of the function that set it, or if the current context bit is clear, set it and clear the TRANSITION bit to allow for another transition between the current context and an even higher one. Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10ftrace: Fix recursion check for NMI testSteven Rostedt (VMware)
commit ee11b93f95eabdf8198edd4668bf9102e7248270 upstream. The code that checks recursion will work to only do the recursion check once if there's nested checks. The top one will do the check, the other nested checks will see recursion was already checked and return zero for its "bit". On the return side, nothing will be done if the "bit" is zero. The problem is that zero is returned for the "good" bit when in NMI context. This will set the bit for NMIs making it look like *all* NMI tracing is recursing, and prevent tracing of anything in NMI context! The simple fix is to return "bit + 1" and subtract that bit on the end to get the real bit. Cc: stable@vger.kernel.org Fixes: edc15cafcbfa3 ("tracing: Avoid unnecessary multiple recursion checks") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10ring-buffer: Fix recursion protection transitions between interrupt contextSteven Rostedt (VMware)
commit b02414c8f045ab3b9afc816c3735bc98c5c3d262 upstream. The recursion protection of the ring buffer depends on preempt_count() to be correct. But it is possible that the ring buffer gets called after an interrupt comes in but before it updates the preempt_count(). This will trigger a false positive in the recursion code. Use the same trick from the ftrace function callback recursion code which uses a "transition" bit that gets set, to allow for a single recursion for to handle transitions between contexts. Cc: stable@vger.kernel.org Fixes: 567cd4da54ff4 ("ring-buffer: User context bit recursion checking") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10kthread_worker: prevent queuing delayed work from timer_fn when it is being ↵Zqiang
canceled commit 6993d0fdbee0eb38bfac350aa016f65ad11ed3b1 upstream. There is a small race window when a delayed work is being canceled and the work still might be queued from the timer_fn: CPU0 CPU1 kthread_cancel_delayed_work_sync() __kthread_cancel_work_sync() __kthread_cancel_work() work->canceling++; kthread_delayed_work_timer_fn() kthread_insert_work(); BUG: kthread_insert_work() should not get called when work->canceling is set. Signed-off-by: Zqiang <qiang.zhang@windriver.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201014083030.16895-1-qiang.zhang@windriver.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-10ptrace: fix task_join_group_stop() for the case when current is tracedOleg Nesterov
commit 7b3c36fc4c231ca532120bbc0df67a12f09c1d96 upstream. This testcase #include <stdio.h> #include <unistd.h> #include <signal.h> #include <sys/ptrace.h> #include <sys/wait.h> #include <pthread.h> #include <assert.h> void *tf(void *arg) { return NULL; } int main(void) { int pid = fork(); if (!pid) { kill(getpid(), SIGSTOP); pthread_t th; pthread_create(&th, NULL, tf, NULL); return 0; } waitpid(pid, NULL, WSTOPPED); ptrace(PTRACE_SEIZE, pid, 0, PTRACE_O_TRACECLONE); waitpid(pid, NULL, 0); ptrace(PTRACE_CONT, pid, 0,0); waitpid(pid, NULL, 0); int status; int thread = waitpid(-1, &status, 0); assert(thread > 0 && thread != pid); assert(status == 0x80137f); return 0; } fails and triggers WARN_ON_ONCE(!signr) in do_jobctl_trap(). This is because task_join_group_stop() has 2 problems when current is traced: 1. We can't rely on the "JOBCTL_STOP_PENDING" check, a stopped tracee can be woken up by debugger and it can clone another thread which should join the group-stop. We need to check group_stop_count || SIGNAL_STOP_STOPPED. 2. If SIGNAL_STOP_STOPPED is already set, we should not increment sig->group_stop_count and add JOBCTL_STOP_CONSUME. The new thread should stop without another do_notify_parent_cldstop() report. To clarify, the problem is very old and we should blame ptrace_init_task(). But now that we have task_join_group_stop() it makes more sense to fix this helper to avoid the code duplication. Reported-by: syzbot+3485e3773f7da290eecc@syzkaller.appspotmail.com Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christian Brauner <christian@brauner.io> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Zhiqiang Liu <liuzhiqiang26@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201019134237.GA18810@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05ring-buffer: Return 0 on success from ring_buffer_resize()Qiujun Huang
commit 0a1754b2a97efa644aa6e84d1db5b17c42251483 upstream. We don't need to check the new buffer size, and the return value had confused resize_buffer_duplicate_size(). ... ret = ring_buffer_resize(trace_buf->buffer, per_cpu_ptr(size_buf->data,cpu_id)->entries, cpu_id); if (ret == 0) per_cpu_ptr(trace_buf->data, cpu_id)->entries = per_cpu_ptr(size_buf->data, cpu_id)->entries; ... Link: https://lkml.kernel.org/r/20201019142242.11560-1-hqjagain@gmail.com Cc: stable@vger.kernel.org Fixes: d60da506cbeb3 ("tracing: Add a resize function to make one buffer equivalent to another buffer") Signed-off-by: Qiujun Huang <hqjagain@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05seccomp: Make duplicate listener detection non-racyJann Horn
commit dfe719fef03d752f1682fa8aeddf30ba501c8555 upstream. Currently, init_listener() tries to prevent adding a filter with SECCOMP_FILTER_FLAG_NEW_LISTENER if one of the existing filters already has a listener. However, this check happens without holding any lock that would prevent another thread from concurrently installing a new filter (potentially with a listener) on top of the ones we already have. Theoretically, this is also a data race: The plain load from current->seccomp.filter can race with concurrent writes to the same location. Fix it by moving the check into the region that holds the siglock to guard against concurrent TSYNC. (The "Fixes" tag points to the commit that introduced the theoretical data race; concurrent installation of another filter with TSYNC only became possible later, in commit 51891498f2da ("seccomp: allow TSYNC and USER_NOTIF together").) Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Reviewed-by: Tycho Andersen <tycho@tycho.pizza> Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20201005014401.490175-1-jannh@google.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-11-05bpf: Permit map_ptr arithmetic with opcode add and offset 0Yonghong Song
[ Upstream commit 7c6967326267bd5c0dded0a99541357d70dd11ac ] Commit 41c48f3a98231 ("bpf: Support access to bpf map fields") added support to access map fields with CORE support. For example, struct bpf_map { __u32 max_entries; } __attribute__((preserve_access_index)); struct bpf_array { struct bpf_map map; __u32 elem_size; } __attribute__((preserve_access_index)); struct { __uint(type, BPF_MAP_TYPE_ARRAY); __uint(max_entries, 4); __type(key, __u32); __type(value, __u32); } m_array SEC(".maps"); SEC("cgroup_skb/egress") int cg_skb(void *ctx) { struct bpf_array *array = (struct bpf_array *)&m_array; /* .. array->map.max_entries .. */ } In kernel, bpf_htab has similar structure, struct bpf_htab { struct bpf_map map; ... } In the above cg_skb(), to access array->map.max_entries, with CORE, the clang will generate two builtin's. base = &m_array; /* access array.map */ map_addr = __builtin_preserve_struct_access_info(base, 0, 0); /* access array.map.max_entries */ max_entries_addr = __builtin_preserve_struct_access_info(map_addr, 0, 0); max_entries = *max_entries_addr; In the current llvm, if two builtin's are in the same function or in the same function after inlining, the compiler is smart enough to chain them together and generates like below: base = &m_array; max_entries = *(base + reloc_offset); /* reloc_offset = 0 in this case */ and we are fine. But if we force no inlining for one of functions in test_map_ptr() selftest, e.g., check_default(), the above two __builtin_preserve_* will be in two different functions. In this case, we will have code like: func check_hash(): reloc_offset_map = 0; base = &m_array; map_base = base + reloc_offset_map; check_default(map_base, ...) func check_default(map_base, ...): max_entries = *(map_base + reloc_offset_max_entries); In kernel, map_ptr (CONST_PTR_TO_MAP) does not allow any arithmetic. The above "map_base = base + reloc_offset_map" will trigger a verifier failure. ; VERIFY(check_default(&hash->map, map)); 0: (18) r7 = 0xffffb4fe8018a004 2: (b4) w1 = 110 3: (63) *(u32 *)(r7 +0) = r1 R1_w=invP110 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0 ; VERIFY_TYPE(BPF_MAP_TYPE_HASH, check_hash); 4: (18) r1 = 0xffffb4fe8018a000 6: (b4) w2 = 1 7: (63) *(u32 *)(r1 +0) = r2 R1_w=map_value(id=0,off=0,ks=4,vs=8,imm=0) R2_w=invP1 R7_w=map_value(id=0,off=4,ks=4,vs=8,imm=0) R10=fp0 8: (b7) r2 = 0 9: (18) r8 = 0xffff90bcb500c000 11: (18) r1 = 0xffff90bcb500c000 13: (0f) r1 += r2 R1 pointer arithmetic on map_ptr prohibited To fix the issue, let us permit map_ptr + 0 arithmetic which will result in exactly the same map_ptr. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200908175702.2463625-1-yhs@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-05kgdb: Make "kgdbcon" work properly with "kgdb_earlycon"Douglas Anderson
[ Upstream commit b18b099e04f450cdc77bec72acefcde7042bd1f3 ] On my system the kernel processes the "kgdb_earlycon" parameter before the "kgdbcon" parameter. When we setup "kgdb_earlycon" we'll end up in kgdb_register_callbacks() and "kgdb_use_con" won't have been set yet so we'll never get around to starting "kgdbcon". Let's remedy this by detecting that the IO module was already registered when setting "kgdb_use_con" and registering the console then. As part of this, to avoid pre-declaring things, move the handling of the "kgdbcon" further down in the file. Signed-off-by: Douglas Anderson <dianders@chromium.org> Link: https://lore.kernel.org/r/20200630151422.1.I4aa062751ff5e281f5116655c976dff545c09a46@changeid Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-11-05futex: Fix incorrect should_fail_futex() handlingMateusz Nosek
[ Upstream commit 921c7ebd1337d1a46783d7e15a850e12aed2eaa0 ] If should_futex_fail() returns true in futex_wake_pi(), then the 'ret' variable is set to -EFAULT and then immediately overwritten. So the failure injection is non-functional. Fix it by actually leaving the function and returning -EFAULT. The Fixes tag is kinda blury because the initial commit which introduced failure injection was already sloppy, but the below mentioned commit broke it completely. [ tglx: Massaged changelog ] Fixes: 6b4f4bc9cb22 ("locking/futex: Allow low-level atomic operations to return -EAGAIN") Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200927000858.24219-1-mateusznosek0@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29bpf: Limit caller's stack depth 256 for subprogs with tailcallsMaciej Fijalkowski
[ Upstream commit 7f6e4312e15a5c370e84eaa685879b6bdcc717e4 ] Protect against potential stack overflow that might happen when bpf2bpf calls get combined with tailcalls. Limit the caller's stack depth for such case down to 256 so that the worst case scenario would result in 8k stack size (32 which is tailcall limit * 256 = 8k). Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29PM: hibernate: remove the bogus call to get_gendisk() in software_resume()Christoph Hellwig
[ Upstream commit 428805c0c5e76ef643b1fbc893edfb636b3d8aef ] get_gendisk grabs a reference on the disk and file operation, so this code will leak both of them while having absolutely no use for the gendisk itself. This effectively reverts commit 2df83fa4bce421f ("PM / Hibernate: Use get_gendisk to verify partition if resume_file is integer format") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29sched/features: Fix !CONFIG_JUMP_LABEL caseJuri Lelli
[ Upstream commit a73f863af4ce9730795eab7097fb2102e6854365 ] Commit: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") made sched features static for !CONFIG_SCHED_DEBUG configurations, but overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases. For the latter echoing changes to /sys/kernel/debug/sched_features has the nasty effect of effectively changing what sched_features reports, but without actually changing the scheduler behaviour (since different translation units get different sysctl_sched_features). Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly restructuring ifdefs. Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds") Co-developed-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Daniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Patrick Bellasi <patrick.bellasi@matbug.net> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29module: statically initialize init section freeing dataDaniel Jordan
[ Upstream commit fdf09ab887829cd1b671e45d9549f8ec1ffda0fa ] Corentin hit the following workqueue warning when running with CRYPTO_MANAGER_EXTRA_TESTS: WARNING: CPU: 2 PID: 147 at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0 Modules linked in: ghash_generic CPU: 2 PID: 147 Comm: modprobe Not tainted 5.6.0-rc1-next-20200214-00068-g166c9264f0b1-dirty #545 Hardware name: Pine H64 model A (DT) pc : __queue_work+0x3b8/0x3d0 Call trace: __queue_work+0x3b8/0x3d0 queue_work_on+0x6c/0x90 do_init_module+0x188/0x1f0 load_module+0x1d00/0x22b0 I wasn't able to reproduce on x86 or rpi 3b+. This is WARN_ON(!list_empty(&work->entry)) from __queue_work(), and it happens because the init_free_wq work item isn't initialized in time for a crypto test that requests the gcm module. Some crypto tests were recently moved earlier in boot as explained in commit c4741b230597 ("crypto: run initcalls for generic implementations earlier"), which went into mainline less than two weeks before the Fixes commit. Avoid the warning by statically initializing init_free_wq and the corresponding llist. Link: https://lore.kernel.org/lkml/20200217204803.GA13479@Red/ Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag") Reported-by: Corentin Labbe <clabbe.montjoie@gmail.com> Tested-by: Corentin Labbe <clabbe.montjoie@gmail.com> Tested-on: sun50i-h6-pine-h64 Tested-on: imx8mn-ddr4-evk Tested-on: sun50i-a64-bananapi-m64 Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29kdb: Fix pager search for multi-line stringsDaniel Thompson
[ Upstream commit d081a6e353168f15e63eb9e9334757f20343319f ] Currently using forward search doesn't handle multi-line strings correctly. The search routine replaces line breaks with \0 during the search and, for regular searches ("help | grep Common\n"), there is code after the line has been discarded or printed to replace the break character. However during a pager search ("help\n" followed by "/Common\n") when the string is matched we will immediately return to normal output and the code that should restore the \n becomes unreachable. Fix this by restoring the replaced character when we disable the search mode and update the comment accordingly. Fixes: fb6daa7520f9d ("kdb: Provide forward search at more prompt") Link: https://lore.kernel.org/r/20200909141708.338273-1-daniel.thompson@linaro.org Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessarySuren Baghdasaryan
[ Upstream commit 67197a4f28d28d0b073ab0427b03cb2ee5382578 ] Currently __set_oom_adj loops through all processes in the system to keep oom_score_adj and oom_score_adj_min in sync between processes sharing their mm. This is done for any task with more that one mm_users, which includes processes with multiple threads (sharing mm and signals). However for such processes the loop is unnecessary because their signal structure is shared as well. Android updates oom_score_adj whenever a tasks changes its role (background/foreground/...) or binds to/unbinds from a service, making it more/less important. Such operation can happen frequently. We noticed that updates to oom_score_adj became more expensive and after further investigation found out that the patch mentioned in "Fixes" introduced a regression. Using Pixel 4 with a typical Android workload, write time to oom_score_adj increased from ~3.57us to ~362us. Moreover this regression linearly depends on the number of multi-threaded processes running on the system. Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj update should be synchronized between multiple processes. To prevent races between clone() and __set_oom_adj(), when oom_score_adj of the process being cloned might be modified from userspace, we use oom_adj_mutex. Its scope is changed to global. The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for the case of vfork(). To prevent performance regressions of vfork(), we skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is specified. Clearing the MMF_MULTIPROCESS flag (when the last process sharing the mm exits) is left out of this patch to keep it simple and because it is believed that this threading model is rare. Should there ever be a need for optimizing that case as well, it can be done by hooking into the exit path, likely following the mm_update_next_owner pattern. With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being quite rare, the regression is gone after the change is applied. [surenb@google.com: v3] Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj") Reported-by: Tim Murray <timmurray@google.com> Suggested-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Eugene Syromiatnikov <esyr@redhat.com> Cc: Christian Kellner <christian@kellner.me> Cc: Adrian Reber <areber@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Alexey Gladkov <gladkov.alexey@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Bernd Edlinger <bernd.edlinger@hotmail.de> Cc: John Johansen <john.johansen@canonical.com> Cc: Yafang Shao <laoar.shao@gmail.com> Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com Debugged-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-29sched/fair: Fix wrong cpu selecting from isolated domainXunlei Pang
[ Upstream commit df3cb4ea1fb63ff326488efd671ba3c39034255e ] We've met problems that occasionally tasks with full cpumask (e.g. by putting it into a cpuset or setting to full affinity) were migrated to our isolated cpus in production environment. After some analysis, we found that it is due to the current select_idle_smt() not considering the sched_domain mask. Steps to reproduce on my 31-CPU hyperthreads machine: 1. with boot parameter: "isolcpus=domain,2-31" (thread lists: 0,16 and 1,17) 2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads" 3. some threads will be migrated to the isolated cpu16~17. Fix it by checking the valid domain mask in select_idle_smt(). Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings()) Reported-by: Wetp Zhang <wetp.zy@linux.alibaba.com> Signed-off-by: Xunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jiang Biao <benbjiang@tencent.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-14perf: Fix task_function_call() error handlingKajol Jain
[ Upstream commit 6d6b8b9f4fceab7266ca03d194f60ec72bd4b654 ] The error handling introduced by commit: 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") looses any return value from smp_call_function_single() that is not {0, -EINVAL}. This is a problem because it will return -EXNIO when the target CPU is offline. Worse, in that case it'll turn into an infinite loop. Fixes: 2ed6edd33a21 ("perf: Add cond_resched() to task_function_call()") Reported-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Kajol Jain <kjain@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Barret Rhoden <brho@google.com> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: https://lkml.kernel.org/r/20200827064732.20860-1-kjain@linux.ibm.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-14bpf: Fix sysfs export of empty BTF sectionTony Ambardar
commit e23bb04b0c938588eae41b7f4712b722290ed2b8 upstream. If BTF data is missing or removed from the ELF section it is still exported via sysfs as a zero-length file: root@OpenWrt:/# ls -l /sys/kernel/btf/vmlinux -r--r--r-- 1 root root 0 Jul 18 02:59 /sys/kernel/btf/vmlinux Moreover, reads from this file succeed and leak kernel data: root@OpenWrt:/# hexdump -C /sys/kernel/btf/vmlinux|head -10 000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 000cc0 00 00 00 00 00 00 00 00 00 00 00 00 80 83 b0 80 |................| 000cd0 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 000ce0 00 00 00 00 00 00 00 00 00 00 00 00 57 ac 6e 9d |............W.n.| 000cf0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 002650 00 00 00 00 00 00 00 10 00 00 00 01 00 00 00 01 |................| 002660 80 82 9a c4 80 85 97 80 81 a9 51 68 00 00 00 02 |..........Qh....| 002670 80 25 44 dc 80 85 97 80 81 a9 50 24 81 ab c4 60 |.%D.......P$...`| This situation was first observed with kernel 5.4.x, cross-compiled for a MIPS target system. Fix by adding a sanity-check for export of zero-length data sections. Fixes: 341dfcf8d78e ("btf: expose BTF info through sysfs") Signed-off-by: Tony Ambardar <Tony.Ambardar@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/b38db205a66238f70823039a8c531535864eaac5.1600417359.git.Tony.Ambardar@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-14usermodehelper: reset umask to default before executing user processLinus Torvalds
commit 4013c1496c49615d90d36b9d513eee8e369778e9 upstream. Kernel threads intentionally do CLONE_FS in order to follow any changes that 'init' does to set up the root directory (or cwd). It is admittedly a bit odd, but it avoids the situation where 'init' does some extensive setup to initialize the system environment, and then we execute a usermode helper program, and it uses the original FS setup from boot time that may be very limited and incomplete. [ Both Al Viro and Eric Biederman point out that 'pivot_root()' will follow the root regardless, since it fixes up other users of root (see chroot_fs_refs() for details), but overmounting root and doing a chroot() would not. ] However, Vegard Nossum noticed that the CLONE_FS not only means that we follow the root and current working directories, it also means we share umask with whatever init changed it to. That wasn't intentional. Just reset umask to the original default (0022) before actually starting the usermode helper program. Reported-by: Vegard Nossum <vegard.nossum@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-07tracing: Make the space reserved for the pid widerSebastian Andrzej Siewior
[ Upstream commit 795d6379a47bcbb88bd95a69920e4acc52849f88 ] For 64bit CONFIG_BASE_SMALL=0 systems PID_MAX_LIMIT is set by default to 4194304. During boot the kernel sets a new value based on number of CPUs but no lower than 32768. It is 1024 per CPU so with 128 CPUs the default becomes 131072 which needs six digits. This value can be increased during run time but must not exceed the initial upper limit. Systemd sometime after v241 sets it to the upper limit during boot. The result is that when the pid exceeds five digits, the trace output is a little hard to read because it is no longer properly padded (same like on big iron with 98+ CPUs). Increase the pid padding to seven digits. Link: https://lkml.kernel.org/r/20200904082331.dcdkrr3bkn3e4qlg@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-07ftrace: Move RCU is watching check after recursion checkSteven Rostedt (VMware)
commit b40341fad6cc2daa195f8090fd3348f18fff640a upstream. The first thing that the ftrace function callback helper functions should do is to check for recursion. Peter Zijlstra found that when "rcu_is_watching()" had its notrace removed, it caused perf function tracing to crash. This is because the call of rcu_is_watching() is tested before function recursion is checked and and if it is traced, it will cause an infinite recursion loop. rcu_is_watching() should still stay notrace, but to prevent this should never had crashed in the first place. The recursion prevention must be the first thing done in callback functions. Link: https://lore.kernel.org/r/20200929112541.GM2628@hirez.programming.kicks-ass.net Cc: stable@vger.kernel.org Cc: Paul McKenney <paulmck@kernel.org> Fixes: c68c0fa293417 ("ftrace: Have ftrace_ops_get_func() handle RCU and PER_CPU flags too") Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01kprobes: Fix compiler warning for !CONFIG_KPROBES_ON_FTRACEMuchun Song
commit 10de795a5addd1962406796a6e13ba6cc0fc6bee upstream. Fix compiler warning(as show below) for !CONFIG_KPROBES_ON_FTRACE. kernel/kprobes.c: In function 'kill_kprobe': kernel/kprobes.c:1116:33: warning: statement with no effect [-Wunused-value] 1116 | #define disarm_kprobe_ftrace(p) (-ENODEV) | ^ kernel/kprobes.c:2154:3: note: in expansion of macro 'disarm_kprobe_ftrace' 2154 | disarm_kprobe_ftrace(p); Link: https://lore.kernel.org/r/20200805142136.0331f7ea@canb.auug.org.au Link: https://lkml.kernel.org/r/20200805172046.19066-1-songmuchun@bytedance.com Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Fixes: 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") Acked-by: Masami Hiramatsu <mhiramat@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01kprobes: tracing/kprobes: Fix to kill kprobes on initmem after bootMasami Hiramatsu
commit 82d083ab60c3693201c6f5c7a5f23a6ed422098d upstream. Since kprobe_event= cmdline option allows user to put kprobes on the functions in initmem, kprobe has to make such probes gone after boot. Currently the probes on the init functions in modules will be handled by module callback, but the kernel init text isn't handled. Without this, kprobes may access non-exist text area to disable or remove it. Link: https://lkml.kernel.org/r/159972810544.428528.1839307531600646955.stgit@devnote2 Fixes: 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter") Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()Masami Hiramatsu
commit 3031313eb3d549b7ad6f9fbcc52ba04412e3eb9e upstream. Commit 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") fixed one bug but not completely fixed yet. If we run a kprobe_module.tc of ftracetest, kernel showed a warning as below. # ./ftracetest test.d/kprobe/kprobe_module.tc === Ftrace unit tests === [1] Kprobe dynamic event - probing module ... [ 22.400215] ------------[ cut here ]------------ [ 22.400962] Failed to disarm kprobe-ftrace at trace_printk_irq_work+0x0/0x7e [trace_printk] (-2) [ 22.402139] WARNING: CPU: 7 PID: 200 at kernel/kprobes.c:1091 __disarm_kprobe_ftrace.isra.0+0x7e/0xa0 [ 22.403358] Modules linked in: trace_printk(-) [ 22.404028] CPU: 7 PID: 200 Comm: rmmod Not tainted 5.9.0-rc2+ #66 [ 22.404870] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 [ 22.406139] RIP: 0010:__disarm_kprobe_ftrace.isra.0+0x7e/0xa0 [ 22.406947] Code: 30 8b 03 eb c9 80 3d e5 09 1f 01 00 75 dc 49 8b 34 24 89 c2 48 c7 c7 a0 c2 05 82 89 45 e4 c6 05 cc 09 1f 01 01 e8 a9 c7 f0 ff <0f> 0b 8b 45 e4 eb b9 89 c6 48 c7 c7 70 c2 05 82 89 45 e4 e8 91 c7 [ 22.409544] RSP: 0018:ffffc90000237df0 EFLAGS: 00010286 [ 22.410385] RAX: 0000000000000000 RBX: ffffffff83066024 RCX: 0000000000000000 [ 22.411434] RDX: 0000000000000001 RSI: ffffffff810de8d3 RDI: ffffffff810de8d3 [ 22.412687] RBP: ffffc90000237e10 R08: 0000000000000001 R09: 0000000000000001 [ 22.413762] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807c478640 [ 22.414852] R13: ffffffff8235ebc0 R14: ffffffffa00060c0 R15: 0000000000000000 [ 22.415941] FS: 00000000019d48c0(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000 [ 22.417264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 22.418176] CR2: 00000000005bb7e3 CR3: 0000000078f7a000 CR4: 00000000000006a0 [ 22.419309] Call Trace: [ 22.419990] kill_kprobe+0x94/0x160 [ 22.420652] kprobes_module_callback+0x64/0x230 [ 22.421470] notifier_call_chain+0x4f/0x70 [ 22.422184] blocking_notifier_call_chain+0x49/0x70 [ 22.422979] __x64_sys_delete_module+0x1ac/0x240 [ 22.423733] do_syscall_64+0x38/0x50 [ 22.424366] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 22.425176] RIP: 0033:0x4bb81d [ 22.425741] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e0 ff ff ff f7 d8 64 89 01 48 [ 22.428726] RSP: 002b:00007ffc70fef008 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0 [ 22.430169] RAX: ffffffffffffffda RBX: 00000000019d48a0 RCX: 00000000004bb81d [ 22.431375] RDX: 0000000000000000 RSI: 0000000000000880 RDI: 00007ffc70fef028 [ 22.432543] RBP: 0000000000000880 R08: 00000000ffffffff R09: 00007ffc70fef320 [ 22.433692] R10: 0000000000656300 R11: 0000000000000246 R12: 00007ffc70fef028 [ 22.434635] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000 [ 22.435682] irq event stamp: 1169 [ 22.436240] hardirqs last enabled at (1179): [<ffffffff810df542>] console_unlock+0x422/0x580 [ 22.437466] hardirqs last disabled at (1188): [<ffffffff810df19b>] console_unlock+0x7b/0x580 [ 22.438608] softirqs last enabled at (866): [<ffffffff81c0038e>] __do_softirq+0x38e/0x490 [ 22.439637] softirqs last disabled at (859): [<ffffffff81a00f42>] asm_call_on_stack+0x12/0x20 [ 22.440690] ---[ end trace 1e7ce7e1e4567276 ]--- [ 22.472832] trace_kprobe: This probe might be able to register after target module is loaded. Continue. This is because the kill_kprobe() calls disarm_kprobe_ftrace() even if the given probe is not enabled. In that case, ftrace_set_filter_ip() fails because the given probe point is not registered to ftrace. Fix to check the given (going) probe is enabled before invoking disarm_kprobe_ftrace(). Link: https://lkml.kernel.org/r/159888672694.1411785.5987998076694782591.stgit@devnote2 Fixes: 0cb2f1372baa ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") Cc: Ingo Molnar <mingo@kernel.org> Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David Miller <davem@davemloft.net> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: stable@vger.kernel.org Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01tracing: fix double freeTom Rix
commit 46bbe5c671e06f070428b9be142cc4ee5cedebac upstream. clang static analyzer reports this problem trace_events_hist.c:3824:3: warning: Attempt to free released memory kfree(hist_data->attrs->var_defs.name[i]); In parse_var_defs() if there is a problem allocating var_defs.expr, the earlier var_defs.name is freed. This free is duplicated by free_var_defs() which frees the rest of the list. Because free_var_defs() has to run anyway, remove the second free fom parse_var_defs(). Link: https://lkml.kernel.org/r/20200907135845.15804-1-trix@redhat.com Cc: stable@vger.kernel.org Fixes: 30350d65ac56 ("tracing: Add variable support to hist triggers") Reviewed-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-10-01bpf: Fix a rcu warning for bpffs map pretty-printYonghong Song
[ Upstream commit ce880cb825fcc22d4e39046a6c3a3a7f6603883d ] Running selftest ./btf_btf -p the kernel had the following warning: [ 51.528185] WARNING: CPU: 3 PID: 1756 at kernel/bpf/hashtab.c:717 htab_map_get_next_key+0x2eb/0x300 [ 51.529217] Modules linked in: [ 51.529583] CPU: 3 PID: 1756 Comm: test_btf Not tainted 5.9.0-rc1+ #878 [ 51.530346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014 [ 51.531410] RIP: 0010:htab_map_get_next_key+0x2eb/0x300 ... [ 51.542826] Call Trace: [ 51.543119] map_seq_next+0x53/0x80 [ 51.543528] seq_read+0x263/0x400 [ 51.543932] vfs_read+0xad/0x1c0 [ 51.544311] ksys_read+0x5f/0xe0 [ 51.544689] do_syscall_64+0x33/0x40 [ 51.545116] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The related source code in kernel/bpf/hashtab.c: 709 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key) 710 { 711 struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 712 struct hlist_nulls_head *head; 713 struct htab_elem *l, *next_l; 714 u32 hash, key_size; 715 int i = 0; 716 717 WARN_ON_ONCE(!rcu_read_lock_held()); In kernel/bpf/inode.c, bpffs map pretty print calls map->ops->map_get_next_key() without holding a rcu_read_lock(), hence causing the above warning. To fix the issue, just surrounding map->ops->map_get_next_key() with rcu read lock. Fixes: a26ca7c982cb ("bpf: btf: Add pretty print support to the basic arraymap") Reported-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andriin@fb.com> Cc: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200916004401.146277-1-yhs@fb.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01lockdep: fix order in trace_hardirqs_off_caller()Sven Schnelle
[ Upstream commit 73ac74c7d489756d2313219a108809921dbfaea1 ] Switch order so that locking state is consistent even if the IRQ tracer calls into lockdep again. Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01printk: handle blank console arguments passed in.Shreyas Joshi
[ Upstream commit 48021f98130880dd74286459a1ef48b5e9bc374f ] If uboot passes a blank string to console_setup then it results in a trashed memory. Ultimately, the kernel crashes during freeing up the memory. This fix checks if there is a blank parameter being passed to console_setup from uboot. In case it detects that the console parameter is blank then it doesn't setup the serial device and it gracefully exits. Link: https://lore.kernel.org/r/20200522065306.83-1-shreyas.joshi@biamp.com Signed-off-by: Shreyas Joshi <shreyas.joshi@biamp.com> Acked-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> [pmladek@suse.com: Better format the commit message and code, remove unnecessary brackets.] Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01sched/fair: Eliminate bandwidth race between throttling and distributionPaul Turner
[ Upstream commit e98fa02c4f2ea4991dae422ac7e34d102d2f0599 ] There is a race window in which an entity begins throttling before quota is added to the pool, but does not finish throttling until after we have finished with distribute_cfs_runtime(). This entity is not observed by distribute_cfs_runtime() because it was not on the throttled list at the time that distribution was running. This race manifests as rare period-length statlls for such entities. Rather than heavy-weight the synchronization with the progress of distribution, we can fix this by aborting throttling if bandwidth has become available. Otherwise, we immediately add the entity to the throttled list so that it can be observed by a subsequent distribution. Additionally, we can remove the case of adding the throttled entity to the head of the throttled list, and simply always add to the tail. Thanks to 26a8b12747c97, distribute_cfs_runtime() no longer holds onto its own pool of runtime. This means that if we do hit the !assign and distribute_running case, we know that distribution is about to end. Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Ben Segall <bsegall@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Link: https://lkml.kernel.org/r/20200410225208.109717-2-joshdon@google.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01workqueue: Remove the warning in wq_worker_sleeping()Sebastian Andrzej Siewior
[ Upstream commit 62849a9612924a655c67cf6962920544aa5c20db ] The kernel test robot triggered a warning with the following race: task-ctx A interrupt-ctx B worker -> process_one_work() -> work_item() -> schedule(); -> sched_submit_work() -> wq_worker_sleeping() -> ->sleeping = 1 atomic_dec_and_test(nr_running) __schedule(); *interrupt* async_page_fault() -> local_irq_enable(); -> schedule(); -> sched_submit_work() -> wq_worker_sleeping() -> if (WARN_ON(->sleeping)) return -> __schedule() -> sched_update_worker() -> wq_worker_running() -> atomic_inc(nr_running); -> ->sleeping = 0; -> sched_update_worker() -> wq_worker_running() if (!->sleeping) return In this context the warning is pointless everything is fine. An interrupt before wq_worker_sleeping() will perform the ->sleeping assignment (0 -> 1 > 0) twice. An interrupt after wq_worker_sleeping() will trigger the warning and nr_running will be decremented (by A) and incremented once (only by B, A will skip it). This is the case until the ->sleeping is zeroed again in wq_worker_running(). Remove the WARN statement because this condition may happen. Document that preemption around wq_worker_sleeping() needs to be disabled to protect ->sleeping and not just as an optimisation. Fixes: 6d25be5782e48 ("sched/core, workqueues: Distangle worker accounting from rq lock") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Tejun Heo <tj@kernel.org> Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01perf: Use new infrastructure to fix deadlocks in execveBernd Edlinger
[ Upstream commit 6914303824bb572278568330d72fc1f8f9814e67 ] This changes perf_event_set_clock to use the new exec_update_mutex instead of cred_guard_mutex. This should be safe, as the credentials are only used for reading. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01kernel/kcmp.c: Use new infrastructure to fix deadlocks in execveBernd Edlinger
[ Upstream commit 454e3126cb842388e22df6b3ac3da44062c00765 ] This changes kcmp_epoll_target to use the new exec_update_mutex instead of cred_guard_mutex. This should be safe, as the credentials are only used for reading, and furthermore ->mm and ->sighand are updated on execve, but only under the new exec_update_mutex. Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01exec: Fix a deadlock in straceBernd Edlinger
[ Upstream commit 3e74fabd39710ee29fa25618d2c2b40cfa7d76c7 ] This fixes a deadlock in the tracer when tracing a multi-threaded application that calls execve while more than one thread are running. I observed that when running strace on the gcc test suite, it always blocks after a while, when expect calls execve, because other threads have to be terminated. They send ptrace events, but the strace is no longer able to respond, since it is blocked in vm_access. The deadlock is always happening when strace needs to access the tracees process mmap, while another thread in the tracee starts to execve a child process, but that cannot continue until the PTRACE_EVENT_EXIT is handled and the WIFEXITED event is received: strace D 0 30614 30584 0x00000000 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 schedule_preempt_disabled+0x15/0x20 __mutex_lock.isra.13+0x1ec/0x520 __mutex_lock_killable_slowpath+0x13/0x20 mutex_lock_killable+0x28/0x30 mm_access+0x27/0xa0 process_vm_rw_core.isra.3+0xff/0x550 process_vm_rw+0xdd/0xf0 __x64_sys_process_vm_readv+0x31/0x40 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 expect D 0 31933 30876 0x80004003 Call Trace: __schedule+0x3ce/0x6e0 schedule+0x5c/0xd0 flush_old_exec+0xc4/0x770 load_elf_binary+0x35a/0x16c0 search_binary_handler+0x97/0x1d0 __do_execve_file.isra.40+0x5d4/0x8a0 __x64_sys_execve+0x49/0x60 do_syscall_64+0x64/0x220 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This changes mm_access to use the new exec_update_mutex instead of cred_guard_mutex. This patch is based on the following patch by Eric W. Biederman: "[PATCH 0/5] Infrastructure to allow fixing exec deadlocks" Link: https://lore.kernel.org/lkml/87v9ne5y4y.fsf_-_@x220.int.ebiederm.org/ Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01exec: Add exec_update_mutex to replace cred_guard_mutexEric W. Biederman
[ Upstream commit eea9673250db4e854e9998ef9da6d4584857f0ea ] The cred_guard_mutex is problematic as it is held over possibly indefinite waits for userspace. The possible indefinite waits for userspace that I have identified are: The cred_guard_mutex is held in PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The cred_guard_mutex is held over "get_user(futex_offset, ...") in exit_robust_list. The cred_guard_mutex held over copy_strings. The functions get_user and put_user can trigger a page fault which can potentially wait indefinitely in the case of userfaultfd or if userspace implements part of the page fault path. In any of those cases the userspace process that the kernel is waiting for might make a different system call that winds up taking the cred_guard_mutex and result in deadlock. Holding a mutex over any of those possibly indefinite waits for userspace does not appear necessary. Add exec_update_mutex that will just cover updating the process during exec where the permissions and the objects pointed to by the task struct may be out of sync. The plan is to switch the users of cred_guard_mutex to exec_update_mutex one by one. This lets us move forward while still being careful and not introducing any regressions. Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/ Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/ Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/ Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/ Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/ Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.") Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2") Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01tracing: Use address-of operator on section symbolsNathan Chancellor
[ Upstream commit bf2cbe044da275021b2de5917240411a19e5c50d ] Clang warns: ../kernel/trace/trace.c:9335:33: warning: array comparison always evaluates to true [-Wtautological-compare] if (__stop___trace_bprintk_fmt != __start___trace_bprintk_fmt) ^ 1 warning generated. These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the runtime result of the check (tested with some print statements compiled in with clang + ld.lld and gcc + ld.bfd in QEMU). Link: http://lkml.kernel.org/r/20200220051011.26113-1-natechancellor@gmail.com Link: https://github.com/ClangBuiltLinux/linux/issues/893 Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01timekeeping: Prevent 32bit truncation in scale64_check_overflow()Wen Yang
[ Upstream commit 4cbbc3a0eeed675449b1a4d080008927121f3da3 ] While unlikely the divisor in scale64_check_overflow() could be >= 32bit in scale64_check_overflow(). do_div() truncates the divisor to 32bit at least on 32bit platforms. Use div64_u64() instead to avoid the truncation to 32-bit. [ tglx: Massaged changelog ] Signed-off-by: Wen Yang <wenyang@linux.alibaba.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200120100523.45656-1-wenyang@linux.alibaba.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01bpf: Remove recursion prevention from rcu free callbackThomas Gleixner
[ Upstream commit 8a37963c7ac9ecb7f86f8ebda020e3f8d6d7b8a0 ] If an element is freed via RCU then recursion into BPF instrumentation functions is not a concern. The element is already detached from the map and the RCU callback does not hold any locks on which a kprobe, perf event or tracepoint attached BPF program could deadlock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200224145643.259118710@linutronix.de Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01locking/lockdep: Decrement IRQ context counters when removing lock chainWaiman Long
[ Upstream commit b3b9c187dc2544923a601733a85352b9ddaba9b3 ] There are currently three counters to track the IRQ context of a lock chain - nr_hardirq_chains, nr_softirq_chains and nr_process_chains. They are incremented when a new lock chain is added, but they are not decremented when a lock chain is removed. That causes some of the statistic counts reported by /proc/lockdep_stats to be incorrect. IRQ Fix that by decrementing the right counter when a lock chain is removed. Since inc_chains() no longer accesses hardirq_context and softirq_context directly, it is moved out from the CONFIG_TRACE_IRQFLAGS conditional compilation block. Fixes: a0b0fd53e1e6 ("locking/lockdep: Free lock classes that are no longer in use") Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200206152408.24165-2-longman@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01audit: CONFIG_CHANGE don't log internal bookkeeping as an eventSteve Grubb
[ Upstream commit 70b3eeed49e8190d97139806f6fbaf8964306cdb ] Common Criteria calls out for any action that modifies the audit trail to be recorded. That usually is interpreted to mean insertion or removal of rules. It is not required to log modification of the inode information since the watch is still in effect. Additionally, if the rule is a never rule and the underlying file is one they do not want events for, they get an event for this bookkeeping update against their wishes. Since no device/inode info is logged at insertion and no device/inode information is logged on update, there is nothing meaningful being communicated to the admin by the CONFIG_CHANGE updated_rules event. One can assume that the rule was not "modified" because it is still watching the intended target. If the device or inode cannot be resolved, then audit_panic is called which is sufficient. The correct resolution is to drop logging config_update events since the watch is still in effect but just on another unknown inode. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01tracing: Set kernel_stack's caller size properlyJosef Bacik
[ Upstream commit cbc3b92ce037f5e7536f6db157d185cd8b8f615c ] I noticed when trying to use the trace-cmd python interface that reading the raw buffer wasn't working for kernel_stack events. This is because it uses a stubbed version of __dynamic_array that doesn't do the __data_loc trick and encode the length of the array into the field. Instead it just shows up as a size of 0. So change this to __array and set the len to FTRACE_STACK_ENTRIES since this is what we actually do in practice and matches how user_stack_trace works. Link: http://lkml.kernel.org/r/1411589652-1318-1-git-send-email-jbacik@fb.com Signed-off-by: Josef Bacik <jbacik@fb.com> [ Pulled from the archeological digging of my INBOX ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01module: Remove accidental change of module_enable_x()Steven Rostedt (VMware)
[ Upstream commit af74262337faa65d5ac2944553437d3f5fb29123 ] When pulling in Divya Indi's patch, I made a minor fix to remove unneeded braces. I commited my fix up via "git commit -a --amend". Unfortunately, I didn't realize I had some changes I was testing in the module code, and those changes were applied to Divya's patch as well. This reverts the accidental updates to the module code. Cc: Jessica Yu <jeyu@kernel.org> Cc: Divya Indi <divya.indi@oracle.com> Reported-by: Peter Zijlstra <peterz@infradead.org> Fixes: e585e6469d6f ("tracing: Verify if trace array exists before destroying it.") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01kernel/sys.c: avoid copying possible padding bytes in copy_to_userJoe Perches
[ Upstream commit 5e1aada08cd19ea652b2d32a250501d09b02ff2e ] Initialization is not guaranteed to zero padding bytes so use an explicit memset instead to avoid leaking any kernel content in any possible padding bytes. Link: http://lkml.kernel.org/r/dfa331c00881d61c8ee51577a082d8bebd61805c.camel@perches.com Signed-off-by: Joe Perches <joe@perches.com> Cc: Dan Carpenter <error27@gmail.com> Cc: Julia Lawall <julia.lawall@lip6.fr> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01kernel/notifier.c: intercept duplicate registrations to avoid infinite loopsXiaoming Ni
[ Upstream commit 1a50cb80f219c44adb6265f5071b81fc3c1deced ] Registering the same notifier to a hook repeatedly can cause the hook list to form a ring or lose other members of the list. case1: An infinite loop in notifier_chain_register() can cause soft lockup atomic_notifier_chain_register(&test_notifier_list, &test1); atomic_notifier_chain_register(&test_notifier_list, &test1); atomic_notifier_chain_register(&test_notifier_list, &test2); case2: An infinite loop in notifier_chain_register() can cause soft lockup atomic_notifier_chain_register(&test_notifier_list, &test1); atomic_notifier_chain_register(&test_notifier_list, &test1); atomic_notifier_call_chain(&test_notifier_list, 0, NULL); case3: lose other hook test2 atomic_notifier_chain_register(&test_notifier_list, &test1); atomic_notifier_chain_register(&test_notifier_list, &test2); atomic_notifier_chain_register(&test_notifier_list, &test1); case4: Unregister returns 0, but the hook is still in the linked list, and it is not really registered. If you call notifier_call_chain after ko is unloaded, it will trigger oops. If the system is configured with softlockup_panic and the same hook is repeatedly registered on the panic_notifier_list, it will cause a loop panic. Add a check in notifier_chain_register(), intercepting duplicate registrations to avoid infinite loops Link: http://lkml.kernel.org/r/1568861888-34045-2-git-send-email-nixiaoming@huawei.com Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Reviewed-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: J. Bruce Fields <bfields@fieldses.org> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: David S. Miller <davem@davemloft.net> Cc: Jeff Layton <jlayton@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Nadia Derbey <Nadia.Derbey@bull.net> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Sam Protsenko <semen.protsenko@linaro.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01tracing: Adding NULL checks for trace_array descriptor pointerDivya Indi
[ Upstream commit 953ae45a0c25e09428d4a03d7654f97ab8a36647 ] As part of commit f45d1225adb0 ("tracing: Kernel access to Ftrace instances") we exported certain functions. Here, we are adding some additional NULL checks to ensure safe usage by users of these APIs. Link: http://lkml.kernel.org/r/1565805327-579-4-git-send-email-divya.indi@oracle.com Signed-off-by: Divya Indi <divya.indi@oracle.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-10-01tracing: Verify if trace array exists before destroying it.Divya Indi
[ Upstream commit e585e6469d6f476b82aa148dc44aaf7ae269a4e2 ] A trace array can be destroyed from userspace or kernel. Verify if the trace array exists before proceeding to destroy/remove it. Link: http://lkml.kernel.org/r/1565805327-579-3-git-send-email-divya.indi@oracle.com Reviewed-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com> Signed-off-by: Divya Indi <divya.indi@oracle.com> [ Removed unneeded braces ] Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: Sasha Levin <sashal@kernel.org>