summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2018-02-25mm/early_ioremap: Fix boot hang with earlyprintk=efi,keepDave Young
[ Upstream commit 7f6f60a1ba52538c16f26930bfbcfe193d9d746a ] earlyprintk=efi,keep does not work any more with a warning in mm/early_ioremap.c: WARN_ON(system_state != SYSTEM_BOOTING): Boot just hangs because of the earlyprintk within the earlyprintk implementation code itself. This is caused by a new introduced middle state in: 69a78ff226fe ("init: Introduce SYSTEM_SCHEDULING state") early_ioremap() is fine in both SYSTEM_BOOTING and SYSTEM_SCHEDULING states, original condition should be updated accordingly. Signed-off-by: Dave Young <dyoung@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: bp@suse.de Cc: linux-efi@vger.kernel.org Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20171209041610.GA3249@dhcp-128-65.nay.redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25Provide a function to create a NUL-terminated string from unterminated dataDavid Howells
commit f35157417215ec138c920320c746fdb3e04ef1d5 upstream. Provide a function, kmemdup_nul(), that will create a NUL-terminated string from an unterminated character array where the length is known in advance. This is better than kstrndup() in situations where we already know the string length as the strnlen() in kstrndup() is superfluous. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-25mm,vmscan: Make unregister_shrinker() no-op if register_shrinker() failed.Tetsuo Handa
commit bb422a738f6566f7439cd347d54e321e4fe92a9f upstream. Syzbot caught an oops at unregister_shrinker() because combination of commit 1d3d4437eae1bb29 ("vmscan: per-node deferred work") and fault injection made register_shrinker() fail and the caller of register_shrinker() did not check for failure. ---------- [ 554.881422] FAULT_INJECTION: forcing a failure. [ 554.881422] name failslab, interval 1, probability 0, space 0, times 0 [ 554.881438] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82 [ 554.881443] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 554.881445] Call Trace: [ 554.881459] dump_stack+0x194/0x257 [ 554.881474] ? arch_local_irq_restore+0x53/0x53 [ 554.881486] ? find_held_lock+0x35/0x1d0 [ 554.881507] should_fail+0x8c0/0xa40 [ 554.881522] ? fault_create_debugfs_attr+0x1f0/0x1f0 [ 554.881537] ? check_noncircular+0x20/0x20 [ 554.881546] ? find_next_zero_bit+0x2c/0x40 [ 554.881560] ? ida_get_new_above+0x421/0x9d0 [ 554.881577] ? find_held_lock+0x35/0x1d0 [ 554.881594] ? __lock_is_held+0xb6/0x140 [ 554.881628] ? check_same_owner+0x320/0x320 [ 554.881634] ? lock_downgrade+0x990/0x990 [ 554.881649] ? find_held_lock+0x35/0x1d0 [ 554.881672] should_failslab+0xec/0x120 [ 554.881684] __kmalloc+0x63/0x760 [ 554.881692] ? lock_downgrade+0x990/0x990 [ 554.881712] ? register_shrinker+0x10e/0x2d0 [ 554.881721] ? trace_event_raw_event_module_request+0x320/0x320 [ 554.881737] register_shrinker+0x10e/0x2d0 [ 554.881747] ? prepare_kswapd_sleep+0x1f0/0x1f0 [ 554.881755] ? _down_write_nest_lock+0x120/0x120 [ 554.881765] ? memcpy+0x45/0x50 [ 554.881785] sget_userns+0xbcd/0xe20 (...snipped...) [ 554.898693] kasan: CONFIG_KASAN_INLINE enabled [ 554.898724] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 554.898732] general protection fault: 0000 [#1] SMP KASAN [ 554.898737] Dumping ftrace buffer: [ 554.898741] (ftrace buffer empty) [ 554.898743] Modules linked in: [ 554.898752] CPU: 1 PID: 13231 Comm: syz-executor1 Not tainted 4.14.0-rc8+ #82 [ 554.898755] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 [ 554.898760] task: ffff8801d1dbe5c0 task.stack: ffff8801c9e38000 [ 554.898772] RIP: 0010:__list_del_entry_valid+0x7e/0x150 [ 554.898775] RSP: 0018:ffff8801c9e3f108 EFLAGS: 00010246 [ 554.898780] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 554.898784] RDX: 0000000000000000 RSI: ffff8801c53c6f98 RDI: ffff8801c53c6fa0 [ 554.898788] RBP: ffff8801c9e3f120 R08: 1ffff100393c7d55 R09: 0000000000000004 [ 554.898791] R10: ffff8801c9e3ef70 R11: 0000000000000000 R12: 0000000000000000 [ 554.898795] R13: dffffc0000000000 R14: 1ffff100393c7e45 R15: ffff8801c53c6f98 [ 554.898800] FS: 0000000000000000(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000 [ 554.898804] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 554.898807] CR2: 00000000dbc23000 CR3: 00000001c7269000 CR4: 00000000001406e0 [ 554.898813] DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000 [ 554.898816] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 [ 554.898818] Call Trace: [ 554.898828] unregister_shrinker+0x79/0x300 [ 554.898837] ? perf_trace_mm_vmscan_writepage+0x750/0x750 [ 554.898844] ? down_write+0x87/0x120 [ 554.898851] ? deactivate_super+0x139/0x1b0 [ 554.898857] ? down_read+0x150/0x150 [ 554.898864] ? check_same_owner+0x320/0x320 [ 554.898875] deactivate_locked_super+0x64/0xd0 [ 554.898883] deactivate_super+0x141/0x1b0 ---------- Since allowing register_shrinker() callers to call unregister_shrinker() when register_shrinker() failed can simplify error recovery path, this patch makes unregister_shrinker() no-op when register_shrinker() failed. Also, reset shrinker->nr_deferred in case unregister_shrinker() was by error called twice. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Aliaksei Karaliou <akaraliou.dev@gmail.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Glauber Costa <glauber@scylladb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-22mm: hide a #warning for COMPILE_TESTArnd Bergmann
commit af27d9403f5b80685b79c88425086edccecaf711 upstream. We get a warning about some slow configurations in randconfig kernels: mm/memory.c:83:2: error: #warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid. [-Werror=cpp] The warning is reasonable by itself, but gets in the way of randconfig build testing, so I'm hiding it whenever CONFIG_COMPILE_TEST is set. The warning was added in 2013 in commit 75980e97dacc ("mm: fold page->_last_nid into page->flags where possible"). Cc: stable@vger.kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03kmemleak: add scheduling point to kmemleak_scan()Yisheng Xie
[ Upstream commit bde5f6bc68db51128f875a756e9082a6c6ff7b4c ] kmemleak_scan() will scan struct page for each node and it can be really large and resulting in a soft lockup. We have seen a soft lockup when do scan while compile kernel: watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [bash:10287] [...] Call Trace: kmemleak_scan+0x21a/0x4c0 kmemleak_write+0x312/0x350 full_proxy_write+0x5a/0xa0 __vfs_write+0x33/0x150 vfs_write+0xad/0x1a0 SyS_write+0x52/0xc0 do_syscall_64+0x61/0x1a0 entry_SYSCALL64_slow_path+0x25/0x25 Fix this by adding cond_resched every MAX_SCAN_SIZE. Link: http://lkml.kernel.org/r/1511439788-20099-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31mm, page_alloc: fix potential false positive in __zone_watermark_okVlastimil Babka
commit b050e3769c6b4013bb937e879fc43bf1847ee819 upstream. Since commit 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations"), __zone_watermark_ok() check for high-order allocations will shortcut per-migratetype free list checks for ALLOC_HARDER allocations, and return true as long as there's free page of any migratetype. The intention is that ALLOC_HARDER can allocate from MIGRATE_HIGHATOMIC free lists, while normal allocations can't. However, as a side effect, the watermark check will then also return true when there are pages only on the MIGRATE_ISOLATE list, or (prior to CMA conversion to ZONE_MOVABLE) on the MIGRATE_CMA list. Since the allocation cannot actually obtain isolated pages, and might not be able to obtain CMA pages, this can result in a false positive. The condition should be rare and perhaps the outcome is not a fatal one. Still, it's better if the watermark check is correct. There also shouldn't be a performance tradeoff here. Link: http://lkml.kernel.org/r/20171102125001.23708-1-vbabka@suse.cz Fixes: 97a16fc82a7c ("mm, page_alloc: only enforce watermarks for order-0 allocations") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31cma: fix calculation of aligned offsetDoug Berger
commit e048cb32f69038aa1c8f11e5c1b331be4181659d upstream. The align_offset parameter is used by bitmap_find_next_zero_area_off() to represent the offset of map's base from the previous alignment boundary; the function ensures that the returned index, plus the align_offset, honors the specified align_mask. The logic introduced by commit b5be83e308f7 ("mm: cma: align to physical address, not CMA region position") has the cma driver calculate the offset to the *next* alignment boundary. In most cases, the base alignment is greater than that specified when making allocations, resulting in a zero offset whether we align up or down. In the example given with the commit, the base alignment (8MB) was half the requested alignment (16MB) so the math also happened to work since the offset is 8MB in both directions. However, when requesting allocations with an alignment greater than twice that of the base, the returned index would not be correctly aligned. Also, the align_order arguments of cma_bitmap_aligned_mask() and cma_bitmap_aligned_offset() should not be negative so the argument type was made unsigned. Fixes: b5be83e308f7 ("mm: cma: align to physical address, not CMA region position") Link: http://lkml.kernel.org/r/20170628170742.2895-1-opendmb@gmail.com Signed-off-by: Angus Clark <angus@angusclark.org> Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Gregory Fong <gregory.0xf0@gmail.com> Cc: Doug Berger <opendmb@gmail.com> Cc: Angus Clark <angus@angusclark.org> Cc: Laura Abbott <labbott@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Lucas Stach <l.stach@pengutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Shiraz Hashim <shashim@codeaurora.org> Cc: Jaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31hwpoison, memcg: forcibly uncharge LRU pagesMichal Hocko
commit 18365225f0440d09708ad9daade2ec11275c3df9 upstream. Laurent Dufour has noticed that hwpoinsoned pages are kept charged. In his particular case he has hit a bad_page("page still charged to cgroup") when onlining a hwpoison page. While this looks like something that shouldn't happen in the first place because onlining hwpages and returning them to the page allocator makes only little sense it shows a real problem. hwpoison pages do not get freed usually so we do not uncharge them (at least not since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API")). Each charge pins memcg (since e8ea14cc6ead ("mm: memcontrol: take a css reference for each charged page")) as well and so the mem_cgroup and the associated state will never go away. Fix this leak by forcibly uncharging a LRU hwpoisoned page in delete_from_lru_cache(). We also have to tweak uncharge_list because it cannot rely on zero ref count for these pages. [akpm@linux-foundation.org: coding-style fixes] Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") Link: http://lkml.kernel.org/r/20170502185507.GB19165@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: Laurent Dufour <ldufour@linux.vnet.ibm.com> Reviewed-by: Balbir Singh <bsingharora@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-31mm/mmap.c: do not blow on PROT_NONE MAP_FIXED holes in the stackMichal Hocko
commit 561b5e0709e4a248c67d024d4d94b6e31e3edf2f upstream. Commit 1be7107fbe18 ("mm: larger stack guard gap, between vmas") has introduced a regression in some rust and Java environments which are trying to implement their own stack guard page. They are punching a new MAP_FIXED mapping inside the existing stack Vma. This will confuse expand_{downwards,upwards} into thinking that the stack expansion would in fact get us too close to an existing non-stack vma which is a correct behavior wrt safety. It is a real regression on the other hand. Let's work around the problem by considering PROT_NONE mapping as a part of the stack. This is a gros hack but overflowing to such a mapping would trap anyway an we only can hope that usespace knows what it is doing and handle it propely. Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas") Link: http://lkml.kernel.org/r/20170705182849.GA18027@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Debugged-by: Vlastimil Babka <vbabka@suse.cz> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Willy Tarreau <w@1wt.eu> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17zswap: don't param_set_charp while holding spinlockDan Streetman
commit fd5bb66cd934987e49557455b6497fc006521940 upstream. Change the zpool/compressor param callback function to release the zswap_pools_lock spinlock before calling param_set_charp, since that function may sleep when it calls kmalloc with GFP_KERNEL. While this problem has existed for a while, I wasn't able to trigger it using a tight loop changing either/both the zpool and compressor params; I think it's very unlikely to be an issue on the stable kernels, especially since most zswap users will change the compressor and/or zpool from sysfs only one time each boot - or zero times, if they add the params to the kernel boot. Fixes: c99b42c3529e ("zswap: use charp for zswap param strings") Link: http://lkml.kernel.org/r/20170126155821.4545-1-ddstreet@ieee.org Signed-off-by: Dan Streetman <dan.streetman@canonical.com> Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17mm/zswap: use workqueue to destroy poolDan Streetman
commit 200867af4dedfe7cb707f96773684de1d1fd21e6 upstream. Add a work_struct to struct zswap_pool, and change __zswap_pool_empty to use the workqueue instead of using call_rcu(). When zswap destroys a pool no longer in use, it uses call_rcu() to perform the destruction/freeing. Since that executes in softirq context, it must not sleep. However, actually destroying the pool involves freeing the per-cpu compressors (which requires locking the cpu_add_remove_lock mutex) and freeing the zpool, for which the implementation may sleep (e.g. zsmalloc calls kmem_cache_destroy, which locks the slab_mutex). So if either mutex is currently taken, or any other part of the compressor or zpool implementation sleeps, it will result in a BUG(). It's not easy to reproduce this when changing zswap's params normally. In testing with a loaded system, this does not fail: $ cd /sys/module/zswap/parameters $ echo lz4 > compressor ; echo zsmalloc > zpool nor does this: $ while true ; do > echo lzo > compressor ; echo zbud > zpool > sleep 1 > echo lz4 > compressor ; echo zsmalloc > zpool > sleep 1 > done although it's still possible either of those might fail, depending on whether anything else besides zswap has locked the mutexes. However, changing a parameter with no delay immediately causes the schedule while atomic BUG: $ while true ; do > echo lzo > compressor ; echo lz4 > compressor > done This is essentially the same as Yu Zhao's proposed patch to zsmalloc, but moved to zswap, to cover compressor and zpool freeing. Fixes: f1c54846ee45 ("zswap: dynamic pool creation") Signed-off-by: Dan Streetman <ddstreet@ieee.org> Reported-by: Yu Zhao <yuzhao@google.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Streetman <dan.streetman@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17mm/page-writeback: fix dirty_ratelimit calculationAndrey Ryabinin
commit d59b1087a98e402ed9a7cc577f4da435f9a555f5 upstream. Calculation of dirty_ratelimit sometimes is not correct. E.g. initial values of dirty_ratelimit == INIT_BW and step == 0, lead to the following result: UBSAN: Undefined behaviour in ../mm/page-writeback.c:1286:7 shift exponent 25600 is too large for 64-bit type 'long unsigned int' The fix is straightforward - make step 0 if the shift exponent is too big. Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Tejun Heo <tj@kernel.org> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17mm/compaction: pass only pageblock aligned range to pageblock_pfn_to_pageJoonsoo Kim
commit e1409c325fdc1fef7b3d8025c51892355f065d15 upstream. pageblock_pfn_to_page() is used to check there is valid pfn and all pages in the pageblock is in a single zone. If there is a hole in the pageblock, passing arbitrary position to pageblock_pfn_to_page() could cause to skip whole pageblock scanning, instead of just skipping the hole page. For deterministic behaviour, it's better to always pass pageblock aligned range to pageblock_pfn_to_page(). It will also help further optimization on pageblock_pfn_to_page() in the following patch. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Aaron Lu <aaron.lu@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17mm/compaction: fix invalid free_pfn and compact_cached_free_pfnJoonsoo Kim
commit 623446e4dc45b37740268165107cc63abb3022f0 upstream. free_pfn and compact_cached_free_pfn are the pointer that remember restart position of freepage scanner. When they are reset or invalid, we set them to zone_end_pfn because freepage scanner works in reverse direction. But, because zone range is defined as [zone_start_pfn, zone_end_pfn), zone_end_pfn is invalid to access. Therefore, we should not store it to free_pfn and compact_cached_free_pfn. Instead, we need to store zone_end_pfn - 1 to them. There is one more thing we should consider. Freepage scanner scan reversely by pageblock unit. If free_pfn and compact_cached_free_pfn are set to middle of pageblock, it regards that sitiation as that it already scans front part of pageblock so we lose opportunity to scan there. To fix-up, this patch do round_down() to guarantee that reset position will be pageblock aligned. Note that thanks to the current pageblock_pfn_to_page() implementation, actual access to zone_end_pfn doesn't happen until now. But, following patch will change pageblock_pfn_to_page() so this patch is needed from now on. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-10proc: much faster /proc/vmstatAlexey Dobriyan
commit 68ba0326b4e14988f9e0c24a6e12a85cf2acd1ca upstream. Every current KDE system has process named ksysguardd polling files below once in several seconds: $ strace -e trace=open -p $(pidof ksysguardd) Process 1812 attached open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8 open("/etc/mtab", O_RDONLY|O_CLOEXEC) = 8 open("/proc/net/dev", O_RDONLY) = 8 open("/proc/net/wireless", O_RDONLY) = -1 ENOENT (No such file or directory) open("/proc/stat", O_RDONLY) = 8 open("/proc/vmstat", O_RDONLY) = 8 Hell knows what it is doing but speed up reading /proc/vmstat by 33%! Benchmark is open+read+close 1.000.000 times. BEFORE $ perf stat -r 10 taskset -c 3 ./proc-vmstat Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs): 13146.768464 task-clock (msec) # 0.960 CPUs utilized ( +- 0.60% ) 15 context-switches # 0.001 K/sec ( +- 1.41% ) 1 cpu-migrations # 0.000 K/sec ( +- 11.11% ) 104 page-faults # 0.008 K/sec ( +- 0.57% ) 45,489,799,349 cycles # 3.460 GHz ( +- 0.03% ) 9,970,175,743 stalled-cycles-frontend # 21.92% frontend cycles idle ( +- 0.10% ) 2,800,298,015 stalled-cycles-backend # 6.16% backend cycles idle ( +- 0.32% ) 79,241,190,850 instructions # 1.74 insn per cycle # 0.13 stalled cycles per insn ( +- 0.00% ) 17,616,096,146 branches # 1339.956 M/sec ( +- 0.00% ) 176,106,232 branch-misses # 1.00% of all branches ( +- 0.18% ) 13.691078109 seconds time elapsed ( +- 0.03% ) ^^^^^^^^^^^^ AFTER $ perf stat -r 10 taskset -c 3 ./proc-vmstat Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs): 8688.353749 task-clock (msec) # 0.950 CPUs utilized ( +- 1.25% ) 10 context-switches # 0.001 K/sec ( +- 2.13% ) 1 cpu-migrations # 0.000 K/sec 104 page-faults # 0.012 K/sec ( +- 0.56% ) 30,384,010,730 cycles # 3.497 GHz ( +- 0.07% ) 12,296,259,407 stalled-cycles-frontend # 40.47% frontend cycles idle ( +- 0.13% ) 3,370,668,651 stalled-cycles-backend # 11.09% backend cycles idle ( +- 0.69% ) 28,969,052,879 instructions # 0.95 insn per cycle # 0.42 stalled cycles per insn ( +- 0.01% ) 6,308,245,891 branches # 726.058 M/sec ( +- 0.00% ) 214,685,502 branch-misses # 3.40% of all branches ( +- 0.26% ) 9.146081052 seconds time elapsed ( +- 0.07% ) ^^^^^^^^^^^ vsnprintf() is slow because: 1. format_decode() is busy looking for format specifier: 2 branches per character (not in this case, but in others) 2. approximately million branches while parsing format mini language and everywhere 3. just look at what string() does /proc/vmstat is good case because most of its content are strings Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.by Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-05kaiser: vmstat show NR_KAISERTABLE as nr_overheadHugh Dickins
The kaiser update made an interesting choice, never to free any shadow page tables. Contention on global spinlock was worrying, particularly with it held across page table scans when freeing. Something had to be done: I was going to add refcounting; but simply never to free them is an appealing choice, minimizing contention without complicating the code (the more a page table is found already, the less the spinlock is used). But leaking pages in this way is also a worry: can we get away with it? At the very least, we need a count to show how bad it actually gets: in principle, one might end up wasting about 1/256 of memory that way (1/512 for when direct-mapped pages have to be user-mapped, plus 1/512 for when they are user-mapped from the vmalloc area on another occasion (but we don't have vmalloc'ed stacks, so only large ldts are vmalloc'ed). Add per-cpu stat NR_KAISERTABLE: including 256 at startup for the shared pgd entries, and 1 for each intermediate page table added thereafter for user-mapping - but leave out the 1 per mm, for its shadow pgd, because that distracts from the monotonic increase. Shown in /proc/vmstat as nr_overhead (0 if kaiser not enabled). In practice, it doesn't look so bad so far: more like 1/12000 after nine hours of gtests below; and movable pageblock segregation should tend to cluster the kaiser tables into a subset of the address space (if not, they will be bad for compaction too). But production may tell a different story: keep an eye on this number, and bring back lighter freeing if it gets out of control (maybe a shrinker). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25mm/mmu_context, sched/core: Fix mmu_context.h assumptionIngo Molnar
commit 8efd755ac2fe262d4c8d5c9bbe054bb67dae93da upstream. Some architectures (such as Alpha) rely on include/linux/sched.h definitions in their mmu_context.h files. So include sched.h before mmu_context.h. Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-25mm/rmap: batched invalidations should use existing apiNadav Amit
commit 858eaaa711700ce4595e039441e239e56d7b9514 upstream. The recently introduced batched invalidations mechanism uses its own mechanism for shootdown. However, it does wrong accounting of interrupts (e.g., inc_irq_stat is called for local invalidations), trace-points (e.g., TLB_REMOTE_SHOOTDOWN for local invalidations) and may break some platforms as it bypasses the invalidation mechanisms of Xen and SGI UV. This patch reuses the existing TLB flushing mechnaisms instead. We use NULL as mm to indicate a global invalidation is required. Fixes 72b252aed506b8 ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages") Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-16thp: fix MADV_DONTNEED vs. numa balancing raceKirill A. Shutemov
commit ced108037c2aa542b3ed8b7afd1576064ad1362a upstream. In case prot_numa, we are under down_read(mmap_sem). It's critical to not clear pmd intermittently to avoid race with MADV_DONTNEED which is also under down_read(mmap_sem): CPU0: CPU1: change_huge_pmd(prot_numa=1) pmdp_huge_get_and_clear_notify() madvise_dontneed() zap_pmd_range() pmd_trans_huge(*pmd) == 0 (without ptl) // skip the pmd set_pmd_at(); // pmd is re-established The race makes MADV_DONTNEED miss the huge pmd and don't clear it which may break userspace. Found by code analysis, never saw triggered. Link: http://lkml.kernel.org/r/20170302151034.27829-3-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [jwang: adjust context for 4.4] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-16thp: reduce indentation level in change_huge_pmd()Kirill A. Shutemov
commit 0a85e51d37645e9ce57e5e1a30859e07810ed07c upstream. Patch series "thp: fix few MADV_DONTNEED races" For MADV_DONTNEED to work properly with huge pages, it's critical to not clear pmd intermittently unless you hold down_write(mmap_sem). Otherwise MADV_DONTNEED can miss the THP which can lead to userspace breakage. See example of such race in commit message of patch 2/4. All these races are found by code inspection. I haven't seen them triggered. I don't think it's worth to apply them to stable@. This patch (of 4): Restructure code in preparation for a fix. Link: http://lkml.kernel.org/r/20170302151034.27829-2-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [jwang: adjust context for 4.4 kernel] Signed-off-by: Jack Wang <jinpu.wang@profitbricks.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05mm/madvise.c: fix madvise() infinite loop under special circumstanceschenjie
commit 6ea8d958a2c95a1d514015d4e29ba21a8c0a1a91 upstream. MADVISE_WILLNEED has always been a noop for DAX (formerly XIP) mappings. Unfortunately madvise_willneed() doesn't communicate this information properly to the generic madvise syscall implementation. The calling convention is quite subtle there. madvise_vma() is supposed to either return an error or update &prev otherwise the main loop will never advance to the next vma and it will keep looping for ever without a way to get out of the kernel. It seems this has been broken since introduction. Nobody has noticed because nobody seems to be using MADVISE_WILLNEED on these DAX mappings. [mhocko@suse.com: rewrite changelog] Link: http://lkml.kernel.org/r/20171127115318.911-1-guoxuenan@huawei.com Fixes: fe77ba6f4f97 ("[PATCH] xip: madvice/fadvice: execute in place") Signed-off-by: chenjie <chenjie6@huawei.com> Signed-off-by: guoxuenan <guoxuenan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: zhangyi (F) <yi.zhang@huawei.com> Cc: Miao Xie <miaoxie@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Shaohua Li <shli@fb.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-12-05mm, thp: Do not make page table dirty unconditionally in touch_p[mu]d()Kirill A. Shutemov
commit a8f97366452ed491d13cf1e44241bc0b5740b1f0 upstream. Currently, we unconditionally make page table dirty in touch_pmd(). It may result in false-positive can_follow_write_pmd(). We may avoid the situation, if we would only make the page table entry dirty if caller asks for write access -- FOLL_WRITE. The patch also changes touch_pud() in the same way. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [Salvatore Bonaccorso: backport for 3.16: - Adjust context - Drop specific part for PUD-sized transparent hugepages. Support for PUD-sized transparent hugepages was added in v4.11-rc1 ] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-24mm, hwpoison: fixup "mm: check the return value of lookup_page_ext for all ↵Michal Hocko
call sites" Backport of the upstream commit f86e4271978b ("mm: check the return value of lookup_page_ext for all call sites") is wrong for hwpoison pages. I have accidentally negated the condition for bailout. This basically disables hwpoison pages tracking while the code still might crash on unusual configurations when struct pages do not have page_ext allocated. The fix is trivial to invert the condition. Reported-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-24mm/pagewalk.c: report holes in hugetlb rangesJann Horn
commit 373c4557d2aa362702c4c2d41288fb1e54990b7c upstream. This matters at least for the mincore syscall, which will otherwise copy uninitialized memory from the page allocator to userspace. It is probably also a correctness error for /proc/$pid/pagemap, but I haven't tested that. Removing the `walk->hugetlb_entry` condition in walk_hugetlb_range() has no effect because the caller already checks for that. This only reports holes in hugetlb ranges to callers who have specified a hugetlb_entry callback. This issue was found using an AFL-based fuzzer. v2: - don't crash on ->pte_hole==NULL (Andrew Morton) - add Cc stable (Andrew Morton) Changed for 4.4/4.9 stable backport: - fix up conflict in the huge_pte_offset() call Fixes: 1e25a271c8ac ("mincore: apply page table walker on do_mincore()") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-24mm/page_ext.c: check if page_ext is not preparedJaewon Kim
commit e492080e640c2d1235ddf3441cae634cfffef7e1 upstream. online_page_ext() and page_ext_init() allocate page_ext for each section, but they do not allocate if the first PFN is !pfn_present(pfn) or !pfn_valid(pfn). Then section->page_ext remains as NULL. lookup_page_ext checks NULL only if CONFIG_DEBUG_VM is enabled. For a valid PFN, __set_page_owner will try to get page_ext through lookup_page_ext. Without CONFIG_DEBUG_VM lookup_page_ext will misuse NULL pointer as value 0. This incurrs invalid address access. This is the panic example when PFN 0x100000 is not valid but PFN 0x13FC00 is being used for page_ext. section->page_ext is NULL, get_entry returned invalid page_ext address as 0x1DFA000 for a PFN 0x13FC00. To avoid this panic, CONFIG_DEBUG_VM should be removed so that page_ext will be checked at all times. Unable to handle kernel paging request at virtual address 01dfa014 ------------[ cut here ]------------ Kernel BUG at ffffff80082371e0 [verbose debug info unavailable] Internal error: Oops: 96000045 [#1] PREEMPT SMP Modules linked in: PC is at __set_page_owner+0x48/0x78 LR is at __set_page_owner+0x44/0x78 __set_page_owner+0x48/0x78 get_page_from_freelist+0x880/0x8e8 __alloc_pages_nodemask+0x14c/0xc48 __do_page_cache_readahead+0xdc/0x264 filemap_fault+0x2ac/0x550 ext4_filemap_fault+0x3c/0x58 __do_fault+0x80/0x120 handle_mm_fault+0x704/0xbb0 do_page_fault+0x2e8/0x394 do_mem_abort+0x88/0x124 Pre-4.7 kernels also need commit f86e4271978b ("mm: check the return value of lookup_page_ext for all call sites"). Link: http://lkml.kernel.org/r/20171107094131.14621-1-jaewon31.kim@samsung.com Fixes: eefa864b701d ("mm/page_ext: resurrect struct page extending code for debugging") Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-24mm: check the return value of lookup_page_ext for all call sitesYang Shi
commit f86e4271978bd93db466d6a95dad4b0fdcdb04f6 upstream. Per the discussion with Joonsoo Kim [1], we need check the return value of lookup_page_ext() for all call sites since it might return NULL in some cases, although it is unlikely, i.e. memory hotplug. Tested with ltp with "page_owner=0". [1] http://lkml.kernel.org/r/20160519002809.GA10245@js1304-P5Q-DELUXE [akpm@linux-foundation.org: fix build-breaking typos] [arnd@arndb.de: fix build problems from lookup_page_ext] Link: http://lkml.kernel.org/r/6285269.2CksypHdYp@wuerfel [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/1464023768-31025-1-git-send-email-yang.shi@linaro.org Signed-off-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-24mm/page_alloc.c: broken deferred calculationPavel Tatashin
commit d135e5750205a21a212a19dbb05aeb339e2cbea7 upstream. In reset_deferred_meminit() we determine number of pages that must not be deferred. We initialize pages for at least 2G of memory, but also pages for reserved memory in this node. The reserved memory is determined in this function: memblock_reserved_memory_within(), which operates over physical addresses, and returns size in bytes. However, reset_deferred_meminit() assumes that that this function operates with pfns, and returns page count. The result is that in the best case machine boots slower than expected due to initializing more pages than needed in single thread, and in the worst case panics because fewer than needed pages are initialized early. Link: http://lkml.kernel.org/r/20171021011707.15191-1-pasha.tatashin@oracle.com Fixes: 864b9a393dcb ("mm: consider memblock reservations for deferred memory initialization sizing") Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21slub: do not merge cache if slub_debug contains a never-merge flagGrygorii Maistrenko
[ Upstream commit c6e28895a4372992961888ffaadc9efc643b5bfe ] In case CONFIG_SLUB_DEBUG_ON=n, find_mergeable() gets debug features from commandline but never checks if there are features from the SLAB_NEVER_MERGE set. As a result selected by slub_debug caches are always mergeable if they have been created without a custom constructor set or without one of the SLAB_* debug features on. This moves the SLAB_NEVER_MERGE check below the flags update from commandline to make sure it won't merge the slab cache if one of the debug features is on. Link: http://lkml.kernel.org/r/20170101124451.GA4740@lp-laptop-d Signed-off-by: Grygorii Maistrenko <grygoriimkd@gmail.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-09-27mm: prevent double decrease of nr_reserved_highatomicMinchan Kim
commit 4855e4a7f29d6d10b0b9c84e189c770c9a94e91e upstream. There is race between page freeing and unreserved highatomic. CPU 0 CPU 1 free_hot_cold_page mt = get_pfnblock_migratetype set_pcppage_migratetype(page, mt) unreserve_highatomic_pageblock spin_lock_irqsave(&zone->lock) move_freepages_block set_pageblock_migratetype(page) spin_unlock_irqrestore(&zone->lock) free_pcppages_bulk __free_one_page(mt) <- mt is stale By above race, a page on CPU 0 could go non-highorderatomic free list since the pageblock's type is changed. By that, unreserve logic of highorderatomic can decrease reserved count on a same pageblock severak times and then it will make mismatch between nr_reserved_highatomic and the number of reserved pageblock. So, this patch verifies whether the pageblock is highatomic or not and decrease the count only if the pageblock is highatomic. Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sangseok Lee <sangseok.lee@lge.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Miles Chen <miles.chen@mediatek.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-24Sanitize 'move_pages()' permission checksLinus Torvalds
commit 197e7e521384a23b9e585178f3f11c9fa08274b9 upstream. The 'move_paghes()' system call was introduced long long ago with the same permission checks as for sending a signal (except using CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability). That turns out to not be a great choice - while the system call really only moves physical page allocations around (and you need other capabilities to do a lot of it), you can check the return value to map out some the virtual address choices and defeat ASLR of a binary that still shares your uid. So change the access checks to the more common 'ptrace_may_access()' model instead. This tightens the access checks for the uid, and also effectively changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that anybody really _uses_ this legacy system call any more (we hav ebetter NUMA placement models these days), so I expect nobody to notice. Famous last words. Reported-by: Otto Ebeling <otto.ebeling@iki.fi> Acked-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Willy Tarreau <w@1wt.eu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-24mm/mempolicy: fix use after free when calling get_mempolicyzhong jiang
commit 73223e4e2e3867ebf033a5a8eb2e5df0158ccc99 upstream. I hit a use after free issue when executing trinity and repoduced it with KASAN enabled. The related call trace is as follows. BUG: KASan: use after free in SyS_get_mempolicy+0x3c8/0x960 at addr ffff8801f582d766 Read of size 2 by task syz-executor1/798 INFO: Allocated in mpol_new.part.2+0x74/0x160 age=3 cpu=1 pid=799 __slab_alloc+0x768/0x970 kmem_cache_alloc+0x2e7/0x450 mpol_new.part.2+0x74/0x160 mpol_new+0x66/0x80 SyS_mbind+0x267/0x9f0 system_call_fastpath+0x16/0x1b INFO: Freed in __mpol_put+0x2b/0x40 age=4 cpu=1 pid=799 __slab_free+0x495/0x8e0 kmem_cache_free+0x2f3/0x4c0 __mpol_put+0x2b/0x40 SyS_mbind+0x383/0x9f0 system_call_fastpath+0x16/0x1b INFO: Slab 0xffffea0009cb8dc0 objects=23 used=8 fp=0xffff8801f582de40 flags=0x200000000004080 INFO: Object 0xffff8801f582d760 @offset=5984 fp=0xffff8801f582d600 Bytes b4 ffff8801f582d750: ae 01 ff ff 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ Object ffff8801f582d760: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b kkkkkkkkkkkkkkkk Object ffff8801f582d770: 6b 6b 6b 6b 6b 6b 6b a5 kkkkkkk. Redzone ffff8801f582d778: bb bb bb bb bb bb bb bb ........ Padding ffff8801f582d8b8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ Memory state around the buggy address: ffff8801f582d600: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8801f582d680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff8801f582d700: fc fc fc fc fc fc fc fc fc fc fc fc fb fb fb fc !shared memory policy is not protected against parallel removal by other thread which is normally protected by the mmap_sem. do_get_mempolicy, however, drops the lock midway while we can still access it later. Early premature up_read is a historical artifact from times when put_user was called in this path see https://lwn.net/Articles/124754/ but that is gone since 8bccd85ffbaf ("[PATCH] Implement sys_* do_* layering in the memory policy layer."). but when we have the the current mempolicy ref count model. The issue was introduced accordingly. Fix the issue by removing the premature release. Link: http://lkml.kernel.org/r/1502950924-27521-1-git-send-email-zhongjiang@huawei.com Signed-off-by: zhong jiang <zhongjiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-16mm: ratelimit PFNs busy info messageJonathan Toppins
commit 75dddef32514f7aa58930bde6a1263253bc3d4ba upstream. The RDMA subsystem can generate several thousand of these messages per second eventually leading to a kernel crash. Ratelimit these messages to prevent this crash. Doug said: "I've been carrying a version of this for several kernel versions. I don't remember when they started, but we have one (and only one) class of machines: Dell PE R730xd, that generate these errors. When it happens, without a rate limit, we get rcu timeouts and kernel oopses. With the rate limit, we just get a lot of annoying kernel messages but the machine continues on, recovers, and eventually the memory operations all succeed" And: "> Well... why are all these EBUSY's occurring? It sounds inefficient > (at least) but if it is expected, normal and unavoidable then > perhaps we should just remove that message altogether? I don't have an answer to that question. To be honest, I haven't looked real hard. We never had this at all, then it started out of the blue, but only on our Dell 730xd machines (and it hits all of them), but no other classes or brands of machines. And we have our 730xd machines loaded up with different brands and models of cards (for instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines meant it wasn't tied to any particular brand/model of RDMA hardware. To me, it always smelled of a hardware oddity specific to maybe the CPUs or mainboard chipsets in these machines, so given that I'm not an mm expert anyway, I never chased it down. A few other relevant details: it showed up somewhere around 4.8/4.9 or thereabouts. It never happened before, but the prinkt has been there since the 3.18 days, so possibly the test to trigger this message was changed, or something else in the allocator changed such that the situation started happening on these machines? And, like I said, it is specific to our 730xd machines (but they are all identical, so that could mean it's something like their specific ram configuration is causing the allocator to hit this on these machine but not on other machines in the cluster, I don't want to say it's necessarily the model of chipset or CPU, there are other bits of identicalness between these machines)" Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Reviewed-by: Doug Ledford <dledford@redhat.com> Tested-by: Doug Ledford <dledford@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-12mm/mempool: avoid KASAN marking mempool poison checks as use-after-freeMatthew Dawson
commit 7640131032db9118a78af715ac77ba2debeeb17c upstream. When removing an element from the mempool, mark it as unpoisoned in KASAN before verifying its contents for SLUB/SLAB debugging. Otherwise KASAN will flag the reads checking the element use-after-free writes as use-after-free reads. Signed-off-by: Matthew Dawson <matthew@mjdsystems.ca> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrii Bordunov <aborduno@cisco.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11mm: don't dereference struct page fields of invalid pagesArd Biesheuvel
[ Upstream commit f073bdc51771f5a5c7a8d1191bfc3ae371d44de7 ] The VM_BUG_ON() check in move_freepages() checks whether the node id of a page matches the node id of its zone. However, it does this before having checked whether the struct page pointer refers to a valid struct page to begin with. This is guaranteed in most cases, but may not be the case if CONFIG_HOLES_IN_ZONE=y. So reorder the VM_BUG_ON() with the pfn_valid_within() check. Link: http://lkml.kernel.org/r/1481706707-6211-2-git-send-email-ard.biesheuvel@linaro.org Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Hanjun Guo <hanjun.guo@linaro.org> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: Robert Richter <rrichter@cavium.com> Cc: James Morse <james.morse@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11mm, mprotect: flush TLB if potentially racing with a parallel reclaim ↵Mel Gorman
leaving stale TLB entries commit 3ea277194daaeaa84ce75180ec7c7a2075027a68 upstream. Stable note for 4.4: The upstream patch patches madvise(MADV_FREE) but 4.4 does not have support for that feature. The changelog is left as-is but the hunk related to madvise is omitted from the backport. Nadav Amit identified a theoritical race between page reclaim and mprotect due to TLB flushes being batched outside of the PTL being held. He described the race as follows: CPU0 CPU1 ---- ---- user accesses memory using RW PTE [PTE now cached in TLB] try_to_unmap_one() ==> ptep_get_and_clear() ==> set_tlb_ubc_flush_pending() mprotect(addr, PROT_READ) ==> change_pte_range() ==> [ PTE non-present - no flush ] user writes using cached RW PTE ... try_to_unmap_flush() The same type of race exists for reads when protecting for PROT_NONE and also exists for operations that can leave an old TLB entry behind such as munmap, mremap and madvise. For some operations like mprotect, it's not necessarily a data integrity issue but it is a correctness issue as there is a window where an mprotect that limits access still allows access. For munmap, it's potentially a data integrity issue although the race is massive as an munmap, mmap and return to userspace must all complete between the window when reclaim drops the PTL and flushes the TLB. However, it's theoritically possible so handle this issue by flushing the mm if reclaim is potentially currently batching TLB flushes. Other instances where a flush is required for a present pte should be ok as either the page lock is held preventing parallel reclaim or a page reference count is elevated preventing a parallel free leading to corruption. In the case of page_mkclean there isn't an obvious path that userspace could take advantage of without using the operations that are guarded by this patch. Other users such as gup as a race with reclaim looks just at PTEs. huge page variants should be ok as they don't race with reclaim. mincore only looks at PTEs. userfault also should be ok as if a parallel reclaim takes place, it will either fault the page back in or read some of the data before the flush occurs triggering a fault. Note that a variant of this patch was acked by Andy Lutomirski but this was for the x86 parts on top of his PCID work which didn't make the 4.13 merge window as expected. His ack is dropped from this version and there will be a follow-on patch on top of PCID that will include his ack. [akpm@linux-foundation.org: tweak comments] [akpm@linux-foundation.org: fix spello] Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-11mm/page_alloc: Remove kernel address exposure in free_reserved_area()Josh Poimboeuf
commit adb1fe9ae2ee6ef6bc10f3d5a588020e7664dfa7 upstream. Linus suggested we try to remove some of the low-hanging fruit related to kernel address exposure in dmesg. The only leaks I see on my local system are: Freeing SMP alternatives memory: 32K (ffffffff9e309000 - ffffffff9e311000) Freeing initrd memory: 10588K (ffffa0b736b42000 - ffffa0b737599000) Freeing unused kernel memory: 3592K (ffffffff9df87000 - ffffffff9e309000) Freeing unused kernel memory: 1352K (ffffa0b7288ae000 - ffffa0b728a00000) Freeing unused kernel memory: 632K (ffffa0b728d62000 - ffffa0b728e00000) Linus says: "I suspect we should just remove [the addresses in the 'Freeing' messages]. I'm sure they are useful in theory, but I suspect they were more useful back when the whole "free init memory" was originally done. These days, if we have a use-after-free, I suspect the init-mem situation is the easiest situation by far. Compared to all the dynamic allocations which are much more likely to show it anyway. So having debug output for that case is likely not all that productive." With this patch the freeing messages now look like this: Freeing SMP alternatives memory: 32K Freeing initrd memory: 10588K Freeing unused kernel memory: 3592K Freeing unused kernel memory: 1352K Freeing unused kernel memory: 632K Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/6836ff90c45b71d38e5d4405aec56fa9e5d1d4b2.1477405374.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21mm: fix overflow check in expand_upwards()Helge Deller
commit 37511fb5c91db93d8bd6e3f52f86e5a7ff7cfcdf upstream. Jörn Engel noticed that the expand_upwards() function might not return -ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE. Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa which all define TASK_SIZE as 0xffffffff, but since none of those have an upwards-growing stack we currently have no actual issue. Nevertheless let's fix this just in case any of the architectures with an upward-growing stack (currently parisc, metag and partly ia64) define TASK_SIZE similar. Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit") Signed-off-by: Helge Deller <deller@gmx.de> Reported-by: Jörn Engel <joern@purestorage.com> Cc: Hugh Dickins <hughd@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21mm/list_lru.c: fix list_lru_count_node() to be race freeSahitya Tummala
commit 2c80cd57c74339889a8752b20862a16c28929c3a upstream. list_lru_count_node() iterates over all memcgs to get the total number of entries on the node but it can race with memcg_drain_all_list_lrus(), which migrates the entries from a dead cgroup to another. This can return incorrect number of entries from list_lru_count_node(). Fix this by keeping track of entries per node and simply return it in list_lru_count_node(). Link: http://lkml.kernel.org/r/1498707555-30525-1-git-send-email-stummala@codeaurora.org Signed-off-by: Sahitya Tummala <stummala@codeaurora.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Alexander Polakov <apolyakov@beget.ru> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-15mm: fix classzone_idx underflow in shrink_zones()Vlastimil Babka
[Not upstream as that would take 34+ patches] We've got reported a BUG in do_try_to_free_pages(): BUG: unable to handle kernel paging request at ffff8ffffff28990 IP: [<ffffffff8119abe0>] do_try_to_free_pages+0x140/0x490 PGD 0 Oops: 0000 [#1] SMP megaraid_sas sg scsi_mod efivarfs autofs4 Supported: No, Unsupported modules are loaded Workqueue: kacpi_hotplug acpi_hotplug_work_fn task: ffff88ffd0d4c540 ti: ffff88ffd0e48000 task.ti: ffff88ffd0e48000 RIP: 0010:[<ffffffff8119abe0>] [<ffffffff8119abe0>] do_try_to_free_pages+0x140/0x490 RSP: 0018:ffff88ffd0e4ba60 EFLAGS: 00010206 RAX: 000006fffffff900 RBX: 00000000ffffffff RCX: ffff88fffff29000 RDX: 000000ffffffff00 RSI: 0000000000000003 RDI: 00000000024200c8 RBP: 0000000001320122 R08: 0000000000000000 R09: ffff88ffd0e4bbac R10: 0000000000000000 R11: 0000000000000000 R12: ffff88ffd0e4bae0 R13: 0000000000000e00 R14: ffff88fffff2a500 R15: ffff88fffff2b300 FS: 0000000000000000(0000) GS:ffff88ffe6440000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8ffffff28990 CR3: 0000000001c0a000 CR4: 00000000003406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 00000002db570a80 024200c80000001e ffff88fffff2b300 0000000000000000 ffff88fffffd5700 ffff88ffd0d4c540 ffff88ffd0d4c540 ffffffff0000000c 0000000000000000 0000000000000040 00000000024200c8 ffff88ffd0e4bae0 Call Trace: [<ffffffff8119afea>] try_to_free_pages+0xba/0x170 [<ffffffff8118cf2f>] __alloc_pages_nodemask+0x53f/0xb20 [<ffffffff811d39ff>] alloc_pages_current+0x7f/0x100 [<ffffffff811e2232>] migrate_pages+0x202/0x710 [<ffffffff815dadaa>] __offline_pages.constprop.23+0x4ba/0x790 [<ffffffff81463263>] memory_subsys_offline+0x43/0x70 [<ffffffff8144cbed>] device_offline+0x7d/0xa0 [<ffffffff81392fa2>] acpi_bus_offline+0xa5/0xef [<ffffffff81394a77>] acpi_device_hotplug+0x21b/0x41f [<ffffffff8138dab7>] acpi_hotplug_work_fn+0x1a/0x23 [<ffffffff81093cee>] process_one_work+0x14e/0x410 [<ffffffff81094546>] worker_thread+0x116/0x490 [<ffffffff810999ed>] kthread+0xbd/0xe0 [<ffffffff815e4e7f>] ret_from_fork+0x3f/0x70 This translates to the loop in shrink_zone(): classzone_idx = requested_highidx; while (!populated_zone(zone->zone_pgdat->node_zones + classzone_idx)) classzone_idx--; where no zone is populated, so classzone_idx becomes -1 (in RBX). Added debugging output reveals that we enter the function with sc->gfp_mask == GFP_NOFS|__GFP_NOFAIL|__GFP_HARDWALL|__GFP_MOVABLE requested_highidx = gfp_zone(sc->gfp_mask) == 2 (ZONE_NORMAL) Inside the for loop, however: gfp_zone(sc->gfp_mask) == 3 (ZONE_MOVABLE) This means we have gone through this branch: if (buffer_heads_over_limit) sc->gfp_mask |= __GFP_HIGHMEM; This changes the gfp_zone() result, but requested_highidx remains unchanged. On nodes where the only populated zone is movable, the inner while loop will check only lower zones, which are not populated, and underflow classzone_idx. To sum up, the bug occurs in configurations with ZONE_MOVABLE (such as when booted with the movable_node parameter) and only in situations when buffer_heads_over_limit is true, and there's an allocation with __GFP_MOVABLE and without __GFP_HIGHMEM performing direct reclaim. This patch makes sure that classzone_idx starts with the correct zone. Mainline has been affected in versions 4.6 and 4.7, but the culprit commit has been also included in stable trees. In mainline, this has been fixed accidentally as part of 34-patch series (plus follow-up fixes) "Move LRU page reclaim from zones to nodes", which makes the mainline commit unsuitable for stable backport, unfortunately. Fixes: 7bf52fb891b6 ("mm: vmscan: reclaim highmem zone if buffer_heads is over limit") Obsoleted-by: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis") Debugged-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05mm: numa: avoid waiting on freed migrated pagesMark Rutland
commit 3c226c637b69104f6b9f1c6ec5b08d7b741b3229 upstream. In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by waiting until the pmd is unlocked before we return and retry. However, we can race with migrate_misplaced_transhuge_page(): // do_huge_pmd_numa_page // migrate_misplaced_transhuge_page() // Holds 0 refs on page // Holds 2 refs on page vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd); /* ... */ if (pmd_trans_migrating(*vmf->pmd)) { page = pmd_page(*vmf->pmd); spin_unlock(vmf->ptl); ptl = pmd_lock(mm, pmd); if (page_count(page) != 2)) { /* roll back */ } /* ... */ mlock_migrate_page(new_page, page); /* ... */ spin_unlock(ptl); put_page(page); put_page(page); // page freed here wait_on_page_locked(page); goto out; } This can result in the freed page having its waiters flag set unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the page alloc/free functions. This has been observed on arm64 KVM guests. We can avoid this by having do_huge_pmd_numa_page() take a reference on the page before dropping the pmd lock, mirroring what we do in __migration_entry_wait(). When we hit the race, migrate_misplaced_transhuge_page() will see the reference and abort the migration, as it may do today in other cases. Fixes: b8916634b77bffb2 ("mm: Prevent parallel splits during THP migration") Link: http://lkml.kernel.org/r/1497349722-6731-2-git-send-email-will.deacon@arm.com Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Acked-by: Steve Capper <steve.capper@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-05mm, swap_cgroup: reschedule when neeed in swap_cgroup_swapoff()David Rientjes
commit 460bcec84e11c75122ace5976214abbc596eb91b upstream. We got need_resched() warnings in swap_cgroup_swapoff() because swap_cgroup_ctrl[type].length is particularly large. Reschedule when needed. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1704061315270.80559@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-26mm: fix new crash in unmapped_area_topdown()Hugh Dickins
commit f4cb767d76cf7ee72f97dd76f6cfa6c76a5edc89 upstream. Trinity gets kernel BUG at mm/mmap.c:1963! in about 3 minutes of mmap testing. That's the VM_BUG_ON(gap_end < gap_start) at the end of unmapped_area_topdown(). Linus points out how MAP_FIXED (which does not have to respect our stack guard gap intentions) could result in gap_end below gap_start there. Fix that, and the similar case in its alternative, unmapped_area(). Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas") Reported-by: Dave Jones <davej@codemonkey.org.uk> Debugged-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-26Allow stack to grow up to address space limitHelge Deller
commit bd726c90b6b8ce87602208701b208a208e6d5600 upstream. Fix expand_upwards() on architectures with an upward-growing stack (parisc, metag and partly IA-64) to allow the stack to reliably grow exactly up to the address space limit given by TASK_SIZE. Signed-off-by: Helge Deller <deller@gmx.de> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-26mm: larger stack guard gap, between vmasHugh Dickins
commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Stack guard page is a useful feature to reduce a risk of stack smashing into a different mapping. We have been using a single page gap which is sufficient to prevent having stack adjacent to a different mapping. But this seems to be insufficient in the light of the stack usage in userspace. E.g. glibc uses as large as 64kB alloca() in many commonly used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX] which is 256kB or stack strings with MAX_ARG_STRLEN. This will become especially dangerous for suid binaries and the default no limit for the stack size limit because those applications can be tricked to consume a large portion of the stack and a single glibc call could jump over the guard page. These attacks are not theoretical, unfortunatelly. Make those attacks less probable by increasing the stack guard gap to 1MB (on systems with 4k pages; but make it depend on the page size because systems with larger base pages might cap stack allocations in the PAGE_SIZE units) which should cover larger alloca() and VLA stack allocations. It is obviously not a full fix because the problem is somehow inherent, but it should reduce attack space a lot. One could argue that the gap size should be configurable from userspace, but that can be done later when somebody finds that the new 1MB is wrong for some special case applications. For now, add a kernel command line option (stack_guard_gap) to specify the stack gap size (in page units). Implementation wise, first delete all the old code for stack guard page: because although we could get away with accounting one extra page in a stack vma, accounting a larger gap can break userspace - case in point, a program run with "ulimit -S -v 20000" failed when the 1MB gap was counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK and strict non-overcommit mode. Instead of keeping gap inside the stack vma, maintain the stack guard gap as a gap between vmas: using vm_start_gap() in place of vm_start (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few places which need to respect the gap - mainly arch_get_unmapped_area(), and and the vma tree's subtree_gap support for that. Original-patch-by: Oleg Nesterov <oleg@redhat.com> Original-patch-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Helge Deller <deller@gmx.de> # parisc Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [wt: backport to 4.11: adjust context] [wt: backport to 4.9: adjust context ; kernel doc was not in admin-guide] [wt: backport to 4.4: adjust context ; drop ppc hugetlb_radix changes] Signed-off-by: Willy Tarreau <w@1wt.eu> [gkh: minor build fixes for 4.4] Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-26swap: cond_resched in swap_cgroup_prepare()Yu Zhao
commit ef70762948dde012146926720b70e79736336764 upstream. I saw need_resched() warnings when swapping on large swapfile (TBs) because continuously allocating many pages in swap_cgroup_prepare() took too long. We already cond_resched when freeing page in swap_cgroup_swapoff(). Do the same for the page allocation. Link: http://lkml.kernel.org/r/20170604200109.17606-1-yuzhao@google.com Signed-off-by: Yu Zhao <yuzhao@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-26mm/memory-failure.c: use compound_head() flags for huge pagesJames Morse
commit 7258ae5c5a2ce2f5969e8b18b881be40ab55433d upstream. memory_failure() chooses a recovery action function based on the page flags. For huge pages it uses the tail page flags which don't have anything interesting set, resulting in: > Memory failure: 0x9be3b4: Unknown page state > Memory failure: 0x9be3b4: recovery action for unknown page: Failed Instead, save a copy of the head page's flags if this is a huge page, this means if there are no relevant flags for this tail page, we use the head pages flags instead. This results in the me_huge_page() recovery action being called: > Memory failure: 0x9b7969: recovery action for huge page: Delayed For hugepages that have not yet been allocated, this allows the hugepage to be dequeued. Fixes: 524fca1e7356 ("HWPOISON: fix misjudgement of page_action() for errors on mlocked pages") Link: http://lkml.kernel.org/r/20170524130204.21845-1-james.morse@arm.com Signed-off-by: James Morse <james.morse@arm.com> Tested-by: Punit Agrawal <punit.agrawal@arm.com> Acked-by: Punit Agrawal <punit.agrawal@arm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-17kasan: respect /proc/sys/kernel/traceoff_on_warningPeter Zijlstra
[ Upstream commit 4f40c6e5627ea73b4e7c615c59631f38cc880885 ] After much waiting I finally reproduced a KASAN issue, only to find my trace-buffer empty of useful information because it got spooled out :/ Make kasan_report honour the /proc/sys/kernel/traceoff_on_warning interface. Link: http://lkml.kernel.org/r/20170125164106.3514-1-aryabinin@virtuozzo.com Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14mm: consider memblock reservations for deferred memory initialization sizingMichal Hocko
commit 864b9a393dcb5aed09b8fd31b9bbda0fdda99374 upstream. We have seen an early OOM killer invocation on ppc64 systems with crashkernel=4096M: kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0 kthreadd cpuset=/ mems_allowed=7 CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1 Call Trace: dump_stack+0xb0/0xf0 (unreliable) dump_header+0xb0/0x258 out_of_memory+0x5f0/0x640 __alloc_pages_nodemask+0xa8c/0xc80 kmem_getpages+0x84/0x1a0 fallback_alloc+0x2a4/0x320 kmem_cache_alloc_node+0xc0/0x2e0 copy_process.isra.25+0x260/0x1b30 _do_fork+0x94/0x470 kernel_thread+0x48/0x60 kthreadd+0x264/0x330 ret_from_kernel_thread+0x5c/0xa4 Mem-Info: active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 slab_reclaimable:5 slab_unreclaimable:73 mapped:0 shmem:0 pagetables:0 bounce:0 free:0 free_pcp:0 free_cma:0 Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes lowmem_reserve[]: 0 0 0 0 Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB 0 total pagecache pages 0 pages in swap cache Swap cache stats: add 0, delete 0, find 0/0 Free swap = 0kB Total swap = 0kB 819200 pages RAM 0 pages HighMem/MovableOnly 817481 pages reserved 0 pages cma reserved 0 pages hwpoisoned the reason is that the managed memory is too low (only 110MB) while the rest of the the 50GB is still waiting for the deferred intialization to be done. update_defer_init estimates the initial memoty to initialize to 2GB at least but it doesn't consider any memory allocated in that range. In this particular case we've had Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB) so the low 2GB is mostly depleted. Fix this by considering memblock allocations in the initial static initialization estimation. Move the max_initialise to reset_deferred_meminit and implement a simple memblock_reserved_memory helper which iterates all reserved blocks and sums the size of all that start below the given address. The cumulative size is than added on top of the initial estimation. This is still not ideal because reset_deferred_meminit doesn't consider holes and so reservation might be above the initial estimation whihch we ignore but let's make the logic simpler until we really need to handle more complicated cases. Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set") Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.cz Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-14fs: add i_blocksize()Fabian Frederick
commit 93407472a21b82f39c955ea7787e5bc7da100642 upstream. Replace all 1 << inode->i_blkbits and (1 << inode->i_blkbits) in fs branch. This patch also fixes multiple checkpatch warnings: WARNING: Prefer 'unsigned int' to bare use of 'unsigned' Thanks to Andrew Morton for suggesting more appropriate function instead of macro. [geliangtang@gmail.com: truncate: use i_blocksize()] Link: http://lkml.kernel.org/r/9c8b2cd83c8f5653805d43debde9fa8817e02fc4.1484895804.git.geliangtang@gmail.com Link: http://lkml.kernel.org/r/1481319905-10126-1-git-send-email-fabf@skynet.be Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Geliang Tang <geliangtang@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-06-07mlock: fix mlock count can not decrease in race conditionYisheng Xie
commit 70feee0e1ef331b22cc51f383d532a0d043fbdcc upstream. Kefeng reported that when running the follow test, the mlock count in meminfo will increase permanently: [1] testcase linux:~ # cat test_mlockal grep Mlocked /proc/meminfo for j in `seq 0 10` do for i in `seq 4 15` do ./p_mlockall >> log & done sleep 0.2 done # wait some time to let mlock counter decrease and 5s may not enough sleep 5 grep Mlocked /proc/meminfo linux:~ # cat p_mlockall.c #include <sys/mman.h> #include <stdlib.h> #include <stdio.h> #define SPACE_LEN 4096 int main(int argc, char ** argv) { int ret; void *adr = malloc(SPACE_LEN); if (!adr) return -1; ret = mlockall(MCL_CURRENT | MCL_FUTURE); printf("mlcokall ret = %d\n", ret); ret = munlockall(); printf("munlcokall ret = %d\n", ret); free(adr); return 0; } In __munlock_pagevec() we should decrement NR_MLOCK for each page where we clear the PageMlocked flag. Commit 1ebb7cc6a583 ("mm: munlock: batch NR_MLOCK zone state updates") has introduced a bug where we don't decrement NR_MLOCK for pages where we clear the flag, but fail to isolate them from the lru list (e.g. when the pages are on some other cpu's percpu pagevec). Since PageMlocked stays cleared, the NR_MLOCK accounting gets permanently disrupted by this. Fix it by counting the number of page whose PageMlock flag is cleared. Fixes: 1ebb7cc6a583 (" mm: munlock: batch NR_MLOCK zone state updates") Link: http://lkml.kernel.org/r/1495678405-54569-1-git-send-email-xieyisheng1@huawei.com Signed-off-by: Yisheng Xie <xieyisheng1@huawei.com> Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com> Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joern Engel <joern@logfs.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: zhongjiang <zhongjiang@huawei.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>