summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2015-06-10mm, numa: really disable NUMA balancing by default on single node machinesMel Gorman
[ Upstream commit b0dc2b9bb4ab782115b964310518ee0b17784277 ] NUMA balancing is meant to be disabled by default on UMA machines but the check is using nr_node_ids (highest node) instead of num_online_nodes (online nodes). The consequences are that a UMA machine with a node ID of 1 or higher will enable NUMA balancing. This will incur useless overhead due to minor faults with the impact depending on the workload. These are the impact on the stats when running a kernel build on a single node machine whose node ID happened to be 1: vanilla patched NUMA base PTE updates 5113158 0 NUMA huge PMD updates 643 0 NUMA page range updates 5442374 0 NUMA hint faults 2109622 0 NUMA hint local faults 2109622 0 NUMA hint local percent 100 100 NUMA pages migrated 0 0 Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [3.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-05-23mm: soft-offline: fix num_poisoned_pages counting on concurrent eventsNaoya Horiguchi
[ Upstream commit 602498f9aa43d4951eece3fd6ad95a6d0a78d537 ] If multiple soft offline events hit one free page/hugepage concurrently, soft_offline_page() can handle the free page/hugepage multiple times, which makes num_poisoned_pages counter increased more than once. This patch fixes this wrong counting by checking TestSetPageHWPoison for normal papes and by checking the return value of dequeue_hwpoisoned_huge_page() for hugepages. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Dean Nelson <dnelson@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-05-23writeback: use |1 instead of +1 to protect against div by zeroTejun Heo
[ Upstream commit 464d1387acb94dc43ba772b35242345e3d2ead1b ] mm/page-writeback.c has several places where 1 is added to the divisor to prevent division by zero exceptions; however, if the original divisor is equivalent to -1, adding 1 leads to division by zero. There are three places where +1 is used for this purpose - one in pos_ratio_polynom() and two in bdi_position_ratio(). The second one in bdi_position_ratio() actually triggered div-by-zero oops on a machine running a 3.10 kernel. The divisor is x_intercept - bdi_setpoint + 1 == span + 1 span is confirmed to be (u32)-1. It isn't clear how it ended up that but it could be from write bandwidth calculation underflow fixed by c72efb658f7c ("writeback: fix possible underflow in write bandwidth calculation"). At any rate, +1 isn't a proper protection against div-by-zero. This patch converts all +1 protections to |1. Note that bdi_update_dirty_ratelimit() was already using |1 before this patch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-05-23mm/memory-failure: call shake_page() when error hits thp tail pageNaoya Horiguchi
[ Upstream commit 09789e5de18e4e442870b2d700831f5cb802eb05 ] Currently memory_failure() calls shake_page() to sweep pages out from pcplists only when the victim page is 4kB LRU page or thp head page. But we should do this for a thp tail page too. Consider that a memory error hits a thp tail page whose head page is on a pcplist when memory_failure() runs. Then, the current kernel skips shake_pages() part, so hwpoison_user_mappings() returns without calling split_huge_page() nor try_to_unmap() because PageLRU of the thp head is still cleared due to the skip of shake_page(). As a result, me_huge_page() runs for the thp, which is broken behavior. One effect is a leak of the thp. And another is to fail to isolate the memory error, so later access to the error address causes another MCE, which kills the processes which used the thp. This patch fixes this problem by calling shake_page() for thp tail case. Fixes: 385de35722c9 ("thp: allow a hwpoisoned head page to be put back to LRU") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Acked-by: Dean Nelson <dnelson@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: Jin Dongming <jin.dongming@np.css.fujitsu.com> Cc: <stable@vger.kernel.org> [3.4+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-05-17mm/hugetlb: take page table lock in follow_huge_pmd()Naoya Horiguchi
[ Upstream commit e66f17ff71772b209eed39de35aaa99ba819c93d ] We have a race condition between move_pages() and freeing hugepages, where move_pages() calls follow_page(FOLL_GET) for hugepages internally and tries to get its refcount without preventing concurrent freeing. This race crashes the kernel, so this patch fixes it by moving FOLL_GET code for hugepages into follow_huge_pmd() with taking the page table lock. This patch intentionally removes page==NULL check after pte_page. This is justified because pte_page() never returns NULL for any architectures or configurations. This patch changes the behavior of follow_huge_pmd() for tail pages and then tail pages can be pinned/returned. So the caller must be changed to properly handle the returned tail pages. We could have a choice to add the similar locking to follow_huge_(addr|pud) for consistency, but it's not necessary because currently these functions don't support FOLL_GET flag, so let's leave it for future development. Here is the reproducer: $ cat movepages.c #include <stdio.h> #include <stdlib.h> #include <numaif.h> #define ADDR_INPUT 0x700000000000UL #define HPS 0x200000 #define PS 0x1000 int main(int argc, char *argv[]) { int i; int nr_hp = strtol(argv[1], NULL, 0); int nr_p = nr_hp * HPS / PS; int ret; void **addrs; int *status; int *nodes; pid_t pid; pid = strtol(argv[2], NULL, 0); addrs = malloc(sizeof(char *) * nr_p + 1); status = malloc(sizeof(char *) * nr_p + 1); nodes = malloc(sizeof(char *) * nr_p + 1); while (1) { for (i = 0; i < nr_p; i++) { addrs[i] = (void *)ADDR_INPUT + i * PS; nodes[i] = 1; status[i] = 0; } ret = numa_move_pages(pid, nr_p, addrs, nodes, status, MPOL_MF_MOVE_ALL); if (ret == -1) err("move_pages"); for (i = 0; i < nr_p; i++) { addrs[i] = (void *)ADDR_INPUT + i * PS; nodes[i] = 0; status[i] = 0; } ret = numa_move_pages(pid, nr_p, addrs, nodes, status, MPOL_MF_MOVE_ALL); if (ret == -1) err("move_pages"); } return 0; } $ cat hugepage.c #include <stdio.h> #include <sys/mman.h> #include <string.h> #define ADDR_INPUT 0x700000000000UL #define HPS 0x200000 int main(int argc, char *argv[]) { int nr_hp = strtol(argv[1], NULL, 0); char *p; while (1) { p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); if (p != (void *)ADDR_INPUT) { perror("mmap"); break; } memset(p, 0, nr_hp * HPS); munmap(p, nr_hp * HPS); } } $ sysctl vm.nr_hugepages=40 $ ./hugepage 10 & $ ./movepages 10 $(pgrep -f hugepage) Fixes: e632a938d914 ("mm: migrate: add hugepage migration code to move_pages()") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-05-17mm/hugetlb: use pmd_page() in follow_huge_pmd()Naoya Horiguchi
[ Upstream commit 97534127012f0e396eddea4691f4c9b170aed74b ] Commit 61f77eda9bbf ("mm/hugetlb: reduce arch dependent code around follow_huge_*") broke follow_huge_pmd() on s390, where pmd and pte layout differ and using pte_page() on a huge pmd will return wrong results. Using pmd_page() instead fixes this. All architectures that were touched by that commit have pmd_page() defined, so this should not break anything on other architectures. Fixes: 61f77eda "mm/hugetlb: reduce arch dependent code around follow_huge_*" Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz>, Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-28mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page()Naoya Horiguchi
[ Upstream commit 9ab3b598d2dfbdb0153ffa7e4b1456bbff59a25d ] A race condition starts to be visible in recent mmotm, where a PG_hwpoison flag is set on a migration source page *before* it's back in buddy page poo= l. This is problematic because no page flag is supposed to be set when freeing (see __free_one_page().) So the user-visible effect of this race is that it could trigger the BUG_ON() when soft-offlining is called. The root cause is that we call lru_add_drain_all() to make sure that the page is in buddy, but that doesn't work because this function just schedule= s a work item and doesn't wait its completion. drain_all_pages() does drainin= g directly, so simply dropping lru_add_drain_all() solves this problem. Fixes: f15bdfa802bf ("mm/memory-failure.c: fix memory leak in successful soft offlining") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: <stable@vger.kernel.org> [3.11+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-24writeback: fix possible underflow in write bandwidth calculationTejun Heo
[ Upstream commit c72efb658f7c8b27ca3d0efb5cfd5ded9fcac89e ] From 1ebf33901ecc75d9496862dceb1ef0377980587c Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Mon, 23 Mar 2015 00:08:19 -0400 2f800fbd777b ("writeback: fix dirtied pages accounting on redirty") introduced account_page_redirty() which reverts stat updates for a redirtied page, making BDI_DIRTIED no longer monotonically increasing. bdi_update_write_bandwidth() uses the delta in BDI_DIRTIED as the basis for bandwidth calculation. While unlikely, since the above patch, the newer value may be lower than the recorded past value and underflow the bandwidth calculation leading to a wild result. Fix it by subtracing min of the old and new values when calculating delta. AFAIK, there hasn't been any report of it happening but the resulting erratic behavior would be non-critical and temporary, so it's possible that the issue is happening without being reported. The risk of the fix is very low, so tagged for -stable. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Fixes: 2f800fbd777b ("writeback: fix dirtied pages accounting on redirty") Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-24writeback: add missing INITIAL_JIFFIES init in global_update_bandwidth()Tejun Heo
[ Upstream commit 7d70e15480c0450d2bfafaad338a32e884fc215e ] global_update_bandwidth() uses static variable update_time as the timestamp for the last update but forgets to initialize it to INITIALIZE_JIFFIES. This means that global_dirty_limit will be 5 mins into the future on 32bit and some large amount jiffies into the past on 64bit. This isn't critical as the only effect is that global_dirty_limit won't be updated for the first 5 mins after booting on 32bit machines, especially given the auxiliary nature of global_dirty_limit's role - protecting against global dirty threshold's sudden dips; however, it does lead to unintended suboptimal behavior. Fix it. Fixes: c42843f2f0bb ("writeback: introduce smoothed global dirty limit") Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-24mm/page_alloc.c: call kernel_map_pages in unset_migrateype_isolateLaura Abbott
[ Upstream commit cfa869438282be84ad4110bba5027ef1fbbe71e4 ] Commit 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock") changed the logic of unset_migratetype_isolate to check the buddy allocator and explicitly call __free_pages to merge. The page that is being freed in this path never had prep_new_page called so set_page_refcounted is called explicitly but there is no call to kernel_map_pages. With the default kernel_map_pages this is mostly harmless but if kernel_map_pages does any manipulation of the page tables (unmapping or setting pages to read only) this may trigger a fault: alloc_contig_range test_pages_isolated(ceb00, ced00) failed Unable to handle kernel paging request at virtual address ffffffc0cec00000 pgd = ffffffc045fc4000 [ffffffc0cec00000] *pgd=0000000000000000 Internal error: Oops: 9600004f [#1] PREEMPT SMP Modules linked in: exfatfs CPU: 1 PID: 23237 Comm: TimedEventQueue Not tainted 3.10.49-gc72ad36-dirty #1 task: ffffffc03de52100 ti: ffffffc015388000 task.ti: ffffffc015388000 PC is at memset+0xc8/0x1c0 LR is at kernel_map_pages+0x1ec/0x244 Fix this by calling kernel_map_pages to ensure the page is set in the page table properly Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock") Signed-off-by: Laura Abbott <lauraa@codeaurora.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-24mm/memory hotplug: postpone the reset of obsolete pgdatGu Zheng
[ Upstream commit b0dc3a342af36f95a68fe229b8f0f73552c5ca08 ] Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under stress condition: BUG: unable to handle kernel paging request at 0000000000025f60 IP: next_online_pgdat+0x1/0x50 PGD 0 Oops: 0000 [#1] SMP ACPI: Device does not support D3cold Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf] CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1 Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015 Workqueue: events vmstat_update task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000 RIP: 0010: next_online_pgdat+0x1/0x50 RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286 RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082 RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000 RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96 R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440 R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800 FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: refresh_cpu_vm_stats+0xd0/0x140 vmstat_update+0x11/0x50 process_one_work+0x194/0x3d0 worker_thread+0x12b/0x410 kthread+0xc6/0xd0 ret_from_fork+0x7c/0xb0 The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of try_offline_node, which will reset all the content of pgdat to 0, as the pgdat is accessed lock-free, so that the users still using the pgdat will panic, such as the vmstat_update routine. process A: offline node XX: vmstat_updat() refresh_cpu_vm_stats() for_each_populated_zone() find online node XX cond_resched() offline cpu and memory, then try_offline_node() node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat)) zone = next_zone(zone) pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now next_online_pgdat(pgdat) next_online_node(pgdat->node_id); // NULL pointer access So the solution here is postponing the reset of obsolete pgdat from try_offline_node() to hotadd_new_pgdat(), and just resetting pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset 0 to avoid breaking pointer information in pgdat. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Reported-by: Xishi Qiu <qiuxishi@huawei.com> Suggested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Xie XiuQi <xiexiuqi@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-24mm: fix anon_vma->degree underflow in anon_vma endless growing preventionLeon Yu
[ Upstream commit 3fe89b3e2a7bbf3e97657104b9b33a9d81b950b3 ] I have constantly stumbled upon "kernel BUG at mm/rmap.c:399!" after upgrading to 3.19 and had no luck with 4.0-rc1 neither. So, after looking into new logic introduced by commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy"), I found chances are that unlink_anon_vmas() is called without incrementing dst->anon_vma->degree in anon_vma_clone() due to allocation failure. If dst->anon_vma is not NULL in error path, its degree will be incorrectly decremented in unlink_anon_vmas() and eventually underflow when exiting as a result of another call to unlink_anon_vmas(). That's how "kernel BUG at mm/rmap.c:399!" is triggered for me. This patch fixes the underflow by dropping dst->anon_vma when allocation fails. It's safe to do so regardless of original value of dst->anon_vma because dst->anon_vma doesn't have valid meaning if anon_vma_clone() fails. Besides, callers don't care dst->anon_vma in such case neither. Also suggested by Michal Hocko, we can clean up vma_adjust() a bit as anon_vma_clone() now does the work. [akpm@linux-foundation.org: tweak comment] Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Signed-off-by: Leon Yu <chianglungyu@gmail.com> Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-24mm: fix corner case in anon_vma endless growing preventionKonstantin Khlebnikov
[ Upstream commit b800c91a0517071156e772d4fb329ad33590da62 ] Fix for BUG_ON(anon_vma->degree) splashes in unlink_anon_vmas() ("kernel BUG at mm/rmap.c:399!") caused by commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy") Anon_vma_clone() is usually called for a copy of source vma in destination argument. If source vma has anon_vma it should be already in dst->anon_vma. NULL in dst->anon_vma is used as a sign that it's called from anon_vma_fork(). In this case anon_vma_clone() finds anon_vma for reusing. Vma_adjust() calls it differently and this breaks anon_vma reusing logic: anon_vma_clone() links vma to old anon_vma and updates degree counters but vma_adjust() overrides vma->anon_vma right after that. As a result final unlink_anon_vmas() decrements degree for wrong anon_vma. This patch assigns ->anon_vma before calling anon_vma_clone(). Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-and-tested-by: Chris Clayton <chris2553@googlemail.com> Reported-and-tested-by: Oded Gabbay <oded.gabbay@amd.com> Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Daniel Forrest <dan.forrest@ssec.wisc.edu> Cc: Michal Hocko <mhocko@suse.cz> Cc: stable@vger.kernel.org # to match back-porting of 7a3ef208e662 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-04-24mm: prevent endless growth of anon_vma hierarchyKonstantin Khlebnikov
[ Upstream commit 7a3ef208e662f4b63d43a23f61a64a129c525bbc ] Constantly forking task causes unlimited grow of anon_vma chain. Each next child allocates new level of anon_vmas and links vma to all previous levels because pages might be inherited from any level. This patch adds heuristic which decides to reuse existing anon_vma instead of forking new one. It adds counter anon_vma->degree which counts linked vmas and directly descending anon_vmas and reuses anon_vma if counter is lower than two. As a result each anon_vma has either vma or at least two descending anon_vmas. In such trees half of nodes are leafs with alive vmas, thus count of anon_vmas is no more than two times bigger than count of vmas. This heuristic reuses anon_vmas as few as possible because each reuse adds false aliasing among vmas and rmap walker ought to scan more ptes when it searches where page is might be mapped. Link: http://lkml.kernel.org/r/20120816024610.GA5350@evergreen.ssec.wisc.edu Fixes: 5beb49305251 ("mm: change anon_vma linking to fix multi-process server scalability issue") [akpm@linux-foundation.org: fix typo, per Rik] Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Daniel Forrest <dan.forrest@ssec.wisc.edu> Tested-by: Michal Hocko <mhocko@suse.cz> Tested-by: Jerome Marchand <jmarchan@redhat.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> [2.6.34+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-28mmu_gather: move minimal range calculations into generic codeWill Deacon
[ Upstream commit fb7332a9fedfd62b1ba6530c86f39f0fa38afd49 ] On architectures with hardware broadcasting of TLB invalidation messages , it makes sense to reduce the range of the mmu_gather structure when unmapping page ranges based on the dirty address information passed to tlb_remove_tlb_entry. arm64 already does this by directly manipulating the start/end fields of the gather structure, but this confuses the generic code which does not expect these fields to change and can end up calculating invalid, negative ranges when forcing a flush in zap_pte_range. This patch moves the minimal range calculation out of the arm64 code and into the generic implementation, simplifying zap_pte_range in the process (which no longer needs to care about start/end, since they will point to the appropriate ranges already). With the range being tracked by core code, the need_flush flag is dropped in favour of checking that the end of the range has actually been set. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King - ARM Linux <linux@arm.linux.org.uk> Cc: Michal Simek <monstr@monstr.eu> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-23vmstat: do not use deferrable delayed work for vmstat_updateMichal Hocko
commit ba4877b9ca51f80b5d30f304a46762f0509e1635 upstream. Vinayak Menon has reported that an excessive number of tasks was throttled in the direct reclaim inside too_many_isolated() because NR_ISOLATED_FILE was relatively high compared to NR_INACTIVE_FILE. However it turned out that the real number of NR_ISOLATED_FILE was 0 and the per-cpu vm_stat_diff wasn't transferred into the global counter. vmstat_work which is responsible for the sync is defined as deferrable delayed work which means that the defined timeout doesn't wake up an idle CPU. A CPU might stay in an idle state for a long time and general effort is to keep such a CPU in this state as long as possible which might lead to all sorts of troubles for vmstat consumers as can be seen with the excessive direct reclaim throttling. This patch basically reverts 39bf6270f524 ("VM statistics: Make timer deferrable") but it shouldn't cause any problems for idle CPUs because only CPUs with an active per-cpu drift are woken up since 7cc36bbddde5 ("vmstat: on-demand vmstat workers v8") and CPUs which are idle for a longer time shouldn't have per-cpu drift. Fixes: 39bf6270f524 (VM statistics: Make timer deferrable) Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Vinayak Menon <vinmenon@codeaurora.org> Acked-by: Christoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-14mm: fix negative nr_isolated countsHugh Dickins
commit ff59909a077b3c51c168cb658601c6b63136a347 upstream. The vmstat interfaces are good at hiding negative counts (at least when CONFIG_SMP); but if you peer behind the curtain, you find that nr_isolated_anon and nr_isolated_file soon go negative, and grow ever more negative: so they can absorb larger and larger numbers of isolated pages, yet still appear to be zero. I'm happy to avoid a congestion_wait() when too_many_isolated() myself; but I guess it's there for a good reason, in which case we ought to get too_many_isolated() working again. The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case: putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot to call acct_isolated() to increment them. It is possible that the bug whcih this patch fixes could cause OOM kills when the system still has a lot of reclaimable page cache. Fixes: edc2ca612496 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-14mm/memory.c: actually remap enough memoryGrazvydas Ignotas
commit 9cb12d7b4ccaa976f97ce0c5fd0f1b6a83bc2a75 upstream. For whatever reason, generic_access_phys() only remaps one page, but actually allows to access arbitrary size. It's quite easy to trigger large reads, like printing out large structure with gdb, which leads to a crash. Fix it by remapping correct size. Fixes: 28b2ee20c7cb ("access_process_vm device memory infrastructure") Signed-off-by: Grazvydas Ignotas <notasas@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-14mm/compaction: fix wrong order check in compact_finished()Joonsoo Kim
commit 372549c2a3778fd3df445819811c944ad54609ca upstream. What we want to check here is whether there is highorder freepage in buddy list of other migratetype in order to steal it without fragmentation. But, current code just checks cc->order which means allocation request order. So, this is wrong. Without this fix, non-movable synchronous compaction below pageblock order would not stopped until compaction is complete, because migratetype of most pageblocks are movable and high order freepage made by compaction is usually on movable type buddy list. There is some report related to this bug. See below link. http://www.spinics.net/lists/linux-mm/msg81666.html Although the issued system still has load spike comes from compaction, this makes that system completely stable and responsive according to his report. stress-highalloc test in mmtests with non movable order 7 allocation doesn't show any notable difference in allocation success rate, but, it shows more compaction success rate. Compaction success rate (Compaction success * 100 / Compaction stalls, %) 18.47 : 28.94 Fixes: 1fb3f8ca0e92 ("mm: compaction: capture a suitable high-order page immediately when it is made available") Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-14mm/nommu.c: fix arithmetic overflow in __vm_enough_memory()Roman Gushchin
commit 8138a67a5557ffea3a21dfd6f037842d4e748513 upstream. I noticed that "allowed" can easily overflow by falling below 0, because (total_vm / 32) can be larger than "allowed". The problem occurs in OVERCOMMIT_NONE mode. In this case, a huge allocation can success and overcommit the system (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall (system-wide), so system become unusable. The problem was masked out by commit c9b1d0981fcc ("mm: limit growth of 3% hardcoded other user reserve"), but it's easy to reproduce it on older kernels: 1) set overcommit_memory sysctl to 2 2) mmap() large file multiple times (with VM_SHARED flag) 3) try to malloc() large amount of memory It also can be reproduced on newer kernels, but miss-configured sysctl_user_reserve_kbytes is required. Fix this issue by switching to signed arithmetic here. Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-14mm/mmap.c: fix arithmetic overflow in __vm_enough_memory()Roman Gushchin
commit 5703b087dc8eaf47bfb399d6cf512d471beff405 upstream. I noticed, that "allowed" can easily overflow by falling below 0, because (total_vm / 32) can be larger than "allowed". The problem occurs in OVERCOMMIT_NONE mode. In this case, a huge allocation can success and overcommit the system (despite OVERCOMMIT_NONE mode). All subsequent allocations will fall (system-wide), so system become unusable. The problem was masked out by commit c9b1d0981fcc ("mm: limit growth of 3% hardcoded other user reserve"), but it's easy to reproduce it on older kernels: 1) set overcommit_memory sysctl to 2 2) mmap() large file multiple times (with VM_SHARED flag) 3) try to malloc() large amount of memory It also can be reproduced on newer kernels, but miss-configured sysctl_user_reserve_kbytes is required. Fix this issue by switching to signed arithmetic here. [akpm@linux-foundation.org: use min_t] Signed-off-by: Roman Gushchin <klamm@yandex-team.ru> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-14mm: when stealing freepages, also take pages created by splitting buddy pageVlastimil Babka
commit 99592d598eca62bdbbf62b59941c189176dfc614 upstream. When studying page stealing, I noticed some weird looking decisions in try_to_steal_freepages(). The first I assume is a bug (Patch 1), the following two patches were driven by evaluation. Testing was done with stress-highalloc of mmtests, using the mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how often page stealing occurs for individual migratetypes, and what migratetypes are used for fallbacks. Arguably, the worst case of page stealing is when UNMOVABLE allocation steals from MOVABLE pageblock. RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal, so the goal is to minimize these two cases. The evaluation of v2 wasn't always clear win and Joonsoo questioned the results. Here I used different baseline which includes RFC compaction improvements from [1]. I found that the compaction improvements reduce variability of stress-highalloc, so there's less noise in the data. First, let's look at stress-highalloc configured to do sync compaction, and how these patches reduce page stealing events during the test. First column is after fresh reboot, other two are reiterations of test without reboot. That was all accumulater over 5 re-iterations (so the benchmark was run 5x3 times with 5 fresh restarts). Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Page alloc extfrag event 10264225 8702233 10244125 Extfrag fragmenting 10263271 8701552 10243473 Extfrag fragmenting for unmovable 13595 17616 15960 Extfrag fragmenting unmovable placed with movable 7989 12193 8447 Extfrag fragmenting for reclaimable 658 1840 1817 Extfrag fragmenting reclaimable placed with movable 558 1677 1679 Extfrag fragmenting for movable 10249018 8682096 10225696 With Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Page alloc extfrag event 11834954 9877523 9774860 Extfrag fragmenting 11833993 9876880 9774245 Extfrag fragmenting for unmovable 7342 16129 11712 Extfrag fragmenting unmovable placed with movable 4191 10547 6270 Extfrag fragmenting for reclaimable 373 1130 923 Extfrag fragmenting reclaimable placed with movable 302 906 738 Extfrag fragmenting for movable 11826278 9859621 9761610 With Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Page alloc extfrag event 4725990 3668793 3807436 Extfrag fragmenting 4725104 3668252 3806898 Extfrag fragmenting for unmovable 6678 7974 7281 Extfrag fragmenting unmovable placed with movable 2051 3829 4017 Extfrag fragmenting for reclaimable 429 1208 1278 Extfrag fragmenting reclaimable placed with movable 369 976 1034 Extfrag fragmenting for movable 4717997 3659070 3798339 With Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Page alloc extfrag event 5016183 4700142 3850633 Extfrag fragmenting 5015325 4699613 3850072 Extfrag fragmenting for unmovable 1312 3154 3088 Extfrag fragmenting unmovable placed with movable 1115 2777 2714 Extfrag fragmenting for reclaimable 437 1193 1097 Extfrag fragmenting reclaimable placed with movable 330 969 879 Extfrag fragmenting for movable 5013576 4695266 3845887 In v2 we've seen apparent regression with Patch 1 for unmovable events, this is now gone, suggesting it was indeed noise. Here, each patch improves the situation for unmovable events. Reclaimable is improved by patch 1 and then either the same modulo noise, or perhaps sligtly worse - a small price for unmovable improvements, IMHO. The number of movable allocations falling back to other migratetypes is most noisy, but it's reduced to half at Patch 2 nevertheless. These are least critical as compaction can move them around. If we look at success rates, the patches don't affect them, that didn't change. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Success 1 Min 49.00 ( 0.00%) 42.00 ( 14.29%) 41.00 ( 16.33%) Success 1 Mean 51.00 ( 0.00%) 45.00 ( 11.76%) 42.60 ( 16.47%) Success 1 Max 55.00 ( 0.00%) 51.00 ( 7.27%) 46.00 ( 16.36%) Success 2 Min 53.00 ( 0.00%) 47.00 ( 11.32%) 44.00 ( 16.98%) Success 2 Mean 59.60 ( 0.00%) 50.80 ( 14.77%) 48.20 ( 19.13%) Success 2 Max 64.00 ( 0.00%) 56.00 ( 12.50%) 52.00 ( 18.75%) Success 3 Min 84.00 ( 0.00%) 82.00 ( 2.38%) 78.00 ( 7.14%) Success 3 Mean 85.60 ( 0.00%) 82.80 ( 3.27%) 79.40 ( 7.24%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Success 1 Min 49.00 ( 0.00%) 44.00 ( 10.20%) 44.00 ( 10.20%) Success 1 Mean 51.80 ( 0.00%) 46.00 ( 11.20%) 45.80 ( 11.58%) Success 1 Max 54.00 ( 0.00%) 49.00 ( 9.26%) 49.00 ( 9.26%) Success 2 Min 58.00 ( 0.00%) 49.00 ( 15.52%) 48.00 ( 17.24%) Success 2 Mean 60.40 ( 0.00%) 51.80 ( 14.24%) 50.80 ( 15.89%) Success 2 Max 63.00 ( 0.00%) 54.00 ( 14.29%) 55.00 ( 12.70%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 81.60 ( 4.00%) 79.80 ( 6.12%) Success 3 Max 86.00 ( 0.00%) 82.00 ( 4.65%) 82.00 ( 4.65%) Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Success 1 Min 50.00 ( 0.00%) 44.00 ( 12.00%) 39.00 ( 22.00%) Success 1 Mean 52.80 ( 0.00%) 45.60 ( 13.64%) 42.40 ( 19.70%) Success 1 Max 55.00 ( 0.00%) 46.00 ( 16.36%) 47.00 ( 14.55%) Success 2 Min 52.00 ( 0.00%) 48.00 ( 7.69%) 45.00 ( 13.46%) Success 2 Mean 53.40 ( 0.00%) 49.80 ( 6.74%) 48.80 ( 8.61%) Success 2 Max 57.00 ( 0.00%) 52.00 ( 8.77%) 52.00 ( 8.77%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 82.40 ( 3.06%) 79.60 ( 6.35%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Success 1 Min 46.00 ( 0.00%) 44.00 ( 4.35%) 42.00 ( 8.70%) Success 1 Mean 50.20 ( 0.00%) 45.60 ( 9.16%) 44.00 ( 12.35%) Success 1 Max 52.00 ( 0.00%) 47.00 ( 9.62%) 47.00 ( 9.62%) Success 2 Min 53.00 ( 0.00%) 49.00 ( 7.55%) 48.00 ( 9.43%) Success 2 Mean 55.80 ( 0.00%) 50.60 ( 9.32%) 49.00 ( 12.19%) Success 2 Max 59.00 ( 0.00%) 52.00 ( 11.86%) 51.00 ( 13.56%) Success 3 Min 84.00 ( 0.00%) 80.00 ( 4.76%) 79.00 ( 5.95%) Success 3 Mean 85.40 ( 0.00%) 81.60 ( 4.45%) 80.40 ( 5.85%) Success 3 Max 87.00 ( 0.00%) 83.00 ( 4.60%) 82.00 ( 5.75%) While there's no improvement here, I consider reduced fragmentation events to be worth on its own. Patch 2 also seems to reduce scanning for free pages, and migrations in compaction, suggesting it has somewhat less work to do: Patch 1: Compaction stalls 4153 3959 3978 Compaction success 1523 1441 1446 Compaction failures 2630 2517 2531 Page migrate success 4600827 4943120 5104348 Page migrate failure 19763 16656 17806 Compaction pages isolated 9597640 10305617 10653541 Compaction migrate scanned 77828948 86533283 87137064 Compaction free scanned 517758295 521312840 521462251 Compaction cost 5503 5932 6110 Patch 2: Compaction stalls 3800 3450 3518 Compaction success 1421 1316 1317 Compaction failures 2379 2134 2201 Page migrate success 4160421 4502708 4752148 Page migrate failure 19705 14340 14911 Compaction pages isolated 8731983 9382374 9910043 Compaction migrate scanned 98362797 96349194 98609686 Compaction free scanned 496512560 469502017 480442545 Compaction cost 5173 5526 5811 As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers of unmovable and reclaimable pageblocks. Configuring the benchmark to allocate like THP page fault (i.e. no sync compaction) gives much noisier results for iterations 2 and 3 after reboot. This is not so surprising given how [1] offers lower improvements in this scenario due to less restarts after deferred compaction which would change compaction pivot. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-thp-1 5-thp-2 5-thp-3 Page alloc extfrag event 8148965 6227815 6646741 Extfrag fragmenting 8147872 6227130 6646117 Extfrag fragmenting for unmovable 10324 12942 15975 Extfrag fragmenting unmovable placed with movable 5972 8495 10907 Extfrag fragmenting for reclaimable 601 1707 2210 Extfrag fragmenting reclaimable placed with movable 520 1570 2000 Extfrag fragmenting for movable 8136947 6212481 6627932 Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-thp-1 6-thp-2 6-thp-3 Page alloc extfrag event 8345457 7574471 7020419 Extfrag fragmenting 8343546 7573777 7019718 Extfrag fragmenting for unmovable 10256 18535 30716 Extfrag fragmenting unmovable placed with movable 6893 11726 22181 Extfrag fragmenting for reclaimable 465 1208 1023 Extfrag fragmenting reclaimable placed with movable 353 996 843 Extfrag fragmenting for movable 8332825 7554034 6987979 Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-thp-1 7-thp-2 7-thp-3 Page alloc extfrag event 3512847 3020756 2891625 Extfrag fragmenting 3511940 3020185 2891059 Extfrag fragmenting for unmovable 9017 6892 6191 Extfrag fragmenting unmovable placed with movable 1524 3053 2435 Extfrag fragmenting for reclaimable 445 1081 1160 Extfrag fragmenting reclaimable placed with movable 375 918 986 Extfrag fragmenting for movable 3502478 3012212 2883708 Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-thp-1 8-thp-2 8-thp-3 Page alloc extfrag event 3181699 3082881 2674164 Extfrag fragmenting 3180812 3082303 2673611 Extfrag fragmenting for unmovable 1201 4031 4040 Extfrag fragmenting unmovable placed with movable 974 3611 3645 Extfrag fragmenting for reclaimable 478 1165 1294 Extfrag fragmenting reclaimable placed with movable 387 985 1030 Extfrag fragmenting for movable 3179133 3077107 2668277 The improvements for first iteration are clear, the rest is much noisier and can appear like regression for Patch 1. Anyway, patch 2 rectifies it. Allocation success rates are again unaffected so there's no point in making this e-mail any longer. [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2 This patch (of 3): When __rmqueue_fallback() is called to allocate a page of order X, it will find a page of order Y >= X of a fallback migratetype, which is different from the desired migratetype. With the help of try_to_steal_freepages(), it may change the migratetype (to the desired one) also of: 1) all currently free pages in the pageblock containing the fallback page 2) the fallback pageblock itself 3) buddy pages created by splitting the fallback page (when Y > X) These decisions take the order Y into account, as well as the desired migratetype, with the goal of preventing multiple fallback allocations that could e.g. distribute UNMOVABLE allocations among multiple pageblocks. Originally, decision for 1) has implied the decision for 3). Commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") changed that (probably unintentionally) so that the buddy pages in case 3) are always changed to the desired migratetype, except for CMA pageblocks. Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code and fix a bug") did some refactoring and added a comment that the case of 3) is intended. Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should respect pageblock type") removed the comment and tried to restore the original behavior where 1) implies 3), but due to the previous refactoring, the result is instead that only 2) implies 3) - and the conditions for 2) are less frequently met than conditions for 1). This may increase fragmentation in situations where the code decides to steal all free pages from the pageblock (case 1)), but then gives back the buddy pages produced by splitting. This patch restores the original intended logic where 1) implies 3). During testing with stress-highalloc from mmtests, this has shown to decrease the number of events where UNMOVABLE and RECLAIMABLE allocations steal from MOVABLE pageblocks, which can lead to permanent fragmentation. In some cases it has increased the number of events when MOVABLE allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are fixable by sync compaction and thus less harmful. Note that evaluation has shown that the behavior introduced by 47118af076f6 for buddy pages in case 3) is actually even better than the original logic, so the following patch will introduce it properly once again. For stable backports of this patch it makes thus sense to only fix versions containing 0cbef29a7821. [iamjoonsoo.kim@lge.com: tracepoint fix] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-14mm/hugetlb: add migration entry check in __unmap_hugepage_rangeNaoya Horiguchi
commit 9fbc1f635fd0bd28cb32550211bf095753ac637a upstream. If __unmap_hugepage_range() tries to unmap the address range over which hugepage migration is on the way, we get the wrong page because pte_page() doesn't work for migration entries. This patch simply clears the pte for migration entries as we do for hwpoison entries. Fixes: 290408d4a2 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-14mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protectionNaoya Horiguchi
commit a8bda28d87c38c6aa93de28ba5d30cc18e865a11 upstream. There is a race condition between hugepage migration and change_protection(), where hugetlb_change_protection() doesn't care about migration entries and wrongly overwrites them. That causes unexpected results like kernel crash. HWPoison entries also can cause the same problem. This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this function to do proper actions. Fixes: 290408d4a2 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-14mm/hugetlb: fix getting refcount 0 page in hugetlb_fault()Naoya Horiguchi
commit 0f792cf949a0be506c2aa8bfac0605746b146dda upstream. When running the test which causes the race as shown in the previous patch, we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault(). This race happens when pte turns into migration entry just after the first check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false. To fix this, we need to check pte_present() again after huge_ptep_get(). This patch also reorders taking ptl and doing pte_page(), because pte_page() should be done in ptl. Due to this reordering, we need use trylock_page() in page != pagecache_page case to respect locking order. Fixes: 66aebce747ea ("hugetlb: fix race condition in hugetlb_fault()") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
2015-03-06mm/hugetlb: pmd_huge() returns true for non-present hugepageNaoya Horiguchi
commit cbef8478bee55775ac312a574aad48af7bb9cf9f upstream. Migrating hugepages and hwpoisoned hugepages are considered as non-present hugepages, and they are referenced via migration entries and hwpoison entries in their page table slots. This behavior causes race condition because pmd_huge() doesn't tell non-huge pages from migrating/hwpoisoned hugepages. follow_page_mask() is one example where the kernel would call follow_page_pte() for such hugepage while this function is supposed to handle only normal pages. To avoid this, this patch makes pmd_huge() return true when pmd_none() is true *and* pmd_present() is false. We don't have to worry about mixing up non-present pmd entry with normal pmd (pointing to leaf level pte entry) because pmd_present() is true in normal pmd. The same race condition could happen in (x86-specific) gup_pmd_range(), where this patch simply adds pmd_present() check instead of pmd_huge(). This is because gup_pmd_range() is fast path. If we have non-present hugepage in this function, we will go into gup_huge_pmd(), then return 0 at flag mask check, and finally fall back to the slow path. Fixes: 290408d4a2 ("hugetlb: hugepage migration core") Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-11memcg, shmem: fix shmem migration to use lrucareMichal Hocko
commit f5e03a4989e80a86f8b514659dca8539132e6e09 upstream. It has been reported that 965GM might trigger VM_BUG_ON_PAGE(!lrucare && PageLRU(oldpage), oldpage) in mem_cgroup_migrate when shmem wants to replace a swap cache page because of shmem_should_replace_page (the page is allocated from an inappropriate zone). shmem_replace_page expects that the oldpage is not on LRU list and calls mem_cgroup_migrate without lrucare. This is obviously incorrect because swapcache pages might be on the LRU list (e.g. swapin readahead page). Fix this by enabling lrucare for the migration in shmem_replace_page. Also clarify that lrucare should be used even if one of the pages might be on LRU list. The BUG_ON will trigger only when CONFIG_DEBUG_VM is enabled but even without that the migration code might leave the old page on an inappropriate memcg' LRU which is not that critical because the page would get removed with its last reference but it is still confusing. Fixes: 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Reported-by: Dave Airlie <airlied@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-11mm: pagewalk: call pte_hole() for VM_PFNMAP during walk_page_rangeShiraz Hashim
commit 23aaed6659df9adfabe9c583e67a36b54e21df46 upstream. walk_page_range() silently skips vma having VM_PFNMAP set, which leads to undesirable behaviour at client end (who called walk_page_range). Userspace applications get the wrong data, so the effect is like just confusing users (if the applications just display the data) or sometimes killing the processes (if the applications do something with misunderstanding virtual addresses due to the wrong data.) For example for pagemap_read, when no callbacks are called against VM_PFNMAP vma, pagemap_read may prepare pagemap data for next virtual address range at wrong index. Eventually userspace may get wrong pagemap data for a task. Corresponding to a VM_PFNMAP marked vma region, kernel may report mappings from subsequent vma regions. User space in turn may account more pages (than really are) to the task. In my case I was using procmem, procrack (Android utility) which uses pagemap interface to account RSS pages of a task. Due to this bug it was giving a wrong picture for vmas (with VM_PFNMAP set). Fixes: a9ff785e4437 ("mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas") Signed-off-by: Shiraz Hashim <shashim@codeaurora.org> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05memcg: remove extra newlines from memcg oom kill logGreg Thelen
commit 0346dadbf041a2606bcb5bd27828b0d105897f4a upstream. Commit e61734c55c24 ("cgroup: remove cgroup->name") added two extra newlines to memcg oom kill log messages. This makes dmesg hard to read and parse. The issue affects 3.15+. Example: Task in /t <<< extra #1 killed as a result of limit of /t <<< extra #2 memory: usage 102400kB, limit 102400kB, failcnt 274712 Remove the extra newlines from memcg oom kill messages, so the messages look like: Task in /t killed as a result of limit of /t memory: usage 102400kB, limit 102400kB, failcnt 240649 Fixes: e61734c55c24 ("cgroup: remove cgroup->name") Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUSLinus Torvalds
commit 9c145c56d0c8a0b62e48c8d71e055ad0fb2012ba upstream. The stack guard page error case has long incorrectly caused a SIGBUS rather than a SIGSEGV, but nobody actually noticed until commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page") because that error case was never actually triggered in any normal situations. Now that we actually report the error, people noticed the wrong signal that resulted. So far, only the test suite of libsigsegv seems to have actually cared, but there are real applications that use libsigsegv, so let's not wait for any of those to break. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-05vm: add VM_FAULT_SIGSEGV handling supportLinus Torvalds
commit 33692f27597fcab536d7cbbcc8f52905133e4aa7 upstream. The core VM already knows about VM_FAULT_SIGBUS, but cannot return a "you should SIGSEGV" error, because the SIGSEGV case was generally handled by the caller - usually the architecture fault handler. That results in lots of duplication - all the architecture fault handlers end up doing very similar "look up vma, check permissions, do retries etc" - but it generally works. However, there are cases where the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV. In particular, when accessing the stack guard page, libsigsegv expects a SIGSEGV. And it usually got one, because the stack growth is handled by that duplicated architecture fault handler. However, when the generic VM layer started propagating the error return from the stack expansion in commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page"), that now exposed the existing VM_FAULT_SIGBUS result to user space. And user space really expected SIGSEGV, not SIGBUS. To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those duplicate architecture fault handlers about it. They all already have the code to handle SIGSEGV, so it's about just tying that new return value to the existing code, but it's all a bit annoying. This is the mindless minimal patch to do this. A more extensive patch would be to try to gather up the mostly shared fault handling logic into one generic helper routine, and long-term we really should do that cleanup. Just from this patch, you can generally see that most architectures just copied (directly or indirectly) the old x86 way of doing things, but in the meantime that original x86 model has been improved to hold the VM semaphore for shorter times etc and to handle VM_FAULT_RETRY and other "newer" things, so it would be a good idea to bring all those improvements to the generic case and teach other architectures about them too. Reported-and-tested-by: Takashi Iwai <tiwai@suse.de> Tested-by: Jan Engelhardt <jengelh@inai.de> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots" Cc: linux-arch@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: get rid of radix tree gfp mask for pagecache_get_pageMichal Hocko
commit 45f87de57f8fad59302fd263dd81ffa4843b5b24 upstream. Commit 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible") has added a separate parameter for specifying gfp mask for radix tree allocations. Not only this is less than optimal from the API point of view because it is error prone, it is also buggy currently because grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then the radix tree allocation wouldn't obey the restriction and might recurse into filesystem and cause deadlocks. This is the case for most filesystems unfortunately because only ext4 and gfs2 are using AOP_FLAG_NOFS. Let's simply remove radix_gfp_mask parameter because the allocation context is same for both page cache and for the radix tree. Just make sure that the radix tree gets only the sane subset of the mask (e.g. do not pass __GFP_WRITE). Long term it is more preferable to convert remaining users of AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this interface even further. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16mm: Don't count the stack guard page towards RLIMIT_STACKLinus Torvalds
commit 690eac53daff34169a4d74fc7bfbd388c4896abb upstream. Commit fee7e49d4514 ("mm: propagate error from stack expansion even for guard page") made sure that we return the error properly for stack growth conditions. It also theorized that counting the guard page towards the stack limit might break something, but also said "Let's see if anybody notices". Somebody did notice. Apparently android-x86 sets the stack limit very close to the limit indeed, and including the guard page in the rlimit check causes the android 'zygote' process problems. So this adds the (fairly trivial) code to make the stack rlimit check be against the actual real stack size, rather than the size of the vma that includes the guard page. Reported-and-tested-by: Chih-Wei Huang <cwhuang@android-x86.org> Cc: Jay Foad <jay.foad@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16mm: propagate error from stack expansion even for guard pageLinus Torvalds
commit fee7e49d45149fba60156f5b59014f764d3e3728 upstream. Jay Foad reports that the address sanitizer test (asan) sometimes gets confused by a stack pointer that ends up being outside the stack vma that is reported by /proc/maps. This happens due to an interaction between RLIMIT_STACK and the guard page: when we do the guard page check, we ignore the potential error from the stack expansion, which effectively results in a missing guard page, since the expected stack expansion won't have been done. And since /proc/maps explicitly ignores the guard page (commit d7824370e263: "mm: fix up some user-visible effects of the stack guard page"), the stack pointer ends up being outside the reported stack area. This is the minimal patch: it just propagates the error. It also effectively makes the guard page part of the stack limit, which in turn measn that the actual real stack is one page less than the stack limit. Let's see if anybody notices. We could teach acct_stack_growth() to allow an extra page for a grow-up/grow-down stack in the rlimit test, but I don't want to add more complexity if it isn't needed. Reported-and-tested-by: Jay Foad <jay.foad@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16mm, vmscan: prevent kswapd livelock due to pfmemalloc-throttled process ↵Vlastimil Babka
being killed commit 9e5e3661727eaf960d3480213f8e87c8d67b6956 upstream. Charles Shirron and Paul Cassella from Cray Inc have reported kswapd stuck in a busy loop with nothing left to balance, but kswapd_try_to_sleep() failing to sleep. Their analysis found the cause to be a combination of several factors: 1. A process is waiting in throttle_direct_reclaim() on pgdat->pfmemalloc_wait 2. The process has been killed (by OOM in this case), but has not yet been scheduled to remove itself from the waitqueue and die. 3. kswapd checks for throttled processes in prepare_kswapd_sleep(): if (waitqueue_active(&pgdat->pfmemalloc_wait)) { wake_up(&pgdat->pfmemalloc_wait); return false; // kswapd will not go to sleep } However, for a process that was already killed, wake_up() does not remove the process from the waitqueue, since try_to_wake_up() checks its state first and returns false when the process is no longer waiting. 4. kswapd is running on the same CPU as the only CPU that the process is allowed to run on (through cpus_allowed, or possibly single-cpu system). 5. CONFIG_PREEMPT_NONE=y kernel is used. If there's nothing to balance, kswapd encounters no voluntary preemption points and repeatedly fails prepare_kswapd_sleep(), blocking the process from running and removing itself from the waitqueue, which would let kswapd sleep. So, the source of the problem is that we prevent kswapd from going to sleep until there are processes waiting on the pfmemalloc_wait queue, and a process waiting on a queue is guaranteed to be removed from the queue only when it gets scheduled. This was done to make sure that no process is left sleeping on pfmemalloc_wait when kswapd itself goes to sleep. However, it isn't necessary to postpone kswapd sleep until the pfmemalloc_wait queue actually empties. To prevent processes from being left sleeping, it's actually enough to guarantee that all processes waiting on pfmemalloc_wait queue have been woken up by the time we put kswapd to sleep. This patch therefore fixes this issue by substituting 'wake_up' with 'wake_up_all' and removing 'return false' in the code snippet from prepare_kswapd_sleep() above. Note that if any process puts itself in the queue after this waitqueue_active() check, or after the wake up itself, it means that the process will also wake up kswapd - and since we are under prepare_to_wait(), the wake up won't be missed. Also we update the comment prepare_kswapd_sleep() to hopefully more clearly describe the races it is preventing. Fixes: 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16mm: protect set_page_dirty() from ongoing truncationJohannes Weiner
commit 2d6d7f98284648c5ed113fe22a132148950b140f upstream. Tejun, while reviewing the code, spotted the following race condition between the dirtying and truncation of a page: __set_page_dirty_nobuffers() __delete_from_page_cache() if (TestSetPageDirty(page)) page->mapping = NULL if (PageDirty()) dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); if (page->mapping) account_page_dirtied(page) __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE. Dirtiers usually lock out truncation, either by holding the page lock directly, or in case of zap_pte_range(), by pinning the mapcount with the page table lock held. The notable exception to this rule, though, is do_wp_page(), for which this race exists. However, do_wp_page() already waits for a locked page to unlock before setting the dirty bit, in order to prevent a race where clear_page_dirty() misses the page bit in the presence of dirty ptes. Upgrade that wait to a fully locked set_page_dirty() to also cover the situation explained above. Afterwards, the code in set_page_dirty() dealing with a truncation race is no longer needed. Remove it. Reported-by: Tejun Heo <tj@kernel.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08mm/CMA: fix boot regression due to physical address of high_memoryJoonsoo Kim
commit 6b101e2a3ce4d2a0312087598bd1ab4a1db2ac40 upstream. high_memory isn't direct mapped memory so retrieving it's physical address isn't appropriate. But, it would be useful to check physical address of highmem boundary so it's justfiable to get physical address from it. In x86, there is a validation check if CONFIG_DEBUG_VIRTUAL and it triggers following boot failure reported by Ingo. ... BUG: Int 6: CR2 00f06f53 ... Call Trace: dump_stack+0x41/0x52 early_idt_handler+0x6b/0x6b cma_declare_contiguous+0x33/0x212 dma_contiguous_reserve_area+0x31/0x4e dma_contiguous_reserve+0x11d/0x125 setup_arch+0x7b5/0xb63 start_kernel+0xb8/0x3e6 i386_start_kernel+0x79/0x7d To fix boot regression, this patch implements workaround to avoid validation check in x86 when retrieving physical address of high_memory. __pa_nodebug() used by this patch is implemented only in x86 so there is no choice but to use dirty #ifdef. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Ingo Molnar <mingo@kernel.org> Tested-by: Ingo Molnar <mingo@kernel.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-03slab: fix nodeid bounds check for non-contiguous node IDsPaul Mackerras
The bounds check for nodeid in ____cache_alloc_node gives false positives on machines where the node IDs are not contiguous, leading to a panic at boot time. For example, on a POWER8 machine the node IDs are typically 0, 1, 16 and 17. This means that num_online_nodes() returns 4, so when ____cache_alloc_node is called with nodeid = 16 the VM_BUG_ON triggers, like this: kernel BUG at /home/paulus/kernel/kvm/mm/slab.c:3079! Call Trace: .____cache_alloc_node+0x5c/0x270 (unreliable) .kmem_cache_alloc_node_trace+0xdc/0x360 .init_list+0x3c/0x128 .kmem_cache_init+0x1dc/0x258 .start_kernel+0x2a0/0x568 start_here_common+0x20/0xa8 To fix this, we instead compare the nodeid with MAX_NUMNODES, and additionally make sure it isn't negative (since nodeid is an int). The check is there mainly to protect the array dereference in the get_node() call in the next line, and the array being dereferenced is of size MAX_NUMNODES. If the nodeid is in range but invalid (for example if the node is off-line), the BUG_ON in the next line will catch that. Fixes: 14e50c6a9bc2 ("mm: slab: Verify the nodeid passed to ____cache_alloc_node") Signed-off-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03mm: fix anon_vma_clone() error treatmentDaniel Forrest
Andrew Morton noticed that the error return from anon_vma_clone() was being dropped and replaced with -ENOMEM (which is not itself a bug because the only error return value from anon_vma_clone() is -ENOMEM). I did an audit of callers of anon_vma_clone() and discovered an actual bug where the error return was being lost. In __split_vma(), between Linux 3.11 and 3.12 the code was changed so the err variable is used before the call to anon_vma_clone() and the default initial value of -ENOMEM is overwritten. So a failure of anon_vma_clone() will return success since err at this point is now zero. Below is a patch which fixes this bug and also propagates the error return value from anon_vma_clone() in all cases. Fixes: ef0855d334e1 ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()") Signed-off-by: Daniel Forrest <dan.forrest@ssec.wisc.edu> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Tim Hartrick <tim@edgecast.com> Cc: Hugh Dickins <hughd@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-03mm: fix swapoff hang after page migration and forkHugh Dickins
I've been seeing swapoff hangs in recent testing: it's cycling around trying unsuccessfully to find an mm for some remaining pages of swap. I have been exercising swap and page migration more heavily recently, and now notice a long-standing error in copy_one_pte(): it's trying to add dst_mm to swapoff's mmlist when it finds a swap entry, but is doing so even when it's a migration entry or an hwpoison entry. Which wouldn't matter much, except it adds dst_mm next to src_mm, assuming src_mm is already on the mmlist: which may not be so. Then if pages are later swapped out from dst_mm, swapoff won't be able to find where to replace them. There's already a !non_swap_entry() test for stats: move that up before the swap_duplicate() and the addition to mmlist. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Kelley Nielsen <kelleynnn@gmail.com> Cc: <stable@vger.kernel.org> [2.6.18+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-02mm/vmpressure.c: fix race in vmpressure_work_fn()Andrew Morton
In some android devices, there will be a "divide by zero" exception. vmpr->scanned could be zero before spin_lock(&vmpr->sr_lock). Addresses https://bugzilla.kernel.org/show_bug.cgi?id=88051 [akpm@linux-foundation.org: neaten] Reported-by: ji_ang <ji_ang@163.com> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-02mm: frontswap: invalidate expired data on a dup-store failureWeijie Yang
If a frontswap dup-store failed, it should invalidate the expired page in the backend, or it could trigger some data corruption issue. Such as: 1. use zswap as the frontswap backend with writeback feature 2. store a swap page(version_1) to entry A, success 3. dup-store a newer page(version_2) to the same entry A, fail 4. use __swap_writepage() write version_2 page to swapfile, success 5. zswap do shrink, writeback version_1 page to swapfile 6. version_2 page is overwrited by version_1, data corrupt. This patch fixes this issue by invalidating expired data immediately when meet a dup-store failure. Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Seth Jennings <sjennings@variantweb.net> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-14Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fix from Al Viro: "Fix for a really embarrassing braino in iov_iter. Kudos to paulus..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: Fix thinko in iov_iter_single_seg_count
2014-11-13mem-hotplug: reset node present pages when hot-adding a new pgdatTang Chen
When memory is hot-added, all the memory is in offline state. So clear all zones' present_pages because they will be updated in online_pages() and offline_pages(). Otherwise, /proc/zoneinfo will corrupt: When the memory of node2 is offline: # cat /proc/zoneinfo ...... Node 2, zone Movable ...... spanned 8388608 present 8388608 managed 0 When we online memory on node2: # cat /proc/zoneinfo ...... Node 2, zone Movable ...... spanned 8388608 present 16777216 managed 8388608 Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [3.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13mem-hotplug: reset node managed pages when hot-adding a new pgdatTang Chen
In free_area_init_core(), zone->managed_pages is set to an approximate value for lowmem, and will be adjusted when the bootmem allocator frees pages into the buddy system. But free_area_init_core() is also called by hotadd_new_pgdat() when hot-adding memory. As a result, zone->managed_pages of the newly added node's pgdat is set to an approximate value in the very beginning. Even if the memory on that node has node been onlined, /sys/device/system/node/nodeXXX/meminfo has wrong value: hot-add node2 (memory not onlined) cat /sys/device/system/node/node2/meminfo Node 2 MemTotal: 33554432 kB Node 2 MemFree: 0 kB Node 2 MemUsed: 33554432 kB Node 2 Active: 0 kB This patch fixes this problem by reset node managed pages to 0 after hot-adding a new node. 1. Move reset_managed_pages_done from reset_node_managed_pages() to reset_all_zones_managed_pages() 2. Make reset_node_managed_pages() non-static 3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat is initialized Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: <stable@vger.kernel.org> [3.16+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13mm/debug-pagealloc: correct freepage accounting and order resettingJoonsoo Kim
One thing I did in this patch is fixing freepage accounting. If we clear guard page and link it onto isolate buddy list, we should not increase freepage count. This patch adds conditional branch to skip counting in this case. Without this patch, this overcounting happens frequently if guard order is set and CMA is used. Another thing fixed in this patch is the target to reset order. In __free_one_page(), we check the buddy page whether it is a guard page or not. And, if so, we should clear guard attribute on the buddy page and reset order of it to 0. But, current code resets original page's order rather than buddy one's. Maybe, this doesn't have any problem, because whole merged page's order will be re-assigned soon. But, it is better to correct code. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Gioh Kim <gioh.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13mm, compaction: prevent infinite loop in compact_zoneVlastimil Babka
Several people have reported occasionally seeing processes stuck in compact_zone(), even triggering soft lockups, in 3.18-rc2+. Testing a revert of commit e14c720efdd7 ("mm, compaction: remember position within pageblock in free pages scanner") fixed the issue, although the stuck processes do not appear to involve the free scanner. Finally, by code inspection, the bug was found in isolate_migratepages() which uses a slightly different condition to detect if the migration and free scanners have met, than compact_finished(). That has not been a problem until commit e14c720efdd7 allowed the free scanner position between individual invocations to be in the middle of a pageblock. In a relatively rare case, the migration scanner position can end up at the beginning of a pageblock, with the free scanner position in the middle of the same pageblock. If it's the migration scanner's turn, isolate_migratepages() exits immediately (without updating the position), while compact_finished() decides to continue compaction, resulting in a potentially infinite loop. The system can recover only if another process creates enough high-order pages to make the watermark checks in compact_finished() pass. This patch fixes the immediate problem by bumping the migration scanner's position to meet the free scanner in isolate_migratepages(), when both are within the same pageblock. This causes compact_finished() to terminate properly. A more robust check in compact_finished() is planned as a cleanup for better future maintainability. Fixes: e14c720efdd73 ("mm, compaction: remember position within pageblock in free pages scanner) Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reported-by: P. Christeas <xrg@linux.gr> Tested-by: P. Christeas <xrg@linux.gr> Link: http://marc.info/?l=linux-mm&m=141508604232522&w=2 Reported-by: Norbert Preining <preining@logic.at> Tested-by: Norbert Preining <preining@logic.at> Link: https://lkml.org/lkml/2014/11/4/904 Reported-by: Pavel Machek <pavel@ucw.cz> Link: https://lkml.org/lkml/2014/11/7/164 Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13mm: alloc_contig_range: demote pages busy message from warn to infoMichal Nazarewicz
Having test_pages_isolated failure message as a warning confuses users into thinking that it is more serious than it really is. In reality, if called via CMA, allocation will be retried so a single test_pages_isolated failure does not prevent allocation from succeeding. Demote the warning message to an info message and reformat it such that the text "failed" does not appear and instead a less worrying "PFNS busy" is used. This message is trivially reproducible on a 10GB x86 machine on 3.16.y kernels configured with CONFIG_DMA_CMA. Signed-off-by: Michal Nazarewicz <mina86@mina86.com> Cc: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13mm/slab: fix unalignment problem on Malta with EVA due to slab mergeJoonsoo Kim
Unlike SLUB, sometimes, object isn't started at the beginning of the slab in SLAB. This causes the unalignment problem after slab merging is supported by commit 12220dea07f1 ("mm/slab: support slab merge"). Following is the report from Markos that fail to boot on Malta with EVA. Calibrating delay loop... 19.86 BogoMIPS (lpj=99328) pid_max: default: 32768 minimum: 301 Mount-cache hash table entries: 4096 (order: 0, 16384 bytes) Mountpoint-cache hash table entries: 4096 (order: 0, 16384 bytes) Kernel bug detected[#1]: CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.17.0-05639-g12220dea07f1 #1631 task: 1f04f5d8 ti: 1f050000 task.ti: 1f050000 epc : 80141190 alloc_unbound_pwq+0x234/0x304 Not tainted ra : 80141184 alloc_unbound_pwq+0x228/0x304 Process swapper/0 (pid: 1, threadinfo=1f050000, task=1f04f5d8, tls=00000000) Call Trace: alloc_unbound_pwq+0x234/0x304 apply_workqueue_attrs+0x11c/0x294 __alloc_workqueue_key+0x23c/0x470 init_workqueues+0x320/0x400 do_one_initcall+0xe8/0x23c kernel_init_freeable+0x9c/0x224 kernel_init+0x10/0x100 ret_from_kernel_thread+0x14/0x1c [ end trace cb88537fdc8fa200 ] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b alloc_unbound_pwq() allocates slab object from pool_workqueue. This kmem_cache requires 256 bytes alignment, but, current merging code doesn't honor that, and merge it with kmalloc-256. kmalloc-256 requires only cacheline size alignment so that above failure occurs. However, in x86, kmalloc-256 is luckily aligned in 256 bytes, so the problem didn't happen on it. To fix this problem, this patch introduces alignment mismatch check in find_mergeable(). This will fix the problem. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: Markos Chandras <Markos.Chandras@imgtec.com> Tested-by: Markos Chandras <Markos.Chandras@imgtec.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-13mm/page_alloc: restrict max order of merging on isolated pageblockJoonsoo Kim
Current pageblock isolation logic could isolate each pageblock individually. This causes freepage accounting problem if freepage with pageblock order on isolate pageblock is merged with other freepage on normal pageblock. We can prevent merging by restricting max order of merging to pageblock order if freepage is on isolate pageblock. A side-effect of this change is that there could be non-merged buddy freepage even if finishing pageblock isolation, because undoing pageblock isolation is just to move freepage from isolate buddy list to normal buddy list rather than to consider merging. So, the patch also makes undoing pageblock isolation consider freepage merge. When un-isolation, freepage with more than pageblock order and it's buddy are checked. If they are on normal pageblock, instead of just moving, we isolate the freepage and free it in order to get merged. Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Laura Abbott <lauraa@codeaurora.org> Cc: Heesub Shin <heesub.shin@samsung.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Ritesh Harjani <ritesh.list@gmail.com> Cc: Gioh Kim <gioh.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>