summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2013-11-04writeback: fix negative bdi max pauseFengguang Wu
commit e3b6c655b91e01a1dade056cfa358581b47a5351 upstream. Toralf runs trinity on UML/i386. After some time it hangs and the last message line is BUG: soft lockup - CPU#0 stuck for 22s! [trinity-child0:1521] It's found that pages_dirtied becomes very large. More than 1000000000 pages in this case: period = HZ * pages_dirtied / task_ratelimit; BUG_ON(pages_dirtied > 2000000000); BUG_ON(pages_dirtied > 1000000000); <--------- UML debug printf shows that we got negative pause here: ick: pause : -984 ick: pages_dirtied : 0 ick: task_ratelimit: 0 pause: + if (pause < 0) { + extern int printf(char *, ...); + printf("ick : pause : %li\n", pause); + printf("ick: pages_dirtied : %lu\n", pages_dirtied); + printf("ick: task_ratelimit: %lu\n", task_ratelimit); + BUG_ON(1); + } trace_balance_dirty_pages(bdi, Since pause is bounded by [min_pause, max_pause] where min_pause is also bounded by max_pause. It's suspected and demonstrated that the max_pause calculation goes wrong: ick: pause : -717 ick: min_pause : -177 ick: max_pause : -717 ick: pages_dirtied : 14 ick: task_ratelimit: 0 The problem lies in the two "long = unsigned long" assignments in bdi_max_pause() which might go negative if the highest bit is 1, and the min_t(long, ...) check failed to protect it falling under 0. Fix all of them by using "unsigned long" throughout the function. Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reported-by: Toralf Förster <toralf.foerster@gmx.de> Tested-by: Toralf Förster <toralf.foerster@gmx.de> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Richard Weinberger <richard@nod.at> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-04mm: fix BUG in __split_huge_page_pmdHugh Dickins
commit 750e8165f5e87b6a142be953640eabb13a9d350a upstream. Occasionally we hit the BUG_ON(pmd_trans_huge(*pmd)) at the end of __split_huge_page_pmd(): seen when doing madvise(,,MADV_DONTNEED). It's invalid: we don't always have down_write of mmap_sem there: a racing do_huge_pmd_wp_page() might have copied-on-write to another huge page before our split_huge_page() got the anon_vma lock. Forget the BUG_ON, just go back and try again if this happens. Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-11-04mm/zswap: bugfix: memory leak when re-swaponWeijie Yang
commit aa9bca05a467c61dcea4142b2877d5392de5bdce upstream. zswap_tree is not freed when swapoff, and it got re-kmalloced in swapon, so a memory leak occurs. Free the memory of zswap_tree in zswap_frontswap_invalidate_area(). Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Reviewed-by: Minchan Kim <minchan@kernel.org> From: Weijie Yang <weijie.yang@samsung.com> Subject: mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Consider the following scenario: thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) thread 1: call zswap_frontswap_invalidate_page to invalidate entry x. finished, entry x and its zbud is not freed as its refcount != 0 now, the swap_map[x] = 0 thread 0: now call zswap_get_swap_cache_page swapcache_prepare return -ENOENT because entry x is not used any more zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM zswap_writeback_entry do nothing except put refcount Now, the memory of zswap_entry x and its zpage leak. Modify: - check the refcount in fail path, free memory if it is not referenced. - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path can be not only caused by nomem but also by invalidate. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Weijie Yang <weijie.yang@samsung.com> Reviewed-by: Bob Liu <bob.liu@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <stable@vger.kernel.org> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-04mm: migration: do not lose soft dirty bit if page is in migration stateCyrill Gorcunov
commit c3d16e16522fe3fe8759735850a0676da18f4b1d upstream. If page migration is turned on in config and the page is migrating, we may lose the soft dirty bit. If fork and mprotect are called on migrating pages (once migration is complete) pages do not obtain the soft dirty bit in the correspond pte entries. Fix it adding an appropriate test on swap entries. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-13mm: avoid reinserting isolated balloon pages into LRU listsRafael Aquini
commit 117aad1e9e4d97448d1df3f84b08bd65811e6d6a upstream. Isolated balloon pages can wrongly end up in LRU lists when migrate_pages() finishes its round without draining all the isolated page list. The same issue can happen when reclaim_clean_pages_from_list() tries to reclaim pages from an isolated page list, before migration, in the CMA path. Such balloon page leak opens a race window against LRU lists shrinkers that leads us to the following kernel panic: BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 IP: [<ffffffff810c2625>] shrink_page_list+0x24e/0x897 PGD 3cda2067 PUD 3d713067 PMD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 340 Comm: kswapd0 Not tainted 3.12.0-rc1-22626-g4367597 #87 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 RIP: shrink_page_list+0x24e/0x897 RSP: 0000:ffff88003da499b8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff88003e82bd60 RCX: 00000000000657d5 RDX: 0000000000000000 RSI: 000000000000031f RDI: ffff88003e82bd40 RBP: ffff88003da49ab0 R08: 0000000000000001 R09: 0000000081121a45 R10: ffffffff81121a45 R11: ffff88003c4a9a28 R12: ffff88003e82bd40 R13: ffff88003da0e800 R14: 0000000000000001 R15: ffff88003da49d58 FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000067d9000 CR3: 000000003ace5000 CR4: 00000000000407b0 Call Trace: shrink_inactive_list+0x240/0x3de shrink_lruvec+0x3e0/0x566 __shrink_zone+0x94/0x178 shrink_zone+0x3a/0x82 balance_pgdat+0x32a/0x4c2 kswapd+0x2f0/0x372 kthread+0xa2/0xaa ret_from_fork+0x7c/0xb0 Code: 80 7d 8f 01 48 83 95 68 ff ff ff 00 4c 89 e7 e8 5a 7b 00 00 48 85 c0 49 89 c5 75 08 80 7d 8f 00 74 3e eb 31 48 8b 80 18 01 00 00 <48> 8b 74 0d 48 8b 78 30 be 02 00 00 00 ff d2 eb RIP [<ffffffff810c2625>] shrink_page_list+0x24e/0x897 RSP <ffff88003da499b8> CR2: 0000000000000028 ---[ end trace 703d2451af6ffbfd ]--- Kernel panic - not syncing: Fatal exception This patch fixes the issue, by assuring the proper tests are made at putback_movable_pages() & reclaim_clean_pages_from_list() to avoid isolated balloon pages being wrongly reinserted in LRU lists. [akpm@linux-foundation.org: clarify awkward comment text] Signed-off-by: Rafael Aquini <aquini@redhat.com> Reported-by: Luiz Capitulino <lcapitulino@redhat.com> Tested-by: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-13mm/bounce.c: fix a regression where MS_SNAP_STABLE (stable pages ↵Darrick J. Wong
snapshotting) was ignored commit 83b2944fd2532b92db099cb3ada12df32a05b368 upstream. The "force" parameter in __blk_queue_bounce was being ignored, which means that stable page snapshots are not always happening (on ext3). This of course leads to DIF disks reporting checksum errors, so fix this regression. The regression was introduced in commit 6bc454d15004 ("bounce: Refactor __blk_queue_bounce to not use bi_io_vec") Reported-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-13powerpc: Fix memory hotplug with sparse vmemmapNathan Fontenot
commit f7e3334a6bcb42e7295a9bd9cb36ca4e6e4e66b4 upstream. Previous commit 46723bfa540... introduced a new config option HAVE_BOOTMEM_INFO_NODE that ended up breaking memory hot-remove for ppc when sparse vmemmap is not defined. This patch defines HAVE_BOOTMEM_INFO_NODE for ppc and adds the call to register_page_bootmem_info_node. Without this we get a BUG_ON for memory hot remove in put_page_bootmem(). This also adds a stub for register_page_bootmem_memmap to allow ppc to build with sparse vmemmap defined. Leaving this as a stub is fine since the same vmemmap addresses are also handled in vmemmap_populate and as such are properly mapped. Signed-off-by: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-10-01mm: fix aio performance regression for database caused by THPKhalid Aziz
commit 7cb2ef56e6a8b7b368b2e883a0a47d02fed66911 upstream. I am working with a tool that simulates oracle database I/O workload. This tool (orion to be specific - <http://docs.oracle.com/cd/E11882_01/server.112/e16638/iodesign.htm#autoId24>) allocates hugetlbfs pages using shmget() with SHM_HUGETLB flag. It then does aio into these pages from flash disks using various common block sizes used by database. I am looking at performance with two of the most common block sizes - 1M and 64K. aio performance with these two block sizes plunged after Transparent HugePages was introduced in the kernel. Here are performance numbers: pre-THP 2.6.39 3.11-rc5 1M read 8384 MB/s 5629 MB/s 6501 MB/s 64K read 7867 MB/s 4576 MB/s 4251 MB/s I have narrowed the performance impact down to the overheads introduced by THP in __get_page_tail() and put_compound_page() routines. perf top shows >40% of cycles being spent in these two routines. Every time direct I/O to hugetlbfs pages starts, kernel calls get_page() to grab a reference to the pages and calls put_page() when I/O completes to put the reference away. THP introduced significant amount of locking overhead to get_page() and put_page() when dealing with compound pages because hugepages can be split underneath get_page() and put_page(). It added this overhead irrespective of whether it is dealing with hugetlbfs pages or transparent hugepages. This resulted in 20%-45% drop in aio performance when using hugetlbfs pages. Since hugetlbfs pages can not be split, there is no reason to go through all the locking overhead for these pages from what I can see. I added code to __get_page_tail() and put_compound_page() to bypass all the locking code when working with hugetlbfs pages. This improved performance significantly. Performance numbers with this patch: pre-THP 3.11-rc5 3.11-rc5 + Patch 1M read 8384 MB/s 6501 MB/s 8371 MB/s 64K read 7867 MB/s 4251 MB/s 6510 MB/s Performance with 64K read is still lower than what it was before THP, but still a 53% improvement. It does mean there is more work to be done but I will take a 53% improvement for now. Please take a look at the following patch and let me know if it looks reasonable. [akpm@linux-foundation.org: tweak comments] Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com> Cc: Pravin B Shelar <pshelar@nicira.com> Cc: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26mm/huge_memory.c: fix potential NULL pointer dereferenceLibin
commit a8f531ebc33052642b4bd7b812eedf397108ce64 upstream. In collapse_huge_page() there is a race window between releasing the mmap_sem read lock and taking the mmap_sem write lock, so find_vma() may return NULL. So check the return value to avoid NULL pointer dereference. collapse_huge_page khugepaged_alloc_page up_read(&mm->mmap_sem) down_write(&mm->mmap_sem) vma = find_vma(mm, address) Signed-off-by: Libin <huawei.libin@huawei.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-09-26memcg: fix multiple large threshold notificationsGreg Thelen
commit 2bff24a3707093c435ab3241c47dcdb5f16e432b upstream. A memory cgroup with (1) multiple threshold notifications and (2) at least one threshold >=2G was not reliable. Specifically the notifications would either not fire or would not fire in the proper order. The __mem_cgroup_threshold() signaling logic depends on keeping 64 bit thresholds in sorted order. mem_cgroup_usage_register_event() sorts them with compare_thresholds(), which returns the difference of two 64 bit thresholds as an int. If the difference is positive but has bit[31] set, then sort() treats the difference as negative and breaks sort order. This fix compares the two arbitrary 64 bit thresholds returning the classic -1, 0, 1 result. The test below sets two notifications (at 0x1000 and 0x81001000): cd /sys/fs/cgroup/memory mkdir x for x in 4096 2164264960; do cgroup_event_listener x/memory.usage_in_bytes $x | sed "s/^/$x listener:/" & done echo $$ > x/cgroup.procs anon_leaker 500M v3.11-rc7 fails to signal the 4096 event listener: Leaking... Done leaking pages. Patched v3.11-rc7 properly notifies: Leaking... 4096 listener:2013:8:31:14:13:36 Done leaking pages. The fixed bug is old. It appears to date back to the introduction of memcg threshold notifications in v2.6.34-rc1-116-g2e72b6347c94 "memcg: implement memory thresholds" Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-28memcg: check that kmem_cache has memcg_params before accessing itAndrey Vagin
If the system had a few memory groups and all of them were destroyed, memcg_limited_groups_array_size has non-zero value, but all new caches are created without memcg_params, because memcg_kmem_enabled() returns false. We try to enumirate child caches in a few places and all of them are potentially dangerous. For example my kernel is compiled with CONFIG_SLAB and it crashed when I tryed to mount a NFS share after a few experiments with kmemcg. BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0 PGD b942a067 PUD b999f067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: fscache(+) ip6table_filter ip6_tables iptable_filter ip_tables i2c_piix4 pcspkr virtio_net virtio_balloon i2c_core floppy CPU: 0 PID: 357 Comm: modprobe Not tainted 3.11.0-rc7+ #59 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff8800b9f98240 ti: ffff8800ba32e000 task.ti: ffff8800ba32e000 RIP: 0010:[<ffffffff8118166a>] [<ffffffff8118166a>] do_tune_cpucache+0x8a/0xd0 RSP: 0018:ffff8800ba32fb70 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006 RDX: 0000000000000000 RSI: ffff8800b9f98910 RDI: 0000000000000246 RBP: ffff8800ba32fba0 R08: 0000000000000002 R09: 0000000000000004 R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000010 R13: 0000000000000008 R14: 00000000000000d0 R15: ffff8800375d0200 FS: 00007f55f1378740(0000) GS:ffff8800bfa00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007f24feba57a0 CR3: 0000000037b51000 CR4: 00000000000006f0 Call Trace: enable_cpucache+0x49/0x100 setup_cpu_cache+0x215/0x280 __kmem_cache_create+0x2fa/0x450 kmem_cache_create_memcg+0x214/0x350 kmem_cache_create+0x2b/0x30 fscache_init+0x19b/0x230 [fscache] do_one_initcall+0xfa/0x1b0 load_module+0x1c41/0x26d0 SyS_finit_module+0x86/0xb0 system_call_fastpath+0x16/0x1b Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Joonsoo Kim <js1304@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-27mm: move_ptes -- Set soft dirty bit depending on pte typeCyrill Gorcunov
Dave reported corrupted swap entries | [ 4588.541886] swap_free: Unused swap offset entry 00002d15 | [ 4588.541952] BUG: Bad page map in process trinity-kid12 pte:005a2a80 pmd:22c01f067 and Hugh pointed that in move_ptes _PAGE_SOFT_DIRTY bit set regardless the type of entry pte consists of. The trick here is that when we carry soft dirty status in swap entries we are to use _PAGE_SWP_SOFT_DIRTY instead, because this is the only place in pte which can be used for own needs without intersecting with bits owned by swap entry type/offset. Reported-and-tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Analyzed-by: Hugh Dickins <hughd@google.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs fixes from Al Viro: "Assorted fixes from the last week or so" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: VFS: collect_mounts() should return an ERR_PTR bfs: iget_locked() doesn't return an ERR_PTR efs: iget_locked() doesn't return an ERR_PTR() proc: kill the extra proc_readfd_common()->dir_emit_dots() cope with potentially long ->d_dname() output for shmem/hugetlb
2013-08-24cope with potentially long ->d_dname() output for shmem/hugetlbAl Viro
dynamic_dname() is both too much and too little for those - the output may be well in excess of 64 bytes dynamic_dname() assumes to be enough (thanks to ashmem feeding really long names to shmem_file_setup()) and vsnprintf() is an overkill for those guys. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-08-23memcg: get rid of swapaccount leftoversMichal Hocko
The swapaccount kernel parameter without any values has been removed by commit a2c8990aed5a ("memsw: remove noswapaccount kernel parameter") but it seems that we didn't get rid of all the left overs. Make sure that menuconfig help text and kernel-parameters.txt are clear about value for the paramter and remove the stalled comment which is not very much useful on its own. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Gergely Risko <gergely@risko.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-16Fix TLB gather virtual address range invalidation corner casesLinus Torvalds
Ben Tebulin reported: "Since v3.7.2 on two independent machines a very specific Git repository fails in 9/10 cases on git-fsck due to an SHA1/memory failures. This only occurs on a very specific repository and can be reproduced stably on two independent laptops. Git mailing list ran out of ideas and for me this looks like some very exotic kernel issue" and bisected the failure to the backport of commit 53a59fc67f97 ("mm: limit mmu_gather batching to fix soft lockups on !CONFIG_PREEMPT"). That commit itself is not actually buggy, but what it does is to make it much more likely to hit the partial TLB invalidation case, since it introduces a new case in tlb_next_batch() that previously only ever happened when running out of memory. The real bug is that the TLB gather virtual memory range setup is subtly buggered. It was introduced in commit 597e1c3580b7 ("mm/mmu_gather: enable tlb flush range in generic mmu_gather"), and the range handling was already fixed at least once in commit e6c495a96ce0 ("mm: fix the TLB range flushed when __tlb_remove_page() runs out of slots"), but that fix was not complete. The problem with the TLB gather virtual address range is that it isn't set up by the initial tlb_gather_mmu() initialization (which didn't get the TLB range information), but it is set up ad-hoc later by the functions that actually flush the TLB. And so any such case that forgot to update the TLB range entries would potentially miss TLB invalidates. Rather than try to figure out exactly which particular ad-hoc range setup was missing (I personally suspect it's the hugetlb case in zap_huge_pmd(), which didn't have the same logic as zap_pte_range() did), this patch just gets rid of the problem at the source: make the TLB range information available to tlb_gather_mmu(), and initialize it when initializing all the other tlb gather fields. This makes the patch larger, but conceptually much simpler. And the end result is much more understandable; even if you want to play games with partial ranges when invalidating the TLB contents in chunks, now the range information is always there, and anybody who doesn't want to bother with it won't introduce subtle bugs. Ben verified that this fixes his problem. Reported-bisected-and-tested-by: Ben Tebulin <tebulin@googlemail.com> Build-testing-by: Stephen Rothwell <sfr@canb.auug.org.au> Build-testing-by: Richard Weinberger <richard.weinberger@gmail.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13mm: save soft-dirty bits on file pagesCyrill Gorcunov
Andy reported that if file page get reclaimed we lose the soft-dirty bit if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get encoded into pte entry. Thus when #pf happens on such non-present pte we can restore it back. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13mm: save soft-dirty bits on swapped pagesCyrill Gorcunov
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set get swapped out, the bit is getting lost and no longer available when pte read back. To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in pte entry for the page being swapped out. When such page is to be read back from a swap cache we check for bit presence and if it's there we clear it and restore the former _PAGE_SOFT_DIRTY bit back. One of the problem was to find a place in pte entry where we can save the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was chosen for that, it doesn't intersect with swap entry format stored in pte. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13memcg: don't initialize kmem-cache destroying work for root cachesAndrey Vagin
struct memcg_cache_params has a union. Different parts of this union are used for root and non-root caches. A part with destroying work is used only for non-root caches. I fixed the same problem in another place v3.9-rc1-16204-gf101a94, but didn't notice this one. This patch fixes the kernel panic: [ 46.848187] BUG: unable to handle kernel paging request at 000000fffffffeb8 [ 46.849026] IP: [<ffffffff811a484c>] kmem_cache_destroy_memcg_children+0x6c/0xc0 [ 46.849092] PGD 0 [ 46.849092] Oops: 0000 [#1] SMP ... Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: <stable@vger.kernel.org> [3.9.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-08Revert "slub: do not put a slab to cpu partial list when cpu_partial is 0"Linus Torvalds
This reverts commit 318df36e57c0ca9f2146660d41ff28e8650af423. This commit caused Steven Rostedt's hackbench runs to run out of memory due to a leak. As noted by Joonsoo Kim, it is buggy in the following scenario: "I guess, you may set 0 to all kmem caches's cpu_partial via sysfs, doesn't it? In this case, memory leak is possible in following case. Code flow of possible leak is follwing case. * in __slab_free() 1. (!new.inuse || !prior) && !was_frozen 2. !kmem_cache_debug && !prior 3. new.frozen = 1 4. after cmpxchg_double_slab, run the (!n) case with new.frozen=1 5. with this patch, put_cpu_partial() doesn't do anything, because this cache's cpu_partial is 0 6. return In step 5, leak occur" And Steven does indeed have cpu_partial set to 0 due to RT testing. Joonsoo is cooking up a patch, but everybody agrees that reverting this for now is the right thing to do. Reported-and-bisected-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-04tmpfs: fix SEEK_DATA/SEEK_HOLE regressionHugh Dickins
Commit 46a1c2c7ae53 ("vfs: export lseek_execute() to modules") broke the tmpfs SEEK_DATA/SEEK_HOLE implementation, because vfs_setpos() converts the carefully prepared -ENXIO to -EINVAL. Other filesystems avoid it in error cases: do the same in tmpfs. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31vmpressure: make sure there are no events queued after memcg is offlinedMichal Hocko
vmpressure is called synchronously from reclaim where the target_memcg is guaranteed to be alive but the eventfd is signaled from the work queue context. This means that memcg (along with vmpressure structure which is embedded into it) might go away while the work item is pending which would result in use-after-release bug. We have two possible ways how to fix this. Either vmpressure pins memcg before it schedules vmpr->work and unpin it in vmpressure_work_fn or explicitely flush the work item from the css_offline context (as suggested by Tejun). This patch implements the later one and it introduces vmpressure_cleanup which flushes the vmpressure work queue item item. It hooks into mem_cgroup_css_offline after the memcg itself is cleaned up. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Li Zefan <lizefan@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31vmpressure: do not check for pending work to prevent from new workMichal Hocko
because it is racy and it doesn't give us much anyway as schedule_work handles this case already. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Li Zefan <lizefan@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31vmpressure: change vmpressure::sr_lock to spinlockMichal Hocko
There is nothing that can sleep inside critical sections protected by this lock and those sections are really small so there doesn't make much sense to use mutex for them. Change the log to a spinlock Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Li Zefan <lizefan@huawei.com> Reviewed-by: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31mm: zbud: fix condition check on allocation sizeHeesub Shin
zbud_alloc() incorrectly verifies the size of allocation limit. It should deny the allocation request greater than (PAGE_SIZE - ZHDR_SIZE_ALIGNED - CHUNK_SIZE), not (PAGE_SIZE - ZHDR_SIZE_ALIGNED) which has no remaining spaces for its buddy. There is no point in spending the entire zbud page storing only a single page, since we don't have any benefits. Signed-off-by: Heesub Shin <heesub.shin@samsung.com> Acked-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Dongjun Shin <d.j.shin@samsung.com> Cc: Sunae Seo <sunae.seo@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31thp, mm: avoid PageUnevictable on active/inactive lru listsKirill A. Shutemov
active/inactive lru lists can contain unevicable pages (i.e. ramfs pages that have been placed on the LRU lists when first allocated), but these pages must not have PageUnevictable set - otherwise shrink_[in]active_list goes crazy: kernel BUG at /home/space/kas/git/public/linux-next/mm/vmscan.c:1122! 1090 static unsigned long isolate_lru_pages(unsigned long nr_to_scan, 1091 struct lruvec *lruvec, struct list_head *dst, 1092 unsigned long *nr_scanned, struct scan_control *sc, 1093 isolate_mode_t mode, enum lru_list lru) 1094 { ... 1108 switch (__isolate_lru_page(page, mode)) { 1109 case 0: ... 1116 case -EBUSY: ... 1121 default: 1122 BUG(); 1123 } 1124 } ... 1130 } __isolate_lru_page() returns EINVAL for PageUnevictable(page). For lru_add_page_tail(), it means we should not set PageUnevictable() for tail pages unless we're sure that it will go to LRU_UNEVICTABLE. Let's just copy PG_active and PG_unevictable from head page in __split_huge_page_refcount(), it will simplify lru_add_page_tail(). This will fix one more bug in lru_add_page_tail(): if page_evictable(page_tail) is false and PageLRU(page) is true, page_tail will go to the same lru as page, but nobody cares to sync page_tail active/inactive state with page. So we can end up with inactive page on active lru. The patch will fix it as well since we copy PG_active from head page. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31mm/swap.c: clear PageActive before adding pages onto unevictable listNaoya Horiguchi
As a result of commit 13f7f78981e4 ("mm: pagevec: defer deciding which LRU to add a page to until pagevec drain time"), pages on unevictable lists can have both of PageActive and PageUnevictable set. This is not only confusing, but also corrupts page migration and shrink_[in]active_list. This patch fixes the problem by adding ClearPageActive before adding pages into unevictable list. It also cleans up VM_BUG_ONs. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-31mm: mempolicy: fix mbind_range() && vma_adjust() interactionOleg Nesterov
vma_adjust() does vma_set_policy(vma, vma_policy(next)) and this is doubly wrong: 1. This leaks vma->vm_policy if it is not NULL and not equal to next->vm_policy. This can happen if vma_merge() expands "area", not prev (case 8). 2. This sets the wrong policy if vma_merge() joins prev and area, area is the vma the caller needs to update and it still has the old policy. Revert commit 1444f92c8498 ("mm: merging memory blocks resets mempolicy") which introduced these problems. Change mbind_range() to recheck mpol_equal() after vma_merge() to fix the problem that commit tried to address. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Steven T Hampson <steven.t.hampson@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-18Merge tag 'driver-core-3.11-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core patches from Greg KH: "Here are some driver core patches for 3.11-rc2. They aren't really bugfixes, but a bunch of new helper macros for drivers to properly create attribute groups, which drivers and subsystems need to fix up a ton of race issues with incorrectly creating sysfs files (binary and normal) after userspace has been told that the device is present. Also here is the ability to create binary files as attribute groups, to solve that race condition, which was impossible to do before this, so that's my fault the drivers were broken. The majority of the .c changes is indenting and moving code around a bit. It affects no existing code, but allows the large backlog of 70+ patches that I already have created to start flowing into the different subtrees, instead of having to live in my driver-core tree, causing merge nightmares in linux-next for the next few months. These were finalized too late for the -rc1 merge window, which is why they were didn't make that pull request, testing and review from others didn't happen until a few weeks ago, and then there's the whole distraction of the past few days, which prevented these from getting to you sooner, sorry about that. Oh, and there's a bugfix for the documentation build warning in here as well. All of these have been in linux-next this week, with no reported problems" * tag 'driver-core-3.11-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: driver-core: fix new kernel-doc warning in base/platform.c sysfs: use file mode defines from stat.h sysfs: add more helper macro's for (bin_)attribute(_groups) driver core: add default groups to struct class driver core: Introduce device_create_groups sysfs: prevent warning when only using binary attributes sysfs: add support for binary attributes in groups driver core: device.h: add RW and RO attribute macros sysfs.h: add BIN_ATTR macro sysfs.h: add ATTRIBUTE_GROUPS() macro sysfs.h: add __ATTR_RW() macro
2013-07-16sysfs.h: add __ATTR_RW() macroGreg Kroah-Hartman
A number of parts of the kernel created their own version of this, might as well have the sysfs core provide it instead. Reviewed-by: Guenter Roeck <linux@roeck-us.net> Tested-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-07-14kernel: delete __cpuinit usage from all core kernel filesPaul Gortmaker
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the uses of the __cpuinit macros from C files in the core kernel directories (kernel, init, lib, mm, and include) that don't really have a specific maintainer. [1] https://lkml.org/lkml/2013/5/20/589 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2013-07-14Merge branch 'slab/for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull slab update from Pekka Enberg: "Highlights: - Fix for boot-time problems on some architectures due to init_lock_keys() not respecting kmalloc_caches boundaries (Christoph Lameter) - CONFIG_SLUB_CPU_PARTIAL requested by RT folks (Joonsoo Kim) - Fix for excessive slab freelist draining (Wanpeng Li) - SLUB and SLOB cleanups and fixes (various people)" I ended up editing the branch, and this avoids two commits at the end that were immediately reverted, and I instead just applied the oneliner fix in between myself. * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux slub: Check for page NULL before doing the node_match check mm/slab: Give s_next and s_stop slab-specific names slob: Check for NULL pointer before calling ctor() slub: Make cpu partial slab support configurable slab: add kmalloc() to kernel API documentation slab: fix init_lock_keys slob: use DIV_ROUND_UP where possible slub: do not put a slab to cpu partial list when cpu_partial is 0 mm/slub: Use node_nr_slabs and node_nr_objs in get_slabinfo mm/slub: Drop unnecessary nr_partials mm/slab: Fix /proc/slabinfo unwriteable for slab mm/slab: Sharing s_next and s_stop between slab and slub mm/slab: Fix drain freelist excessively slob: Rework #ifdeffery in slab.h mm, slab: moved kmem_cache_alloc_node comment to correct place
2013-07-14slub: Check for page NULL before doing the node_match checkSteven Rostedt
In the -rt kernel (mrg), we hit the following dump: BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180 PGD a2d39067 PUD b1641067 PMD 0 Oops: 0000 [#1] PREEMPT SMP Modules linked in: sunrpc cpufreq_ondemand ipv6 tg3 joydev sg serio_raw pcspkr k8temp amd64_edac_mod edac_core i2c_piix4 e100 mii shpchp ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom sata_svw ata_generic pata_acpi pata_serverworks radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod CPU 3 Pid: 20878, comm: hackbench Not tainted 3.6.11-rt25.14.el6rt.x86_64 #1 empty empty/Tyan Transport GT24-B3992 RIP: 0010:[<ffffffff811573f1>] [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180 RSP: 0018:ffff8800a9b17d70 EFLAGS: 00010213 RAX: 0000000000000000 RBX: 0000000001200011 RCX: ffff8800a06d8000 RDX: 0000000004d92a03 RSI: 00000000000000d0 RDI: ffff88013b805500 RBP: ffff8800a9b17dc0 R08: ffff88023fd14d10 R09: ffffffff81041cbd R10: 00007f4e3f06e9d0 R11: 0000000000000246 R12: ffff88013b805500 R13: ffff8801ff46af40 R14: 0000000000000001 R15: 0000000000000000 FS: 00007f4e3f06e700(0000) GS:ffff88023fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 00000000a2d3a000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process hackbench (pid: 20878, threadinfo ffff8800a9b16000, task ffff8800a06d8000) Stack: ffff8800a9b17da0 ffffffff81202e08 ffff8800a9b17de0 000000d001200011 0000000001200011 0000000001200011 0000000000000000 0000000000000000 00007f4e3f06e9d0 0000000000000000 ffff8800a9b17e60 ffffffff81041cbd Call Trace: [<ffffffff81202e08>] ? current_has_perm+0x68/0x80 [<ffffffff81041cbd>] copy_process+0xdd/0x15b0 [<ffffffff810a2125>] ? rt_up_read+0x25/0x30 [<ffffffff8104369a>] do_fork+0x5a/0x360 [<ffffffff8107c66b>] ? migrate_enable+0xeb/0x220 [<ffffffff8100b068>] sys_clone+0x28/0x30 [<ffffffff81527423>] stub_clone+0x13/0x20 [<ffffffff81527152>] ? system_call_fastpath+0x16/0x1b Code: 89 fc 89 75 cc 41 89 d6 4d 8b 04 24 65 4c 03 04 25 48 ae 00 00 49 8b 50 08 4d 8b 28 49 8b 40 10 4d 85 ed 74 12 41 83 fe ff 74 27 <48> 8b 00 48 c1 e8 3a 41 39 c6 74 1b 8b 75 cc 4c 89 c9 44 89 f2 RIP [<ffffffff811573f1>] kmem_cache_alloc_node+0x51/0x180 RSP <ffff8800a9b17d70> CR2: 0000000000000000 ---[ end trace 0000000000000002 ]--- Now, this uses SLUB pretty much unmodified, but as it is the -rt kernel with CONFIG_PREEMPT_RT set, spinlocks are mutexes, although they do disable migration. But the SLUB code is relatively lockless, and the spin_locks there are raw_spin_locks (not converted to mutexes), thus I believe this bug can happen in mainline without -rt features. The -rt patch is just good at triggering mainline bugs ;-) Anyway, looking at where this crashed, it seems that the page variable can be NULL when passed to the node_match() function (which does not check if it is NULL). When this happens we get the above panic. As page is only used in slab_alloc() to check if the node matches, if it's NULL I'm assuming that we can say it doesn't and call the __slab_alloc() code. Is this a correct assumption? Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-10mm: remove free_area_cacheMichel Lespinasse
Since all architectures have been converted to use vm_unmapped_area(), there is no remaining use for the free_area_cache. Signed-off-by: Michel Lespinasse <walken@google.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: Helge Deller <deller@gmx.de> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-10zswap: add to mm/Seth Jennings
zswap is a thin backend for frontswap that takes pages that are in the process of being swapped out and attempts to compress them and store them in a RAM-based memory pool. This can result in a significant I/O reduction on the swap device and, in the case where decompressing from RAM is faster than reading from the swap device, can also improve workload performance. It also has support for evicting swap pages that are currently compressed in zswap to the swap device on an LRU(ish) basis. This functionality makes zswap a true cache in that, once the cache is full, the oldest pages can be moved out of zswap to the swap device so newer pages can be compressed and stored in zswap. This patch adds the zswap driver to mm/ Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Jenifer Hopper <jhopper@us.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Hansen <dave@sr71.net> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Hugh Dickens <hughd@google.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-10zbud: add to mm/Seth Jennings
zbud is an special purpose allocator for storing compressed pages. It is designed to store up to two compressed pages per physical page. While this design limits storage density, it has simple and deterministic reclaim properties that make it preferable to a higher density approach when reclaim will be used. zbud works by storing compressed pages, or "zpages", together in pairs in a single memory page called a "zbud page". The first buddy is "left justifed" at the beginning of the zbud page, and the last buddy is "right justified" at the end of the zbud page. The benefit is that if either buddy is freed, the freed buddy space, coalesced with whatever slack space that existed between the buddies, results in the largest possible free region within the zbud page. zbud also provides an attractive lower bound on density. The ratio of zpages to zbud pages can not be less than 1. This ensures that zbud can never "do harm" by using more pages to store zpages than the uncompressed zpages would have used on their own. This implementation is a rewrite of the zbud allocator internally used by zcache in the driver/staging tree. The rewrite was necessary to remove some of the zcache specific elements that were ingrained throughout and provide a generic allocation interface that can later be used by zsmalloc and others. This patch adds zbud to mm/ for later use by zswap. Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Jenifer Hopper <jhopper@us.ibm.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Hansen <dave@sr71.net> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Cc: Hugh Dickens <hughd@google.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Bob Liu <bob.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09ipc/shmc.c: eliminate ugly 80-col tricksAndrew Morton
Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm/memory_hotplug.c: fix return value of online_pages()Toshi Kani
online_pages() is called from memory_block_action() when a user requests to online a memory block via sysfs. This function needs to return a proper error value in case of error. Signed-off-by: Toshi Kani <toshi.kani@hp.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm: honor min_free_kbytes set by userMichal Hocko
min_free_kbytes is updated during memory hotplug (by init_per_zone_wmark_min) currently which is right thing to do in most cases but this could be unexpected if admin increased the value to prevent from allocation failures and the new min_free_kbytes would be decreased as a result of memory hotadd. This patch saves the user defined value and allows updating min_free_kbytes only if it is higher than the saved one. A warning is printed when the new value is ignored. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09memcg: don't need to free memcg via RCU or workqueueLi Zefan
Now memcg has the same life cycle with its corresponding cgroup, and a cgroup is freed via RCU and then mem_cgroup_css_free() will be called in a work function, so we can simply call __mem_cgroup_free() in mem_cgroup_css_free(). This actually reverts commit 59927fb984d ("memcg: free mem_cgroup by RCU to fix oops"). Signed-off-by: Li Zefan <lizefan@huawei.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09memcg: kill memcg refcntLi Zefan
Now memcg has the same life cycle as its corresponding cgroup. Kill the useless refcnt. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09memcg: don't need to get a reference to the parentLi Zefan
The cgroup core guarantees it's always safe to access the parent. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09memcg: use css_get/put for swap memcgLi Zefan
Use css_get/put instead of mem_cgroup_get/put. A simple replacement will do. The historical reason that memcg has its own refcnt instead of always using css_get/put, is that cgroup couldn't be removed if there're still css refs, so css refs can't be used as long-lived reference. The situation has changed so that rmdir a cgroup will succeed regardless css refs, but won't be freed until css refs goes down to 0. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09memcg: use css_get/put when charging/uncharging kmemLi Zefan
Use css_get/put instead of mem_cgroup_get/put. We can't do a simple replacement, because here mem_cgroup_put() is called during mem_cgroup_css_free(), while mem_cgroup_css_free() won't be called until css refcnt goes down to 0. Instead we increment css refcnt in mem_cgroup_css_offline(), and then check if there's still kmem charges. If not, css refcnt will be decremented immediately, otherwise the refcnt will be released after the last kmem allocation is uncahred. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09memcg: don't use mem_cgroup_get() when creating a kmemcg cacheLi Zefan
Use css_get()/css_put() instead of mem_cgroup_get()/mem_cgroup_put(). There are two things being done in the current code: First, we acquired a css_ref to make sure that the underlying cgroup would not go away. That is a short lived reference, and it is put as soon as the cache is created. At this point, we acquire a long-lived per-cache memcg reference count to guarantee that the memcg will still be alive. so it is: enqueue: css_get create : memcg_get, css_put destroy: memcg_put So we only need to get rid of the memcg_get, change the memcg_put to css_put, and get rid of the now extra css_put. (This changelog is mostly written by Glauber) Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09memcg: use css_get() in sock_update_memcg()Li Zefan
Use css_get/css_put instead of mem_cgroup_get/put. Note, if at the same time someone is moving @current to a different cgroup and removing the old cgroup, css_tryget() may return false, and sock->sk_cgrp won't be initialized, which is fine. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09memcg, kmem: fix reference count handling on the error pathMichal Hocko
mem_cgroup_css_online calls mem_cgroup_put if memcg_init_kmem fails. This is not correct because only memcg_propagate_kmem takes an additional reference while mem_cgroup_sockets_init is allowed to fail as well (although no current implementation fails) but it doesn't take any reference. This all suggests that it should be memcg_propagate_kmem that should clean up after itself so this patch moves mem_cgroup_put over there. Unfortunately this is not that easy (as pointed out by Li Zefan) because memcg_kmem_mark_dead marks the group dead (KMEM_ACCOUNTED_DEAD) if it is marked active (KMEM_ACCOUNTED_ACTIVE) which is the case even if memcg_propagate_kmem fails so the additional reference is dropped in that case in kmem_cgroup_destroy which means that the reference would be dropped two times. The easiest way then would be to simply remove mem_cgrroup_put from mem_cgroup_css_online and rely on kmem_cgroup_destroy doing the right thing. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.8] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09Revert "memcg: avoid dangling reference count in creation failure"Michal Hocko
This reverts commit e4715f01be697a. mem_cgroup_put is hierarchy aware so mem_cgroup_put(memcg) already drops an additional reference from all parents so the additional mem_cgrroup_put(parent) potentially causes use-after-free. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [3.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mmap: allow MAP_HUGETLB for hugetlbfs files v2Jörn Engel
It is counterintuitive at best that mmap'ing a hugetlbfs file with MAP_HUGETLB fails, while mmap'ing it without will a) succeed and b) return huge pages. v2: use is_file_hugepages(), as suggested by Jianguo Signed-off-by: Joern Engel <joern@logfs.org> Cc: Jianguo Wu <wujianguo@huawei.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-07-09mm: vmscan: do not scale writeback pages when deciding whether to set ↵Mel Gorman
ZONE_WRITEBACK After the patch "mm: vmscan: Flatten kswapd priority loop" was merged the scanning priority of kswapd changed. The priority now rises until it is scanning enough pages to meet the high watermark. shrink_inactive_list sets ZONE_WRITEBACK if a number of pages were encountered under writeback but this value is scaled based on the priority. As kswapd frequently scans with a higher priority now it is relatively easy to set ZONE_WRITEBACK. This patch removes the scaling and treates writeback pages similar to how it treats unqueued dirty pages and congested pages. The user-visible effect should be that kswapd will writeback fewer pages from reclaim context. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>