summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2010-08-26vmscan: raise the bar to PAGEOUT_IO_SYNC stallsWu Fengguang
commit e31f3698cd3499e676f6b0ea12e3528f569c4fa3 upstream. Fix "system goes unresponsive under memory pressure and lots of dirty/writeback pages" bug. http://lkml.org/lkml/2010/4/4/86 In the above thread, Andreas Mohr described that Invoking any command locked up for minutes (note that I'm talking about attempted additional I/O to the _other_, _unaffected_ main system HDD - such as loading some shell binaries -, NOT the external SSD18M!!). This happens when the two conditions are both meet: - under memory pressure - writing heavily to a slow device OOM also happens in Andreas' system. The OOM trace shows that 3 processes are stuck in wait_on_page_writeback() in the direct reclaim path. One in do_fork() and the other two in unix_stream_sendmsg(). They are blocked on this condition: (sc->order && priority < DEF_PRIORITY - 2) which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim for the order-1 fork() allocation runs into a range of 512KB hard-to-reclaim LRU pages, it will be stalled. It's a severe problem in three ways. Firstly, it can easily happen in daily desktop usage. vmscan priority can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if the system has 50% globally reclaimable pages, it still has good opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a simple dd can easily create a big range (up to 20%) of dirty pages in the LRU lists. And order-1 to order-3 allocations are more than common with SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high order slab caches. For example, the order-1 radix_tree_node slab cache may stall applications at swap-in time; the order-3 inode cache on most filesystems may stall applications when trying to read some file; the order-2 proc_inode_cache may stall applications when trying to open a /proc file. Secondly, once triggered, it will stall unrelated processes (not doing IO at all) in the system. This "one slow USB device stalls the whole system" avalanching effect is very bad. Thirdly, once stalled, the stall time could be intolerable long for the users. When there are 20MB queued writeback pages and USB 1.1 is writing them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds. Not to mention it may be called multiple times. So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is 20%, it will hardly be triggered by pure dirty pages. We'd better treat PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so uncomfortably long (easily goes beyond 1s). The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations, which are easy to satisfy in 1TB memory boxes. So, although 6.25% of memory could be an awful lot of pages to scan on a system with 1TB of memory, it won't really have to busy scan that much. Andreas tested an older version of this patch and reported that it mostly fixed his problem. Mel Gorman helped improve it and KOSAKI Motohiro will fix it further in the next patch. Reported-by: Andreas Mohr <andi@lisas.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26slab: fix object alignmentCarsten Otte
commit 1ab335d8f85792e3b107ff8237d53cf64db714df upstream. This patch fixes alignment of slab objects in case CONFIG_DEBUG_PAGEALLOC is active. Before this spot in kmem_cache_create, we have this situation: - align contains the required alignment of the object - cachep->obj_offset is 0 or equals align in case of CONFIG_DEBUG_SLAB - size equals the size of the object, or object plus trailing redzone in case of CONFIG_DEBUG_SLAB This spot tries to fill one page per object if the object is in certain size limits, however setting obj_offset to PAGE_SIZE - size does break the object alignment since size may not be aligned with the required alignment. This patch simply adds an ALIGN(size, align) to the equation and fixes the object size detection accordingly. This code in drivers/s390/cio/qdio_setup_init has lead to incorrectly aligned slab objects (sizeof(struct qdio_q) equals 1792): qdio_q_cache = kmem_cache_create("qdio_q", sizeof(struct qdio_q), 256, 0, NULL); Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26mm: make stack guard page logic use vm_prev pointerLinus Torvalds
commit 0e8e50e20c837eeec8323bba7dcd25fe5479194c upstream. Like the mlock() change previously, this makes the stack guard check code use vma->vm_prev to see what the mapping below the current stack is, rather than have to look it up with find_vma(). Also, accept an abutting stack segment, since that happens naturally if you split the stack with mlock or mprotect. Tested-by: Ian Campbell <ijc@hellion.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26mm: make the mlock() stack guard page checks stricterLinus Torvalds
commit 7798330ac8114c731cfab83e634c6ecedaa233d7 upstream. If we've split the stack vma, only the lowest one has the guard page. Now that we have a doubly linked list of vma's, checking this is trivial. Tested-by: Ian Campbell <ijc@hellion.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-26mm: make the vma list be doubly linkedLinus Torvalds
commit 297c5eee372478fc32fec5fe8eed711eedb13f3d upstream. It's a really simple list, and several of the users want to go backwards in it to find the previous vma. So rather than have to look up the previous entry with 'find_vma_prev()' or something similar, just make it doubly linked instead. Tested-by: Ian Campbell <ijc@hellion.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-20mm: fix up some user-visible effects of the stack guard pageLinus Torvalds
commit d7824370e26325c881b665350ce64fb0a4fde24a upstream. This commit makes the stack guard page somewhat less visible to user space. It does this by: - not showing the guard page in /proc/<pid>/maps It looks like lvm-tools will actually read /proc/self/maps to figure out where all its mappings are, and effectively do a specialized "mlockall()" in user space. By not showing the guard page as part of the mapping (by just adding PAGE_SIZE to the start for grows-up pages), lvm-tools ends up not being aware of it. - by also teaching the _real_ mlock() functionality not to try to lock the guard page. That would just expand the mapping down to create a new guard page, so there really is no point in trying to lock it in place. It would perhaps be nice to show the guard page specially in /proc/<pid>/maps (or at least mark grow-down segments some way), but let's not open ourselves up to more breakage by user space from programs that depends on the exact deails of the 'maps' file. Special thanks to Henrique de Moraes Holschuh for diving into lvm-tools source code to see what was going on with the whole new warning. Reported-and-tested-by: François Valenduc <francois.valenduc@tvcablenet.be Reported-by: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-20mm: fix page table unmap for stack guard page properlyLinus Torvalds
commit 11ac552477e32835cb6970bf0a70c210807f5673 upstream. We do in fact need to unmap the page table _before_ doing the whole stack guard page logic, because if it is needed (mainly 32-bit x86 with PAE and CONFIG_HIGHPTE, but other architectures may use it too) then it will do a kmap_atomic/kunmap_atomic. And those kmaps will create an atomic region that we cannot do allocations in. However, the whole stack expand code will need to do anon_vma_prepare() and vma_lock_anon_vma() and they cannot do that in an atomic region. Now, a better model might actually be to do the anon_vma_prepare() when _creating_ a VM_GROWSDOWN segment, and not have to worry about any of this at page fault time. But in the meantime, this is the straightforward fix for the issue. See https://bugzilla.kernel.org/show_bug.cgi?id=16588 for details. Reported-by: Wylda <wylda@volny.cz> Reported-by: Sedat Dilek <sedat.dilek@gmail.com> Reported-by: Mike Pagano <mpagano@gentoo.org> Reported-by: François Valenduc <francois.valenduc@tvcablenet.be> Tested-by: Ed Tomlinson <edt@aei.ca> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13mm: fix missing page table unmap for stack guard page failure caseLinus Torvalds
commit 5528f9132cf65d4d892bcbc5684c61e7822b21e9 upstream. .. which didn't show up in my tests because it's a no-op on x86-64 and most other architectures. But we enter the function with the last-level page table mapped, and should unmap it at exit. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13mm: keep a guard page below a grow-down stack segmentLinus Torvalds
commit 320b2b8de12698082609ebbc1a17165727f4c893 upstream. This is a rather minimally invasive patch to solve the problem of the user stack growing into a memory mapped area below it. Whenever we fill the first page of the stack segment, expand the segment down by one page. Now, admittedly some odd application might _want_ the stack to grow down into the preceding memory mapping, and so we may at some point need to make this a process tunable (some people might also want to have more than a single page of guarding), but let's try the minimal approach first. Tested with trivial application that maps a single page just below the stack, and then starts recursing. Without this, we will get a SIGSEGV _after_ the stack has smashed the mapping. With this patch, we'll get a nice SIGBUS just as the stack touches the page just above the mapping. Requested-by: Keith Packard <keithp@keithp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-13mm: fix corruption of hibernation caused by reusing swap during image savingKAMEZAWA Hiroyuki
commit 966cca029f739716fbcc8068b8c6dfe381f86fc3 upstream. Since 2.6.31, swap_map[]'s refcounting was changed to show that a used swap entry is just for swap-cache, can be reused. Then, while scanning free entry in swap_map[], a swap entry may be able to be reclaimed and reused. It was caused by commit c9e444103b5e7a5 ("mm: reuse unused swap entry if necessary"). But this caused deta corruption at resume. The scenario is - Assume a clean-swap cache, but mapped. - at hibernation_snapshot[], clean-swap-cache is saved as clean-swap-cache and swap_map[] is marked as SWAP_HAS_CACHE. - then, save_image() is called. And reuse SWAP_HAS_CACHE entry to save image, and break the contents. After resume: - the memory reclaim runs and finds clean-not-referenced-swap-cache and discards it because it's marked as clean. But here, the contents on disk and swap-cache is inconsistent. Hance memory is corrupted. This patch avoids the bug by not reclaiming swap-entry during hibernation. This is a quick fix for backporting. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Reported-by: Ondreg Zary <linux@rainbow-software.org> Tested-by: Ondreg Zary <linux@rainbow-software.org> Tested-by: Andrea Gelmini <andrea.gelmini@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-10mm: fix ia64 crash when gcore reads gate areaHugh Dickins
commit de51257aa301652876ab6e8f13ea4eadbe4a3846 upstream. Debian's ia64 autobuilders have been seeing kernel freeze or reboot when running the gdb testsuite (Debian bug 588574): dannf bisected to 2.6.32 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1 "mm: ZERO_PAGE without PTE_SPECIAL"; and reproduced it with gdb's gcore on a simple target. I'd missed updating the gate_vma handling in __get_user_pages(): that happens to use vm_normal_page() (nowadays failing on the zero page), yet reported success even when it failed to get a page - boom when access_process_vm() tried to copy that to its intermediate buffer. Fix this, resisting cleanups: in particular, leave it for now reporting success when not asked to get any pages - very probably safe to change, but let's not risk it without testing exposure. Why did ia64 crash with 16kB pages, but succeed with 64kB pages? Because setup_gate() pads each 64kB of its gate area with zero pages. Reported-by: Andreas Barth <aba@not.so.argh.org> Bisected-by: dann frazier <dannf@debian.org> Signed-off-by: Hugh Dickins <hughd@google.com> Tested-by: dann frazier <dannf@dannf.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-02x86,nobootmem: make alloc_bootmem_node fall back to other node when 32bit ↵Yinghai Lu
numa is used commit b8ab9f82025adea77864115da73e70026fa4f540 upstream. Borislav Petkov reported his 32bit numa system has problem: [ 0.000000] Reserving total of 4c00 pages for numa KVA remap [ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe [ 0.000000] max_pfn = 238000 [ 0.000000] 8202MB HIGHMEM available. [ 0.000000] 885MB LOWMEM available. [ 0.000000] mapped low ram: 0 - 375fe000 [ 0.000000] low ram: 0 - 375fe000 [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000 [ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80 [ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140 [ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000 [ 0.000000] BUG: unable to handle kernel paging request at 40000000 [ 0.000000] IP: [<c2c8cff1>] __alloc_memory_core_early+0x147/0x1d6 [ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00 ... [ 0.000000] Call Trace: [ 0.000000] [<c2c8b4f8>] ? __alloc_bootmem_node+0x216/0x22f [ 0.000000] [<c2c90c9b>] ? sparse_early_usemaps_alloc_node+0x5a/0x10b [ 0.000000] [<c2c9149e>] ? sparse_init+0x1dc/0x499 [ 0.000000] [<c2c79118>] ? paging_init+0x168/0x1df [ 0.000000] [<c2c780ff>] ? native_pagetable_setup_start+0xef/0x1bb looks like it allocates too much high address for bootmem. Try to cut limit with get_max_mapped() Reported-by: Borislav Petkov <borislav.petkov@amd.com> Tested-by: Conny Seidel <conny.seidel@amd.com> Signed-off-by: Yinghai Lu <yinghai@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-08-02kmemleak: Add support for NO_BOOTMEM configurationsCatalin Marinas
commit 9078370c0d2cfe4a905aa34f398bbb0d65921a2b upstream. With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and friends use the early_res functions for memory management when NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the corresponding code paths for bootmem allocations. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Yinghai Lu <yinghai@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05do_generic_file_read: clear page errors when issuing a fresh read of the pageJeff Moyer
commit 91803b499cca2fe558abad709ce83dc896b80950 upstream. I/O errors can happen due to temporary failures, like multipath errors or losing network contact with the iSCSI server. Because of that, the VM will retry readpage on the page. However, do_generic_file_read does not clear PG_error. This causes the system to be unable to actually use the data in the page cache page, even if the subsequent readpage completes successfully! The function filemap_fault has had a ClearPageError before readpage forever. This patch simply adds the same to do_generic_file_read. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Larry Woodman <lwoodman@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05slub: move kmem_cache_node into it's own cachelineAlexander Duyck
commit 73367bd8eef4f4eb311005886aaa916013073265 upstream. This patch is meant to improve the performance of SLUB by moving the local kmem_cache_node lock into it's own cacheline separate from kmem_cache. This is accomplished by simply removing the local_node when NUMA is enabled. On my system with 2 nodes I saw around a 5% performance increase w/ hackbench times dropping from 6.2 seconds to 5.9 seconds on average. I suspect the performance gain would increase as the number of nodes increases, but I do not have the data to currently back that up. Bugzilla-Reference: http://bugzilla.kernel.org/show_bug.cgi?id=15713 Reported-by: Alex Shi <alex.shi@intel.com> Tested-by: Alex Shi <alex.shi@intel.com> Acked-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-07-05tmpfs: insert tmpfs cache pages to inactive list at firstKOSAKI Motohiro
commit e9d6c157385e4efa61cb8293e425c9d8beba70d3 upstream. Shaohua Li reported parallel file copy on tmpfs can lead to OOM killer. This is regression of caused by commit 9ff473b9a7 ("vmscan: evict streaming IO first"). Wow, It is 2 years old patch! Currently, tmpfs file cache is inserted active list at first. This means that the insertion doesn't only increase numbers of pages in anon LRU, but it also reduces anon scanning ratio. Therefore, vmscan will get totally confused. It scans almost only file LRU even though the system has plenty unused tmpfs pages. Historically, lru_cache_add_active_anon() was used for two reasons. 1) Intend to priotize shmem page rather than regular file cache. 2) Intend to avoid reclaim priority inversion of used once pages. But we've lost both motivation because (1) Now we have separate anon and file LRU list. then, to insert active list doesn't help such priotize. (2) In past, one pte access bit will cause page activation. then to insert inactive list with pte access bit mean higher priority than to insert active list. Its priority inversion may lead to uninteded lru chun. but it was already solved by commit 645747462 (vmscan: detect mapped file pages used only once). (Thanks Hannes, you are great!) Thus, now we can use lru_cache_add_anon() instead. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Shaohua Li <shaohua.li@intel.com> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-05-11memcg: fix css_is_ancestor() RCU lockingKAMEZAWA Hiroyuki
Some callers (in memcontrol.c) calls css_is_ancestor() without rcu_read_lock. Because css_is_ancestor() has to access RCU protected data, it should be under rcu_read_lock(). This makes css_is_ancestor() itself does safe access to RCU protected area. (At least, "root" can have refcnt==0 if it's not an ancestor of "child". So, we need rcu_read_lock().) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11memcg: fix css_id() RCU locking for realKAMEZAWA Hiroyuki
Commit ad4ba375373937817404fd92239ef4cadbded23b ("memcg: css_id() must be called under rcu_read_lock()") modifies memcontol.c for fixing RCU check message. But Andrew Morton pointed out that the fix doesn't seems sane and it was just for hidining lockdep messages. This is a patch for do proper things. Checking again, all places, accessing without rcu_read_lock, that commit fixies was intentional.... all callers of css_id() has reference count on it. So, it's not necessary to be under rcu_read_lock(). Considering again, we can use rcu_dereference_check for css_id(). We know css->id is valid if css->refcnt > 0. (css->id never changes and freed after css->refcnt going to be 0.) This patch makes use of rcu_dereference_check() in css_id/depth and remove unnecessary rcu-read-lock added by the commit. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11rmap: remove anon_vma check in page_address_in_vma()Naoya Horiguchi
Currently page_address_in_vma() compares vma->anon_vma and page_anon_vma(page) for parameter check, but in 2.6.34 a vma can have multiple anon_vmas with anon_vma_chain, so current check does not work. (For anonymous page shared by multiple processes, some verified (page,vma) pairs return -EFAULT wrongly.) We can go to checking all anon_vmas in the "same_vma" chain, but it needs to meet lock requirement. Instead, we can remove anon_vma check safely because page_address_in_vma() assumes that page and vma are already checked to belong to the identical process. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-11hugetlbfs: kill applications that use MAP_NORESERVE with SIGBUS instead of ↵Mel Gorman
OOM-killer Ordinarily, application using hugetlbfs will create mappings with reserves. For shared mappings, these pages are reserved before mmap() returns success and for private mappings, the caller process is guaranteed and a child process that cannot get the pages gets killed with sigbus. An application that uses MAP_NORESERVE gets no reservations and mmap() will always succeed at the risk the page will not be available at fault time. This might be used for example on very large sparse mappings where the developer is confident the necessary huge pages exist to satisfy all faults even though the whole mapping cannot be backed by huge pages. Unfortunately, if an allocation does fail, VM_FAULT_OOM is returned to the fault handler which proceeds to trigger the OOM-killer. This is unhelpful. Even without hugetlbfs mounted, a user using mmap() can trivially trigger the OOM-killer because VM_FAULT_OOM is returned (will provide example program if desired - it's a whopping 24 lines long). It could be considered a DOS available to an unprivileged user. This patch alters hugetlbfs to kill a process that uses MAP_NORESERVE where huge pages were not available with SIGBUS instead of triggering the OOM killer. This change affects hugetlb_cow() as well. I feel there is a failure case in there, but I didn't create one. It would need a fairly specific target in terms of the faulting application and the hugepage pool size. The hugetlb_no_page() path is much easier to hit but both might as well be closed. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-07Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: create rcu_my_thread_group_empty() wrapper memcg: css_id() must be called under rcu_read_lock() cgroup: Check task_lock in task_subsys_state() sched: Fix an RCU warning in print_task() cgroup: Fix an RCU warning in alloc_css_id() cgroup: Fix an RCU warning in cgroup_path() KEYS: Fix an RCU warning in the reading of user keys KEYS: Fix an RCU warning
2010-05-05slub: Fix bad boundary check in init_kmem_cache_nodes()Zhang, Yanmin
Function init_kmem_cache_nodes is incorrect when checking upper limitation of kmalloc_caches. The breakage was introduced by commit 91efd773c74bb26b5409c85ad755d536448e229c ("dma kmalloc handling fixes"). Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-05-04memcg: css_id() must be called under rcu_read_lock()Paul E. McKenney
This patch fixes task_in_mem_cgroup(), mem_cgroup_uncharge_swapcache(), mem_cgroup_move_swap_account(), and is_target_pte_for_mc() to protect calls to css_id(). An additional RCU lockdep splat was reported for memcg_oom_wake_function(), however, this function is not yet in mainline as of 2.6.34-rc5. Reported-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org>
2010-04-28Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: coda: move backing-dev.h kernel include inside __KERNEL__ mtd: ensure that bdi entries are properly initialized and registered Move mtd_bdi_*mappable to mtdcore.c btrfs: convert to using bdi_setup_and_register() Catch filesystems lacking s_bdi drbd: Terminate a connection early if sending the protocol fails drbd: fix memory leak Fix JFFS2 sync silent failure smbfs: add bdi backing to mount session ncpfs: add bdi backing to mount session exofs: add bdi backing to mount session ecryptfs: add bdi backing to mount session coda: add bdi backing to mount session cifs: add bdi backing to mount session afs: add bdi backing to mount session. 9p: add bdi backing to mount session bdi: add helper function for doing init and register of a bdi for a file system block: ensure jiffies wrap is handled correctly in blk_rq_timed_out_timer
2010-04-27mmap: check ->vm_ops before dereferencingRik van Riel
Check whether the VMA has a vm_ops before calling close, just like we check vm_ops before calling open a few dozen lines higher up in the function. Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-25Catch filesystems lacking s_bdiJörn Engel
noop_backing_dev_info is used only as a flag to mark filesystems that don't have any backing store, like tmpfs, procfs, spufs, etc. Signed-off-by: Joern Engel <joern@logfs.org> Changed the BUG_ON() to a WARN_ON(). Note that adding dirty inodes to the noop_backing_dev_info is not legal and will not result in them being flushed, but we already catch this condition in __mark_inode_dirty() when checking for a registered bdi. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-24ksm: check for ERR_PTR from follow_page()Dan Carpenter
The follow_page() function can potentially return -EFAULT so I added checks for this. Also I silenced an uninitialized variable warning on my version of gcc (version 4.3.2). Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Izik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-24rmap: anon_vma_prepare() can leak anon_vma_chainOleg Nesterov
If find_mergeable_anon_vma() succeeds but another thread installs ->anon_vma before we take ptl, then allocated == NULL but avc should be freed. Change the code to check avc != NULL to detect this case. Also, a couple of whitespace changes to make the critical section more visible. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Pete Zaitcev <zaitcev@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-24hugetlb: fix infinite loop in get_futex_key() when backed by huge pagesMel Gorman
If a futex key happens to be located within a huge page mapped MAP_PRIVATE, get_futex_key() can go into an infinite loop waiting for a page->mapping that will never exist. See https://bugzilla.redhat.com/show_bug.cgi?id=552257 for more details about the problem. This patch makes page->mapping a poisoned value that includes PAGE_MAPPING_ANON mapped MAP_PRIVATE. This is enough for futex to continue but because of PAGE_MAPPING_ANON, the poisoned value is not dereferenced or used by futex. No other part of the VM should be dereferencing the page->mapping of a hugetlbfs page as its page cache is not on the LRU. This patch fixes the problem with the test case described in the bugzilla. [akpm@linux-foundation.org: mel cant spel] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Darren Hart <darren@dvhart.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-24memcg: fix prepare migrationAndrea Arcangeli
If a signal is pending (task being killed by sigkill) __mem_cgroup_try_charge will write NULL into &mem, and css_put will oops on null pointer dereference. BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 IP: [<ffffffff810fc6cc>] mem_cgroup_prepare_migration+0x7c/0xc0 PGD a5d89067 PUD a5d8a067 PMD 0 Oops: 0000 [#1] SMP last sysfs file: /sys/devices/platform/microcode/firmware/microcode/loading CPU 0 Modules linked in: nfs lockd nfs_acl auth_rpcgss sunrpc acpi_cpufreq pcspkr sg [last unloaded: microcode] Pid: 5299, comm: largepages Tainted: G W 2.6.34-rc3 #3 Penryn1600SLI-110dB/To Be Filled By O.E.M. RIP: 0010:[<ffffffff810fc6cc>] [<ffffffff810fc6cc>] mem_cgroup_prepare_migration+0x7c/0xc0 [nishimura@mxp.nes.nec.co.jp: fix merge issues] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-22bdi: add helper function for doing init and register of a bdi for a file systemJens Axboe
Pretty trivial helper, just sets up the bdi and registers it. An atomic sequence count is used to ensure that the registered sysfs names are unique. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-19rmap: add exclusively owned pages to the newest anon_vmaRik van Riel
The recent anon_vma fixes cause many anonymous pages to end up in the parent process anon_vma, even when the page is exclusively owned by the current process. Adding exclusively owned anonymous pages to the top anon_vma reduces rmap scanning overhead, especially in workloads with forking servers. This patch adds a parameter to __page_set_anon_rmap that can be used to indicate whether or not the added page is exclusively owned by the current process. Pages added through page_add_new_anon_rmap are exclusively owned by the current process, and can be added to the top anon_vma. Pages added through page_add_anon_rmap can be either shared or exclusively owned, so we do the conservative thing and add it to the oldest anon_vma. A next step would be to add the exclusive parameter to page_add_anon_rmap, to be used from functions where we do know for sure whether a page is exclusively owned. Signed-off-by: Rik van Riel <riel@redhat.com> Reviewed-by: Johannes Weiner <hannes@cmpxchg.org> Lightly-tested-by: Borislav Petkov <bp@alien8.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> [ Edited to look nicer - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-12anonvma: when setting up page->mapping, we need to pick the _oldest_ anonvmaLinus Torvalds
Otherwise we might be mapping in a page in a new mapping, but that page (through the swapcache) would later be mapped into an old mapping too. The page->mapping must be the case that works for everybody, not just the mapping that happened to page it in first. Here's the scenario: - page gets allocated/mapped by process A. Let's call the anon_vma we associate the page with 'A' to keep it easy to track. - Process A forks, creating process B. The anon_vma in B is 'B', and has a chain that looks like 'B' -> 'A'. Everything is fine. - Swapping happens. The page (with mapping pointing to 'A') gets swapped out (perhaps not to disk - it's enough to assume that it's just not mapped any more, and lives entirely in the swap-cache) - Process B pages it in, which goes like this: do_swap_page -> page = lookup_swap_cache(entry); ... set_pte_at(mm, address, page_table, pte); page_add_anon_rmap(page, vma, address); And think about what happens here! In particular, what happens is that this will now be the "first" mapping of that page, so page_add_anon_rmap() used to do if (first) __page_set_anon_rmap(page, vma, address); and notice what anon_vma it will use? It will use the anon_vma for process B! What happens then? Trivial: process 'A' also pages it in (nothing happens, it's not the first mapping), and then process 'B' execve's or exits or unmaps, making anon_vma B go away. End result: process A has a page that points to anon_vma B, but anon_vma B does not exist any more. This can go on forever. Forget about RCU grace periods, forget about locking, forget anything like that. The bug is simply that page->mapping points to an anon_vma that was correct at one point, but was _not_ the one that was shared by all users of that possible mapping. Changing it to always use the deepest anon_vma in the anonvma chain gets us to the safest model. This can be improved in certain cases: if we know the page is private to just this particular mapping (for example, it's a new page, or it is the only swapcache entry), we could pick the top (most specific) anon_vma. But that's a future optimization. Make it _work_ reliably first. Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-12anon_vma: clone the anon_vma chain in the right orderLinus Torvalds
We want to walk the chain in reverse order when cloning it, so that the order of the result chain will be the same as the order in the source chain. When we add entries to the chain, they go at the head of the chain, so we want to add the source head last. Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Borislav Petkov <bp@alien8.de> [ "No, it still oopses" ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-12vma_adjust: fix the copying of anon_vma chainsLinus Torvalds
When we move the boundaries between two vma's due to things like mprotect, we need to make sure that the anon_vma of the pages that got moved from one vma to another gets properly copied around. And that was not always the case, in this rather hard-to-follow code sequence. Clarify the code, and fix it so that it copies the anon_vma from the right source. Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Borislav Petkov <bp@alien8.de> [ "Yeah, not so much this one either" ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-12Simplify and comment on anon_vma re-use for anon_vma_prepare()Linus Torvalds
This changes the anon_vma reuse case to require that we only reuse simple anon_vma's - ie the case when the vma only has a single anon_vma associated with it. This means that a reuse of an anon_vma from an adjacent vma will always guarantee that both vma's are associated not only with the same anon_vma, they will also have the same anon_vma chain (of just a single entry in this case). And since anon_vma re-use was the only case where the same anon_vma might be associated with different chains of anon_vma's, we now have the case that every vma that shares the same anon_vma will always also have the same chain. That makes it much easier to think about merging vma's that share the same anon_vma's: you can always just drop the other anon_vma chain in anon_vma_merge() since you know that they are always identical. This also splits up the function to validate the anon_vma re-use, and adds a lot of commentary about the possible races. Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Borislav Petkov <bp@alien8.de> [ "That didn't fix it" ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-09Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (34 commits) cfq-iosched: Fix the incorrect timeslice accounting with forced_dispatch loop: Update mtime when writing using aops block: expose the statistics in blkio.time and blkio.sectors for the root cgroup backing-dev: Handle class_create() failure Block: Fix block/elevator.c elevator_get() off-by-one error drbd: lc_element_by_index() never returns NULL cciss: unlock on error path cfq-iosched: Do not merge queues of BE and IDLE classes cfq-iosched: Add additional blktrace log messages in CFQ for easier debugging i2o: Remove the dangerous kobj_to_i2o_device macro block: remove 16 bytes of padding from struct request on 64bits cfq-iosched: fix a kbuild regression block: make CONFIG_BLK_CGROUP visible Remove GENHD_FL_DRIVERFS block: Export max number of segments and max segment size in sysfs block: Finalize conversion of block limits functions block: Fix overrun in lcm() and move it to lib vfs: improve writeback_inodes_wb() paride: fix off-by-one test drbd: fix al-to-on-disk-bitmap for 4k logical_block_size ...
2010-04-09slub: Fix kmem_ptr_validate() for non-kernel pointersPekka Enberg
As suggested by Linus, fix up kmem_ptr_validate() to handle non-kernel pointers more graciously. The patch changes kmem_ptr_validate() to use the newly introduced kern_ptr_validate() helper to check that a pointer is a valid kernel pointer before we attempt to convert it into a 'struct page'. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-09slab: Generify kernel pointer validationPekka Enberg
As suggested by Linus, introduce a kern_ptr_validate() helper that does some sanity checks to make sure a pointer is a valid kernel pointer. This is a preparational step for fixing SLUB kmem_ptr_validate(). Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-07Merge branch 'x86-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip: x86: Fix double enable_IR_x2apic() call on SMP kernel on !SMP boards x86: Increase CONFIG_NODES_SHIFT max to 10 ibft, x86: Change reserve_ibft_region() to find_ibft_region() x86, hpet: Fix bug in RTC emulation x86, hpet: Erratum workaround for read after write of HPET comparator bootmem, x86: Fix 32bit numa system without RAM on node 0 nobootmem, x86: Fix 32bit numa system without RAM on node 0 x86: Handle overlapping mptables x86: Make e820_remove_range to handle all covered case x86-32, resume: do a global tlb flush in S4 resume
2010-04-07memcg: fix race in file_mapped accountingKAMEZAWA Hiroyuki
Presently, memcg's FILE_MAPPED accounting has following race with move_account (happens at rmdir()). increment page->mapcount (rmap.c) mem_cgroup_update_file_mapped() move_account() lock_page_cgroup() check page_mapped() if page_mapped(page)>1 { FILE_MAPPED -1 from old memcg FILE_MAPPED +1 to old memcg } ..... overwrite pc->mem_cgroup unlock_page_cgroup() lock_page_cgroup() FILE_MAPPED + 1 to pc->mem_cgroup unlock_page_cgroup() Then, old memcg (-1 file mapped) new memcg (+2 file mapped) This happens because move_account see page_mapped() which is not guarded by lock_page_cgroup(). This patch adds FILE_MAPPED flag to page_cgroup and move account information based on it. Now, all checks are synchronous with lock_page_cgroup(). Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Balbir Singh <balbir@in.ibm.com> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrea Righi <arighi@develer.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-07pagemap: fix pfn calculation for hugepageNaoya Horiguchi
When we look into pagemap using page-types with option -p, the value of pfn for hugepages looks wrong (see below.) This is because pte was evaluated only once for one vma although it should be updated for each hugepage. This patch fixes it. $ page-types -p 3277 -Nl -b huge voffset offset len flags 7f21e8a00 11e400 1 ___U___________H_G________________ 7f21e8a01 11e401 1ff ________________TG________________ ^^^ 7f21e8c00 11e400 1 ___U___________H_G________________ 7f21e8c01 11e401 1ff ________________TG________________ ^^^ One hugepage contains 1 head page and 511 tail pages in x86_64 and each two lines represent each hugepage. Voffset and offset mean virtual address and physical address in the page unit, respectively. The different hugepages should not have the same offset value. With this patch applied: $ page-types -p 3386 -Nl -b huge voffset offset len flags 7fec7a600 112c00 1 ___UD__________H_G________________ 7fec7a601 112c01 1ff ________________TG________________ ^^^ 7fec7a800 113200 1 ___UD__________H_G________________ 7fec7a801 113201 1ff ________________TG________________ ^^^ OK More info: - This patch modifies walk_page_range()'s hugepage walker. But the change only affects pagemap_read(), which is the only caller of hugepage callback. - Without this patch, hugetlb_entry() callback is called per vma, that doesn't match the natural expectation from its name. - With this patch, hugetlb_entry() is called per hugepte entry and the callback can become much simpler. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-07mm: revert "vmscan: get_scan_ratio() cleanup"KOSAKI Motohiro
Shaohua Li reported his tmpfs streaming I/O test can lead to make oom. The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there are 6 copies of kernel source and the test does kbuild for each copy. His investigation shows the test has a lot of rotated anon pages and quite few file pages, so get_scan_ratio calculates percent[0] (i.e. scanning percent for anon) to be zero. Actually the percent[0] shoule be a big value, but our calculation round it to zero. Although before commit 84b18490 ("vmscan: get_scan_ratio() cleanup") , we have the same problem too. But the old logic can rescue percent[0]==0 case only when priority==0. It had hided the real issue. I didn't think merely streaming io can makes percent[0]==0 && priority==0 situation. but I was wrong. So, definitely we have to fix such tmpfs streaming io issue. but anyway I revert the regression commit at first. This reverts commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-07readahead: fix NULL filp dereferenceWu Fengguang
btrfs relocate_file_extent_cluster() calls us with NULL filp: [ 4005.426805] BUG: unable to handle kernel NULL pointer dereference at 00000021 [ 4005.426818] IP: [<c109a130>] page_cache_sync_readahead+0x18/0x3e Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Yan Zheng <yanzheng@21cn.com> Reported-by: Kirill A. Shutemov <kirill@shutemov.name> Tested-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-07mm: avoid null-pointer deref in sync_mm_rss()KAMEZAWA Hiroyuki
- We weren't zeroing p->rss_stat[] at fork() - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel threads and was oopsing. - Make __sync_task_rss_stat() static, too. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648 [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)] Reported-by: Troels Liebe Bentsen <tlb@rapanden.dk> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> "Michael S. Tsirkin" <mst@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-05Merge branch 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/miscLinus Torvalds
* 'slabh' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc: eeepc-wmi: include slab.h staging/otus: include slab.h from usbdrv.h percpu: don't implicitly include slab.h from percpu.h kmemcheck: Fix build errors due to missing slab.h include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h iwlwifi: don't include iwl-dev.h from iwl-devtrace.h x86: don't include slab.h from arch/x86/include/asm/pgtable_32.h Fix up trivial conflicts in include/linux/percpu.h due to is_kernel_percpu_address() having been introduced since the slab.h cleanup with the percpu_up.c splitup.
2010-04-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: module: add stub for is_module_percpu_address percpu, module: implement and use is_kernel/module_percpu_address() module: encapsulate percpu handling better and record percpu_size
2010-04-05rmap: fix anon_vma_fork() memory leakRik van Riel
Fix a memory leak in anon_vma_fork(), where we fail to tear down the anon_vmas attached to the new VMA in case setting up the new anon_vma fails. This bug also has the potential to leave behind anon_vma_chain structs with pointers to invalid memory. Reported-by: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-02backing-dev: Handle class_create() failureAnton Blanchard
I hit this when we had a bug in IDR for a few days. Basically sysfs would fail to create new inodes since it uses an IDR and therefore class_create would fail. While we are unlikely to see this fail we may as well handle it instead of oopsing. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2010-04-01bootmem, x86: Fix 32bit numa system without RAM on node 0Yinghai Lu
When 32bit numa is used, free_all_bootmem() will still only go over with node id 0. If node 0 doesn't have RAM installed, the lowest populated node becomes low RAM. This one fixes BOOTMEM path by iterating over the bdata_list. -v3: add more comments, and fix bootmem path too. -v4: seperate from one big patch Signed-off-by: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <4BB416D7.6090203@kernel.org> Signed-off-by: H. Peter Anvin <hpa@zytor.com>