summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2009-01-24mm: direct IO starvation improvementNick Piggin
commit 48b47c561e41525061b5bc0cfd67d6367fd11dc4 upstream. Direct IO can invalidate and sync a lot of pagecache pages in the mapping. A 4K direct IO will actually try to sync and/or invalidate the pagecache of the entire file, for example (which might be many GB or TB large). Improve this by doing range syncs. Also, memory no longer has to be unmapped to catch the dirty bits for syncing, as dirty bits would remain coherent due to dirty mmap accounting. This fixes the immediate DM deadlocks when doing direct IO reads to block device with a mounted filesystem, if only by papering over the problem somewhat rather than addressing the fsync starvation cases. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages more terminate quicklyAndrew Morton
commit 82fd1a9a8ced9607312b54859572bcc6211e8919 upstream. Now that we have the early-termination logic in place, it makes sense to bail out early in all other cases where done is set to 1. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages terminate quicklyNick Piggin
commit d5482cdf8a0aacb1e6468a97d5544f5829c8d8c4 upstream. Terminate the write_cache_pages loop upon encountering the first page past end, without locking the page. Pages cannot have their index change when we have a reference on them (truncate, eg truncate_inode_pages_range performs the same check without the page lock). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages optimise page cleaningNick Piggin
commit 515f4a037fb9ab736f8bad733fcd2ffd350cf265 upstream. In write_cache_pages, if we get stuck behind another process that is cleaning pages, we will be forced to wait for them to finish, then perform our own writeout (if it was redirtied during the long wait), then wait for that. If a page under writeout is still clean, we can skip waiting for it (if we're part of a data integrity sync, we'll be waiting for all writeout pages afterwards, so we'll still be waiting for the other guy's write that's cleaned the page). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages cleanupsNick Piggin
commit 5a3d5c9813db56a75934eb1015367fda23a8b0b4 upstream. Get rid of some complex expressions from flow control statements, add a comment, remove some duplicate code. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages integrity fixNick Piggin
commit 05fe478dd04e02fa230c305ab9b5616669821dd3 upstream. In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so the function will return success after writing out nr_to_write pages, even if that was not sufficient to guarantee data integrity. The callers tend to set it to values that could break data interity semantics easily in practice. For example, nr_to_write can be set to mapping->nr_pages * 2, however if a file has a single, dirty page, then fsync is called, subsequent pages might be concurrently added and dirtied, then write_cache_pages might writeout two of these newly dirty pages, while not writing out the old page that should have been written out. Fix this by ignoring nr_to_write if it is a data integrity sync. This is a data integrity bug. The reason this has been done in the past is to avoid stalling sync operations behind page dirtiers. "If a file has one dirty page at offset 1000000000000000 then someone does an fsync() and someone else gets in first and starts madly writing pages at offset 0, we want to write that page at 1000000000000000. Somehow." What we do today is return success after an arbitrary amount of pages are written, whether or not we have provided the data-integrity semantics that the caller has asked for. Even this doesn't actually fix all stall cases completely: in the above situation, if the file has a huge number of pages in pagecache (but not dirty), then mapping->nrpages is going to be huge, even if pages are being dirtied. This change does indeed make the possibility of long stalls lager, and that's not a good thing, but lying about data integrity is even worse. We have to either perform the sync, or return -ELINUXISLAME so at least the caller knows what has happened. There are subsequent competing approaches in the works to solve the stall problems properly, without compromising data integrity. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages writepage error fixNick Piggin
commit 00266770b8b3a6a77f896ca501a0613739086832 upstream. In write_cache_pages, if ret signals a real error, but we still have some pages left in the pagevec, done would be set to 1, but the remaining pages would continue to be processed and ret will be overwritten in the process. It could easily be overwritten with success, and thus success will be returned even if there is an error. Thus the caller is told all writes succeeded, wheras in reality some did not. Fix this by bailing immediately if there is an error, and retaining the first error code. This is a data integrity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages early loop terminationNick Piggin
commit bd19e012f6fd3b7309689165ea865cbb7bb88c1e upstream. We'd like to break out of the loop early in many situations, however the existing code has been setting mapping->writeback_index past the final page in the pagevec lookup for cyclic writeback. This is a problem if we don't process all pages up to the final page. Currently the code mostly keeps writeback_index reasonable and hacked around this by not breaking out of the loop or writing pages outside the range in these cases. Keep track of a real "done index" that enables us to terminate the loop in a much more flexible manner. Needed by the subsequent patch to preserve writepage errors, and then further patches to break out of the loop early for other reasons. However there are no functional changes with this patch alone. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages cyclic fixNick Piggin
commit 31a12666d8f0c22235297e1c1575f82061480029 upstream. In write_cache_pages, scanned == 1 is supposed to mean that cyclic writeback has circled through zero, thus we should not circle again. However it gets set to 1 after the first successful pagevec lookup. This leads to cases where not enough data gets written. Counterexample: file with first 10 pages dirty, writeback_index == 5, nr_to_write == 10. Then the 5 last pages will be found, and scanned will be set to 1, after writing those out, we will not cycle back to get the first 5. Rework this logic, now we'll always cycle unless we started off from index 0. When cycling, only write out as far as 1 page before the start page from the first cycle (so we don't write parts of the file twice). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 28Heiko Carstens
commit 938bb9f5e840eddbf54e4f62f6c5ba9b3ae12c9d upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 26Heiko Carstens
commit c4ea37c26a691ad0b7e86aa5884aab27830e95c9 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 14Heiko Carstens
commit 3480b25743cb7404928d57efeaa3d085708b04c2 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 13Heiko Carstens
commit 6a6160a7b5c27b3c38651baef92a14fa7072b3c1 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrapper special casesHeiko Carstens
commit 6673e0c3fbeaed2cd08e2fd4a4aa97382d6fedb0 upstream. System calls with an unsigned long long argument can't be converted with the standard wrappers since that would include a cast to long, which in turn means that we would lose the upper 32 bit on 32 bit architectures. Also semctl can't use the standard wrapper since it has a 'union' parameter. So we handle them as special case and add some extra wrappers instead. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Convert all system calls to return a longHeiko Carstens
commit 2ed7c03ec17779afb4fcfa3b8c61df61bd4879ba upstream. Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18fs: symlink write_begin allocation context fixNick Piggin
commit 54566b2c1594c2326a645a3551f9d989f7ba3c5e upstream. With the write_begin/write_end aops, page_symlink was broken because it could no longer pass a GFP_NOFS type mask into the point where the allocations happened. They are done in write_begin, which would always assume that the filesystem can be entered from reclaim. This bug could cause filesystem deadlocks. The funny thing with having a gfp_t mask there is that it doesn't really allow the caller to arbitrarily tinker with the context in which it can be called. It couldn't ever be GFP_ATOMIC, for example, because it needs to take the page lock. The only thing any callers care about is __GFP_FS anyway, so turn that into a single flag. Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on this flag in their write_begin function. Change __grab_cache_page to accept a nofs argument as well, to honour that flag (while we're there, change the name to grab_cache_page_write_begin which is more instructive and does away with random leading underscores). This is really a more flexible way to go in the end anyway -- if a filesystem happens to want any extra allocations aside from the pagecache ones in ints write_begin function, it may now use GFP_KERNEL (rather than GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a random example). [kosaki.motohiro@jp.fujitsu.com: fix ubifs] [kosaki.motohiro@jp.fujitsu.com: fix fuse] Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Cleaned up the calling convention: just pass in the AOP flags untouched to the grab_cache_page_write_begin() function. That just simplifies everybody, and may even allow future expansion of the logic. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18vmalloc.c: fix flushing in vmap_page_range()Adam Lackorzynski
commit 2e4e27c7d082b2198b63041310609d7191185a9d upstream. The flush_cache_vmap in vmap_page_range() is called with the end of the range twice. The following patch fixes this for me. Signed-off-by: Adam Lackorzynski <adam@os.inf.tu-dresden.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-16mm: Don't touch uninitialized variable in do_pages_stat_array()KOSAKI Motohiro
Commit 80bba1290ab5122c60cdb73332b26d288dc8aedd removed one necessary variable initialization. As a result following warning happened: CC mm/migrate.o mm/migrate.c: In function 'sys_move_pages': mm/migrate.c:1001: warning: 'err' may be used uninitialized in this function More unfortunately, if find_vma() failed, kernel read uninitialized memory. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Brice Goglin <Brice.Goglin@inria.fr> Cc: Christoph Lameter <clameter@sgi.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-15slob: do not pass the SLAB flags as GFP in kmem_cache_create()Catalin Marinas
The kmem_cache_create() function in the slob allocator passes the SLAB flags as GFP flags to the slob_alloc() function. The patch changes this call to pass GFP_KERNEL as the other allocators seem to do. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Matt Mackall <mpm@selenic.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10KSYM_SYMBOL_LEN fixesHugh Dickins
Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use less stack exposing a bug in slub's list_locations() - kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was beyond the end of page provided. The 100 slop which list_locations() allows at end of page looks roughly enough for all the other stuff it might print after the symbol before it checks again: break out KSYM_SYMBOL_LEN earlier than before. Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies them. [akpm@linux-foundation.org: ftrace.h needs module.h] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc Miles Lane <miles.lane@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10mm: no get_user/put_user while holding mmap_sem in do_pages_stat?Brice Goglin
Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat() gets the page address from user-space and puts the corresponding status back while holding the mmap_sem for read. There is no need to hold mmap_sem there while some page-faults may occur. This patch adds a temporary address and status buffer so as to only hold mmap_sem while working on these kernel buffers. This is implemented by extracting do_pages_stat_array() out of do_pages_stat(). Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Cc: Christoph Lameter <clameter@sgi.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10page_cgroup should ignore empty nodesKAMEZAWA Hiroyuki
Fix a total bootup freeze on ia64. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10mm: remove UP version of lru_add_drain_all()KOSAKI Motohiro
Currently, lru_add_drain_all() has two version. (1) use schedule_on_each_cpu() (2) don't use schedule_on_each_cpu() Gerald Schaefer reported it doesn't work well on SMP (not NUMA) S390 machine. offline_pages() calls lru_add_drain_all() followed by drain_all_pages(). While drain_all_pages() works on each cpu, lru_add_drain_all() only runs on the current cpu for architectures w/o CONFIG_NUMA. This let us run into the BUG_ON(!PageBuddy(page)) in __offline_isolated_pages() during memory hotplug stress test on s390. The page in question was still on the pcp list, because of a race with lru_add_drain_all() and drain_all_pages() on different cpus. Actually, Almost machine has CONFIG_UNEVICTABLE_LRU=y. Then almost machine use (1) version lru_add_drain_all although the machine is UP. Then this ifdef is not valueable. simple removing is better. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10mm/backing-dev.c: remove recently-added WARN_ON()Andrew Morton
On second thoughts, this is just going to disturb people while telling us things which we already knew. Cc: Peter Korsgaard <jacmet@sunsite.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-02vmscan: evict streaming IO firstRik van Riel
Count the insertion of new pages in the statistics used to drive the pageout scanning code. This should help the kernel quickly evict streaming file IO. We count on the fact that new file pages start on the inactive file LRU and new anonymous pages start on the active anon list. This means streaming file IO will increment the recent scanned file statistic, while leaving the recent rotated file statistic alone, driving pageout scanning to the file LRUs. Pageout activity does its own list manipulation. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Gene Heskett <gene.heskett@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-02bdi: register sysfs bdi device only once per queueKay Sievers
Devices which share the same queue, like floppies and mtd devices, get registered multiple times in the bdi interface, but bdi accounts only the last registered device of the devices sharing one queue. On remove, all earlier registered devices leak, stay around in sysfs, and cause "duplicate filename" errors if the devices are re-created. This prevents the creation of multiple bdi interfaces per queue, and the bdi device will carry the dev_t name of the block device which is the first one registered, of the pool of devices using the same queue. [akpm@linux-foundation.org: add a WARN_ON so we know which drivers are misbehaving] Tested-by: Peter Korsgaard <jacmet@sunsite.dk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01memcg: memory hotplug fix for notifier callbackKAMEZAWA Hiroyuki
Fixes for memcg/memory hotplug. While memory hotplug allocate/free memmap, page_cgroup doesn't free page_cgroup at OFFLINE when page_cgroup is allocated via bootomem. (Because freeing bootmem requires special care.) Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated by memory hotplug, page_cgroup->page == page is no longer true. But current MEM_ONLINE handler doesn't check it and update page_cgroup->page if it's not necessary to allocate page_cgroup. (This was not found because memmap is not freed if SPARSEMEM_VMEMMAP is y.) And I noticed that MEM_ONLINE can be called against "part of section". So, freeing page_cgroup at CANCEL_ONLINE will cause trouble. (freeing used page_cgroup) Don't rollback at CANCEL. One more, current memory hotplug notifier is stopped by slub because it sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback never be called. (low priority than slub now.) I think this slub's behavior is not intentional(BUG). and fixes it. Another way to be considered about page_cgroup allocation: - free page_cgroup at OFFLINE even if it's from bootmem and remove specieal handler. But it requires more changes. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12041 Signed-off-by: KAMEZAWA Hiruyoki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Tested-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01mm: vmalloc fix lazy unmapping cache aliasingNick Piggin
Jim Radford has reported that the vmap subsystem rewrite was sometimes causing his VIVT ARM system to behave strangely (seemed like going into infinite loops trying to fault in pages to userspace). We determined that the problem was most likely due to a cache aliasing issue. flush_cache_vunmap was only being called at the moment the page tables were to be taken down, however with lazy unmapping, this can happen after the page has subsequently been freed and allocated for something else. The dangling alias may still have dirty data attached to it. The fix for this problem is to do the cache flushing when the caller has called vunmap -- it would be a bug for them to write anything else to the mapping at that point. That appeared to solve Jim's problems. Reported-by: Jim Radford <radford@blackbean.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01vmscan: protect zone rotation stats by lru lockJohannes Weiner
The zone's rotation statistics must not be accessed without the corresponding LRU lock held. Fix an unprotected write in shrink_active_list(). Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-30meminit section warningsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19vmscan: fix get_scan_ratio() commentRik van Riel
Fix the old comment on the scan ratio calculations. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19vmscan: let GFP_NOFS go to swap againHugh Dickins
In the past, GFP_NOFS (but of course not GFP_NOIO) was allowed to reclaim by writing to swap. That got partially broken in 2.6.23, when may_enter_fs initialization was moved up before the allocation of swap, so its PageSwapCache test was failing the first time around, Fix it by setting may_enter_fs when add_to_swap() succeeds with __GFP_IO. In fact, check __GFP_IO before calling add_to_swap(): allocating swap we're not ready to use just increases disk seeking. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19migration: fix writepage errorHugh Dickins
Page migration's writeout() has got understandably confused by the nasty AOP_WRITEPAGE_ACTIVATE case: as in normal success, a writepage() error has unlocked the page, so writeout() then needs to relock it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc search restart fixGlauber Costa
Current vmalloc restart search for a free area in case we can't find one. The reason is there are areas which are lazily freed, and could be possibly freed now. However, current implementation start searching the tree from the last failing address, which is pretty much by definition at the end of address space. So, we fail. The proposal of this patch is to restart the search from the beginning of the requested vstart address. This fixes the regression in running KVM virtual machines for me, described in http://lkml.org/lkml/2008/10/28/349, caused by commit db64fe02258f1507e13fe5212a989922323685ce. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc failure flush fixNick Piggin
An initial vmalloc failure should start off a synchronous flush of lazy areas, in case someone is in progress flushing them already, which could cause us to return an allocation failure even if there is plenty of KVA free. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc allocator off by oneNick Piggin
Fix off by one bug in the KVA allocator that can leave gaps in the address space. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19cpuset: update top cpuset's mems after adding a nodeMiao Xie
After adding a node into the machine, top cpuset's mems isn't updated. By reviewing the code, we found that the update function cpuset_track_online_nodes() was invoked after node_states[N_ONLINE] changes. It is wrong because N_ONLINE just means node has pgdat, and if node has/added memory, we use N_HIGH_MEMORY. So, We should invoke the update function after node_states[N_HIGH_MEMORY] changes, just like its commit says. This patch fixes it. And we use notifier of memory hotplug instead of direct calling of cpuset_track_online_nodes(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-16unitialized return value in mm/mlock.c: __mlock_vma_pages_range()Helge Deller
Fix an unitialized return value when compiling on parisc (with CONFIG_UNEVICTABLE_LRU=y): mm/mlock.c: In function `__mlock_vma_pages_range': mm/mlock.c:165: warning: `ret' might be used uninitialized in this function Signed-off-by: Helge Deller <deller@gmx.de> [ It isn't ever really used uninitialized, since no caller should ever call this function with an empty range. But the compiler is correct that from a local analysis standpoint that is impossible to see, and fixing the warning is appropriate. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-15mm: remove unevictable's show_page_pathKOSAKI Motohiro
Hugh Dickins reported show_page_path() is buggy and unsafe because - lack dput() against d_find_alias() - don't concern vma->vm_mm->owner == NULL - lack lock_page() it was only for debugging, so rather than trying to fix it, just remove it now. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Lee Schermerhorn <Lee.Schermerhorn@hp.com> CC: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12memcg: bugfix for memory hotplugKAMEZAWA Hiroyuki
The start pfn calculation in page_cgroup's memory hotplug notifier chain is wrong. Tested-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12mm: remove lru_add_drain_all() from the munlock pathKOSAKI Motohiro
lockdep warns about following message at boot time on one of my test machine. Then, schedule_on_each_cpu() sholdn't be called when the task have mmap_sem. Actually, lru_add_drain_all() exist to prevent the unevictalble pages stay on reclaimable lru list. but currenct unevictable code can rescue unevictable pages although it stay on reclaimable list. So removing is better. In addition, this patch add lru_add_drain_all() to sys_mlock() and sys_mlockall(). it isn't must. but it reduce the failure of moving to unevictable list. its failure can rescue in vmscan later. but reducing is better. Note, if above rescuing happend, the Mlocked and the Unevictable field mismatching happend in /proc/meminfo. but it doesn't cause any real trouble. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.28-rc2-mm1 #2 ------------------------------------------------------- lvm/1103 is trying to acquire lock: (&cpu_hotplug.lock){--..}, at: [<c0130789>] get_online_cpus+0x29/0x50 but task is already holding lock: (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_sem){----}: [<c0153da2>] check_noncircular+0x82/0x110 [<c0185e6a>] might_fault+0x4a/0xa0 [<c0156161>] validate_chain+0xb11/0x1070 [<c0185e6a>] might_fault+0x4a/0xa0 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 (*) grab mmap_sem [<c0185e6a>] might_fault+0x4a/0xa0 [<c0185e9b>] might_fault+0x7b/0xa0 [<c0185e6a>] might_fault+0x4a/0xa0 [<c0294dd0>] copy_to_user+0x30/0x60 [<c01ae3ec>] filldir+0x7c/0xd0 [<c01e3a6a>] sysfs_readdir+0x11a/0x1f0 (*) grab sysfs_mutex [<c01ae370>] filldir+0x0/0xd0 [<c01ae370>] filldir+0x0/0xd0 [<c01ae4c6>] vfs_readdir+0x86/0xa0 (*) grab i_mutex [<c01ae75b>] sys_getdents+0x6b/0xc0 [<c010355a>] syscall_call+0x7/0xb [<ffffffff>] 0xffffffff -> #2 (sysfs_mutex){--..}: [<c0153da2>] check_noncircular+0x82/0x110 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c0156161>] validate_chain+0xb11/0x1070 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 (*) grab sysfs_mutex [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c01e422f>] create_dir+0x3f/0x90 [<c01e42a9>] sysfs_create_dir+0x29/0x50 [<c04faaf5>] _spin_unlock+0x25/0x40 [<c028f21d>] kobject_add_internal+0xcd/0x1a0 [<c028f37a>] kobject_set_name_vargs+0x3a/0x50 [<c028f41d>] kobject_init_and_add+0x2d/0x40 [<c019d4d2>] sysfs_slab_add+0xd2/0x180 [<c019d580>] sysfs_add_func+0x0/0x70 [<c019d5dc>] sysfs_add_func+0x5c/0x70 (*) grab slub_lock [<c01400f2>] run_workqueue+0x172/0x200 [<c014008f>] run_workqueue+0x10f/0x200 [<c0140bd0>] worker_thread+0x0/0xf0 [<c0140c6c>] worker_thread+0x9c/0xf0 [<c0143c80>] autoremove_wake_function+0x0/0x50 [<c0140bd0>] worker_thread+0x0/0xf0 [<c0143972>] kthread+0x42/0x70 [<c0143930>] kthread+0x0/0x70 [<c01042db>] kernel_thread_helper+0x7/0x1c [<ffffffff>] 0xffffffff -> #1 (slub_lock){----}: [<c0153d2d>] check_noncircular+0xd/0x110 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c0156161>] validate_chain+0xb11/0x1070 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c015433d>] mark_lock+0x35d/0xd00 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c04f93a3>] down_read+0x43/0x80 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 (*) grab slub_lock [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c04fd9ac>] notifier_call_chain+0x3c/0x70 [<c04f5454>] _cpu_up+0x84/0x110 [<c04f552b>] cpu_up+0x4b/0x70 (*) grab cpu_hotplug.lock [<c06d1530>] kernel_init+0x0/0x170 [<c06d15e5>] kernel_init+0xb5/0x170 [<c06d1530>] kernel_init+0x0/0x170 [<c01042db>] kernel_thread_helper+0x7/0x1c [<ffffffff>] 0xffffffff -> #0 (&cpu_hotplug.lock){--..}: [<c0155bff>] validate_chain+0x5af/0x1070 [<c040f7e0>] dev_status+0x0/0x50 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 [<c0130789>] get_online_cpus+0x29/0x50 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0 [<c0130789>] get_online_cpus+0x29/0x50 [<c0130789>] get_online_cpus+0x29/0x50 [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10 [<c0130789>] get_online_cpus+0x29/0x50 (*) grab cpu_hotplug.lock [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0 [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0 [<c0156945>] __lock_acquire+0x285/0xa10 [<c0188f09>] vma_merge+0xa9/0x1d0 [<c0187450>] mlock_fixup+0x180/0x200 [<c0187548>] do_mlockall+0x78/0x90 (*) grab mmap_sem [<c01878e1>] sys_mlockall+0x81/0xb0 [<c010355a>] syscall_call+0x7/0xb [<ffffffff>] 0xffffffff other info that might help us debug this: 1 lock held by lvm/1103: #0: (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0 stack backtrace: Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2 Call Trace: [<c01555fc>] print_circular_bug_tail+0x7c/0xd0 [<c0155bff>] validate_chain+0x5af/0x1070 [<c040f7e0>] dev_status+0x0/0x50 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 [<c0130789>] get_online_cpus+0x29/0x50 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0 [<c0130789>] get_online_cpus+0x29/0x50 [<c0130789>] get_online_cpus+0x29/0x50 [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10 [<c0130789>] get_online_cpus+0x29/0x50 [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0 [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0 [<c0156945>] __lock_acquire+0x285/0xa10 [<c0188f09>] vma_merge+0xa9/0x1d0 [<c0187450>] mlock_fixup+0x180/0x200 [<c0187548>] do_mlockall+0x78/0x90 [<c01878e1>] sys_mlockall+0x81/0xb0 [<c010355a>] syscall_call+0x7/0xb Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12cpusets: update mems allowed in page allocatorDavid Rientjes
If all allowable memory is unreclaimable, it is possible to loop forever in the page allocator for ~__GFP_NORETRY allocations. During this time, it is also possible for a task's cpuset to expand its set of allowable nodes so that it now includes free memory. The cached copy of this set, current->mems_allowed, is stale, however, since there has not been a subsequent call to cpuset_update_task_memory_state(). The cached copy of the set of allowable nodes is now updated in the page allocator's slow path so the additional memory is available to get_page_from_freelist(). [akpm@linux-foundation.org: add comment] Signed-off-by: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12hugetlb: make unmap_ref_private multi-size-awareAdam Litke
Oops. Part of the hugetlb private reservation code was not fully converted to use hstates. When a huge page must be unmapped from VMAs due to a failed COW, HPAGE_SIZE is used in the call to unmap_hugepage_range() regardless of the page size being used. This works if the VMA is using the default huge page size. Otherwise we might unmap too much, too little, or trigger a BUG_ON. Rare but serious -- fix it. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12parisc: fix find_extend_vma() breakageDenys Vlasenko
The STACK_GROWSUP case of stack expansion was missing a test for 'prev', which got removed by commit cb8f488c33539f096580e202f5438a809195008f ("mmap.c: deinline a few functions") by mistake. I found my original email in "sent" folder. The patch in that mail does NOT remove !prev. That change had beed added by someone else. Ok, I think we are not much interested in who did it, let's fix it for good. [ "It looks like this was caused by me fixing rejects. That was the fancy include-lots-of-context-so-it-wont-apply patch." - akpm ] Reported-and-bisected-by: Helge Deller <deller@gmx.de> Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-07vmap: cope with vm_unmap_aliases before vmalloc_init()Jeremy Fitzhardinge
Xen can end up calling vm_unmap_aliases() before vmalloc_init() has been called. In this case its safe to make it a simple no-op. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Linux Memory Management List <linux-mm@kvack.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-11-06Merge master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds
* master.kernel.org:/home/rmk/linux-2.6-arm: [ARM] xsc3: fix xsc3_l2_inv_range [ARM] mm: fix page table initialization [ARM] fix naming of MODULE_START / MODULE_END ARM: OMAP: Fix define for twl4030 irqs ARM: OMAP: Fix get_irqnr_and_base to clear spurious interrupt bits ARM: OMAP: Fix debugfs_create_*'s error checking method for arm/plat-omap ARM: OMAP: Fix compiler warnings in gpmc.c [ARM] fix VFP+softfloat binaries
2008-11-06memory hotplug: fix page_zone() calculation in test_pages_isolated()Gerald Schaefer
My last bugfix here (adding zone->lock) introduced a new problem: Using page_zone(pfn_to_page(pfn)) to get the zone after the for() loop is wrong. pfn will then be >= end_pfn, which may be in a different zone or not present at all. This may lead to an addressing exception in page_zone() or spin_lock_irqsave(). Now I use __first_valid_page() again after the loop to find a valid page for page_zone(). Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Nathan Fontenot <nfont@austin.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06mm/oom_kill.c: fix badness() kerneldocQinghuang Feng
Paramter @mem has been removed since v2.6.26, now delete it's comment. Signed-off-by: Qinghuang Feng <qhfeng.kernel@gmail.com> Acked-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06vmemmap: warn about page_structs with remote distanceDavid Rientjes
It's insufficient to simply compare node ids when warning about offnode page_structs since it's possible to still have local affinity. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-06mm: move migrate_prep out from under mmap_semChristoph Lameter
Move the migrate_prep outside the mmap_sem for the following system calls 1. sys_move_pages 2. sys_migrate_pages 3. sys_mbind() It really does not matter when we flush the lru. The system is free to add pages onto the lru even during migration which will make the page migration either skip the page (mbind, migrate_pages) or return a busy state (move_pages). Fixes this lockdep warning (and potential deadlock): Some VM place has mmap_sem -> kevent_wq via lru_add_drain_all() net/core/dev.c::dev_ioctl() has rtnl_lock -> mmap_sem (*) the ioctl has copy_from_user() and it can do page fault. linkwatch_event has kevent_wq -> rtnl_lock Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>