summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2009-07-02mm: fix handling of pagesets for downed cpusDimitri Sivanich
commit 364df0ebfbbb1330bfc6ca159f4d6020efc15a12 upstream. After downing/upping a cpu, an attempt to set /proc/sys/vm/percpu_pagelist_fraction results in an oops in percpu_pagelist_fraction_sysctl_handler(). If a processor is downed then we need to set the pageset pointer back to the boot pageset. Updates of the high water marks should not access pagesets of unpopulated zones (those pointer go to the boot pagesets which would be no longer functional if their size would be increased beyond zero). Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-06-11mm: account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in ↵Mel Gorman
hugetlbfs commit f83a275dbc5ca1721143698e844243fcadfabf6a upstream. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302 hugetlbfs reserves huge pages but does not fault them at mmap() time to ensure that future faults succeed. The reservation behaviour differs depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. For MAP_SHARED mappings, hugepages are reserved when mmap() is first called and are tracked based on information associated with the inode. Other processes mapping MAP_SHARED use the same reservation. MAP_PRIVATE track the reservations based on the VMA created as part of the mmap() operation. Each process mapping MAP_PRIVATE must make its own reservation. hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag and not VM_MAYSHARE. For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the mapping is MAP_SHARED and the file was opened read-write. If a shared memory mapping was mapped shared-read-write for populating of data and mapped shared-read-only by other processes, then hugetlbfs would account for the mapping as if it was MAP_PRIVATE. This causes processes to fail to map the file MAP_SHARED even though it should succeed as the reservation is there. This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when the intent of the code was to check whether the VMA was mapped MAP_SHARED or MAP_PRIVATE. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <starlight@binnacle.cx> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-19mm: close page_mkwrite racesNick Piggin
commit b827e496c893de0c0f142abfaeb8730a2fd6b37f upstream mm: close page_mkwrite races Change page_mkwrite to allow implementations to return with the page locked, and also change it's callers (in page fault paths) to hold the lock until the page is marked dirty. This allows the filesystem to have full control of page dirtying events coming from the VM. Rather than simply hold the page locked over the page_mkwrite call, we call page_mkwrite with the page unlocked and allow callers to return with it locked, so filesystems can avoid LOR conditions with page lock. The problem with the current scheme is this: a filesystem that wants to associate some metadata with a page as long as the page is dirty, will perform this manipulation in its ->page_mkwrite. It currently then must return with the page unlocked and may not hold any other locks (according to existing page_mkwrite convention). In this window, the VM could write out the page, clearing page-dirty. The filesystem has no good way to detect that a dirty pte is about to be attached, so it will happily write out the page, at which point, the filesystem may manipulate the metadata to reflect that the page is no longer dirty. It is not always possible to perform the required metadata manipulation in ->set_page_dirty, because that function cannot block or fail. The filesystem may need to allocate some data structure, for example. And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite is allowed to fail, so we must not allow any window where the page could be written to if page_mkwrite does fail. This solution of holding the page locked over the 3 critical operations (page_mkwrite, setting the pte dirty, and finally setting the page dirty) closes out races nicely, preventing page cleaning for writeout being initiated in that window. This provides the filesystem with a strong synchronisation against the VM here. - Sage needs this race closed for ceph filesystem. - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913). - I need it for fsblock. - I suspect other filesystems may need it too (eg. btrfs). - I have converted buffer.c to the new locking. Even simple block allocation under dirty pages might be susceptible to i_size changing under partial page at the end of file (we also have a buffer.c-side problem here, but it cannot be fixed properly without this patch). - Other filesystems (eg. NFS, maybe btrfs) will need to change their page_mkwrite functions themselves. [ This also moves page_mkwrite another step closer to fault, which should eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a filesystem calldown and page lock/unlock cycle in __do_fault. ] [akpm@linux-foundation.org: fix derefs of NULL ->mapping] Cc: Sage Weil <sage@newdream.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-19mm: page_mkwrite change prototype to match faultNick Piggin
commit c2ec175c39f62949438354f603f4aa170846aabb upstream mm: page_mkwrite change prototype to match fault Change the page_mkwrite prototype to take a struct vm_fault, and return VM_FAULT_xxx flags. There should be no functional change. This makes it possible to return much more detailed error information to the VM (and also can provide more information eg. virtual_address to the driver, which might be important in some special cases). This is required for a subsequent fix. And will also make it easier to merge page_mkwrite() with fault() in future. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <joel.becker@oracle.com> Cc: Artem Bityutskiy <dedekind@infradead.org> Cc: Felix Blyakher <felixb@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-08Ignore madvise(MADV_WILLNEED) for hugetlbfs-backed regionsMel Gorman
commit a425a638c858fd10370b573bde81df3ba500e271 upstream. madvise(MADV_WILLNEED) forces page cache readahead on a range of memory backed by a file. The assumption is made that the page required is order-0 and "normal" page cache. On hugetlbfs, this assumption is not true and order-0 pages are allocated and inserted into the hugetlbfs page cache. This leaks hugetlbfs page reservations and can cause BUGs to trigger related to corrupted page tables. This patch causes MADV_WILLNEED to be ignored for hugetlbfs-backed regions. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-02mm: check for no mmaps in exit_mmap()Johannes Weiner
commit dcd4a049b9751828c516c59709f3fdf50436df85 upstream. When dup_mmap() ooms we can end up with mm->mmap == NULL. The error path does mmput() and unmap_vmas() gets a NULL vma which it dereferences. In exit_mmap() there is nothing to do at all for this case, we can cancel the callpath right there. [akpm@linux-foundation.org: add sorely-needed comment] Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Kir Kolyshkin <kir@openvz.org> Tested-by: Kir Kolyshkin <kir@openvz.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-02mm: do_xip_mapping_read: fix length calculationMartin Schwidefsky
upstream commit: 58984ce21d315b70df1a43644df7416ea7c9bfd8 The calculation of the value nr in do_xip_mapping_read is incorrect. If the copy required more than one iteration in the do while loop the copies variable will be non-zero. The maximum length that may be passed to the call to copy_to_user(buf+copied, xip_mem+offset, nr) is len-copied but the check only compares against (nr > len). This bug is the cause for the heap corruption Carsten has been chasing for so long:
2009-03-16mm: fix memmap init for handling memory holeKAMEZAWA Hiroyuki
commit cc2559bccc72767cb446f79b071d96c30c26439b upstream. Now, early_pfn_in_nid(PFN, NID) may returns false if PFN is a hole. and memmap initialization was not done. This was a trouble for sparc boot. To fix this, the PFN should be initialized and marked as PG_reserved. This patch changes early_pfn_in_nid() return true if PFN is a hole. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reported-by: David Miller <davem@davemlloft.net> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-03-16mm: clean up for early_pfn_to_nid()KAMEZAWA Hiroyuki
commit f2dbcfa738368c8a40d4a5f0b65dc9879577cb21 upstream. What's happening is that the assertion in mm/page_alloc.c:move_freepages() is triggering: BUG_ON(page_zone(start_page) != page_zone(end_page)); Once I knew this is what was happening, I added some annotations: if (unlikely(page_zone(start_page) != page_zone(end_page))) { printk(KERN_ERR "move_freepages: Bogus zones: " "start_page[%p] end_page[%p] zone[%p]\n", start_page, end_page, zone); printk(KERN_ERR "move_freepages: " "start_zone[%p] end_zone[%p]\n", page_zone(start_page), page_zone(end_page)); printk(KERN_ERR "move_freepages: " "start_pfn[0x%lx] end_pfn[0x%lx]\n", page_to_pfn(start_page), page_to_pfn(end_page)); printk(KERN_ERR "move_freepages: " "start_nid[%d] end_nid[%d]\n", page_to_nid(start_page), page_to_nid(end_page)); ... And here's what I got: move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00] move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00] move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff] move_freepages: start_nid[1] end_nid[0] My memory layout on this box is: [ 0.000000] Zone PFN ranges: [ 0.000000] Normal 0x00000000 -> 0x0081ff5d [ 0.000000] Movable zone start PFN for each node [ 0.000000] early_node_map[8] active PFN ranges [ 0.000000] 0: 0x00000000 -> 0x00020000 [ 0.000000] 1: 0x00800000 -> 0x0081f7ff [ 0.000000] 1: 0x0081f800 -> 0x0081fe50 [ 0.000000] 1: 0x0081fed1 -> 0x0081fed8 [ 0.000000] 1: 0x0081feda -> 0x0081fedb [ 0.000000] 1: 0x0081fedd -> 0x0081fee5 [ 0.000000] 1: 0x0081fee7 -> 0x0081ff51 [ 0.000000] 1: 0x0081ff59 -> 0x0081ff5d So it's a block move in that 0x81f600-->0x81f7ff region which triggers the problem. This patch: Declaration of early_pfn_to_nid() is scattered over per-arch include files, and it seems it's complicated to know when the declaration is used. I think it makes fix-for-memmap-init not easy. This patch moves all declaration to include/linux/mm.h After this, if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID -> Use static definition in include/linux/mm.h else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID -> Use generic definition in mm/page_alloc.c else -> per-arch back end function will be called. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: David Miller <davem@davemlloft.net> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-17writeback: fix break conditionFederico Cuello
commit 89e1219004b3657cc014521663eeef0744f1c99d upstream. Commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa ("write-back: fix nr_to_write counter") fixed nr_to_write counter, but didn't set the break condition properly. If nr_to_write == 0 after being decremented it will loop one more time before setting done = 1 and breaking the loop. [akpm@linux-foundation.org: coding-style fixes] Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-17write-back: fix nr_to_write counterArtem Bityutskiy
commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa upstream. Commit 05fe478dd04e02fa230c305ab9b5616669821dd3 introduced some @wbc->nr_to_write breakage. It made the following changes: 1. Decrement wbc->nr_to_write instead of nr_to_write 2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE 3. If synced nr_to_write pages, stop only if if wbc->sync_mode == WB_SYNC_NONE, otherwise keep going. However, according to the commit message, the intention was to only make change 3. Change 1 is a bug. Change 2 does not seem to be necessary, and it breaks UBIFS expectations, so if needed, it should be done separately later. And change 2 does not seem to be documented in the commit message. This patch does the following: 1. Undo changes 1 and 2 2. Add a comment explaining change 3 (it very useful to have comments in _code_, not only in the commit). Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-17Fix page writeback thinko, causing Berkeley DB slowdownNick Piggin
commit 3a4c6800f31ea8395628af5e7e490270ee5d0585 upstream. A bug was introduced into write_cache_pages cyclic writeout by commit 31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic fix"). The intention (and comments) is that we should cycle back and look for more dirty pages at the beginning of the file if there is no more work to be done. But the !done condition was dropped from the test. This means that any time the page writeout loop breaks (eg. due to nr_to_write == 0), we will set index to 0, then goto again. This will set done_index to index, then find done is set, so will proceed to the end of the function. When updating mapping->writeback_index for cyclic writeout, we now use done_index == 0, so we're always cycling back to 0. This seemed to be causing random mmap writes (slapadd and iozone) to start writing more pages from the LRU and writeout would slowdown, and caused bugzilla entry http://bugzilla.kernel.org/show_bug.cgi?id=12604 about Berkeley DB slowing down dramatically. With this patch, iozone random write performance is increased nearly 5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2). Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-and-tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-12mm: remove UP version of lru_add_drain_all()KOSAKI Motohiro
commit 6841c8e26357904ef462650273f5d5015f7bb370 upstream. Currently, lru_add_drain_all() has two version. (1) use schedule_on_each_cpu() (2) don't use schedule_on_each_cpu() Gerald Schaefer reported it doesn't work well on SMP (not NUMA) S390 machine. offline_pages() calls lru_add_drain_all() followed by drain_all_pages(). While drain_all_pages() works on each cpu, lru_add_drain_all() only runs on the current cpu for architectures w/o CONFIG_NUMA. This let us run into the BUG_ON(!PageBuddy(page)) in __offline_isolated_pages() during memory hotplug stress test on s390. The page in question was still on the pcp list, because of a race with lru_add_drain_all() and drain_all_pages() on different cpus. Actually, Almost machine has CONFIG_UNEVICTABLE_LRU=y. Then almost machine use (1) version lru_add_drain_all although the machine is UP. Then this ifdef is not valueable. simple removing is better. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: direct IO starvation improvementNick Piggin
commit 48b47c561e41525061b5bc0cfd67d6367fd11dc4 upstream. Direct IO can invalidate and sync a lot of pagecache pages in the mapping. A 4K direct IO will actually try to sync and/or invalidate the pagecache of the entire file, for example (which might be many GB or TB large). Improve this by doing range syncs. Also, memory no longer has to be unmapped to catch the dirty bits for syncing, as dirty bits would remain coherent due to dirty mmap accounting. This fixes the immediate DM deadlocks when doing direct IO reads to block device with a mounted filesystem, if only by papering over the problem somewhat rather than addressing the fsync starvation cases. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages more terminate quicklyAndrew Morton
commit 82fd1a9a8ced9607312b54859572bcc6211e8919 upstream. Now that we have the early-termination logic in place, it makes sense to bail out early in all other cases where done is set to 1. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages terminate quicklyNick Piggin
commit d5482cdf8a0aacb1e6468a97d5544f5829c8d8c4 upstream. Terminate the write_cache_pages loop upon encountering the first page past end, without locking the page. Pages cannot have their index change when we have a reference on them (truncate, eg truncate_inode_pages_range performs the same check without the page lock). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages optimise page cleaningNick Piggin
commit 515f4a037fb9ab736f8bad733fcd2ffd350cf265 upstream. In write_cache_pages, if we get stuck behind another process that is cleaning pages, we will be forced to wait for them to finish, then perform our own writeout (if it was redirtied during the long wait), then wait for that. If a page under writeout is still clean, we can skip waiting for it (if we're part of a data integrity sync, we'll be waiting for all writeout pages afterwards, so we'll still be waiting for the other guy's write that's cleaned the page). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages cleanupsNick Piggin
commit 5a3d5c9813db56a75934eb1015367fda23a8b0b4 upstream. Get rid of some complex expressions from flow control statements, add a comment, remove some duplicate code. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages integrity fixNick Piggin
commit 05fe478dd04e02fa230c305ab9b5616669821dd3 upstream. In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so the function will return success after writing out nr_to_write pages, even if that was not sufficient to guarantee data integrity. The callers tend to set it to values that could break data interity semantics easily in practice. For example, nr_to_write can be set to mapping->nr_pages * 2, however if a file has a single, dirty page, then fsync is called, subsequent pages might be concurrently added and dirtied, then write_cache_pages might writeout two of these newly dirty pages, while not writing out the old page that should have been written out. Fix this by ignoring nr_to_write if it is a data integrity sync. This is a data integrity bug. The reason this has been done in the past is to avoid stalling sync operations behind page dirtiers. "If a file has one dirty page at offset 1000000000000000 then someone does an fsync() and someone else gets in first and starts madly writing pages at offset 0, we want to write that page at 1000000000000000. Somehow." What we do today is return success after an arbitrary amount of pages are written, whether or not we have provided the data-integrity semantics that the caller has asked for. Even this doesn't actually fix all stall cases completely: in the above situation, if the file has a huge number of pages in pagecache (but not dirty), then mapping->nrpages is going to be huge, even if pages are being dirtied. This change does indeed make the possibility of long stalls lager, and that's not a good thing, but lying about data integrity is even worse. We have to either perform the sync, or return -ELINUXISLAME so at least the caller knows what has happened. There are subsequent competing approaches in the works to solve the stall problems properly, without compromising data integrity. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages writepage error fixNick Piggin
commit 00266770b8b3a6a77f896ca501a0613739086832 upstream. In write_cache_pages, if ret signals a real error, but we still have some pages left in the pagevec, done would be set to 1, but the remaining pages would continue to be processed and ret will be overwritten in the process. It could easily be overwritten with success, and thus success will be returned even if there is an error. Thus the caller is told all writes succeeded, wheras in reality some did not. Fix this by bailing immediately if there is an error, and retaining the first error code. This is a data integrity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages early loop terminationNick Piggin
commit bd19e012f6fd3b7309689165ea865cbb7bb88c1e upstream. We'd like to break out of the loop early in many situations, however the existing code has been setting mapping->writeback_index past the final page in the pagevec lookup for cyclic writeback. This is a problem if we don't process all pages up to the final page. Currently the code mostly keeps writeback_index reasonable and hacked around this by not breaking out of the loop or writing pages outside the range in these cases. Keep track of a real "done index" that enables us to terminate the loop in a much more flexible manner. Needed by the subsequent patch to preserve writepage errors, and then further patches to break out of the loop early for other reasons. However there are no functional changes with this patch alone. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages cyclic fixNick Piggin
commit 31a12666d8f0c22235297e1c1575f82061480029 upstream. In write_cache_pages, scanned == 1 is supposed to mean that cyclic writeback has circled through zero, thus we should not circle again. However it gets set to 1 after the first successful pagevec lookup. This leads to cases where not enough data gets written. Counterexample: file with first 10 pages dirty, writeback_index == 5, nr_to_write == 10. Then the 5 last pages will be found, and scanned will be set to 1, after writing those out, we will not cycle back to get the first 5. Rework this logic, now we'll always cycle unless we started off from index 0. When cycling, only write out as far as 1 page before the start page from the first cycle (so we don't write parts of the file twice). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18fs: symlink write_begin allocation context fixNick Piggin
commit 54566b2c1594c2326a645a3551f9d989f7ba3c5e upstream. With the write_begin/write_end aops, page_symlink was broken because it could no longer pass a GFP_NOFS type mask into the point where the allocations happened. They are done in write_begin, which would always assume that the filesystem can be entered from reclaim. This bug could cause filesystem deadlocks. The funny thing with having a gfp_t mask there is that it doesn't really allow the caller to arbitrarily tinker with the context in which it can be called. It couldn't ever be GFP_ATOMIC, for example, because it needs to take the page lock. The only thing any callers care about is __GFP_FS anyway, so turn that into a single flag. Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on this flag in their write_begin function. Change __grab_cache_page to accept a nofs argument as well, to honour that flag (while we're there, change the name to grab_cache_page_write_begin which is more instructive and does away with random leading underscores). This is really a more flexible way to go in the end anyway -- if a filesystem happens to want any extra allocations aside from the pagecache ones in ints write_begin function, it may now use GFP_KERNEL (rather than GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a random example). [kosaki.motohiro@jp.fujitsu.com: fix ubifs] [kosaki.motohiro@jp.fujitsu.com: fix fuse] Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Cleaned up the calling convention: just pass in the AOP flags untouched to the grab_cache_page_write_begin() function. That just simplifies everybody, and may even allow future expansion of the logic. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 28Heiko Carstens
commit 938bb9f5e840eddbf54e4f62f6c5ba9b3ae12c9d upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 26Heiko Carstens
commit c4ea37c26a691ad0b7e86aa5884aab27830e95c9 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 14Heiko Carstens
commit 3480b25743cb7404928d57efeaa3d085708b04c2 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 13Heiko Carstens
commit 6a6160a7b5c27b3c38651baef92a14fa7072b3c1 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrapper special casesHeiko Carstens
commit 6673e0c3fbeaed2cd08e2fd4a4aa97382d6fedb0 upstream. System calls with an unsigned long long argument can't be converted with the standard wrappers since that would include a cast to long, which in turn means that we would lose the upper 32 bit on 32 bit architectures. Also semctl can't use the standard wrapper since it has a 'union' parameter. So we handle them as special case and add some extra wrappers instead. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Convert all system calls to return a longHeiko Carstens
commit 2ed7c03ec17779afb4fcfa3b8c61df61bd4879ba upstream. Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-18setup_per_zone_pages_min(): take zone->lock instead of zone->lru_lockGerald Schaefer
commit 1125b4e3949949b44a7c80b619507c6f61d62911 upstream. This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock. There seems to be no need for the lru_lock anymore, but there is a need for zone->lock instead, because that function may call move_freepages() via setup_zone_migrate_reserve(). Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-20memory hotplug: fix page_zone() calculation in test_pages_isolated()Gerald Schaefer
commit a70dcb969f64e2fa98c24f47854f20bf02ff0092 upstream. My last bugfix here (adding zone->lock) introduced a new problem: Using page_zone(pfn_to_page(pfn)) to get the zone after the for() loop is wrong. pfn will then be >= end_pfn, which may be in a different zone or not present at all. This may lead to an addressing exception in page_zone() or spin_lock_irqsave(). Now I use __first_valid_page() again after the loop to find a valid page for page_zone(). Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: Nathan Fontenot <nfont@austin.ibm.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-20hugetlb: make unmap_ref_private multi-size-awareAdam Litke
commit 7526674de0c921e7f1e9b6f71a1f9d832557b554 upstream. Oops. Part of the hugetlb private reservation code was not fully converted to use hstates. When a huge page must be unmapped from VMAs due to a failed COW, HPAGE_SIZE is used in the call to unmap_hugepage_range() regardless of the page size being used. This works if the VMA is using the default huge page size. Otherwise we might unmap too much, too little, or trigger a BUG_ON. Rare but serious -- fix it. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-11-13hugetlbfs: handle pages higher order than MAX_ORDERAndy Whitcroft
commit 69d177c2fc702d402b17fdca2190d5a7e3ca55c5 upstream When working with hugepages, hugetlbfs assumes that those hugepages are smaller than MAX_ORDER. Specifically it assumes that the mem_map is contigious and uses that to optimise access to the elements of the mem_map that represent the hugepage. Gigantic pages (such as 16GB pages on powerpc) by definition are of greater order than MAX_ORDER (larger than MAX_ORDER_NR_PAGES in size). This means that we can no longer make use of the buddy alloctor guarentees for the contiguity of the mem_map, which ensures that the mem_map is at least contigious for maximmally aligned areas of MAX_ORDER_NR_PAGES pages. This patch adds new mem_map accessors and iterator helpers which handle any discontiguity at MAX_ORDER_NR_PAGES boundaries. It then uses these to implement gigantic page versions of copy_huge_page and clear_huge_page, and to allow follow_hugetlb_page handle gigantic pages. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-13hugetlb: pull gigantic page initialisation out of the default pathAndy Whitcroft
commit 18229df5b613ed0732a766fc37850de2e7988e43 upstream As we can determine exactly when a gigantic page is in use we can optimise the common regular page cases by pulling out gigantic page initialisation into its own function. As gigantic pages are never released to buddy we do not need a destructor. This effectivly reverts the previous change to the main buddy allocator. It also adds a paranoid check to ensure we never release gigantic pages from hugetlbfs to the main buddy. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-25anon_vma_prepare: properly lock even newly allocated entriesLinus Torvalds
commit d9d332e0874f46b91d8ac4604b68ee42b8a7a2c6 upstream The anon_vma code is very subtle, and we end up doing optimistic lookups of anon_vmas under RCU in page_lock_anon_vma() with no locking. Other CPU's can also see the newly allocated entry immediately after we've exposed it by setting "vma->anon_vma" to the new value. We protect against the anon_vma being destroyed by having the SLAB marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the allocation not being destroyed - but it might still be free'd and re-allocated here to a new vma. As a result, we should not do the anon_vma list ops on a newly allocated vma without proper locking. Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Hugh Dickins <hugh@veritas.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-10-09SLOB: fix bogus ksize calculation fixMatt Mackall
This fixes the previous fix, which was completely wrong on closer inspection. This version has been manually tested with a user-space test harness and generates sane values. A nearly identical patch has been boot-tested. The problem arose from changing how kmalloc/kfree handled alignment padding without updating ksize to match. This brings it in sync. Signed-off-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-07SLOB: fix bogus ksize calculationMatt Mackall
SLOB's ksize calculation was braindamaged and generally harmlessly underreported the allocation size. But for very small buffers, it could in fact overreport them, leading code depending on krealloc to overrun the allocation and trample other data. Signed-off-by: Matt Mackall <mpm@selenic.com> Tested-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02mm: handle initialising compound pages at orders greater than MAX_ORDERAndy Whitcroft
When we initialise a compound page we initialise the page flags and head page pointer for all base pages spanned by that page. When we initialise a gigantic page (a page of order greater than or equal to MAX_ORDER) we have to initialise more than MAX_ORDER_NR_PAGES pages. Currently we assume that all elements of the mem_map in this page are contigious in memory. However this is only guarenteed out to MAX_ORDER_NR_PAGES pages, and with SPARSEMEM enabled they will not be contigious. This leads us to walk off the end of the first section and scribble on everything which follows, BAD. When we reach a MAX_ORDER_NR_PAGES boundary we much locate the next section of the mem_map. As gigantic pages can only be maximally aligned we know this will occur at exact multiple of MAX_ORDER_NR_PAGES pages from the start of the page. This is a bug fix for the gigantic page support in hugetlbfs. Credit to Mel Gorman for spotting the issue. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02mm: tiny-shmem nommu fixNick Piggin
The previous patch db203d53d474aa068984e409d807628f5841da1b ("mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutex") to fix the lock ordering in tiny-shmem breaks shared anonymous and IPC memory on NOMMU architectures because it was using the expanding truncate to signal ramfs to allocate a physically contiguous RAM backing the inode (otherwise it is unusable for "memory mapping" it to userspace). However do_truncate is what caused the lock ordering error, due to it taking i_mutex. In this case, we can actually just call ramfs directly to allocate memory for the mapping, rather than go via truncate. Acked-by: David Howells <dhowells@redhat.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-02memory hotplug: missing zone->lock in test_pages_isolated()Gerald Schaefer
__test_page_isolated_in_pageblock() in mm/page_isolation.c has a comment saying that the caller must hold zone->lock. But the only caller of that function, test_pages_isolated(), does not hold zone->lock and the lock is also not acquired anywhere before. This patch adds the missing zone->lock to test_pages_isolated(). We reproducibly run into BUG_ON(!PageBuddy(page)) in __offline_isolated_pages() during memory hotplug stress test, see trace below. This patch fixes that problem, it would be good if we could have it in 2.6.27. kernel BUG at /home/autobuild/BUILD/linux-2.6.26-20080909/mm/page_alloc.c:4561! illegal operation: 0001 [#1] PREEMPT SMP Modules linked in: dm_multipath sunrpc bonding qeth_l3 dm_mod qeth ccwgroup vmur CPU: 1 Not tainted 2.6.26-29.x.20080909-s390default #1 Process memory_loop_all (pid: 10025, task: 2f444028, ksp: 2b10dd28) Krnl PSW : 040c0000 801727ea (__offline_isolated_pages+0x18e/0x1c4) R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0 Krnl GPRS: 00000000 7e27fc00 00000000 7e27fc00 00000000 00000400 00014000 7e27fc01 00606f00 7e27fc00 00013fe0 2b10dd28 00000005 80172662 801727b2 2b10dd28 Krnl Code: 801727de: 5810900c l %r1,12(%r9) 801727e2: a7f4ffb3 brc 15,80172748 801727e6: a7f40001 brc 15,801727e8 >801727ea: a7f4ffbc brc 15,80172762 801727ee: a7f40001 brc 15,801727f0 801727f2: a7f4ffaf brc 15,80172750 801727f6: 0707 bcr 0,%r7 801727f8: 0017 unknown Call Trace: ([<0000000000172772>] __offline_isolated_pages+0x116/0x1c4) [<00000000001953a2>] offline_isolated_pages_cb+0x22/0x34 [<000000000013164c>] walk_memory_resource+0xcc/0x11c [<000000000019520e>] offline_pages+0x36a/0x498 [<00000000001004d6>] remove_memory+0x36/0x44 [<000000000028fb06>] memory_block_change_state+0x112/0x150 [<000000000028ffb8>] store_mem_state+0x90/0xe4 [<0000000000289c00>] sysdev_store+0x34/0x40 [<00000000001ee048>] sysfs_write_file+0xd0/0x178 [<000000000019b1a8>] vfs_write+0x74/0x118 [<000000000019b9ae>] sys_write+0x46/0x7c [<000000000011160e>] sysc_do_restart+0x12/0x16 [<0000000077f3e8ca>] 0x77f3e8ca Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-29mm owner: fix race between swapoff and exitBalbir Singh
There's a race between mm->owner assignment and swapoff, more easily seen when task slab poisoning is turned on. The condition occurs when try_to_unuse() runs in parallel with an exiting task. A similar race can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats> or ptrace or page migration. CPU0 CPU1 try_to_unuse looks at mm = task0->mm increments mm->mm_users task 0 exits mm->owner needs to be updated, but no new owner is found (mm_users > 1, but no other task has task->mm = task0->mm) mm_update_next_owner() leaves mmput(mm) decrements mm->mm_users task0 freed dereferencing mm->owner fails The fix is to notify the subsystem via mm_owner_changed callback(), if no new owner is found, by specifying the new task as NULL. Jiri Slaby: mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but must be set after that, so as not to pass NULL as old owner causing oops. Daisuke Nishimura: mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task() and its callers need to take account of this situation to avoid oops. Hugh Dickins: Lockdep warning and hang below exec_mmap() when testing these patches. exit_mm() up_reads mmap_sem before calling mm_update_next_owner(), so exec_mmap() now needs to do the same. And with that repositioning, there's now no point in mm_need_new_owner() allowing for NULL mm. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-23memcg: check under limit at shrink_usageDaisuke Nishimura
Current memory cgroup(both in mainline and -mm) doesn't account swap caches as memory(swap cache support is dropped temporarily now). So try_to_free_mem_cgroup_pages doesn't reflect the count of pages that have been moved to swap cache. But this makes mem_cgroup_shrink_usage fail easily if most of the pages are anon/shmem, and then shmem_getpage returns -ENOMEM and the process will be killed. This patch adds res_counter_check_under_limit to avoid these cases. BTW, even if swap cache support is enabled again, if a process is moved to another cgroup, which has been just made, between precharge and shrink_usage in shmem_getpage, shrink_usage may fail just because there is no pages to reclaim. So this change would make sense anyway. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-23mm: tiny-shmem fix lock ordering: mmap_sem vs i_mutexNick Piggin
tiny-shmem calls do_truncate in shmem_file_setup. do_truncate takes i_mutex, and shmem_file_setup is called with mmap_sem held. However i_mutex nests outside mmap_sem. Copy the code in shmem.c to avoid this problem. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-and-tested-by: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Matt Mackall <mpm@selenic.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-15slub: fixed uninitialized counter in struct kmem_cache_nodeSalman Qazi
Initialized total objects atomic for the node in init_kmem_cache_node. The uninitialized value was ruining the stats in /proc/slabinfo. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Salman Qazi <sqazi@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2008-09-13mm: mark the correct zone as full when scanning zonelistsMel Gorman
The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when scanning zonelists to keep track of where in the zonelist it is. The zoneref that is returned corresponds to the the next zone that is to be scanned, not the current one. It was intended to be treated as an opaque list. When the page allocator is scanning a zonelist, it marks elements in the zonelist corresponding to zones that are temporarily full. As the zonelist is being updated, it uses the cursor here; if (NUMA_BUILD) zlc_mark_zone_full(zonelist, z); This is intended to prevent rescanning in the near future but the zoneref cursor does not correspond to the zone that has been found to be full. This is an easy misunderstanding to make so this patch corrects the problem by changing zoneref cursor to be the current zone being scanned instead of the next one. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-03mmap: fix petty bug in anonymous shared mmap offset handlingTejun Heo
Anonymous mappings should ignore offset but shared anonymous mapping forgot to clear it and makes the following legit test program trigger SIGBUS. #include <sys/mman.h> #include <stdio.h> #include <errno.h> #define PAGE_SIZE 4096 int main(void) { char *p; int i; p = mmap(NULL, 2 * PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, PAGE_SIZE); if (p == MAP_FAILED) { perror("mmap"); return 1; } for (i = 0; i < 2; i++) { printf("page %d\n", i); p[i * 4096] = i; } return 0; } Fix it. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Hugh Dickins <hugh@veritas.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02mm: size of quicklists shouldn't be proportional to the number of CPUsKOSAKI Motohiro
Quicklists store pages for each CPU as caches. (Each CPU can cache node_free_pages/16 pages) It is used for page table cache. exit() will increase the cache size, while fork() consumes it. So for example if an apache-style application runs (one parent and many child model), one CPU process will fork() while another CPU will process the middleware work and exit(). At that time, the CPU on which the parent runs doesn't have page table cache at all. Others (on which children runs) have maximum caches. QList_max = (#ofCPUs - 1) x Free / 16 => QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1) So, How much quicklist memory is used in the maximum case? This is proposional to # of CPUs because the limit of per cpu quicklist cache doesn't see the number of cpus. Above calculation mean Number of CPUs per node 2 4 8 16 ============================== ==================== QList_max / (Free + QList_max) 5.8% 16% 30% 48% Wow! Quicklist can spend about 50% memory at worst case. My demonstration program is here -------------------------------------------------------------------------------- #define _GNU_SOURCE #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <string.h> #include <sched.h> #include <unistd.h> #include <sys/mman.h> #include <sys/wait.h> #define BUFFSIZE 512 int max_cpu(void) /* get max number of logical cpus from /proc/cpuinfo */ { FILE *fd; char *ret, buffer[BUFFSIZE]; int cpu = 1; fd = fopen("/proc/cpuinfo", "r"); if (fd == NULL) { perror("fopen(/proc/cpuinfo)"); exit(EXIT_FAILURE); } while (1) { ret = fgets(buffer, BUFFSIZE, fd); if (ret == NULL) break; if (!strncmp(buffer, "processor", 9)) cpu = atoi(strchr(buffer, ':') + 2); } fclose(fd); return cpu; } void cpu_bind(int cpu) /* bind current process to one cpu */ { cpu_set_t mask; int ret; CPU_ZERO(&mask); CPU_SET(cpu, &mask); ret = sched_setaffinity(0, sizeof(mask), &mask); if (ret == -1) { perror("sched_setaffinity()"); exit(EXIT_FAILURE); } sched_yield(); /* not necessary */ } #define MMAP_SIZE (10 * 1024 * 1024) /* 10 MB */ #define FORK_INTERVAL 1 /* 1 second */ main(int argc, char *argv[]) { int cpu_max, nextcpu; long pagesize; pid_t pid; /* set max number of logical cpu */ if (argc > 1) cpu_max = atoi(argv[1]) - 1; else cpu_max = max_cpu(); /* get the page size */ pagesize = sysconf(_SC_PAGESIZE); if (pagesize == -1) { perror("sysconf(_SC_PAGESIZE)"); exit(EXIT_FAILURE); } /* prepare parent process */ cpu_bind(0); nextcpu = cpu_max; loop: /* select destination cpu for child process by round-robin rule */ if (++nextcpu > cpu_max) nextcpu = 1; pid = fork(); if (pid == 0) { /* child action */ char *p; int i; /* consume page tables */ p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); i = MMAP_SIZE / pagesize; while (i-- > 0) { *p = 1; p += pagesize; } /* move to other cpu */ cpu_bind(nextcpu); /* printf("a child moved to cpu%d after mmap().\n", nextcpu); fflush(stdout); */ /* back page tables to pgtable_quicklist */ exit(0); } else if (pid > 0) { /* parent action */ sleep(FORK_INTERVAL); waitpid(pid, NULL, WNOHANG); } goto loop; } ---------------------------------------- When above program which does task migration runs, my 8GB box spends 800MB of memory for quicklist. This is not memory leak but doesn't seem good. % cat /proc/meminfo MemTotal: 7701568 kB MemFree: 4724672 kB (snip) Quicklists: 844800 kB because - My machine spec is number of numa node: 2 number of cpus: 8 (4CPU x2 node) total mem: 8GB (4GB x2 node) free mem: about 5GB - Then, 4.7GB x 16% ~= 880MB. So, Quicklist can use 800MB. So, if following spec machine run that program CPUs: 64 (8cpu x 8node) Mem: 1TB (128GB x8node) Then, quicklist can waste 300GB (= 1TB x 30%). It is too large. So, I don't like cache policies which is proportional to # of cpus. My patch changes the number of caches from: per-cpu-cache-amount = memory_on_node / 16 to per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Tested-by: David Miller <davem@davemloft.net> Acked-by: Mike Travis <travis@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02mm/bootmem: silence section mismatch warning - ↵Marcin Slusarz
contig_page_data/bootmem_node_data WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data The variable contig_page_data references the variable __initdata bootmem_node_data If the reference is valid then annotate the variable with __init* (see linux/init.h) or name the variable: *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console, Signed-off-by: Marcin Slusarz <marcin.slusarz@gmail.com> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: Sean MacLennan <smaclennan@pikatech.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02VFS: fix dio write returning EIO when try_to_release_page failsHisashi Hifumi
Dio write returns EIO when try_to_release_page fails because bh is still referenced. The patch commit 3f31fddfa26b7594b44ff2b34f9a04ba409e0f91 Author: Mingming Cao <cmm@us.ibm.com> Date: Fri Jul 25 01:46:22 2008 -0700 jbd: fix race between free buffer and commit transaction was merged into 2.6.27-rc1, but I noticed that this patch is not enough to fix the race. I did fsstress test heavily to 2.6.27-rc1, and found that dio write still sometimes got EIO through this test. The patch above fixed race between freeing buffer(dio) and committing transaction(jbd) but I discovered that there is another race, freeing buffer(dio) and ext3/4_ordered_writepage. : background_writeout() ->write_cache_pages() ->ext3_ordered_writepage() walk_page_buffers() -> take a bh ref block_write_full_page() -> unlock_page : <- end_page_writeback : <- race! (dio write->try_to_release_page fails) walk_page_buffers() ->release a bh ref ext3_ordered_writepage holds bh ref and does unlock_page remaining taking a bh ref, so this causes the race and failure of try_to_release_page. To fix this race, I used the approach of falling back to buffered writes if try_to_release_page() fails on a page. [akpm@linux-foundation.org: cleanups] Signed-off-by: Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-09-02mm: make setup_zone_migrate_reserve() aware of overlapping nodesAdam Litke
I have gotten to the root cause of the hugetlb badness I reported back on August 15th. My system has the following memory topology (note the overlapping node): Node 0 Memory: 0x8000000-0x44000000 Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000 setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking for a pageblock to move onto the MIGRATE_RESERVE list. Finding no candidates, it happily continues the scan into 0x8000000-0x44000000. When a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on the wrong zone. Oops. setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes. Signed-off-by: Adam Litke <agl@us.ibm.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>