summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2009-10-05mmap: avoid unnecessary anon_vma lock acquisition in vma_adjust()Lee Schermerhorn
commit 252c5f94d944487e9f50ece7942b0fbf659c5c31 upstream. We noticed very erratic behavior [throughput] with the AIM7 shared workload running on recent distro [SLES11] and mainline kernels on an 8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel [2.6.27.19+] with Barcelona processors, as we increased the load [10s of thousands of tasks], the throughput would vary between two "plateaus"--one at ~65K jobs per minute and one at ~130K jpm. The simple patch below causes the results to smooth out at the ~130k plateau. But wait, there's more: We do not see this behavior on smaller platforms--e.g., 4 socket/8 core. This could be the result of the larger number of cpus on the larger platform--a scalability issue--or it could be the result of the larger number of interconnect "hops" between some nodes in this platform and how the tasks for a given load end up distributed over the nodes' cpus and memories--a stochastic NUMA effect. The variability in the results are less pronounced [on the same platform] with Shanghai processors and with mainline kernels. With 31-rc6 on Shanghai processors and 288 file systems on 288 fibre attached storage volumes, the curves [jpm vs load] are both quite flat with the patched kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K jpm] than the unpatched kernel. Profiling indicated that the "slow" runs were incurring high[er] contention on an anon_vma lock in vma_adjust(), apparently called from the sbrk() system call. The patch: A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the anon_vma lock when we're only adjusting the end of a vma, as is the case for brk(). The comment questions whether it's worth while to optimize for this case. Apparently, on the newer, larger x86_64 platforms, with interesting NUMA topologies, it is worth while--especially considering that the patch [if correct!] is quite simple. We can detect this condition--no overlap with next vma--by noting a NULL "importer". The anon_vma pointer will also be NULL in this case, so simply avoid loading vma->anon_vma to avoid the lock. However, we DO need to take the anon_vma lock when we're inserting a vma ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we need to check for 'insert', as well. And Hugh points out that we should also take it when adjusting vm_start (so that rmap.c can rely upon vma_address() while it holds the anon_vma lock). akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it might be -stable material even though thiss isn't a regression: "this issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is severe on large machine (4*8*2 cpu)" [hugh.dickins@tiscali.co.uk: test vma start too] Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Nick Piggin <npiggin@suse.de> Cc: Eric Whitney <eric.whitney@hp.com> Tested-by: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-05mm: fix anonymous dirtyingHugh Dickins
commit 1ac0cb5d0e22d5e483f56b2bc12172dec1cf7536 upstream. do_anonymous_page() has been wrong to dirty the pte regardless. If it's not going to mark the pte writable, then it won't help to mark it dirty here, and clogs up memory with pages which will need swap instead of being thrown away. Especially wrong if no overcommit is chosen, and this vma is not yet VM_ACCOUNTed - we could exceed the limit and OOM despite no overcommit. Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-10-05hugetlb: restore interleaving of bootmem huge pages (2.6.31)Lee Schermerhorn
Not upstream as it is fixed differently in .32 I noticed that alloc_bootmem_huge_page() will only advance to the next node on failure to allocate a huge page. I asked about this on linux-mm and linux-numa, cc'ing the usual huge page suspects. Mel Gorman responded: I strongly suspect that the same node being used until allocation failure instead of round-robin is an oversight and not deliberate at all. It appears to be a side-effect of a fix made way back in commit 63b4613c3f0d4b724ba259dc6c201bb68b884e1a ["hugetlb: fix hugepage allocation with memoryless nodes"]. Prior to that patch it looked like allocations would always round-robin even when allocation was successful. Andy Whitcroft countered that the existing behavior looked like Andi Kleen's original implementation and suggested that we ask him. We did and Andy replied that his intention was to interleave the allocations. So, ... This patch moves the advance of the hstate next node from which to allocate up before the test for success of the attempted allocation. This will unconditionally advance the next node from which to alloc, interleaving successful allocations over the nodes with sufficient contiguous memory, and skipping over nodes that fail the huge page allocation attempt. Note that alloc_bootmem_huge_page() will only be called for huge pages of order > MAX_ORDER. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-15slub: Fix kmem_cache_destroy() with SLAB_DESTROY_BY_RCUEric Dumazet
commit d76b1590e06a63a3d8697168cd0aabf1c4b3cb3a upstream. kmem_cache_destroy() should call rcu_barrier() *after* kmem_cache_close() and *before* sysfs_slab_remove() or risk rcu_free_slab() being called after kmem_cache is deleted (kfreed). rmmod nf_conntrack can crash the machine because it has to kmem_cache_destroy() a SLAB_DESTROY_BY_RCU enabled cache. Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-09-08mm: build_zonelists(): move clear node_load[] to __build_all_zonelists()Bo Liu
commit 7f9cfb31030737a7fc9a1cbca3fd01bec184c849 upstream. If node_load[] is cleared everytime build_zonelists() is called,node_load[] will have no help to find the next node that should appear in the given node's fallback list. Because of the bug, zonelist's node_order is not calculated as expected. This bug affects on big machine, which has asynmetric node distance. [synmetric NUMA's node distance] 0 1 2 0 10 12 12 1 12 10 12 2 12 12 10 [asynmetric NUMA's node distance] 0 1 2 0 10 12 20 1 12 10 14 2 20 14 10 This (my bug) is very old but no one has reported this for a long time. Maybe because the number of asynmetric NUMA is very small and they use cpuset for customizing node memory allocation fallback. [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] Signed-off-by: Bo Liu <bo-liu@hotmail.com> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-08-16page-allocator: preserve PFN ordering when __GFP_COLD is setMel Gorman
commit e084b2d95e48b31aa45f9c49ffc6cdae8bdb21d4 upstream. Fix a post-2.6.24 performace regression caused by 3dfa5721f12c3d5a441448086bee156887daa961 ("page-allocator: preserve PFN ordering when __GFP_COLD is set"). Narayanan reports "The regression is around 15%. There is no disk controller as our setup is based on Samsung OneNAND used as a memory mapped device on a OMAP2430 based board." The page allocator tries to preserve contiguous PFN ordering when returning pages such that repeated callers to the allocator have a strong chance of getting physically contiguous pages, particularly when external fragmentation is low. However, of the bulk of the allocations have __GFP_COLD set as they are due to aio_read() for example, then the PFNs are in reverse PFN order. This can cause performance degration when used with IO controllers that could have merged the requests. This patch attempts to preserve the contiguous ordering of PFNs for users of __GFP_COLD. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reported-by: Narayananu Gopalakrishnan <narayanan.g@samsung.com> Tested-by: Narayanan Gopalakrishnan <narayanan.g@samsung.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-08-16hugetlbfs: fix i_blocks accountingEric Sandeen
commit e4c6f8bed01f9f9a5c607bd689bf67e7b8a36bd8 upstream. As reported in Red Hat bz #509671, i_blocks for files on hugetlbfs get accounting wrong when doing something like: $ > foo $ date > foo date: write error: Invalid argument $ /usr/bin/stat foo File: `foo' Size: 0 Blocks: 18446744073709547520 IO Block: 2097152 regular ... This is because hugetlb_unreserve_pages() is unconditionally removing blocks_per_huge_page(h) on each call rather than using the freed amount. If there were 0 blocks, it goes negative, resulting in the above. This is a regression from commit a5516438959d90b071ff0a484ce4f3f523dc3152 ("hugetlb: modular state for hugetlb page size") which did: - inode->i_blocks -= BLOCKS_PER_HUGEPAGE * freed; + inode->i_blocks -= blocks_per_huge_page(h); so just put back the freed multiplier, and it's all happy again. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Andi Kleen <andi@firstfloor.org> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-30nommu: Provide mmap_min_addr definition.Paul Mundt
commit 35f2c2f6f6ae13ef23c4f68e6d3073753077ca43 upstream. With the "security: use mmap_min_addr indepedently of security models" change, mmap_min_addr is used in common areas, which susbsequently blows up the nommu build. This stubs in the definition in the nommu case as well. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Cc: Mike Frysinger <vapier.adi@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: James Morris <jmorris@namei.org>
2009-07-30mm: mark page accessed before we write_end()Josef Bacik
commit c8236db9cd7aa492dcfcdcca702638e704abed49 upstream. In testing a backport of the write_begin/write_end AOPs, a 10% re-read regression was noticed when running iozone. This regression was introduced because the old AOPs would always do a mark_page_accessed(page) after the commit_write, but when the new AOPs where introduced, the only place this was kept was in pagecache_write_end(). This patch does the same thing in the generic case as what is done in pagecache_write_end(), which is just to mark the page accessed before we do write_end(). Signed-off-by: Josef Bacik <jbacik@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-30vmscan: do not unconditionally treat zones that fail zone_reclaim() as fullMel Gorman
commit fa5e084e43eb14c14942027e1e2e894aeed96097 upstream. vmscan: do not unconditionally treat zones that fail zone_reclaim() as full On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. The problem is that zone_reclaim() failing at all means the zone gets marked full. This can cause situations where a zone is usable, but is being skipped because it has been considered full. Take a situation where a large tmpfs mount is occuping a large percentage of memory overall. The pages do not get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full and the zonelist cache considers them not worth trying in the future. This patch makes zone_reclaim() return more fine-grained information about what occured when zone_reclaim() failued. The zone only gets marked full if it really is unreclaimable. If it's a case that the scan did not occur or if enough pages were not reclaimed with the limited reclaim_mode, then the zone is simply skipped. There is a side-effect to this patch. Currently, if zone_reclaim() successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go ahead. With this patch applied, zone watermarks are rechecked after zone_reclaim() does some work. This bug was introduced by commit 9276b1bc96a132f4068fdee00983c532f43d3a26 ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the zonelist_cache was introduced. It was not intended that zone_reclaim() aggressively consider the zone to be full when it failed as full direct reclaim can still be an option. Due to the age of the bug, it should be considered a -stable candidate. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-30fix RCU-callback-after-kmem_cache_destroy problem in sl[aou]bPaul E. McKenney
commit 7ed9f7e5db58c6e8c2b4b738a75d5dcd8e17aad5 upstream. Jesper noted that kmem_cache_destroy() invokes synchronize_rcu() rather than rcu_barrier() in the SLAB_DESTROY_BY_RCU case, which could result in RCU callbacks accessing a kmem_cache after it had been destroyed. Acked-by: Matt Mackall <mpm@selenic.com> Reported-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-19security: use mmap_min_addr indepedently of security modelsChristoph Lameter
commit e0a94c2a63f2644826069044649669b5e7ca75d3 upstream. This patch removes the dependency of mmap_min_addr on CONFIG_SECURITY. It also sets a default mmap_min_addr of 4096. mmapping of addresses below 4096 will only be possible for processes with CAP_SYS_RAWIO. Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Acked-by: Eric Paris <eparis@redhat.com> Looks-ok-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-02vmscan: count the number of times zone_reclaim() scans and failsMel Gorman
commit 24cf72518c79cdcda486ed26074ff8151291cf65 upstream. On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but it is possible that the heuristic will fail and the CPU gets tied up scanning uselessly. Detecting the situation requires some guesswork and experimentation so this patch adds a counter "zreclaim_failed" to /proc/vmstat. If during high CPU utilisation this counter is increasing rapidly, then the resolution to the problem may be to set /proc/sys/vm/zone_reclaim_mode to 0. [akpm@linux-foundation.org: name things consistently] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-02vmscan: properly account for the number of page cache pages zone_reclaim() ↵Mel Gorman
can reclaim commit 90afa5de6f3fa89a733861e843377302479fcf7e upstream. A bug was brought to my attention against a distro kernel but it affects mainline and I believe problems like this have been reported in various guises on the mailing lists although I don't have specific examples at the moment. The reported problem was that malloc() stalled for a long time (minutes in some cases) if a large tmpfs mount was occupying a large percentage of memory overall. The pages did not get cleaned or reclaimed by zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists are uselessly scanned frequencly making the CPU spin at near 100%. This patchset intends to address that bug and bring the behaviour of zone_reclaim() more in line with expectations which were noticed during investigation. It is based on top of mmotm and takes advantage of Kosaki's work with respect to zone_reclaim(). Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the scan should go ahead. The broken heuristic is what was causing the malloc() stall as it uselessly scanned the LRU constantly. Currently, zone_reclaim is assuming zone_reclaim_mode is 1 and historically it could not deal with tmpfs pages at all. This fixes up the heuristic so that an unnecessary scan is more likely to be correctly avoided. Patch 2 notes that zone_reclaim() returning a failure automatically means the zone is marked full. This is not always true. It could have failed because the GFP mask or zone_reclaim_mode were unsuitable. Patch 3 introduces a counter zreclaim_failed that will increment each time the zone_reclaim scan-avoidance heuristics fail. If that counter is rapidly increasing, then zone_reclaim_mode should be set to 0 as a temporarily resolution and a bug reported because the scan-avoidance heuristic is still broken. This patch: On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but the problem is that the heuristic is not being properly applied and is basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can manfiest as high CPU usage as the LRU list is scanned uselessly. Historically, once enabled it was depending on NR_FILE_PAGES which may include swapcache pages that the reclaim_mode cannot deal with. Patch vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included pages that were not file-backed such as swapcache and made a calculation based on the inactive, active and mapped files. This is far superior when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a reasonable starting figure. This patch alters how zone_reclaim() works out how many pages it might be able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in the reclaim_mode it will either consider NR_FILE_PAGES as potential candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set, then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is not set, then NR_FILE_MAPPED are not. [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages] [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-07-02mm: fix handling of pagesets for downed cpusDimitri Sivanich
commit 364df0ebfbbb1330bfc6ca159f4d6020efc15a12 upstream. After downing/upping a cpu, an attempt to set /proc/sys/vm/percpu_pagelist_fraction results in an oops in percpu_pagelist_fraction_sysctl_handler(). If a processor is downed then we need to set the pageset pointer back to the boot pageset. Updates of the high water marks should not access pagesets of unpopulated zones (those pointer go to the boot pagesets which would be no longer functional if their size would be increased beyond zero). Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-05-29memcg: fix build warning and avoid checking for mem != null again and againNikanth Karthikesan
Fix build warning, "mem_cgroup_is_obsolete defined but not used" when CONFIG_DEBUG_VM is not set. Also avoid checking for !mem again and again. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29mm: account for MAP_SHARED mappings using VM_MAYSHARE and not VM_SHARED in ↵Mel Gorman
hugetlbfs Addresses http://bugzilla.kernel.org/show_bug.cgi?id=13302 hugetlbfs reserves huge pages but does not fault them at mmap() time to ensure that future faults succeed. The reservation behaviour differs depending on whether the mapping was mapped MAP_SHARED or MAP_PRIVATE. For MAP_SHARED mappings, hugepages are reserved when mmap() is first called and are tracked based on information associated with the inode. Other processes mapping MAP_SHARED use the same reservation. MAP_PRIVATE track the reservations based on the VMA created as part of the mmap() operation. Each process mapping MAP_PRIVATE must make its own reservation. hugetlbfs currently checks if a VMA is MAP_SHARED with the VM_SHARED flag and not VM_MAYSHARE. For file-backed mappings, such as hugetlbfs, VM_SHARED is set only if the mapping is MAP_SHARED and the file was opened read-write. If a shared memory mapping was mapped shared-read-write for populating of data and mapped shared-read-only by other processes, then hugetlbfs would account for the mapping as if it was MAP_PRIVATE. This causes processes to fail to map the file MAP_SHARED even though it should succeed as the reservation is there. This patch alters mm/hugetlb.c and replaces VM_SHARED with VM_MAYSHARE when the intent of the code was to check whether the VMA was mapped MAP_SHARED or MAP_PRIVATE. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <starlight@binnacle.cx> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29memcg: fix deadlock between lock_page_cgroup and mapping tree_lockDaisuke Nishimura
mapping->tree_lock can be acquired from interrupt context. Then, following dead lock can occur. Assume "A" as a page. CPU0: lock_page_cgroup(A) interrupted -> take mapping->tree_lock. CPU1: take mapping->tree_lock -> lock_page_cgroup(A) This patch tries to fix above deadlock by moving memcg's hook to out of mapping->tree_lock. charge/uncharge of pagecache/swapcache is protected by page lock, not tree_lock. After this patch, lock_page_cgroup() is not called under mapping->tree_lock. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-29oom: fix possible oom_dump_tasks NULL pointerDavid Rientjes
When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL pointer for tasks that have detached mm's since task_lock() is not held during the tasklist scan. Add the task_lock(). Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-21hugh: update email addressHugh Dickins
My old address will shut down in a few days time: remove it from the tree, and add a tmpfs (shmem filesystem) maintainer entry with the new address. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-20Merge master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds
* master.kernel.org:/home/rmk/linux-2.6-arm: (25 commits) [ARM] 5519/1: amba probe: pass "struct amba_id *" instead of void * [ARM] 5517/1: integrator: don't put clock lookups in __initdata [ARM] 5518/1: versatile: don't put clock lookups in __initdata [ARM] mach-l7200: fix spelling of SYS_CLOCK_OFF [ARM] Double check memmap is actually valid with a memmap has unexpected holes V2 [ARM] realview: fix broadcast tick support [ARM] realview: remove useless smp_cross_call_done() [ARM] smp: fix cpumask usage in ARM SMP code [ARM] 5513/1: Eurotech VIPER SBC: fix compilation error [ARM] 5509/1: ep93xx: clkdev enable UARTS ARM: OMAP2/3: Change omapfb to use clkdev for dispc and rfbi, v2 ARM: OMAP3: Fix HW SAVEANDRESTORE shift define ARM: OMAP3: Fix number of GPIO lines for 34xx [ARM] S3C: Do not set clk->owner field if unset [ARM] S3C2410: mach-bast.c registering i2c data too early [ARM] S3C24XX: Fix unused code warning in arch/arm/plat-s3c24xx/dma.c [ARM] S3C64XX: fix GPIO debug [ARM] S3C64XX: GPIO include cleanup [ARM] nwfpe: fix 'floatx80_is_nan' sparse warning [ARM] nwfpe: Add decleration for ExtendedCPDO ...
2009-05-18[ARM] Double check memmap is actually valid with a memmap has unexpected ↵Mel Gorman
holes V2 pfn_valid() is meant to be able to tell if a given PFN has valid memmap associated with it or not. In FLATMEM, it is expected that holes always have valid memmap as long as there is valid PFNs either side of the hole. In SPARSEMEM, it is assumed that a valid section has a memmap for the entire section. However, ARM and maybe other embedded architectures in the future free memmap backing holes to save memory on the assumption the memmap is never used. The page_zone linkages are then broken even though pfn_valid() returns true. A walker of the full memmap must then do this additional check to ensure the memmap they are looking at is sane by making sure the zone and PFN linkages are still valid. This is expensive, but walkers of the full memmap are extremely rare. This was caught before for FLATMEM and hacked around but it hits again for SPARSEMEM because the page_zone linkages can look ok where the PFN linkages are totally screwed. This looks like a hatchet job but the reality is that any clean solution would end up consumning all the memory saved by punching these unexpected holes in the memmap. For example, we tried marking the memmap within the section invalid but the section size exceeds the size of the hole in most cases so pfn_valid() starts returning false where valid memmap exists. Shrinking the size of the section would increase memory consumption offsetting the gains. This patch identifies when an architecture is punching unexpected holes in the memmap that the memory model cannot automatically detect and sets ARCH_HAS_HOLES_MEMORYMODEL. At the moment, this is restricted to EP93xx which is the model sub-architecture this has been reported on but may expand later. When set, walkers of the full memmap must call memmap_valid_within() for each PFN and passing in what it expects the page and zone to be for that PFN. If it finds the linkages to be broken, it assumes the memmap is invalid for that PFN. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2009-05-17page-writeback: fix the calculation of the oldest_jif in wb_kupdate()Toshiyuki Okajima
wb_kupdate() function has a bug on linux-2.6.30-rc5. This bug causes generic_sync_sb_inodes() to start to write inodes back much earlier than our expectations because it miscalculates oldest_jif in wb_kupdate(). This bug was introduced in 704503d836042d4a4c7685b7036e7de0418fbc0f ('mm: fix proc_dointvec_userhz_jiffies "breakage"'). Signed-off-by: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: mm: SLOB fix reclaim_state mm: SLUB fix reclaim_state slub: add Documentation/ABI/testing/sysfs-kernel-slab slub: enforce MAX_ORDER
2009-05-15Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: Revert "mm: add /proc controls for pdflush threads" viocd: needs to depend on BLOCK block: fix the bio_vec array index out-of-bounds test
2009-05-15Revert "mm: add /proc controls for pdflush threads"Jens Axboe
This reverts commit fafd688e4c0c34da0f3de909881117d374e4c7af. Work is progressing to switch away from pdflush as the process backing for flushing out dirty data. So it seems pointless to add more knobs to control pdflush threads. The original author of the patch did not have any specific use cases for adding the knobs, so we can easily revert this before 2.6.30 to avoid having to maintain this API forever. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-05-13Revert "Ignore madvise(MADV_WILLNEED) for hugetlbfs-backed regions"Linus Torvalds
This reverts commit a425a638c858fd10370b573bde81df3ba500e271. Now that the previous commit removed the "readpage" actor for hugetlb files, read-ahead will no longer mess up the mapping, and there's no longer any reason to treat hugetlbfs mappings specially. Tested-and-acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-07NOMMU: Don't check vm_region::vm_start is page aligned in add_nommu_region()David Howells
Don't check vm_region::vm_start is page aligned in add_nommu_region() because the region may reflect some non-page-aligned mapped file, such as could be obtained from RomFS XIP. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06nommu: make the initial mmap allocation excess behaviour Kconfig configurableDavid Howells
NOMMU mmap() has an option controlled by a sysctl variable that determines whether the allocations made by do_mmap_private() should have the excess space trimmed off and returned to the allocator. Make the initial setting of this variable a Kconfig configuration option. The reason there can be excess space is that the allocator only allocates in power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a power of 2. There are two alternatives: (1) Keep the excess as dead space. The dead space then remains unused for the lifetime of the mapping. Mappings of shared objects such as libc, ld.so or busybox's text segment may retain their dead space forever. (2) Return the excess to the allocator. This means that the dead space is limited to less than a page per mapping, but it means that for a transient process, there's more chance of fragmentation as the excess space may be reused fairly quickly. During the boot process, a lot of transient processes are created, and this can cause a lot of fragmentation as the pagecache and various slabs grow greatly during this time. By turning off the trimming of excess space during boot and disabling batching of frees, Coldfire can manage to boot. A better way of doing things might be to have /sbin/init turn this option off. By that point libc, ld.so and init - which are all long-duration processes - have all been loaded and trimmed. Reported-by: Lanttor Guo <lanttor.guo@freescale.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Lanttor Guo <lanttor.guo@freescale.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06nommu: clamp zone_batchsize() to 0 under NOMMU conditionsDavid Howells
Clamp zone_batchsize() to 0 under NOMMU conditions to stop free_hot_cold_page() from queueing and batching frees. The problem is that under NOMMU conditions it is really important to be able to allocate large contiguous chunks of memory, but when munmap() or exit_mmap() releases big stretches of memory, return of these to the buddy allocator can be deferred, and when it does finally happen, it can be in small chunks. Whilst the fragmentation this incurs isn't so much of a problem under MMU conditions as userspace VM is glued together from individual pages with the aid of the MMU, it is a real problem if there isn't an MMU. By clamping the page freeing queue size to 0, pages are returned to the allocator immediately, and the buddy detector is more likely to be able to glue them together into large chunks immediately, and fragmentation is less likely to occur. By disabling batching of frees, and by turning off the trimming of excess space during boot, Coldfire can manage to boot. Reported-by: Lanttor Guo <lanttor.guo@freescale.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Lanttor Guo <lanttor.guo@freescale.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06mm: use roundown_pow_of_two() in zone_batchsize()David Howells
Use roundown_pow_of_two(N) in zone_batchsize() rather than (1 << (fls(N)-1)) as they are equivalent, and with the former it is easier to see what is going on. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Lanttor Guo <lanttor.guo@freescale.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06alloc_vmap_area: fix memory leakRalph Wuerthner
If alloc_vmap_area() fails the allocated struct vmap_area has to be freed. Signed-off-by: Ralph Wuerthner <ralphw@linux.vnet.ibm.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06oom: prevent livelock when oom_kill_allocating_task is setDavid Rientjes
When /proc/sys/vm/oom_kill_allocating_task is set for large systems that want to avoid the lengthy tasklist scan, it's possible to livelock if current is ineligible for oom kill. This normally happens when it is set to OOM_DISABLE, but is also possible if any threads are sharing the same ->mm with a different tgid. So change __out_of_memory() to fall back to the full task-list scan if it was unable to kill `current'. Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-06Merge branches 'topic/documentation', 'topic/slub/fixes' and 'topic/urgent' ↵Pekka Enberg
into for-linus
2009-05-06mm: SLOB fix reclaim_stateNick Piggin
SLOB does not correctly account reclaim_state.reclaimed_slab, so it will break memory reclaim. Account it like SLAB does. Cc: stable@kernel.org Cc: linux-mm@kvack.org Acked-by: Matt Mackall <mpm@selenic.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-05-06mm: SLUB fix reclaim_stateNick Piggin
SLUB does not correctly account reclaim_state.reclaimed_slab, so it will break memory reclaim. Account it like SLAB does. Cc: stable@kernel.org Cc: linux-mm@kvack.org Cc: Matt Mackall <mpm@selenic.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-05-05Ignore madvise(MADV_WILLNEED) for hugetlbfs-backed regionsMel Gorman
madvise(MADV_WILLNEED) forces page cache readahead on a range of memory backed by a file. The assumption is made that the page required is order-0 and "normal" page cache. On hugetlbfs, this assumption is not true and order-0 pages are allocated and inserted into the hugetlbfs page cache. This leaks hugetlbfs page reservations and can cause BUGs to trigger related to corrupted page tables. This patch causes MADV_WILLNEED to be ignored for hugetlbfs-backed regions. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02vmscan: avoid multiplication overflow in shrink_zone()Andrew Morton
Local variable `scan' can overflow on zones which are larger than (2G * 4k) / 100 = 80GB. Making it 64-bit on 64-bit will fix that up. Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02mm: fix Committed_AS underflow on large NR_CPUS environmentKOSAKI Motohiro
The Committed_AS field can underflow in certain situations: > # while true; do cat /proc/meminfo | grep _AS; sleep 1; done | uniq -c > 1 Committed_AS: 18446744073709323392 kB > 11 Committed_AS: 18446744073709455488 kB > 6 Committed_AS: 35136 kB > 5 Committed_AS: 18446744073709454400 kB > 7 Committed_AS: 35904 kB > 3 Committed_AS: 18446744073709453248 kB > 2 Committed_AS: 34752 kB > 9 Committed_AS: 18446744073709453248 kB > 8 Committed_AS: 34752 kB > 3 Committed_AS: 18446744073709320960 kB > 7 Committed_AS: 18446744073709454080 kB > 3 Committed_AS: 18446744073709320960 kB > 5 Committed_AS: 18446744073709454080 kB > 6 Committed_AS: 18446744073709320960 kB Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does not check for underflow. But NR_CPUS proportional isn't good calculation. In general, possibility of lock contention is proportional to the number of online cpus, not theorical maximum cpus (NR_CPUS). The current kernel has generic percpu-counter stuff. using it is right way. it makes code simplify and percpu_counter_read_positive() don't make underflow issue. Reported-by: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: <stable@kernel.org> [All kernel versions] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02memcg: fix mem_cgroup_shrink_usage()Daisuke Nishimura
Current mem_cgroup_shrink_usage() has two problems. 1. It doesn't call mem_cgroup_out_of_memory and doesn't update last_oom_jiffies, so pagefault_out_of_memory invokes global OOM. 2. Considering hierarchy, shrinking has to be done from the mem_over_limit, not from the memcg which the page would be charged to. mem_cgroup_try_charge_swapin() does all of these things properly, so we use it and call cancel_charge_swapin when it succeeded. The name of "shrink_usage" is not appropriate for this behavior, so we change it too. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.cn> Cc: Paul Menage <menage@google.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02mm: close page_mkwrite racesNick Piggin
Change page_mkwrite to allow implementations to return with the page locked, and also change it's callers (in page fault paths) to hold the lock until the page is marked dirty. This allows the filesystem to have full control of page dirtying events coming from the VM. Rather than simply hold the page locked over the page_mkwrite call, we call page_mkwrite with the page unlocked and allow callers to return with it locked, so filesystems can avoid LOR conditions with page lock. The problem with the current scheme is this: a filesystem that wants to associate some metadata with a page as long as the page is dirty, will perform this manipulation in its ->page_mkwrite. It currently then must return with the page unlocked and may not hold any other locks (according to existing page_mkwrite convention). In this window, the VM could write out the page, clearing page-dirty. The filesystem has no good way to detect that a dirty pte is about to be attached, so it will happily write out the page, at which point, the filesystem may manipulate the metadata to reflect that the page is no longer dirty. It is not always possible to perform the required metadata manipulation in ->set_page_dirty, because that function cannot block or fail. The filesystem may need to allocate some data structure, for example. And the VM cannot mark the pte dirty before page_mkwrite, because page_mkwrite is allowed to fail, so we must not allow any window where the page could be written to if page_mkwrite does fail. This solution of holding the page locked over the 3 critical operations (page_mkwrite, setting the pte dirty, and finally setting the page dirty) closes out races nicely, preventing page cleaning for writeout being initiated in that window. This provides the filesystem with a strong synchronisation against the VM here. - Sage needs this race closed for ceph filesystem. - Trond for NFS (http://bugzilla.kernel.org/show_bug.cgi?id=12913). - I need it for fsblock. - I suspect other filesystems may need it too (eg. btrfs). - I have converted buffer.c to the new locking. Even simple block allocation under dirty pages might be susceptible to i_size changing under partial page at the end of file (we also have a buffer.c-side problem here, but it cannot be fixed properly without this patch). - Other filesystems (eg. NFS, maybe btrfs) will need to change their page_mkwrite functions themselves. [ This also moves page_mkwrite another step closer to fault, which should eventually allow page_mkwrite to be moved into ->fault, and thus avoiding a filesystem calldown and page lock/unlock cycle in __do_fault. ] [akpm@linux-foundation.org: fix derefs of NULL ->mapping] Cc: Sage Weil <sage@newdream.net> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02memcg: fix try_get_mem_cgroup_from_swapcache()Daisuke Nishimura
This is a bugfix for commit 3c776e64660028236313f0e54f3a9945764422df ("memcg: charge swapcache to proper memcg"). Used bit of swapcache is solid under page lock, but considering move_account, pc->mem_cgroup is not. We need lock_page_cgroup() anyway. Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-05-02mm: fix pageref leak in do_swap_page()Johannes Weiner
By the time the memory cgroup code is notified about a swapin we already hold a reference on the fault page. If the cgroup callback fails make sure to unlock AND release the page reference which was taken by lookup_swap_cach(), or we leak the reference. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-23slub: enforce MAX_ORDERDavid Rientjes
slub_max_order may not be equal to or greater than MAX_ORDER. Additionally, if a single object cannot be placed in a slab of slub_max_order, it still must allocate slabs below MAX_ORDER. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-04-21vmscan,memcg: reintroduce sc->may_swapKOSAKI Motohiro
Commit a6dc60f8975ad96d162915e07703a4439c80dcf0 ("vmscan: rename sc.may_swap to may_unmap") removed the may_swap flag, but memcg had used it as a flag for "we need to use swap?", as the name indicate. And in the current implementation, memcg cannot reclaim mapped file caches when mem+swap hits the limit. re-introduce may_swap flag and handle it at get_scan_ratio(). This patch doesn't influence any scan_control users other than memcg. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-18PM/Hibernate: Fix memory shrinkingRafael J. Wysocki
Commit d979677c4c0 ("mm: shrink_all_memory(): use sc.nr_reclaimed") broke the memory shrinking used by hibernation, becuse it did not update shrink_all_zones() in accordance with the other changes it made. Fix this by making shrink_all_zones() update sc->nr_reclaimed instead of overwriting its value. This fixes http://bugzilla.kernel.org/show_bug.cgi?id=13058 Reported-and-tested-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-16mm: pass correct mm when growing stackHugh Dickins
Tetsuo Handa reports seeing the WARN_ON(current->mm == NULL) in security_vm_enough_memory(), when do_execve() is touching the target mm's stack, to set up its args and environment. Yes, a UMH_NO_WAIT or UMH_WAIT_PROC call_usermodehelper() spawns an mm-less kernel thread to do the exec. And in any case, that vm_enough_memory check when growing stack ought to be done on the target mm, not on the execer's mm (though apart from the warning, it only makes a slight tweak to OVERCOMMIT_NEVER behaviour). Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-16Export filemap_write_and_wait_rangeChris Mason
This wasn't exported before and is useful (used by the experimental ext3 data=guarded code) Signed-off-by: Chris Mason <chris.mason@oracle.com> Acked-by: Theodore Tso <tytso@mit.edu> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13shmem: respect MAX_LFS_FILESIZEHugh Dickins
SHMEM_MAX_BYTES was derived from the maximum size of its triple-indirect swap vector, forgetting to take the MAX_LFS_FILESIZE limit into account. Never mind 256kB pages, even 8kB pages on 32-bit kernels allowed files to grow slightly bigger than that supposed maximum. Fix this by using the min of both (at build time not run time). And it happens that this calculation is good as far as 8MB pages on 32-bit or 16MB pages on 64-bit: though SHMSWP_MAX_INDEX gets truncated before that, it's truncated to such large numbers that we don't need to care. [akpm@linux-foundation.org: it needs pagemap.h] [akpm@linux-foundation.org: fix sparc64 min() warnings] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Yuri Tikhonov <yur@emcraft.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-13shmem: fix division by zeroYuri Tikhonov
Fix a division by zero which we have in shmem_truncate_range() and shmem_unuse_inode() when using big PAGE_SIZE values (e.g. 256kB on ppc44x). With 256kB PAGE_SIZE, the ENTRIES_PER_PAGEPAGE constant becomes too large (0x1.0000.0000) on a 32-bit kernel, so this patch just changes its type from 'unsigned long' to 'unsigned long long'. Hugh: reverted its unsigned long longs in shmem_truncate_range() and shmem_getpage(): the pagecache index cannot be more than an unsigned long, so the divisions by zero occurred in unreached code. It's a pity we need any ULL arithmetic here, but I found no pretty way to avoid it. Signed-off-by: Yuri Tikhonov <yur@emcraft.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>