summaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2005-09-05[PATCH] mm/slab.c: prefetchw the start of new allocated objectsEric Dumazet
Mostobjects returned by __cache_alloc() will be written by the caller, (but not all callers want to write all the object, but just at the begining) prefetchw() tells the modern CPU to think about the future writes, ie start some memory transactions in advance. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] x86: ptep_clear optimizationZachary Amsden
Add a new accessor for PTEs, which passes the full hint from the mmu_gather struct; this allows architectures with hardware pagetables to optimize away atomic PTE operations when destroying an address space. Removing the locked operation should allow better pipelining of memory access in this loop. I measured an average savings of 30-35 cycles per zap_pte_range on the first 500 destructions on Pentium-M, but I believe the optimization would win more on older processors which still assert the bus lock on xchg for an exclusive cacheline. Update: I made some new measurements, and this saves exactly 26 cycles over ptep_get_and_clear on Pentium M. On P4, with a PAE kernel, this saves 180 cycles per ptep_get_and_clear, for a whopping 92160 cycles savings for a full address space destruction. pte_clear_full is not yet used, but is provided for future optimizations (in particular, when running inside of a hypervisor that queues page table updates, the full hint allows us to avoid queueing unnecessary page table update for an address space in the process of being destroyed. This is not a huge win, but it does help a bit, and sets the stage for further hypervisor optimization of the mm layer on all architectures. Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Christoph Lameter <christoph@lameter.com> Cc: <linux-mm@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] sab: consolidate kmem_bufctl_tKyle Moffett
This is used only in slab.c and each architecture gets to define whcih underlying type is to be used. Seems a bit silly - move it to slab.c and use the same type for all architectures: unsigned int. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] hugetlb: move stale pte check into huge_pte_alloc()Adam Litke
Initial Post (Wed, 17 Aug 2005) This patch moves the if (! pte_none(*pte)) hugetlb_clean_stale_pgtable(pte); logic into huge_pte_alloc() so all of its callers can be immune to the bug described by Kenneth Chen at http://lkml.org/lkml/2004/6/16/246 > It turns out there is a bug in hugetlb_prefault(): with 3 level page table, > huge_pte_alloc() might return a pmd that points to a PTE page. It happens > if the virtual address for hugetlb mmap is recycled from previously used > normal page mmap. free_pgtables() might not scrub the pmd entry on > munmap and hugetlb_prefault skips on any pmd presence regardless what type > it is. Unless I am missing something, it seems more correct to place the check inside huge_pte_alloc() to prevent a the same bug wherever a huge pte is allocated. It also allows checking for this condition when lazily faulting huge pages later in the series. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: <linux-mm@kvack.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] arm: allow for arch-specific IOREMAP_MAX_ORDERDeepak Saxena
Version 6 of the ARM architecture introduces the concept of 16MB pages (supersections) and 36-bit (40-bit actually, but nobody uses this) physical addresses. 36-bit addressed memory and I/O and ARMv6 can only be mapped using supersections and the requirement on these is that both virtual and physical addresses be 16MB aligned. In trying to add support for ioremap() of 36-bit I/O, we run into the issue that get_vm_area() allows for a maximum of 512K alignment via the IOREMAP_MAX_ORDER constant. To work around this, we can: - Allocate a larger VM area than needed (size + (1ul << IOREMAP_MAX_ORDER)) and then align the pointer ourselves, but this ends up with 512K of wasted VM per ioremap(). - Provide a new __get_vm_area_aligned() API and make __get_vm_area() sit on top of this. I did this and it works but I don't like the idea adding another VM API just for this one case. - My preferred solution which is to allow the architecture to override the IOREMAP_MAX_ORDER constant with it's own version. Signed-off-by: Deepak Saxena <dsaxena@plexity.net> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] mm: remove implied vm_ops checkPaolo 'Blaisorblade' Giarrusso
If !vma->vm-ops we already BUG above, so retesting it is useless. The compiler cannot optimize this because BUG is a macro and is not thus marked noreturn; that should possibly be fixed. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] shmem_populate: avoid an useless check, and some commentsPaolo 'Blaisorblade' Giarrusso
Either shmem_getpage returns a failure, or it found a page, or it was told it couldn't do any I/O. So it's useless to check nonblock in the else branch. We could add a BUG() there but I preferred to comment the offending function. This was taken out from one Ingo Molnar's old patch I'm resurrecting. Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Cc: Ingo Molnar <mingo@elte.hu> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] vm: slab.c spelling correctionMartin Hicks
Fix a small spelling mistake. subtile->subtle Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] mm: fix madvise vma mergingHugh Dickins
Better late than never, I've at last reviewed the madvise vma merging going into 2.6.13. Remove a pointless check and fix two little bugs - a simple test (with /proc/<pid>/maps hacked to show ReadHints) showed both mismerges in practice: though being madvise, neither was disastrous. 1. Correct placement of the success label in madvise_behavior: as in mprotect_fixup and mlock_fixup, it is necessary to update vm_flags when vma_merge succeeds (to handle the exceptional Case 8 noted in the comments above vma_merge itself). 2. Correct initial value of prev when starting part way into a vma: as in sys_mprotect and do_mlock, it needs to be set to vma in this case (vma_merge handles only that minimum of cases shown in its comments). 3. If find_vma_prev sets prev, then the vma it returns is prev->vm_next, so it's pointless to make that same assignment again in sys_madvise. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] VM: zone reclaim atomic ops cleanupMartin Hicks
Christoph Lameter and Marcelo Tosatti asked to get rid of the atomic_inc_and_test() to cleanup the atomic ops in the zone reclaim code. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] VM: add capabilites check to set_zone_reclaimMartin Hicks
Add a capability check to sys_set_zone_reclaim(). This syscall is not something that should be available to a user. Signed-off-by: Martin Hicks <mort@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] mm: remove atomicNick Piggin
This bitop does not need to be atomic because it is performed when there will be no references to the page (ie. the page is being freed). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] mm: remap ZERO_PAGE mappingsNick Piggin
filemap_xip's nopage routine maps the ZERO_PAGE into readonly mappings, if it has no data page to map there: then if the hole in the file is later filled, __xip_unmap uses an rmap technique to replace the ZERO_PAGEs mapped for that offset by the newly allocated file page, so that established mappings will see the newly written data. However, on MIPS (alone) there's not one but as many as eight ZERO_PAGEs, chosen for coloring by user virtual address; and if mremap has meanwhile been used to move a mapping containing a ZERO_PAGE, it will generally not match the ZERO_PAGE(address) __xip_unmap is looking for. To maintain XIP's established mappings correctly on MIPS, we need Nick's fix to mremap's move_one_page (originally presented as an optimization), to replace the ZERO_PAGE appropriate to the old address by the ZERO_PAGE appropriate to the new address. (But when I first saw this, I was thinking the ZERO_PAGEs themselves would get corrupted, very bad. Now I think it's the other way round, that the established mappings will fail to see the newly written data: incorrect, but not corrupting everything else. Whether filemap_xip's technique is generally safe, I'd hesitate to say in a hurry: it's interesting, but we've never tried to do that in tmpfs.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] mm: cleanup rmapNick Piggin
Thanks to Bill Irwin for pointing this out. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] mm: micro-optimise rmapNick Piggin
Microoptimise page_add_anon_rmap. Although these expressions are used only in the taken branch of the if() statement, the compiler can't reorder them inside because atomic_inc_and_test is a barrier. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] mm: comment rmapNick Piggin
Just be clear that VM_RESERVED pages here are a bug, and the test is not there because they are expected. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] /proc/<pid>/numa_maps to show on which nodes pages resideChristoph Lameter
This patch was recently discussed on linux-mm: http://marc.theaimsgroup.com/?t=112085728500002&r=1&w=2 I inherited a large code base from Ray for page migration. There was a small patch in there that I find to be very useful since it allows the display of the locality of the pages in use by a process. I reworked that patch and came up with a /proc/<pid>/numa_maps that gives more information about the vma's of a process. numa_maps is indexes by the start address found in /proc/<pid>/maps. F.e. with this patch you can see the page use of the "getty" process: margin:/proc/12008 # cat maps 00000000-00004000 r--p 00000000 00:00 0 2000000000000000-200000000002c000 r-xp 00000000 08:04 516 /lib/ld-2.3.3.so 2000000000038000-2000000000040000 rw-p 00028000 08:04 516 /lib/ld-2.3.3.so 2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0 2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000260000-2000000000268000 ---p 00208000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842 /lib/tls/libc.so.6.1 2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0 2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923 /usr/lib/locale/en_US.utf8/LC_CTYPE 2000000000300000-2000000000308000 r--s 00000000 08:04 60071467 /usr/lib/gconv/gconv-modules.cache 2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0 4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399 /sbin/mingetty 6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399 /sbin/mingetty 6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0 [heap] 60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0 60000ffffff44000-60000ffffff98000 rw-p 60000ffffff44000 00:00 0 [stack] a000000000000000-a000000000020000 ---p 00000000 00:00 0 [vdso] cat numa_maps 2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2 2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2 2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16 2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2 2000000000274000 default MaxRef=1 Pages=3 Mapped=3 Anon=3 N0=3 2000000000280000 default MaxRef=8 Pages=3 Mapped=3 N0=3 2000000000300000 default MaxRef=8 Pages=2 Mapped=2 N0=2 2000000000318000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N2=1 4000000000000000 default MaxRef=6 Pages=2 Mapped=2 N1=2 6000000000004000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 6000000000008000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 60000fff7fffc000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 60000ffffff44000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1 getty uses ld.so. The first vma is the code segment which is used by 43 other processes and the pages are evenly distributed over the 4 nodes. The second vma is the process specific data portion for ld.so. This is only one page. The display format is: <startaddress> Links to information in /proc/<pid>/map <memory policy> This can be "default" "interleave={}", "prefer=<node>" or "bind={<zones>}" MaxRef= <maximum reference to a page in this vma> Pages= <Nr of pages in use> Mapped= <Nr of pages with mapcount > Anon= <nr of anonymous pages> Nx= <Nr of pages on Node x> The content of the proc-file is self-evident. If this would be tied into the sparsemem system then the contents of this file would not be too useful. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] rmap: don't test rssHugh Dickins
Remove the three get_mm_counter(mm, rss) tests from rmap.c: there was a time when testing rss was important to avoid a particular race between dup_mmap and the anonmm rmap; but now it's just a rather silly pseudo- optimization, made even more obscure by the get_mm_counter macro. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] delete from_swap_cache BUG_ONsHugh Dickins
Three of the four BUG_ONs in delete_from_swap_cache are immediately repeated in __delete_from_swap_cache: delete those and add the one. But perhaps mm/ is altogether overprovisioned with historic BUGs? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: swap_lock replace list+deviceHugh Dickins
The idea of a swap_device_lock per device, and a swap_list_lock over them all, is appealing; but in practice almost every holder of swap_device_lock must already hold swap_list_lock, which defeats the purpose of the split. The only exceptions have been swap_duplicate, valid_swaphandles and an untrodden path in try_to_unuse (plus a few places added in this series). valid_swaphandles doesn't show up high in profiles, but swap_duplicate does demand attention. However, with the hold time in get_swap_pages so much reduced, I've not yet found a load and set of swap device priorities to show even swap_duplicate benefitting from the split. Certainly the split is mere overhead in the common case of a single swap device. So, replace swap_list_lock and swap_device_lock by spinlock_t swap_lock (generally we seem to prefer an _ in the name, and not hide in a macro). If someone can show a regression in swap_duplicate, then probably we should add a hashlock for the swap_map entries alone (shorts being anatomic), so as to help the case of the single swap device too. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: scan_swap_map latency breaksHugh Dickins
The get_swap_page/scan_swap_map latency can be so bad that even those without preemption configured deserve relief: periodically cond_resched. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: scan_swap_map drop swap_device_lockHugh Dickins
get_swap_page has often shown up on latency traces, doing lengthy scans while holding two spinlocks. swap_list_lock is already dropped, now scan_swap_map drop swap_device_lock before scanning the swap_map. While scanning for an empty cluster, don't worry that racing tasks may allocate what was free and free what was allocated; but when allocating an entry, check it's still free after retaking the lock. Avoid dropping the lock in the expected common path. No barriers beyond the locks, just let the cookie crumble; highest_bit limit is volatile, but benign. Guard against swapoff: must check SWP_WRITEOK before allocating, must raise SWP_SCANNING reference count while in scan_swap_map, swapoff wait for that to fall - just use schedule_timeout, we don't want to burden scan_swap_map itself, and it's very unlikely that anyone can really still be in scan_swap_map once swapoff gets this far. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: scan_swap_map restyledHugh Dickins
Rewrite scan_swap_map to allocate in just the same way as before (taking the next free entry SWAPFILE_CLUSTER-1 times, then restarting at the lowest wholly empty cluster, falling back to lowest entry if none), but with a view towards dropping the lock in the next patch. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: get_swap_page drop swap_list_lockHugh Dickins
Rewrite get_swap_page to allocate in just the same sequence as before, but without holding swap_list_lock across its scan_swap_map. Decrement nr_swap_pages and update swap_list.next in advance, while still holding swap_list_lock. Skip full devices by testing highest_bit. Swapoff hold swap_device_lock as well as swap_list_lock to clear SWP_WRITEOK. Reduces lock contention when there are parallel swap devices of the same priority. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: freeing update swap_list.nextHugh Dickins
This makes negligible difference in practice: but swap_list.next should not be updated to a higher prio in the general helper swap_info_get, but rather in swap_entry_free; and then only in the case when entry is actually freed. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: swap unsigned int consistencyHugh Dickins
The swap header's unsigned int last_page determines the range of swap pages, but swap_info has been using int or unsigned long in some cases: use unsigned int throughout (except, in several places a local unsigned long is useful to avoid overflows when adding). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Jens Axboe <axboe@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: show span of swap extentsHugh Dickins
The "Adding %dk swap" message shows the number of swap extents, as a guide to how fragmented the swapfile may be. But a useful further guide is what total extent they span across (sometimes scarily large). And there's no need to keep nr_extents in swap_info: it's unused after the initial message, so save a little space by keeping it on stack. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: swap extent list is orderedHugh Dickins
There are several comments that swap's extent_list.prev points to the lowest extent: that's not so, it's extent_list.next which points to it, as you'd expect. And a couple of loops in add_swap_extent which go all the way through the list, when they should just add to the other end. Fix those up, and let map_swap_page search the list forwards: profiles shows it to be twice as quick that way - because prefetch works better on how the structs are typically kmalloc'ed? or because usually more is written to than read from swap, and swap is allocated ascendingly? Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: move destroy_swap_extents callsHugh Dickins
sys_swapon's call to destroy_swap_extents on failure is made after the final swap_list_unlock, which is faintly unsafe: another sys_swapon might already be setting up that swap_info_struct. Calling it earlier, before taking swap_list_lock, is safe. sys_swapoff's call to destroy_swap_extents was safe, but likewise move it earlier, before taking the locks (once try_to_unuse has completed, nothing can be needing the swap extents). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: correct swapfile nr_good_pagesHugh Dickins
If a regular swapfile lies on a filesystem whose blocksize is less than PAGE_SIZE, then setup_swap_extents may have to cut the number of usable swap pages; but sys_swapon's nr_good_pages was not expecting that. Also, setup_swap_extents takes no account of badpages listed in the swap header: not worth doing so, but ensure nr_badpages is 0 for a regular swapfile. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] swap: update swapfile i_sem commentHugh Dickins
Update swap extents comment: nowadays we guard with S_SWAPFILE not i_sem. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] sparsemem extreme: hotplug preparationDave Hansen
This splits up sparse_index_alloc() into two pieces. This is needed because we'll allocate the memory for the second level in a different place from where we actually consume it to keep the allocation from happening underneath a lock Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Bob Picco <bob.picco@hp.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] sparsemem extreme implementationBob Picco
With cleanups from Dave Hansen <haveblue@us.ibm.com> SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to mem_sections. This two level layout scheme is able to achieve smaller memory requirements for SPARSEMEM with the tradeoff of an additional shift and load when fetching the memory section. The current SPARSEMEM implementation is a one dimensional array of mem_sections which is the default SPARSEMEM configuration. The patch attempts isolates the implementation details of the physical layout of the sparsemem section array. SPARSEMEM_EXTREME requires bootmem to be functioning at the time of memory_present() calls. This is not always feasible, so architectures which do not need it may allocate everything statically by using SPARSEMEM_STATIC. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-05[PATCH] SPARSEMEM EXTREMEBob Picco
A new option for SPARSEMEM is ARCH_SPARSEMEM_EXTREME. Architecture platforms with a very sparse physical address space would likely want to select this option. For those architecture platforms that don't select the option, the code generated is equivalent to SPARSEMEM currently in -mm. I'll be posting a patch on ia64 ml which uses this new SPARSEMEM feature. ARCH_SPARSEMEM_EXTREME makes mem_section a one dimensional array of pointers to mem_sections. This two level layout scheme is able to achieve smaller memory requirements for SPARSEMEM with the tradeoff of an additional shift and load when fetching the memory section. The current SPARSEMEM -mm implementation is a one dimensional array of mem_sections which is the default SPARSEMEM configuration. The patch attempts isolates the implementation details of the physical layout of the sparsemem section array. ARCH_SPARSEMEM_EXTREME depends on 64BIT and is by default boolean false. I've boot tested under aim load ia64 configured for ARCH_SPARSEMEM_EXTREME. I've also boot tested a 4 way Opteron machine with !ARCH_SPARSEMEM_EXTREME and tested with aim. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Bob Picco <bob.picco@hp.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-29[PATCH] Lazy page table copies in fork()Nick Piggin
Defer copying of ptes until fault time when it is possible to reconstruct the pte from backing store. Idea from Andi Kleen and Nick Piggin. Thanks to input from Rik van Riel and Linus and to Hugh for correcting my blundering. Ray Fucillo <fucillo@intersystems.com> reports: "I applied this latest patch to a 2.6.12 kernel and found that it does resolve the problem. Prior to the patch on this machine, I was seeing about 23ms spent in fork for ever 100MB of shared memory segment. After applying the patch, fork is taking about 1ms regardless of the shared memory size." Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-19Fix nasty ncpfs symlink handling bug.Linus Torvalds
This bug could cause oopses and page state corruption, because ncpfs used the generic page-cache symlink handlign functions. But those functions only work if the page cache is guaranteed to be "stable", ie a page that was installed when the symlink walk was started has to still be installed in the page cache at the end of the walk. We could have fixed ncpfs to not use the generic helper routines, but it is in many ways much cleaner to instead improve on the symlink walking helper routines so that they don't require that absolute stability. We do this by allowing "follow_link()" to return a error-pointer as a cookie, which is fed back to the cleanup "put_link()" routine. This also simplifies NFS symlink handling. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-05[PATCH] Fix hugepage crash on failing mmap()David Gibson
This patch fixes a crash in the hugepage code. unmap_hugepage_area() was assuming that (due to prefault) PTEs must exist for all the area in question. However, this may not be the case, if mmap() encounters an error before the prefault and calls unmap_region() to clean up any partial mapping. Depending on the hugepage configuration, this crash can be triggered by an unpriveleged user. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04[PATCH] __vm_enough_memory() signedness fixSimon Derr
We have found what seems to be a small bug in __vm_enough_memory() when sysctl_overcommit_memory is set to OVERCOMMIT_NEVER. When this bug occurs the systems fails to boot, with /sbin/init whining about fork() returning ENOMEM. We hunted down the problem to this: The deferred update mecanism used in vm_acct_memory(), on a SMP system, allows the vm_committed_space counter to have a negative value. This should not be a problem since this counter is known to be inaccurate. But in __vm_enough_memory() this counter is compared to the `allowed' variable, which is an unsigned long. This comparison is broken since it will consider the negative values of vm_committed_space to be huge positive values, resulting in a memory allocation failure. Signed-off-by: <Jean-Marc.Saffroy@ext.bull.net> Signed-off-by: <Simon.Derr@bull.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-04[PATCH] fix VmSize and VmData after mremapHugh Dickins
mremap's move_vma is applying __vm_stat_account to the old vma which may have already been freed: move it to just before the do_munmap. mremapping to and fro with CONFIG_DEBUG_SLAB=y showed /proc/<pid>/status VmSize and VmData wrapping just like in kernel bugzilla #4842, and fixed by this patch - worth including in 2.6.13, though not yet confirmed that it fixes that specific report from Frank van Maarseveen. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-03Fix up recent get_user_pages() handlingLinus Torvalds
The VM_FAULT_WRITE thing is an extra bit, not a valid return value, and has to be treated as such by get_user_pages(). Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-03[PATCH] fix get_user_pages bugNick Piggin
Checking pte_dirty instead of pte_write in __follow_page is problematic for s390, and for copy_one_pte which leaves dirty when clearing write. So revert __follow_page to check pte_write as before, and make do_wp_page pass back a special extra VM_FAULT_WRITE bit to say it has done its full job: once get_user_pages receives this value, it no longer requires pte_write in __follow_page. But most callers of handle_mm_fault, in the various architectures, have switch statements which do not expect this new case. To avoid changing them all in a hurry, make an inline wrapper function (using the old name) that masks off the new bit, and use the extended interface with double underscores. Yes, we do have a call to do_wp_page from do_swap_page, but no need to change that: in rare case it's needed, another do_wp_page will follow. Signed-off-by: Hugh Dickins <hugh@veritas.com> [ Cleanups by Nick Piggin ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01[PATCH] sys_set_mempolicy() doesnt check if mode < 0Eric Dumazet
A kernel BUG() is triggered by a call to set_mempolicy() with a negative first argument. This is because the mode is declared as an int, and the validity check doesnt check < 0 values. Alternatively, mode could be declared as unsigned int or unsigned long. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01[PATCH] x86_64: access of some bad addressHugh Dickins
x86_64 has a large sparse gate area between VSYSCALL_START and VSYSCALL_END, not all of it presently backed by pmds. Alexander Nyberg has found that in some circumstances gdb may try to ptrace here, and hit get_user_pages BUG_ON. It seems odd that gdb should be accessing here, but it certainly shouldn't crash in this way: relax BUG_ON to -EFAULT. Fixes kernel bugzilla #4801. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01Fix get_user_pages() race for write accessLinus Torvalds
There's no real guarantee that handle_mm_fault() will always be able to break a COW situation - if an update from another thread ends up modifying the page table some way, handle_mm_fault() may end up requiring us to re-try the operation. That's normally fine, but get_user_pages() ended up re-trying it as a read, and thus a write access could in theory end up losing the dirty bit or be done on a page that had not been properly COW'ed. This makes get_user_pages() always retry write accesses as write accesses by making "follow_page()" require that a writable follow has the dirty bit set. That simplifies the code and solves the race: if the COW break fails for some reason, we'll just loop around and try again. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-30[PATCH] Fix NUMA node sizing in nr_free_zone_pagesMartin J. Bligh
We are iterating over all nodes in nr_free_zone_pages(). Because the fallback zonelists contain all nodes in the system, and we walk all the zonelists, we're counting memory multiple times (once for each node). This caused us to make a size estimate of 32GB for an 8GB AMD64 box, which makes all the dirty ratio calculations, etc incorrect. There's still a further bug to fix from e820 holes causing overestimation as well, but this fix is separate, and good as is, and fixes one class of problems. Problem found by Badari, and tested by Ram Pai - thanks! Signed-off-by: Martin J. Bligh <mbligh@mbligh.org> Signed-off-by: Matt Dobson <colpatch@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] Remove bogus warning in page_alloc.cAndy Whitcroft
Originally __free_pages_bulk used the relative page number within a zone to define its buddies. This meant that to maintain the "maximally aligned" requirements (that an allocation of size N will be aligned at least to N physically) zones had to also be aligned to 1<<MAX_ORDER pages. When __free_pages_bulk was updated to use the relative page frame numbers of the free'd pages to pair buddies this released the alignment constraint on the 'left' edge of the zone. This allows _either_ edge of the zone to contain partial MAX_ORDER sized buddies. These simply never will have matching buddies and thus will never make it to the 'top' of the pyramid. The patch below removes a now redundant check ensuring that the mem_map was aligned to MAX_ORDER. Signed-off-by: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] madvise() does not always return -EBADF on non-file mapped areasuzuki
The madvise() system call returns -EBADF for areas which does not map to files, only for *behaviour* request MADV_WILLNEED. According to man pages, madvise returns : EBADF - the map exists, but the area maps something that isn't a file. Fixes bug 2995. Signed-off-by: Suzuki K P <suzuki@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] check_user_page_readable() deadlock fixAndrew Morton
Fix bug identifued by Richard Purdie <rpurdie@rpsys.net>. oprofile calls check_user_page_readable() from interrupt context, so we deadlock over various VFS locks. But check_user_page_readable() doesn't imply either a read or a write of the page's contents. Change __follow_page() so that check_user_page_readable() can tell __follow_page() that we're not accessing the page's contents, and use that info to avoid the troublesome lock-takings. Also, make follow_page() inline for the single callsite in memory.c to save a bit of stack space. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-27[PATCH] Undo mempolicy shared policy rbtree microoptimizationAndi Kleen
All mempolicy changes must be inside the spinlock and readding the rb_erase prevents a crash while doing: > echo "1" > /tmp/numatest > numactl --length=0x4000 --shm /tmp/numatest --localalloc > numactl --length=0x2000 --offset=0 --shm /tmp/numatest --membind=0 > numactl --length=0x2000 --offset=0x2000 --shm /tmp/numatest --membind=1 > ipcs > ipcrm -M "the_key_value_of_this_shm_area" Based on a patch by John Blackwood Cc: <john.blackwood@ccur.com> Cc: <andrea@suse.de> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-15[PATCH] execute-in-place fixesCarsten Otte
This patch includes feedback from Andrew and Christoph. Thanks for taking time to review. Use of empty_zero_page was eliminated to fix compilation for architectures that don't have it. This patch removes setting pages up-to-date in ext2_get_xip_page and all bug checks to verify that the page is indeed up to date. Setting the page state on mapping to userland is bogus. None of the code patchs involved with these pages in mm cares about the page state. still on my ToDo list: identify a place outside second extended where __inode_direct_access should reside Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>