summaryrefslogtreecommitdiff
path: root/Documentation/vm
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/vm')
-rw-r--r--Documentation/vm/.gitignore2
-rw-r--r--Documentation/vm/00-INDEX18
-rw-r--r--Documentation/vm/Makefile8
-rw-r--r--Documentation/vm/active_mm.txt83
-rw-r--r--Documentation/vm/balance18
-rw-r--r--Documentation/vm/cleancache.txt278
-rw-r--r--Documentation/vm/highmem.txt162
-rw-r--r--Documentation/vm/hugepage-mmap.c91
-rw-r--r--Documentation/vm/hugepage-shm.c98
-rw-r--r--Documentation/vm/hugetlbpage.txt477
-rw-r--r--Documentation/vm/hwpoison.txt182
-rw-r--r--Documentation/vm/ksm.txt82
-rw-r--r--Documentation/vm/locking4
-rw-r--r--Documentation/vm/map_hugetlb.c77
-rw-r--r--Documentation/vm/numa186
-rw-r--r--Documentation/vm/numa_memory_policy.txt13
-rw-r--r--Documentation/vm/overcommit-accounting2
-rw-r--r--Documentation/vm/page-types.c1100
-rw-r--r--Documentation/vm/page_migration12
-rw-r--r--Documentation/vm/pagemap.txt147
-rw-r--r--Documentation/vm/slabinfo.c1364
-rw-r--r--Documentation/vm/slub.txt15
-rw-r--r--Documentation/vm/transhuge.txt299
-rw-r--r--Documentation/vm/unevictable-lru.txt690
24 files changed, 3738 insertions, 1670 deletions
diff --git a/Documentation/vm/.gitignore b/Documentation/vm/.gitignore
new file mode 100644
index 000000000000..09b164a5700f
--- /dev/null
+++ b/Documentation/vm/.gitignore
@@ -0,0 +1,2 @@
+page-types
+slabinfo
diff --git a/Documentation/vm/00-INDEX b/Documentation/vm/00-INDEX
index 2131b00b63f6..dca82d7c83d8 100644
--- a/Documentation/vm/00-INDEX
+++ b/Documentation/vm/00-INDEX
@@ -1,20 +1,38 @@
00-INDEX
- this file.
+active_mm.txt
+ - An explanation from Linus about tsk->active_mm vs tsk->mm.
balance
- various information on memory balancing.
+hugepage-mmap.c
+ - Example app using huge page memory with the mmap system call.
+hugepage-shm.c
+ - Example app using huge page memory with Sys V shared memory system calls.
hugetlbpage.txt
- a brief summary of hugetlbpage support in the Linux kernel.
+hwpoison.txt
+ - explains what hwpoison is
+ksm.txt
+ - how to use the Kernel Samepage Merging feature.
locking
- info on how locking and synchronization is done in the Linux vm code.
+map_hugetlb.c
+ - an example program that uses the MAP_HUGETLB mmap flag.
numa
- information about NUMA specific code in the Linux vm.
numa_memory_policy.txt
- documentation of concepts and APIs of the 2.6 memory policy support.
overcommit-accounting
- description of the Linux kernels overcommit handling modes.
+page-types.c
+ - Tool for querying page flags
page_migration
- description of page migration in NUMA systems.
+pagemap.txt
+ - pagemap, from the userspace perspective
slabinfo.c
- source code for a tool to get reports about slabs.
slub.txt
- a short users guide for SLUB.
+unevictable-lru.txt
+ - Unevictable LRU infrastructure
diff --git a/Documentation/vm/Makefile b/Documentation/vm/Makefile
new file mode 100644
index 000000000000..3fa4d0668864
--- /dev/null
+++ b/Documentation/vm/Makefile
@@ -0,0 +1,8 @@
+# kbuild trick to avoid linker error. Can be omitted if a module is built.
+obj- := dummy.o
+
+# List of programs to build
+hostprogs-y := page-types hugepage-mmap hugepage-shm map_hugetlb
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
diff --git a/Documentation/vm/active_mm.txt b/Documentation/vm/active_mm.txt
new file mode 100644
index 000000000000..dbf45817405f
--- /dev/null
+++ b/Documentation/vm/active_mm.txt
@@ -0,0 +1,83 @@
+List: linux-kernel
+Subject: Re: active_mm
+From: Linus Torvalds <torvalds () transmeta ! com>
+Date: 1999-07-30 21:36:24
+
+Cc'd to linux-kernel, because I don't write explanations all that often,
+and when I do I feel better about more people reading them.
+
+On Fri, 30 Jul 1999, David Mosberger wrote:
+>
+> Is there a brief description someplace on how "mm" vs. "active_mm" in
+> the task_struct are supposed to be used? (My apologies if this was
+> discussed on the mailing lists---I just returned from vacation and
+> wasn't able to follow linux-kernel for a while).
+
+Basically, the new setup is:
+
+ - we have "real address spaces" and "anonymous address spaces". The
+ difference is that an anonymous address space doesn't care about the
+ user-level page tables at all, so when we do a context switch into an
+ anonymous address space we just leave the previous address space
+ active.
+
+ The obvious use for a "anonymous address space" is any thread that
+ doesn't need any user mappings - all kernel threads basically fall into
+ this category, but even "real" threads can temporarily say that for
+ some amount of time they are not going to be interested in user space,
+ and that the scheduler might as well try to avoid wasting time on
+ switching the VM state around. Currently only the old-style bdflush
+ sync does that.
+
+ - "tsk->mm" points to the "real address space". For an anonymous process,
+ tsk->mm will be NULL, for the logical reason that an anonymous process
+ really doesn't _have_ a real address space at all.
+
+ - however, we obviously need to keep track of which address space we
+ "stole" for such an anonymous user. For that, we have "tsk->active_mm",
+ which shows what the currently active address space is.
+
+ The rule is that for a process with a real address space (ie tsk->mm is
+ non-NULL) the active_mm obviously always has to be the same as the real
+ one.
+
+ For a anonymous process, tsk->mm == NULL, and tsk->active_mm is the
+ "borrowed" mm while the anonymous process is running. When the
+ anonymous process gets scheduled away, the borrowed address space is
+ returned and cleared.
+
+To support all that, the "struct mm_struct" now has two counters: a
+"mm_users" counter that is how many "real address space users" there are,
+and a "mm_count" counter that is the number of "lazy" users (ie anonymous
+users) plus one if there are any real users.
+
+Usually there is at least one real user, but it could be that the real
+user exited on another CPU while a lazy user was still active, so you do
+actually get cases where you have a address space that is _only_ used by
+lazy users. That is often a short-lived state, because once that thread
+gets scheduled away in favour of a real thread, the "zombie" mm gets
+released because "mm_users" becomes zero.
+
+Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any
+more. "init_mm" should be considered just a "lazy context when no other
+context is available", and in fact it is mainly used just at bootup when
+no real VM has yet been created. So code that used to check
+
+ if (current->mm == &init_mm)
+
+should generally just do
+
+ if (!current->mm)
+
+instead (which makes more sense anyway - the test is basically one of "do
+we have a user context", and is generally done by the page fault handler
+and things like that).
+
+Anyway, I put a pre-patch-2.3.13-1 on ftp.kernel.org just a moment ago,
+because it slightly changes the interfaces to accommodate the alpha (who
+would have thought it, but the alpha actually ends up having one of the
+ugliest context switch codes - unlike the other architectures where the MM
+and register state is separate, the alpha PALcode joins the two, and you
+need to switch both together).
+
+(From http://marc.info/?l=linux-kernel&m=93337278602211&w=2)
diff --git a/Documentation/vm/balance b/Documentation/vm/balance
index bd3d31bc4915..c46e68cf9344 100644
--- a/Documentation/vm/balance
+++ b/Documentation/vm/balance
@@ -75,15 +75,15 @@ Page stealing from process memory and shm is done if stealing the page would
alleviate memory pressure on any zone in the page's node that has fallen below
its watermark.
-pages_min/pages_low/pages_high/low_on_memory/zone_wake_kswapd: These are
-per-zone fields, used to determine when a zone needs to be balanced. When
-the number of pages falls below pages_min, the hysteric field low_on_memory
-gets set. This stays set till the number of free pages becomes pages_high.
-When low_on_memory is set, page allocation requests will try to free some
-pages in the zone (providing GFP_WAIT is set in the request). Orthogonal
-to this, is the decision to poke kswapd to free some zone pages. That
-decision is not hysteresis based, and is done when the number of free
-pages is below pages_low; in which case zone_wake_kswapd is also set.
+watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
+are per-zone fields, used to determine when a zone needs to be balanced. When
+the number of pages falls below watermark[WMARK_MIN], the hysteric field
+low_on_memory gets set. This stays set till the number of free pages becomes
+watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
+try to free some pages in the zone (providing GFP_WAIT is set in the request).
+Orthogonal to this, is the decision to poke kswapd to free some zone pages.
+That decision is not hysteresis based, and is done when the number of free
+pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
(Good) Ideas that I have heard:
diff --git a/Documentation/vm/cleancache.txt b/Documentation/vm/cleancache.txt
new file mode 100644
index 000000000000..36c367c73084
--- /dev/null
+++ b/Documentation/vm/cleancache.txt
@@ -0,0 +1,278 @@
+MOTIVATION
+
+Cleancache is a new optional feature provided by the VFS layer that
+potentially dramatically increases page cache effectiveness for
+many workloads in many environments at a negligible cost.
+
+Cleancache can be thought of as a page-granularity victim cache for clean
+pages that the kernel's pageframe replacement algorithm (PFRA) would like
+to keep around, but can't since there isn't enough memory. So when the
+PFRA "evicts" a page, it first attempts to use cleancache code to
+put the data contained in that page into "transcendent memory", memory
+that is not directly accessible or addressable by the kernel and is
+of unknown and possibly time-varying size.
+
+Later, when a cleancache-enabled filesystem wishes to access a page
+in a file on disk, it first checks cleancache to see if it already
+contains it; if it does, the page of data is copied into the kernel
+and a disk access is avoided.
+
+Transcendent memory "drivers" for cleancache are currently implemented
+in Xen (using hypervisor memory) and zcache (using in-kernel compressed
+memory) and other implementations are in development.
+
+FAQs are included below.
+
+IMPLEMENTATION OVERVIEW
+
+A cleancache "backend" that provides transcendent memory registers itself
+to the kernel's cleancache "frontend" by calling cleancache_register_ops,
+passing a pointer to a cleancache_ops structure with funcs set appropriately.
+Note that cleancache_register_ops returns the previous settings so that
+chaining can be performed if desired. The functions provided must conform to
+certain semantics as follows:
+
+Most important, cleancache is "ephemeral". Pages which are copied into
+cleancache have an indefinite lifetime which is completely unknowable
+by the kernel and so may or may not still be in cleancache at any later time.
+Thus, as its name implies, cleancache is not suitable for dirty pages.
+Cleancache has complete discretion over what pages to preserve and what
+pages to discard and when.
+
+Mounting a cleancache-enabled filesystem should call "init_fs" to obtain a
+pool id which, if positive, must be saved in the filesystem's superblock;
+a negative return value indicates failure. A "put_page" will copy a
+(presumably about-to-be-evicted) page into cleancache and associate it with
+the pool id, a file key, and a page index into the file. (The combination
+of a pool id, a file key, and an index is sometimes called a "handle".)
+A "get_page" will copy the page, if found, from cleancache into kernel memory.
+A "flush_page" will ensure the page no longer is present in cleancache;
+a "flush_inode" will flush all pages associated with the specified file;
+and, when a filesystem is unmounted, a "flush_fs" will flush all pages in
+all files specified by the given pool id and also surrender the pool id.
+
+An "init_shared_fs", like init_fs, obtains a pool id but tells cleancache
+to treat the pool as shared using a 128-bit UUID as a key. On systems
+that may run multiple kernels (such as hard partitioned or virtualized
+systems) that may share a clustered filesystem, and where cleancache
+may be shared among those kernels, calls to init_shared_fs that specify the
+same UUID will receive the same pool id, thus allowing the pages to
+be shared. Note that any security requirements must be imposed outside
+of the kernel (e.g. by "tools" that control cleancache). Or a
+cleancache implementation can simply disable shared_init by always
+returning a negative value.
+
+If a get_page is successful on a non-shared pool, the page is flushed (thus
+making cleancache an "exclusive" cache). On a shared pool, the page
+is NOT flushed on a successful get_page so that it remains accessible to
+other sharers. The kernel is responsible for ensuring coherency between
+cleancache (shared or not), the page cache, and the filesystem, using
+cleancache flush operations as required.
+
+Note that cleancache must enforce put-put-get coherency and get-get
+coherency. For the former, if two puts are made to the same handle but
+with different data, say AAA by the first put and BBB by the second, a
+subsequent get can never return the stale data (AAA). For get-get coherency,
+if a get for a given handle fails, subsequent gets for that handle will
+never succeed unless preceded by a successful put with that handle.
+
+Last, cleancache provides no SMP serialization guarantees; if two
+different Linux threads are simultaneously putting and flushing a page
+with the same handle, the results are indeterminate. Callers must
+lock the page to ensure serial behavior.
+
+CLEANCACHE PERFORMANCE METRICS
+
+Cleancache monitoring is done by sysfs files in the
+/sys/kernel/mm/cleancache directory. The effectiveness of cleancache
+can be measured (across all filesystems) with:
+
+succ_gets - number of gets that were successful
+failed_gets - number of gets that failed
+puts - number of puts attempted (all "succeed")
+flushes - number of flushes attempted
+
+A backend implementatation may provide additional metrics.
+
+FAQ
+
+1) Where's the value? (Andrew Morton)
+
+Cleancache provides a significant performance benefit to many workloads
+in many environments with negligible overhead by improving the
+effectiveness of the pagecache. Clean pagecache pages are
+saved in transcendent memory (RAM that is otherwise not directly
+addressable to the kernel); fetching those pages later avoids "refaults"
+and thus disk reads.
+
+Cleancache (and its sister code "frontswap") provide interfaces for
+this transcendent memory (aka "tmem"), which conceptually lies between
+fast kernel-directly-addressable RAM and slower DMA/asynchronous devices.
+Disallowing direct kernel or userland reads/writes to tmem
+is ideal when data is transformed to a different form and size (such
+as with compression) or secretly moved (as might be useful for write-
+balancing for some RAM-like devices). Evicted page-cache pages (and
+swap pages) are a great use for this kind of slower-than-RAM-but-much-
+faster-than-disk transcendent memory, and the cleancache (and frontswap)
+"page-object-oriented" specification provides a nice way to read and
+write -- and indirectly "name" -- the pages.
+
+In the virtual case, the whole point of virtualization is to statistically
+multiplex physical resources across the varying demands of multiple
+virtual machines. This is really hard to do with RAM and efforts to
+do it well with no kernel change have essentially failed (except in some
+well-publicized special-case workloads). Cleancache -- and frontswap --
+with a fairly small impact on the kernel, provide a huge amount
+of flexibility for more dynamic, flexible RAM multiplexing.
+Specifically, the Xen Transcendent Memory backend allows otherwise
+"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
+virtual machines, but the pages can be compressed and deduplicated to
+optimize RAM utilization. And when guest OS's are induced to surrender
+underutilized RAM (e.g. with "self-ballooning"), page cache pages
+are the first to go, and cleancache allows those pages to be
+saved and reclaimed if overall host system memory conditions allow.
+
+And the identical interface used for cleancache can be used in
+physical systems as well. The zcache driver acts as a memory-hungry
+device that stores pages of data in a compressed state. And
+the proposed "RAMster" driver shares RAM across multiple physical
+systems.
+
+2) Why does cleancache have its sticky fingers so deep inside the
+ filesystems and VFS? (Andrew Morton and Christoph Hellwig)
+
+The core hooks for cleancache in VFS are in most cases a single line
+and the minimum set are placed precisely where needed to maintain
+coherency (via cleancache_flush operations) between cleancache,
+the page cache, and disk. All hooks compile into nothingness if
+cleancache is config'ed off and turn into a function-pointer-
+compare-to-NULL if config'ed on but no backend claims the ops
+functions, or to a compare-struct-element-to-negative if a
+backend claims the ops functions but a filesystem doesn't enable
+cleancache.
+
+Some filesystems are built entirely on top of VFS and the hooks
+in VFS are sufficient, so don't require an "init_fs" hook; the
+initial implementation of cleancache didn't provide this hook.
+But for some filesystems (such as btrfs), the VFS hooks are
+incomplete and one or more hooks in fs-specific code are required.
+And for some other filesystems, such as tmpfs, cleancache may
+be counterproductive. So it seemed prudent to require a filesystem
+to "opt in" to use cleancache, which requires adding a hook in
+each filesystem. Not all filesystems are supported by cleancache
+only because they haven't been tested. The existing set should
+be sufficient to validate the concept, the opt-in approach means
+that untested filesystems are not affected, and the hooks in the
+existing filesystems should make it very easy to add more
+filesystems in the future.
+
+The total impact of the hooks to existing fs and mm files is only
+about 40 lines added (not counting comments and blank lines).
+
+3) Why not make cleancache asynchronous and batched so it can
+ more easily interface with real devices with DMA instead
+ of copying each individual page? (Minchan Kim)
+
+The one-page-at-a-time copy semantics simplifies the implementation
+on both the frontend and backend and also allows the backend to
+do fancy things on-the-fly like page compression and
+page deduplication. And since the data is "gone" (copied into/out
+of the pageframe) before the cleancache get/put call returns,
+a great deal of race conditions and potential coherency issues
+are avoided. While the interface seems odd for a "real device"
+or for real kernel-addressable RAM, it makes perfect sense for
+transcendent memory.
+
+4) Why is non-shared cleancache "exclusive"? And where is the
+ page "flushed" after a "get"? (Minchan Kim)
+
+The main reason is to free up space in transcendent memory and
+to avoid unnecessary cleancache_flush calls. If you want inclusive,
+the page can be "put" immediately following the "get". If
+put-after-get for inclusive becomes common, the interface could
+be easily extended to add a "get_no_flush" call.
+
+The flush is done by the cleancache backend implementation.
+
+5) What's the performance impact?
+
+Performance analysis has been presented at OLS'09 and LCA'10.
+Briefly, performance gains can be significant on most workloads,
+especially when memory pressure is high (e.g. when RAM is
+overcommitted in a virtual workload); and because the hooks are
+invoked primarily in place of or in addition to a disk read/write,
+overhead is negligible even in worst case workloads. Basically
+cleancache replaces I/O with memory-copy-CPU-overhead; on older
+single-core systems with slow memory-copy speeds, cleancache
+has little value, but in newer multicore machines, especially
+consolidated/virtualized machines, it has great value.
+
+6) How do I add cleancache support for filesystem X? (Boaz Harrash)
+
+Filesystems that are well-behaved and conform to certain
+restrictions can utilize cleancache simply by making a call to
+cleancache_init_fs at mount time. Unusual, misbehaving, or
+poorly layered filesystems must either add additional hooks
+and/or undergo extensive additional testing... or should just
+not enable the optional cleancache.
+
+Some points for a filesystem to consider:
+
+- The FS should be block-device-based (e.g. a ram-based FS such
+ as tmpfs should not enable cleancache)
+- To ensure coherency/correctness, the FS must ensure that all
+ file removal or truncation operations either go through VFS or
+ add hooks to do the equivalent cleancache "flush" operations
+- To ensure coherency/correctness, either inode numbers must
+ be unique across the lifetime of the on-disk file OR the
+ FS must provide an "encode_fh" function.
+- The FS must call the VFS superblock alloc and deactivate routines
+ or add hooks to do the equivalent cleancache calls done there.
+- To maximize performance, all pages fetched from the FS should
+ go through the do_mpag_readpage routine or the FS should add
+ hooks to do the equivalent (cf. btrfs)
+- Currently, the FS blocksize must be the same as PAGESIZE. This
+ is not an architectural restriction, but no backends currently
+ support anything different.
+- A clustered FS should invoke the "shared_init_fs" cleancache
+ hook to get best performance for some backends.
+
+7) Why not use the KVA of the inode as the key? (Christoph Hellwig)
+
+If cleancache would use the inode virtual address instead of
+inode/filehandle, the pool id could be eliminated. But, this
+won't work because cleancache retains pagecache data pages
+persistently even when the inode has been pruned from the
+inode unused list, and only flushes the data page if the file
+gets removed/truncated. So if cleancache used the inode kva,
+there would be potential coherency issues if/when the inode
+kva is reused for a different file. Alternately, if cleancache
+flushed the pages when the inode kva was freed, much of the value
+of cleancache would be lost because the cache of pages in cleanache
+is potentially much larger than the kernel pagecache and is most
+useful if the pages survive inode cache removal.
+
+8) Why is a global variable required?
+
+The cleancache_enabled flag is checked in all of the frequently-used
+cleancache hooks. The alternative is a function call to check a static
+variable. Since cleancache is enabled dynamically at runtime, systems
+that don't enable cleancache would suffer thousands (possibly
+tens-of-thousands) of unnecessary function calls per second. So the
+global variable allows cleancache to be enabled by default at compile
+time, but have insignificant performance impact when cleancache remains
+disabled at runtime.
+
+9) Does cleanache work with KVM?
+
+The memory model of KVM is sufficiently different that a cleancache
+backend may have less value for KVM. This remains to be tested,
+especially in an overcommitted system.
+
+10) Does cleancache work in userspace? It sounds useful for
+ memory hungry caches like web browsers. (Jamie Lokier)
+
+No plans yet, though we agree it sounds useful, at least for
+apps that bypass the page cache (e.g. O_DIRECT).
+
+Last updated: Dan Magenheimer, April 13 2011
diff --git a/Documentation/vm/highmem.txt b/Documentation/vm/highmem.txt
new file mode 100644
index 000000000000..4324d24ffacd
--- /dev/null
+++ b/Documentation/vm/highmem.txt
@@ -0,0 +1,162 @@
+
+ ====================
+ HIGH MEMORY HANDLING
+ ====================
+
+By: Peter Zijlstra <a.p.zijlstra@chello.nl>
+
+Contents:
+
+ (*) What is high memory?
+
+ (*) Temporary virtual mappings.
+
+ (*) Using kmap_atomic.
+
+ (*) Cost of temporary mappings.
+
+ (*) i386 PAE.
+
+
+====================
+WHAT IS HIGH MEMORY?
+====================
+
+High memory (highmem) is used when the size of physical memory approaches or
+exceeds the maximum size of virtual memory. At that point it becomes
+impossible for the kernel to keep all of the available physical memory mapped
+at all times. This means the kernel needs to start using temporary mappings of
+the pieces of physical memory that it wants to access.
+
+The part of (physical) memory not covered by a permanent mapping is what we
+refer to as 'highmem'. There are various architecture dependent constraints on
+where exactly that border lies.
+
+In the i386 arch, for example, we choose to map the kernel into every process's
+VM space so that we don't have to pay the full TLB invalidation costs for
+kernel entry/exit. This means the available virtual memory space (4GiB on
+i386) has to be divided between user and kernel space.
+
+The traditional split for architectures using this approach is 3:1, 3GiB for
+userspace and the top 1GiB for kernel space:
+
+ +--------+ 0xffffffff
+ | Kernel |
+ +--------+ 0xc0000000
+ | |
+ | User |
+ | |
+ +--------+ 0x00000000
+
+This means that the kernel can at most map 1GiB of physical memory at any one
+time, but because we need virtual address space for other things - including
+temporary maps to access the rest of the physical memory - the actual direct
+map will typically be less (usually around ~896MiB).
+
+Other architectures that have mm context tagged TLBs can have separate kernel
+and user maps. Some hardware (like some ARMs), however, have limited virtual
+space when they use mm context tags.
+
+
+==========================
+TEMPORARY VIRTUAL MAPPINGS
+==========================
+
+The kernel contains several ways of creating temporary mappings:
+
+ (*) vmap(). This can be used to make a long duration mapping of multiple
+ physical pages into a contiguous virtual space. It needs global
+ synchronization to unmap.
+
+ (*) kmap(). This permits a short duration mapping of a single page. It needs
+ global synchronization, but is amortized somewhat. It is also prone to
+ deadlocks when using in a nested fashion, and so it is not recommended for
+ new code.
+
+ (*) kmap_atomic(). This permits a very short duration mapping of a single
+ page. Since the mapping is restricted to the CPU that issued it, it
+ performs well, but the issuing task is therefore required to stay on that
+ CPU until it has finished, lest some other task displace its mappings.
+
+ kmap_atomic() may also be used by interrupt contexts, since it is does not
+ sleep and the caller may not sleep until after kunmap_atomic() is called.
+
+ It may be assumed that k[un]map_atomic() won't fail.
+
+
+=================
+USING KMAP_ATOMIC
+=================
+
+When and where to use kmap_atomic() is straightforward. It is used when code
+wants to access the contents of a page that might be allocated from high memory
+(see __GFP_HIGHMEM), for example a page in the pagecache. The API has two
+functions, and they can be used in a manner similar to the following:
+
+ /* Find the page of interest. */
+ struct page *page = find_get_page(mapping, offset);
+
+ /* Gain access to the contents of that page. */
+ void *vaddr = kmap_atomic(page);
+
+ /* Do something to the contents of that page. */
+ memset(vaddr, 0, PAGE_SIZE);
+
+ /* Unmap that page. */
+ kunmap_atomic(vaddr);
+
+Note that the kunmap_atomic() call takes the result of the kmap_atomic() call
+not the argument.
+
+If you need to map two pages because you want to copy from one page to
+another you need to keep the kmap_atomic calls strictly nested, like:
+
+ vaddr1 = kmap_atomic(page1);
+ vaddr2 = kmap_atomic(page2);
+
+ memcpy(vaddr1, vaddr2, PAGE_SIZE);
+
+ kunmap_atomic(vaddr2);
+ kunmap_atomic(vaddr1);
+
+
+==========================
+COST OF TEMPORARY MAPPINGS
+==========================
+
+The cost of creating temporary mappings can be quite high. The arch has to
+manipulate the kernel's page tables, the data TLB and/or the MMU's registers.
+
+If CONFIG_HIGHMEM is not set, then the kernel will try and create a mapping
+simply with a bit of arithmetic that will convert the page struct address into
+a pointer to the page contents rather than juggling mappings about. In such a
+case, the unmap operation may be a null operation.
+
+If CONFIG_MMU is not set, then there can be no temporary mappings and no
+highmem. In such a case, the arithmetic approach will also be used.
+
+
+========
+i386 PAE
+========
+
+The i386 arch, under some circumstances, will permit you to stick up to 64GiB
+of RAM into your 32-bit machine. This has a number of consequences:
+
+ (*) Linux needs a page-frame structure for each page in the system and the
+ pageframes need to live in the permanent mapping, which means:
+
+ (*) you can have 896M/sizeof(struct page) page-frames at most; with struct
+ page being 32-bytes that would end up being something in the order of 112G
+ worth of pages; the kernel, however, needs to store more than just
+ page-frames in that memory...
+
+ (*) PAE makes your page tables larger - which slows the system down as more
+ data has to be accessed to traverse in TLB fills and the like. One
+ advantage is that PAE has more PTE bits and can provide advanced features
+ like NX and PAT.
+
+The general recommendation is that you don't use more than 8GiB on a 32-bit
+machine - although more might work for you and your workload, you're pretty
+much on your own - don't expect kernel developers to really care much if things
+come apart.
diff --git a/Documentation/vm/hugepage-mmap.c b/Documentation/vm/hugepage-mmap.c
new file mode 100644
index 000000000000..db0dd9a33d54
--- /dev/null
+++ b/Documentation/vm/hugepage-mmap.c
@@ -0,0 +1,91 @@
+/*
+ * hugepage-mmap:
+ *
+ * Example of using huge page memory in a user application using the mmap
+ * system call. Before running this application, make sure that the
+ * administrator has mounted the hugetlbfs filesystem (on some directory
+ * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this
+ * example, the app is requesting memory of size 256MB that is backed by
+ * huge pages.
+ *
+ * For the ia64 architecture, the Linux kernel reserves Region number 4 for
+ * huge pages. That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required. If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <fcntl.h>
+
+#define FILE_NAME "/mnt/hugepagefile"
+#define LENGTH (256UL*1024*1024)
+#define PROTECTION (PROT_READ | PROT_WRITE)
+
+/* Only ia64 requires this */
+#ifdef __ia64__
+#define ADDR (void *)(0x8000000000000000UL)
+#define FLAGS (MAP_SHARED | MAP_FIXED)
+#else
+#define ADDR (void *)(0x0UL)
+#define FLAGS (MAP_SHARED)
+#endif
+
+static void check_bytes(char *addr)
+{
+ printf("First hex is %x\n", *((unsigned int *)addr));
+}
+
+static void write_bytes(char *addr)
+{
+ unsigned long i;
+
+ for (i = 0; i < LENGTH; i++)
+ *(addr + i) = (char)i;
+}
+
+static void read_bytes(char *addr)
+{
+ unsigned long i;
+
+ check_bytes(addr);
+ for (i = 0; i < LENGTH; i++)
+ if (*(addr + i) != (char)i) {
+ printf("Mismatch at %lu\n", i);
+ break;
+ }
+}
+
+int main(void)
+{
+ void *addr;
+ int fd;
+
+ fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
+ if (fd < 0) {
+ perror("Open failed");
+ exit(1);
+ }
+
+ addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0);
+ if (addr == MAP_FAILED) {
+ perror("mmap");
+ unlink(FILE_NAME);
+ exit(1);
+ }
+
+ printf("Returned address is %p\n", addr);
+ check_bytes(addr);
+ write_bytes(addr);
+ read_bytes(addr);
+
+ munmap(addr, LENGTH);
+ close(fd);
+ unlink(FILE_NAME);
+
+ return 0;
+}
diff --git a/Documentation/vm/hugepage-shm.c b/Documentation/vm/hugepage-shm.c
new file mode 100644
index 000000000000..07956d8592c9
--- /dev/null
+++ b/Documentation/vm/hugepage-shm.c
@@ -0,0 +1,98 @@
+/*
+ * hugepage-shm:
+ *
+ * Example of using huge page memory in a user application using Sys V shared
+ * memory system calls. In this example the app is requesting 256MB of
+ * memory that is backed by huge pages. The application uses the flag
+ * SHM_HUGETLB in the shmget system call to inform the kernel that it is
+ * requesting huge pages.
+ *
+ * For the ia64 architecture, the Linux kernel reserves Region number 4 for
+ * huge pages. That means that if one requires a fixed address, a huge page
+ * aligned address starting with 0x800000... will be required. If a fixed
+ * address is not required, the kernel will select an address in the proper
+ * range.
+ * Other architectures, such as ppc64, i386 or x86_64 are not so constrained.
+ *
+ * Note: The default shared memory limit is quite low on many kernels,
+ * you may need to increase it via:
+ *
+ * echo 268435456 > /proc/sys/kernel/shmmax
+ *
+ * This will increase the maximum size per shared memory segment to 256MB.
+ * The other limit that you will hit eventually is shmall which is the
+ * total amount of shared memory in pages. To set it to 16GB on a system
+ * with a 4kB pagesize do:
+ *
+ * echo 4194304 > /proc/sys/kernel/shmall
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/ipc.h>
+#include <sys/shm.h>
+#include <sys/mman.h>
+
+#ifndef SHM_HUGETLB
+#define SHM_HUGETLB 04000
+#endif
+
+#define LENGTH (256UL*1024*1024)
+
+#define dprintf(x) printf(x)
+
+/* Only ia64 requires this */
+#ifdef __ia64__
+#define ADDR (void *)(0x8000000000000000UL)
+#define SHMAT_FLAGS (SHM_RND)
+#else
+#define ADDR (void *)(0x0UL)
+#define SHMAT_FLAGS (0)
+#endif
+
+int main(void)
+{
+ int shmid;
+ unsigned long i;
+ char *shmaddr;
+
+ if ((shmid = shmget(2, LENGTH,
+ SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) {
+ perror("shmget");
+ exit(1);
+ }
+ printf("shmid: 0x%x\n", shmid);
+
+ shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
+ if (shmaddr == (char *)-1) {
+ perror("Shared memory attach failure");
+ shmctl(shmid, IPC_RMID, NULL);
+ exit(2);
+ }
+ printf("shmaddr: %p\n", shmaddr);
+
+ dprintf("Starting the writes:\n");
+ for (i = 0; i < LENGTH; i++) {
+ shmaddr[i] = (char)(i);
+ if (!(i % (1024 * 1024)))
+ dprintf(".");
+ }
+ dprintf("\n");
+
+ dprintf("Starting the Check...");
+ for (i = 0; i < LENGTH; i++)
+ if (shmaddr[i] != (char)i)
+ printf("\nIndex %lu mismatched\n", i);
+ dprintf("Done.\n");
+
+ if (shmdt((const void *)shmaddr) != 0) {
+ perror("Detach failure");
+ shmctl(shmid, IPC_RMID, NULL);
+ exit(3);
+ }
+
+ shmctl(shmid, IPC_RMID, NULL);
+
+ return 0;
+}
diff --git a/Documentation/vm/hugetlbpage.txt b/Documentation/vm/hugetlbpage.txt
index 3102b81bef88..f8551b3879f8 100644
--- a/Documentation/vm/hugetlbpage.txt
+++ b/Documentation/vm/hugetlbpage.txt
@@ -11,23 +11,21 @@ This optimization is more critical now as bigger and bigger physical memories
(several GBs) are more readily available.
Users can use the huge page support in Linux kernel by either using the mmap
-system call or standard SYSv shared memory system calls (shmget, shmat).
+system call or standard SYSV shared memory system calls (shmget, shmat).
First the Linux kernel needs to be built with the CONFIG_HUGETLBFS
(present under "File systems") and CONFIG_HUGETLB_PAGE (selected
automatically when CONFIG_HUGETLBFS is selected) configuration
options.
-The kernel built with hugepage support should show the number of configured
-hugepages in the system by running the "cat /proc/meminfo" command.
+The /proc/meminfo file provides information about the total number of
+persistent hugetlb pages in the kernel's huge page pool. It also displays
+information about the number of free, reserved and surplus huge pages and the
+default huge page size. The huge page size is needed for generating the
+proper alignment and size of the arguments to system calls that map huge page
+regions.
-/proc/meminfo also provides information about the total number of hugetlb
-pages configured in the kernel. It also displays information about the
-number of free hugetlb pages at any time. It also displays information about
-the configured hugepage size - this is needed for generating the proper
-alignment and size of the arguments to the above system calls.
-
-The output of "cat /proc/meminfo" will have lines like:
+The output of "cat /proc/meminfo" will include lines like:
.....
HugePages_Total: vvv
@@ -37,65 +35,231 @@ HugePages_Surp: yyy
Hugepagesize: zzz kB
where:
-HugePages_Total is the size of the pool of hugepages.
-HugePages_Free is the number of hugepages in the pool that are not yet
-allocated.
-HugePages_Rsvd is short for "reserved," and is the number of hugepages
-for which a commitment to allocate from the pool has been made, but no
-allocation has yet been made. It's vaguely analogous to overcommit.
-HugePages_Surp is short for "surplus," and is the number of hugepages in
-the pool above the value in /proc/sys/vm/nr_hugepages. The maximum
-number of surplus hugepages is controlled by
-/proc/sys/vm/nr_overcommit_hugepages.
+HugePages_Total is the size of the pool of huge pages.
+HugePages_Free is the number of huge pages in the pool that are not yet
+ allocated.
+HugePages_Rsvd is short for "reserved," and is the number of huge pages for
+ which a commitment to allocate from the pool has been made,
+ but no allocation has yet been made. Reserved huge pages
+ guarantee that an application will be able to allocate a
+ huge page from the pool of huge pages at fault time.
+HugePages_Surp is short for "surplus," and is the number of huge pages in
+ the pool above the value in /proc/sys/vm/nr_hugepages. The
+ maximum number of surplus huge pages is controlled by
+ /proc/sys/vm/nr_overcommit_hugepages.
/proc/filesystems should also show a filesystem of type "hugetlbfs" configured
in the kernel.
-/proc/sys/vm/nr_hugepages indicates the current number of configured hugetlb
-pages in the kernel. Super user can dynamically request more (or free some
-pre-configured) hugepages.
-The allocation (or deallocation) of hugetlb pages is possible only if there are
-enough physically contiguous free pages in system (freeing of hugepages is
-possible only if there are enough hugetlb pages free that can be transferred
-back to regular memory pool).
+/proc/sys/vm/nr_hugepages indicates the current number of "persistent" huge
+pages in the kernel's huge page pool. "Persistent" huge pages will be
+returned to the huge page pool when freed by a task. A user with root
+privileges can dynamically allocate more or free some persistent huge pages
+by increasing or decreasing the value of 'nr_hugepages'.
+
+Pages that are used as huge pages are reserved inside the kernel and cannot
+be used for other purposes. Huge pages cannot be swapped out under
+memory pressure.
+
+Once a number of huge pages have been pre-allocated to the kernel huge page
+pool, a user with appropriate privilege can use either the mmap system call
+or shared memory system calls to use the huge pages. See the discussion of
+Using Huge Pages, below.
+
+The administrator can allocate persistent huge pages on the kernel boot
+command line by specifying the "hugepages=N" parameter, where 'N' = the
+number of huge pages requested. This is the most reliable method of
+allocating huge pages as memory has not yet become fragmented.
+
+Some platforms support multiple huge page sizes. To allocate huge pages
+of a specific size, one must precede the huge pages boot command parameters
+with a huge page size selection parameter "hugepagesz=<size>". <size> must
+be specified in bytes with optional scale suffix [kKmMgG]. The default huge
+page size may be selected with the "default_hugepagesz=<size>" boot parameter.
+
+When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
+indicates the current number of pre-allocated huge pages of the default size.
+Thus, one can use the following command to dynamically allocate/deallocate
+default sized persistent huge pages:
+
+ echo 20 > /proc/sys/vm/nr_hugepages
-Pages that are used as hugetlb pages are reserved inside the kernel and cannot
-be used for other purposes.
+This command will try to adjust the number of default sized huge pages in the
+huge page pool to 20, allocating or freeing huge pages, as required.
+
+On a NUMA platform, the kernel will attempt to distribute the huge page pool
+over all the set of allowed nodes specified by the NUMA memory policy of the
+task that modifies nr_hugepages. The default for the allowed nodes--when the
+task has default memory policy--is all on-line nodes with memory. Allowed
+nodes with insufficient available, contiguous memory for a huge page will be
+silently skipped when allocating persistent huge pages. See the discussion
+below of the interaction of task memory policy, cpusets and per node attributes
+with the allocation and freeing of persistent huge pages.
+
+The success or failure of huge page allocation depends on the amount of
+physically contiguous memory that is present in system at the time of the
+allocation attempt. If the kernel is unable to allocate huge pages from
+some nodes in a NUMA system, it will attempt to make up the difference by
+allocating extra pages on other nodes with sufficient available contiguous
+memory, if any.
+
+System administrators may want to put this command in one of the local rc
+init files. This will enable the kernel to allocate huge pages early in
+the boot process when the possibility of getting physical contiguous pages
+is still very high. Administrators can verify the number of huge pages
+actually allocated by checking the sysctl or meminfo. To check the per node
+distribution of huge pages in a NUMA system, use:
+
+ cat /sys/devices/system/node/node*/meminfo | fgrep Huge
+
+/proc/sys/vm/nr_overcommit_hugepages specifies how large the pool of
+huge pages can grow, if more huge pages than /proc/sys/vm/nr_hugepages are
+requested by applications. Writing any non-zero value into this file
+indicates that the hugetlb subsystem is allowed to try to obtain that
+number of "surplus" huge pages from the kernel's normal page pool, when the
+persistent huge page pool is exhausted. As these surplus huge pages become
+unused, they are freed back to the kernel's normal page pool.
+
+When increasing the huge page pool size via nr_hugepages, any existing surplus
+pages will first be promoted to persistent huge pages. Then, additional
+huge pages will be allocated, if necessary and if possible, to fulfill
+the new persistent huge page pool size.
+
+The administrator may shrink the pool of persistent huge pages for
+the default huge page size by setting the nr_hugepages sysctl to a
+smaller value. The kernel will attempt to balance the freeing of huge pages
+across all nodes in the memory policy of the task modifying nr_hugepages.
+Any free huge pages on the selected nodes will be freed back to the kernel's
+normal page pool.
+
+Caveat: Shrinking the persistent huge page pool via nr_hugepages such that
+it becomes less than the number of huge pages in use will convert the balance
+of the in-use huge pages to surplus huge pages. This will occur even if
+the number of surplus pages it would exceed the overcommit value. As long as
+this condition holds--that is, until nr_hugepages+nr_overcommit_hugepages is
+increased sufficiently, or the surplus huge pages go out of use and are freed--
+no more surplus huge pages will be allowed to be allocated.
+
+With support for multiple huge page pools at run-time available, much of
+the huge page userspace interface in /proc/sys/vm has been duplicated in sysfs.
+The /proc interfaces discussed above have been retained for backwards
+compatibility. The root huge page control directory in sysfs is:
+
+ /sys/kernel/mm/hugepages
+
+For each huge page size supported by the running kernel, a subdirectory
+will exist, of the form:
+
+ hugepages-${size}kB
+
+Inside each of these directories, the same set of files will exist:
+
+ nr_hugepages
+ nr_hugepages_mempolicy
+ nr_overcommit_hugepages
+ free_hugepages
+ resv_hugepages
+ surplus_hugepages
+
+which function as described above for the default huge page-sized case.
+
+
+Interaction of Task Memory Policy with Huge Page Allocation/Freeing
+
+Whether huge pages are allocated and freed via the /proc interface or
+the /sysfs interface using the nr_hugepages_mempolicy attribute, the NUMA
+nodes from which huge pages are allocated or freed are controlled by the
+NUMA memory policy of the task that modifies the nr_hugepages_mempolicy
+sysctl or attribute. When the nr_hugepages attribute is used, mempolicy
+is ignored.
+
+The recommended method to allocate or free huge pages to/from the kernel
+huge page pool, using the nr_hugepages example above, is:
+
+ numactl --interleave <node-list> echo 20 \
+ >/proc/sys/vm/nr_hugepages_mempolicy
+
+or, more succinctly:
+
+ numactl -m <node-list> echo 20 >/proc/sys/vm/nr_hugepages_mempolicy
+
+This will allocate or free abs(20 - nr_hugepages) to or from the nodes
+specified in <node-list>, depending on whether number of persistent huge pages
+is initially less than or greater than 20, respectively. No huge pages will be
+allocated nor freed on any node not included in the specified <node-list>.
+
+When adjusting the persistent hugepage count via nr_hugepages_mempolicy, any
+memory policy mode--bind, preferred, local or interleave--may be used. The
+resulting effect on persistent huge page allocation is as follows:
+
+1) Regardless of mempolicy mode [see Documentation/vm/numa_memory_policy.txt],
+ persistent huge pages will be distributed across the node or nodes
+ specified in the mempolicy as if "interleave" had been specified.
+ However, if a node in the policy does not contain sufficient contiguous
+ memory for a huge page, the allocation will not "fallback" to the nearest
+ neighbor node with sufficient contiguous memory. To do this would cause
+ undesirable imbalance in the distribution of the huge page pool, or
+ possibly, allocation of persistent huge pages on nodes not allowed by
+ the task's memory policy.
+
+2) One or more nodes may be specified with the bind or interleave policy.
+ If more than one node is specified with the preferred policy, only the
+ lowest numeric id will be used. Local policy will select the node where
+ the task is running at the time the nodes_allowed mask is constructed.
+ For local policy to be deterministic, the task must be bound to a cpu or
+ cpus in a single node. Otherwise, the task could be migrated to some
+ other node at any time after launch and the resulting node will be
+ indeterminate. Thus, local policy is not very useful for this purpose.
+ Any of the other mempolicy modes may be used to specify a single node.
+
+3) The nodes allowed mask will be derived from any non-default task mempolicy,
+ whether this policy was set explicitly by the task itself or one of its
+ ancestors, such as numactl. This means that if the task is invoked from a
+ shell with non-default policy, that policy will be used. One can specify a
+ node list of "all" with numactl --interleave or --membind [-m] to achieve
+ interleaving over all nodes in the system or cpuset.
-Once the kernel with Hugetlb page support is built and running, a user can
-use either the mmap system call or shared memory system calls to start using
-the huge pages. It is required that the system administrator preallocate
-enough memory for huge page purposes.
+4) Any task mempolicy specifed--e.g., using numactl--will be constrained by
+ the resource limits of any cpuset in which the task runs. Thus, there will
+ be no way for a task with non-default policy running in a cpuset with a
+ subset of the system nodes to allocate huge pages outside the cpuset
+ without first moving to a cpuset that contains all of the desired nodes.
-Use the following command to dynamically allocate/deallocate hugepages:
+5) Boot-time huge page allocation attempts to distribute the requested number
+ of huge pages over all on-lines nodes with memory.
- echo 20 > /proc/sys/vm/nr_hugepages
+Per Node Hugepages Attributes
+
+A subset of the contents of the root huge page control directory in sysfs,
+described above, will be replicated under each the system device of each
+NUMA node with memory in:
+
+ /sys/devices/system/node/node[0-9]*/hugepages/
+
+Under this directory, the subdirectory for each supported huge page size
+contains the following attribute files:
+
+ nr_hugepages
+ free_hugepages
+ surplus_hugepages
+
+The free_' and surplus_' attribute files are read-only. They return the number
+of free and surplus [overcommitted] huge pages, respectively, on the parent
+node.
+
+The nr_hugepages attribute returns the total number of huge pages on the
+specified node. When this attribute is written, the number of persistent huge
+pages on the parent node will be adjusted to the specified value, if sufficient
+resources exist, regardless of the task's mempolicy or cpuset constraints.
+
+Note that the number of overcommit and reserve pages remain global quantities,
+as we don't know until fault time, when the faulting task's mempolicy is
+applied, from which node the huge page allocation will be attempted.
+
+
+Using Huge Pages
-This command will try to configure 20 hugepages in the system. The success
-or failure of allocation depends on the amount of physically contiguous
-memory that is preset in system at this time. System administrators may want
-to put this command in one of the local rc init files. This will enable the
-kernel to request huge pages early in the boot process (when the possibility
-of getting physical contiguous pages is still very high). In either
-case, adminstrators will want to verify the number of hugepages actually
-allocated by checking the sysctl or meminfo.
-
-/proc/sys/vm/nr_overcommit_hugepages indicates how large the pool of
-hugepages can grow, if more hugepages than /proc/sys/vm/nr_hugepages are
-requested by applications. echo'ing any non-zero value into this file
-indicates that the hugetlb subsystem is allowed to try to obtain
-hugepages from the buddy allocator, if the normal pool is exhausted. As
-these surplus hugepages go out of use, they are freed back to the buddy
-allocator.
-
-Caveat: Shrinking the pool via nr_hugepages such that it becomes less
-than the number of hugepages in use will convert the balance to surplus
-huge pages even if it would exceed the overcommit value. As long as
-this condition holds, however, no more surplus huge pages will be
-allowed on the system until one of the two sysctls are increased
-sufficiently, or the surplus huge pages go out of use and are freed.
-
-If the user applications are going to request hugepages using mmap system
+If the user applications are going to request huge pages using mmap system
call, then it is required that system administrator mount a file system of
type hugetlbfs:
@@ -104,7 +268,7 @@ type hugetlbfs:
none /mnt/huge
This command mounts a (pseudo) filesystem of type hugetlbfs on the directory
-/mnt/huge. Any files created on /mnt/huge uses hugepages. The uid and gid
+/mnt/huge. Any files created on /mnt/huge uses huge pages. The uid and gid
options sets the owner and group of the root of the file system. By default
the uid and gid of the current process are taken. The mode option sets the
mode of root of file system to value & 0777. This value is given in octal.
@@ -123,194 +287,23 @@ Regular chown, chgrp, and chmod commands (with right permissions) could be
used to change the file attributes on hugetlbfs.
Also, it is important to note that no such mount command is required if the
-applications are going to use only shmat/shmget system calls. Users who
-wish to use hugetlb page via shared memory segment should be a member of
-a supplementary group and system admin needs to configure that gid into
-/proc/sys/vm/hugetlb_shm_group. It is possible for same or different
-applications to use any combination of mmaps and shm* calls, though the
-mount of filesystem will be required for using mmap calls.
+applications are going to use only shmat/shmget system calls or mmap with
+MAP_HUGETLB. Users who wish to use hugetlb page via shared memory segment
+should be a member of a supplementary group and system admin needs to
+configure that gid into /proc/sys/vm/hugetlb_shm_group. It is possible for
+same or different applications to use any combination of mmaps and shm*
+calls, though the mount of filesystem will be required for using mmap calls
+without MAP_HUGETLB. For an example of how to use mmap with MAP_HUGETLB see
+map_hugetlb.c.
*******************************************************************
/*
- * Example of using hugepage memory in a user application using Sys V shared
- * memory system calls. In this example the app is requesting 256MB of
- * memory that is backed by huge pages. The application uses the flag
- * SHM_HUGETLB in the shmget system call to inform the kernel that it is
- * requesting hugepages.
- *
- * For the ia64 architecture, the Linux kernel reserves Region number 4 for
- * hugepages. That means the addresses starting with 0x800000... will need
- * to be specified. Specifying a fixed address is not required on ppc64,
- * i386 or x86_64.
- *
- * Note: The default shared memory limit is quite low on many kernels,
- * you may need to increase it via:
- *
- * echo 268435456 > /proc/sys/kernel/shmmax
- *
- * This will increase the maximum size per shared memory segment to 256MB.
- * The other limit that you will hit eventually is shmall which is the
- * total amount of shared memory in pages. To set it to 16GB on a system
- * with a 4kB pagesize do:
- *
- * echo 4194304 > /proc/sys/kernel/shmall
+ * hugepage-shm: see Documentation/vm/hugepage-shm.c
*/
-#include <stdlib.h>
-#include <stdio.h>
-#include <sys/types.h>
-#include <sys/ipc.h>
-#include <sys/shm.h>
-#include <sys/mman.h>
-
-#ifndef SHM_HUGETLB
-#define SHM_HUGETLB 04000
-#endif
-
-#define LENGTH (256UL*1024*1024)
-
-#define dprintf(x) printf(x)
-
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define SHMAT_FLAGS (SHM_RND)
-#else
-#define ADDR (void *)(0x0UL)
-#define SHMAT_FLAGS (0)
-#endif
-
-int main(void)
-{
- int shmid;
- unsigned long i;
- char *shmaddr;
-
- if ((shmid = shmget(2, LENGTH,
- SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) {
- perror("shmget");
- exit(1);
- }
- printf("shmid: 0x%x\n", shmid);
-
- shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
- if (shmaddr == (char *)-1) {
- perror("Shared memory attach failure");
- shmctl(shmid, IPC_RMID, NULL);
- exit(2);
- }
- printf("shmaddr: %p\n", shmaddr);
-
- dprintf("Starting the writes:\n");
- for (i = 0; i < LENGTH; i++) {
- shmaddr[i] = (char)(i);
- if (!(i % (1024 * 1024)))
- dprintf(".");
- }
- dprintf("\n");
-
- dprintf("Starting the Check...");
- for (i = 0; i < LENGTH; i++)
- if (shmaddr[i] != (char)i)
- printf("\nIndex %lu mismatched\n", i);
- dprintf("Done.\n");
-
- if (shmdt((const void *)shmaddr) != 0) {
- perror("Detach failure");
- shmctl(shmid, IPC_RMID, NULL);
- exit(3);
- }
-
- shmctl(shmid, IPC_RMID, NULL);
-
- return 0;
-}
*******************************************************************
/*
- * Example of using hugepage memory in a user application using the mmap
- * system call. Before running this application, make sure that the
- * administrator has mounted the hugetlbfs filesystem (on some directory
- * like /mnt) using the command mount -t hugetlbfs nodev /mnt. In this
- * example, the app is requesting memory of size 256MB that is backed by
- * huge pages.
- *
- * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
- * That means the addresses starting with 0x800000... will need to be
- * specified. Specifying a fixed address is not required on ppc64, i386
- * or x86_64.
+ * hugepage-mmap: see Documentation/vm/hugepage-mmap.c
*/
-#include <stdlib.h>
-#include <stdio.h>
-#include <unistd.h>
-#include <sys/mman.h>
-#include <fcntl.h>
-
-#define FILE_NAME "/mnt/hugepagefile"
-#define LENGTH (256UL*1024*1024)
-#define PROTECTION (PROT_READ | PROT_WRITE)
-
-/* Only ia64 requires this */
-#ifdef __ia64__
-#define ADDR (void *)(0x8000000000000000UL)
-#define FLAGS (MAP_SHARED | MAP_FIXED)
-#else
-#define ADDR (void *)(0x0UL)
-#define FLAGS (MAP_SHARED)
-#endif
-
-void check_bytes(char *addr)
-{
- printf("First hex is %x\n", *((unsigned int *)addr));
-}
-
-void write_bytes(char *addr)
-{
- unsigned long i;
-
- for (i = 0; i < LENGTH; i++)
- *(addr + i) = (char)i;
-}
-
-void read_bytes(char *addr)
-{
- unsigned long i;
-
- check_bytes(addr);
- for (i = 0; i < LENGTH; i++)
- if (*(addr + i) != (char)i) {
- printf("Mismatch at %lu\n", i);
- break;
- }
-}
-
-int main(void)
-{
- void *addr;
- int fd;
-
- fd = open(FILE_NAME, O_CREAT | O_RDWR, 0755);
- if (fd < 0) {
- perror("Open failed");
- exit(1);
- }
-
- addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, fd, 0);
- if (addr == MAP_FAILED) {
- perror("mmap");
- unlink(FILE_NAME);
- exit(1);
- }
-
- printf("Returned address is %p\n", addr);
- check_bytes(addr);
- write_bytes(addr);
- read_bytes(addr);
-
- munmap(addr, LENGTH);
- close(fd);
- unlink(FILE_NAME);
-
- return 0;
-}
diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt
new file mode 100644
index 000000000000..550068466605
--- /dev/null
+++ b/Documentation/vm/hwpoison.txt
@@ -0,0 +1,182 @@
+What is hwpoison?
+
+Upcoming Intel CPUs have support for recovering from some memory errors
+(``MCA recovery''). This requires the OS to declare a page "poisoned",
+kill the processes associated with it and avoid using it in the future.
+
+This patchkit implements the necessary infrastructure in the VM.
+
+To quote the overview comment:
+
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focusses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states. The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * Some of the operations here are somewhat inefficient and have non
+ * linear algorithmic complexity, because the data structures have not
+ * been optimized for this case. This is in particular the case
+ * for the mapping from a vma to a process. Since this case is expected
+ * to be rare we hope we can get away with this.
+
+The code consists of a the high level handler in mm/memory-failure.c,
+a new page poison bit and various checks in the VM to handle poisoned
+pages.
+
+The main target right now is KVM guests, but it works for all kinds
+of applications. KVM support requires a recent qemu-kvm release.
+
+For the KVM use there was need for a new signal type so that
+KVM can inject the machine check into the guest with the proper
+address. This in theory allows other applications to handle
+memory failures too. The expection is that near all applications
+won't do that, but some very specialized ones might.
+
+---
+
+There are two (actually three) modi memory failure recovery can be in:
+
+vm.memory_failure_recovery sysctl set to zero:
+ All memory failures cause a panic. Do not attempt recovery.
+ (on x86 this can be also affected by the tolerant level of the
+ MCE subsystem)
+
+early kill
+ (can be controlled globally and per process)
+ Send SIGBUS to the application as soon as the error is detected
+ This allows applications who can process memory errors in a gentle
+ way (e.g. drop affected object)
+ This is the mode used by KVM qemu.
+
+late kill
+ Send SIGBUS when the application runs into the corrupted page.
+ This is best for memory error unaware applications and default
+ Note some pages are always handled as late kill.
+
+---
+
+User control:
+
+vm.memory_failure_recovery
+ See sysctl.txt
+
+vm.memory_failure_early_kill
+ Enable early kill mode globally
+
+PR_MCE_KILL
+ Set early/late kill mode/revert to system default
+ arg1: PR_MCE_KILL_CLEAR: Revert to system default
+ arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode
+ PR_MCE_KILL_EARLY: Early kill
+ PR_MCE_KILL_LATE: Late kill
+ PR_MCE_KILL_DEFAULT: Use system global default
+PR_MCE_KILL_GET
+ return current mode
+
+
+---
+
+Testing:
+
+madvise(MADV_HWPOISON, ....)
+ (as root)
+ Poison a page in the process for testing
+
+
+hwpoison-inject module through debugfs
+
+/sys/debug/hwpoison/
+
+corrupt-pfn
+
+Inject hwpoison fault at PFN echoed into this file. This does
+some early filtering to avoid corrupted unintended pages in test suites.
+
+unpoison-pfn
+
+Software-unpoison page at PFN echoed into this file. This
+way a page can be reused again.
+This only works for Linux injected failures, not for real
+memory failures.
+
+Note these injection interfaces are not stable and might change between
+kernel versions
+
+corrupt-filter-dev-major
+corrupt-filter-dev-minor
+
+Only handle memory failures to pages associated with the file system defined
+by block device major/minor. -1U is the wildcard value.
+This should be only used for testing with artificial injection.
+
+corrupt-filter-memcg
+
+Limit injection to pages owned by memgroup. Specified by inode number
+of the memcg.
+
+Example:
+ mkdir /sys/fs/cgroup/mem/hwpoison
+
+ usemem -m 100 -s 1000 &
+ echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks
+
+ memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ')
+ echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg
+
+ page-types -p `pidof init` --hwpoison # shall do nothing
+ page-types -p `pidof usemem` --hwpoison # poison its pages
+
+corrupt-filter-flags-mask
+corrupt-filter-flags-value
+
+When specified, only poison pages if ((page_flags & mask) == value).
+This allows stress testing of many kinds of pages. The page_flags
+are the same as in /proc/kpageflags. The flag bits are defined in
+include/linux/kernel-page-flags.h and documented in
+Documentation/vm/pagemap.txt
+
+Architecture specific MCE injector
+
+x86 has mce-inject, mce-test
+
+Some portable hwpoison test programs in mce-test, see blow.
+
+---
+
+References:
+
+http://halobates.de/mce-lc09-2.pdf
+ Overview presentation from LinuxCon 09
+
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
+ Test suite (hwpoison specific portable tests in tsrc)
+
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
+ x86 specific injector
+
+
+---
+
+Limitations:
+
+- Not all page types are supported and never will. Most kernel internal
+objects cannot be recovered, only LRU pages for now.
+- Right now hugepage support is missing.
+
+---
+Andi Kleen, Oct 2009
+
diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
new file mode 100644
index 000000000000..b392e496f816
--- /dev/null
+++ b/Documentation/vm/ksm.txt
@@ -0,0 +1,82 @@
+How to use the Kernel Samepage Merging feature
+----------------------------------------------
+
+KSM is a memory-saving de-duplication feature, enabled by CONFIG_KSM=y,
+added to the Linux kernel in 2.6.32. See mm/ksm.c for its implementation,
+and http://lwn.net/Articles/306704/ and http://lwn.net/Articles/330589/
+
+The KSM daemon ksmd periodically scans those areas of user memory which
+have been registered with it, looking for pages of identical content which
+can be replaced by a single write-protected page (which is automatically
+copied if a process later wants to update its content).
+
+KSM was originally developed for use with KVM (where it was known as
+Kernel Shared Memory), to fit more virtual machines into physical memory,
+by sharing the data common between them. But it can be useful to any
+application which generates many instances of the same data.
+
+KSM only merges anonymous (private) pages, never pagecache (file) pages.
+KSM's merged pages were originally locked into kernel memory, but can now
+be swapped out just like other user pages (but sharing is broken when they
+are swapped back in: ksmd must rediscover their identity and merge again).
+
+KSM only operates on those areas of address space which an application
+has advised to be likely candidates for merging, by using the madvise(2)
+system call: int madvise(addr, length, MADV_MERGEABLE).
+
+The app may call int madvise(addr, length, MADV_UNMERGEABLE) to cancel
+that advice and restore unshared pages: whereupon KSM unmerges whatever
+it merged in that range. Note: this unmerging call may suddenly require
+more memory than is available - possibly failing with EAGAIN, but more
+probably arousing the Out-Of-Memory killer.
+
+If KSM is not configured into the running kernel, madvise MADV_MERGEABLE
+and MADV_UNMERGEABLE simply fail with EINVAL. If the running kernel was
+built with CONFIG_KSM=y, those calls will normally succeed: even if the
+the KSM daemon is not currently running, MADV_MERGEABLE still registers
+the range for whenever the KSM daemon is started; even if the range
+cannot contain any pages which KSM could actually merge; even if
+MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
+
+Like other madvise calls, they are intended for use on mapped areas of
+the user address space: they will report ENOMEM if the specified range
+includes unmapped gaps (though working on the intervening mapped areas),
+and might fail with EAGAIN if not enough memory for internal structures.
+
+Applications should be considerate in their use of MADV_MERGEABLE,
+restricting its use to areas likely to benefit. KSM's scans may use a lot
+of processing power: some installations will disable KSM for that reason.
+
+The KSM daemon is controlled by sysfs files in /sys/kernel/mm/ksm/,
+readable by all but writable only by root:
+
+pages_to_scan - how many present pages to scan before ksmd goes to sleep
+ e.g. "echo 100 > /sys/kernel/mm/ksm/pages_to_scan"
+ Default: 100 (chosen for demonstration purposes)
+
+sleep_millisecs - how many milliseconds ksmd should sleep before next scan
+ e.g. "echo 20 > /sys/kernel/mm/ksm/sleep_millisecs"
+ Default: 20 (chosen for demonstration purposes)
+
+run - set 0 to stop ksmd from running but keep merged pages,
+ set 1 to run ksmd e.g. "echo 1 > /sys/kernel/mm/ksm/run",
+ set 2 to stop ksmd and unmerge all pages currently merged,
+ but leave mergeable areas registered for next run
+ Default: 0 (must be changed to 1 to activate KSM,
+ except if CONFIG_SYSFS is disabled)
+
+The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
+
+pages_shared - how many shared pages are being used
+pages_sharing - how many more sites are sharing them i.e. how much saved
+pages_unshared - how many pages unique but repeatedly checked for merging
+pages_volatile - how many pages changing too fast to be placed in a tree
+full_scans - how many times all mergeable areas have been scanned
+
+A high ratio of pages_sharing to pages_shared indicates good sharing, but
+a high ratio of pages_unshared to pages_sharing indicates wasted effort.
+pages_volatile embraces several different kinds of activity, but a high
+proportion there would also indicate poor use of madvise MADV_MERGEABLE.
+
+Izik Eidus,
+Hugh Dickins, 17 Nov 2009
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
index f366fa956179..f61228bd6395 100644
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@ in some cases it is not really needed. Eg, vm_start is modified by
expand_stack(), it is hard to come up with a destructive scenario without
having the vmlist protection in this case.
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_mutex and the kmem cache
c_spinlock spinlocks. This is okay, since the kmem code asks for pages after
dropping c_spinlock. The page_table_lock also nests with pagecache_lock and
pagemap_lru_lock spinlocks, and no code asks for memory with these locks
@@ -80,7 +80,7 @@ Note: PTL can also be used to guarantee that no new clones using the
mm start up ... this is a loose form of stability on mm_users. For
example, it is used in copy_mm to protect against a racing tlb_gather_mmu
single address space optimization, so that the zap_page_range (from
-vmtruncate) does not lose sending ipi's to cloned threads that might
+truncate) does not lose sending ipi's to cloned threads that might
be spawned underneath it and go to user mode to drag in pte's into tlbs.
swap_lock
diff --git a/Documentation/vm/map_hugetlb.c b/Documentation/vm/map_hugetlb.c
new file mode 100644
index 000000000000..eda1a6d3578a
--- /dev/null
+++ b/Documentation/vm/map_hugetlb.c
@@ -0,0 +1,77 @@
+/*
+ * Example of using hugepage memory in a user application using the mmap
+ * system call with MAP_HUGETLB flag. Before running this program make
+ * sure the administrator has allocated enough default sized huge pages
+ * to cover the 256 MB allocation.
+ *
+ * For ia64 architecture, Linux kernel reserves Region number 4 for hugepages.
+ * That means the addresses starting with 0x800000... will need to be
+ * specified. Specifying a fixed address is not required on ppc64, i386
+ * or x86_64.
+ */
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <fcntl.h>
+
+#define LENGTH (256UL*1024*1024)
+#define PROTECTION (PROT_READ | PROT_WRITE)
+
+#ifndef MAP_HUGETLB
+#define MAP_HUGETLB 0x40000 /* arch specific */
+#endif
+
+/* Only ia64 requires this */
+#ifdef __ia64__
+#define ADDR (void *)(0x8000000000000000UL)
+#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB | MAP_FIXED)
+#else
+#define ADDR (void *)(0x0UL)
+#define FLAGS (MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB)
+#endif
+
+static void check_bytes(char *addr)
+{
+ printf("First hex is %x\n", *((unsigned int *)addr));
+}
+
+static void write_bytes(char *addr)
+{
+ unsigned long i;
+
+ for (i = 0; i < LENGTH; i++)
+ *(addr + i) = (char)i;
+}
+
+static void read_bytes(char *addr)
+{
+ unsigned long i;
+
+ check_bytes(addr);
+ for (i = 0; i < LENGTH; i++)
+ if (*(addr + i) != (char)i) {
+ printf("Mismatch at %lu\n", i);
+ break;
+ }
+}
+
+int main(void)
+{
+ void *addr;
+
+ addr = mmap(ADDR, LENGTH, PROTECTION, FLAGS, 0, 0);
+ if (addr == MAP_FAILED) {
+ perror("mmap");
+ exit(1);
+ }
+
+ printf("Returned address is %p\n", addr);
+ check_bytes(addr);
+ write_bytes(addr);
+ read_bytes(addr);
+
+ munmap(addr, LENGTH);
+
+ return 0;
+}
diff --git a/Documentation/vm/numa b/Documentation/vm/numa
index e93ad9425e2a..a200a386429d 100644
--- a/Documentation/vm/numa
+++ b/Documentation/vm/numa
@@ -1,41 +1,149 @@
Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
-The intent of this file is to have an uptodate, running commentary
-from different people about NUMA specific code in the Linux vm.
-
-What is NUMA? It is an architecture where the memory access times
-for different regions of memory from a given processor varies
-according to the "distance" of the memory region from the processor.
-Each region of memory to which access times are the same from any
-cpu, is called a node. On such architectures, it is beneficial if
-the kernel tries to minimize inter node communications. Schemes
-for this range from kernel text and read-only data replication
-across nodes, and trying to house all the data structures that
-key components of the kernel need on memory on that node.
-
-Currently, all the numa support is to provide efficient handling
-of widely discontiguous physical memory, so architectures which
-are not NUMA but can have huge holes in the physical address space
-can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM.
-
-The initial port includes NUMAizing the bootmem allocator code by
-encapsulating all the pieces of information into a bootmem_data_t
-structure. Node specific calls have been added to the allocator.
-In theory, any platform which uses the bootmem allocator should
-be able to put the bootmem and mem_map data structures anywhere
-it deems best.
-
-Each node's page allocation data structures have also been encapsulated
-into a pg_data_t. The bootmem_data_t is just one part of this. To
-make the code look uniform between NUMA and regular UMA platforms,
-UMA platforms have a statically allocated pg_data_t too (contig_page_data).
-For the sake of uniformity, the function num_online_nodes() is also defined
-for all platforms. As we run benchmarks, we might decide to NUMAize
-more variables like low_on_memory, nr_free_pages etc into the pg_data_t.
-
-The NUMA aware page allocation code currently tries to allocate pages
-from different nodes in a round robin manner. This will be changed to
-do concentratic circle search, starting from current node, once the
-NUMA port achieves more maturity. The call alloc_pages_node has been
-added, so that drivers can make the call and not worry about whether
-it is running on a NUMA or UMA platform.
+What is NUMA?
+
+This question can be answered from a couple of perspectives: the
+hardware view and the Linux software view.
+
+From the hardware perspective, a NUMA system is a computer platform that
+comprises multiple components or assemblies each of which may contain 0
+or more CPUs, local memory, and/or IO buses. For brevity and to
+disambiguate the hardware view of these physical components/assemblies
+from the software abstraction thereof, we'll call the components/assemblies
+'cells' in this document.
+
+Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset
+of the system--although some components necessary for a stand-alone SMP system
+may not be populated on any given cell. The cells of the NUMA system are
+connected together with some sort of system interconnect--e.g., a crossbar or
+point-to-point link are common types of NUMA system interconnects. Both of
+these types of interconnects can be aggregated to create NUMA platforms with
+cells at multiple distances from other cells.
+
+For Linux, the NUMA platforms of interest are primarily what is known as Cache
+Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible
+to and accessible from any CPU attached to any cell and cache coherency
+is handled in hardware by the processor caches and/or the system interconnect.
+
+Memory access time and effective memory bandwidth varies depending on how far
+away the cell containing the CPU or IO bus making the memory access is from the
+cell containing the target memory. For example, access to memory by CPUs
+attached to the same cell will experience faster access times and higher
+bandwidths than accesses to memory on other, remote cells. NUMA platforms
+can have cells at multiple remote distances from any given cell.
+
+Platform vendors don't build NUMA systems just to make software developers'
+lives interesting. Rather, this architecture is a means to provide scalable
+memory bandwidth. However, to achieve scalable memory bandwidth, system and
+application software must arrange for a large majority of the memory references
+[cache misses] to be to "local" memory--memory on the same cell, if any--or
+to the closest cell with memory.
+
+This leads to the Linux software view of a NUMA system:
+
+Linux divides the system's hardware resources into multiple software
+abstractions called "nodes". Linux maps the nodes onto the physical cells
+of the hardware platform, abstracting away some of the details for some
+architectures. As with physical cells, software nodes may contain 0 or more
+CPUs, memory and/or IO buses. And, again, memory accesses to memory on
+"closer" nodes--nodes that map to closer cells--will generally experience
+faster access times and higher effective bandwidth than accesses to more
+remote cells.
+
+For some architectures, such as x86, Linux will "hide" any node representing a
+physical cell that has no memory attached, and reassign any CPUs attached to
+that cell to a node representing a cell that does have memory. Thus, on
+these architectures, one cannot assume that all CPUs that Linux associates with
+a given node will see the same local memory access times and bandwidth.
+
+In addition, for some architectures, again x86 is an example, Linux supports
+the emulation of additional nodes. For NUMA emulation, linux will carve up
+the existing nodes--or the system memory for non-NUMA platforms--into multiple
+nodes. Each emulated node will manage a fraction of the underlying cells'
+physical memory. NUMA emluation is useful for testing NUMA kernel and
+application features on non-NUMA platforms, and as a sort of memory resource
+management mechanism when used together with cpusets.
+[see Documentation/cgroups/cpusets.txt]
+
+For each node with memory, Linux constructs an independent memory management
+subsystem, complete with its own free page lists, in-use page lists, usage
+statistics and locks to mediate access. In addition, Linux constructs for
+each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE],
+an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a
+selected zone/node cannot satisfy the allocation request. This situation,
+when a zone has no available memory to satisfy a request, is called
+"overflow" or "fallback".
+
+Because some nodes contain multiple zones containing different types of
+memory, Linux must decide whether to order the zonelists such that allocations
+fall back to the same zone type on a different node, or to a different zone
+type on the same node. This is an important consideration because some zones,
+such as DMA or DMA32, represent relatively scarce resources. Linux chooses
+a default zonelist order based on the sizes of the various zone types relative
+to the total memory of the node and the total memory of the system. The
+default zonelist order may be overridden using the numa_zonelist_order kernel
+boot parameter or sysctl. [see Documentation/kernel-parameters.txt and
+Documentation/sysctl/vm.txt]
+
+By default, Linux will attempt to satisfy memory allocation requests from the
+node to which the CPU that executes the request is assigned. Specifically,
+Linux will attempt to allocate from the first node in the appropriate zonelist
+for the node where the request originates. This is called "local allocation."
+If the "local" node cannot satisfy the request, the kernel will examine other
+nodes' zones in the selected zonelist looking for the first zone in the list
+that can satisfy the request.
+
+Local allocation will tend to keep subsequent access to the allocated memory
+"local" to the underlying physical resources and off the system interconnect--
+as long as the task on whose behalf the kernel allocated some memory does not
+later migrate away from that memory. The Linux scheduler is aware of the
+NUMA topology of the platform--embodied in the "scheduling domains" data
+structures [see Documentation/scheduler/sched-domains.txt]--and the scheduler
+attempts to minimize task migration to distant scheduling domains. However,
+the scheduler does not take a task's NUMA footprint into account directly.
+Thus, under sufficient imbalance, tasks can migrate between nodes, remote
+from their initial node and kernel data structures.
+
+System administrators and application designers can restrict a task's migration
+to improve NUMA locality using various CPU affinity command line interfaces,
+such as taskset(1) and numactl(1), and program interfaces such as
+sched_setaffinity(2). Further, one can modify the kernel's default local
+allocation behavior using Linux NUMA memory policy.
+[see Documentation/vm/numa_memory_policy.]
+
+System administrators can restrict the CPUs and nodes' memories that a non-
+privileged user can specify in the scheduling or NUMA commands and functions
+using control groups and CPUsets. [see Documentation/cgroups/CPUsets.txt]
+
+On architectures that do not hide memoryless nodes, Linux will include only
+zones [nodes] with memory in the zonelists. This means that for a memoryless
+node the "local memory node"--the node of the first zone in CPU's node's
+zonelist--will not be the node itself. Rather, it will be the node that the
+kernel selected as the nearest node with memory when it built the zonelists.
+So, default, local allocations will succeed with the kernel supplying the
+closest available memory. This is a consequence of the same mechanism that
+allows such allocations to fallback to other nearby nodes when a node that
+does contain memory overflows.
+
+Some kernel allocations do not want or cannot tolerate this allocation fallback
+behavior. Rather they want to be sure they get memory from the specified node
+or get notified that the node has no free memory. This is usually the case when
+a subsystem allocates per CPU memory resources, for example.
+
+A typical model for making such an allocation is to obtain the node id of the
+node to which the "current CPU" is attached using one of the kernel's
+numa_node_id() or CPU_to_node() functions and then request memory from only
+the node id returned. When such an allocation fails, the requesting subsystem
+may revert to its own fallback path. The slab kernel memory allocator is an
+example of this. Or, the subsystem may choose to disable or not to enable
+itself on allocation failure. The kernel profiling subsystem is an example of
+this.
+
+If the architecture supports--does not hide--memoryless nodes, then CPUs
+attached to memoryless nodes would always incur the fallback path overhead
+or some subsystems would fail to initialize if they attempted to allocated
+memory exclusively from a node without memory. To support such
+architectures transparently, kernel subsystems can use the numa_mem_id()
+or cpu_to_mem() function to locate the "local memory node" for the calling or
+specified CPU. Again, this is the same node from which default, local page
+allocations will be attempted.
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index bad16d3f6a47..4e7da6543424 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -8,7 +8,8 @@ The current memory policy support was added to Linux 2.6 around May 2004. This
document attempts to describe the concepts and APIs of the 2.6 memory policy
support.
-Memory policies should not be confused with cpusets (Documentation/cpusets.txt)
+Memory policies should not be confused with cpusets
+(Documentation/cgroups/cpusets.txt)
which is an administrative mechanism for restricting the nodes from which
memory may be allocated by a set of processes. Memory policies are a
programming interface that a NUMA-aware application can take advantage of. When
@@ -44,7 +45,7 @@ most general to most specific:
to establish the task policy for a child task exec()'d from an
executable image that has no awareness of memory policy. See the
MEMORY POLICY APIS section, below, for an overview of the system call
- that a task may use to set/change it's task/process policy.
+ that a task may use to set/change its task/process policy.
In a multi-threaded task, task policies apply only to the thread
[Linux kernel task] that installs the policy and any threads
@@ -58,7 +59,7 @@ most general to most specific:
the policy at the time they were allocated.
VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's
- virtual adddress space. A task may define a specific policy for a range
+ virtual address space. A task may define a specific policy for a range
of its virtual address space. See the MEMORY POLICIES APIS section,
below, for an overview of the mbind() system call used to set a VMA
policy.
@@ -300,7 +301,7 @@ decrement this reference count, respectively. mpol_put() will only free
the structure back to the mempolicy kmem cache when the reference count
goes to zero.
-When a new memory policy is allocated, it's reference count is initialized
+When a new memory policy is allocated, its reference count is initialized
to '1', representing the reference held by the task that is installing the
new policy. When a pointer to a memory policy structure is stored in another
structure, another reference is added, as the task's reference will be dropped
@@ -353,7 +354,7 @@ follows:
Because of this extra reference counting, and because we must lookup
shared policies in a tree structure under spinlock, shared policies are
- more expensive to use in the page allocation path. This is expecially
+ more expensive to use in the page allocation path. This is especially
true for shared policies on shared memory regions shared by tasks running
on different NUMA nodes. This extra overhead can be avoided by always
falling back to task or system default policy for shared memory regions,
@@ -423,7 +424,7 @@ a command line tool, numactl(8), exists that allows one to:
+ set the shared policy for a shared memory segment via mbind(2)
-The numactl(8) tool is packages with the run-time version of the library
+The numactl(8) tool is packaged with the run-time version of the library
containing the memory policy system call wrappers. Some distributions
package the headers and compile-time libraries in a separate development
package.
diff --git a/Documentation/vm/overcommit-accounting b/Documentation/vm/overcommit-accounting
index 21c7b1f8f32b..706d7ed9d8d2 100644
--- a/Documentation/vm/overcommit-accounting
+++ b/Documentation/vm/overcommit-accounting
@@ -4,7 +4,7 @@ The Linux kernel supports the following overcommit handling modes
address space are refused. Used for a typical system. It
ensures a seriously wild allocation fails while allowing
overcommit to reduce swap usage. root is allowed to
- allocate slighly more memory in this mode. This is the
+ allocate slightly more memory in this mode. This is the
default.
1 - Always overcommit. Appropriate for some scientific
diff --git a/Documentation/vm/page-types.c b/Documentation/vm/page-types.c
new file mode 100644
index 000000000000..7445caa26d05
--- /dev/null
+++ b/Documentation/vm/page-types.c
@@ -0,0 +1,1100 @@
+/*
+ * page-types: Tool for querying page flags
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; version 2.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ * more details.
+ *
+ * You should find a copy of v2 of the GNU General Public License somewhere on
+ * your Linux system; if not, write to the Free Software Foundation, Inc., 59
+ * Temple Place, Suite 330, Boston, MA 02111-1307 USA.
+ *
+ * Copyright (C) 2009 Intel corporation
+ *
+ * Authors: Wu Fengguang <fengguang.wu@intel.com>
+ */
+
+#define _LARGEFILE64_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdint.h>
+#include <stdarg.h>
+#include <string.h>
+#include <getopt.h>
+#include <limits.h>
+#include <assert.h>
+#include <sys/types.h>
+#include <sys/errno.h>
+#include <sys/fcntl.h>
+#include <sys/mount.h>
+#include <sys/statfs.h>
+#include "../../include/linux/magic.h"
+
+
+#ifndef MAX_PATH
+# define MAX_PATH 256
+#endif
+
+#ifndef STR
+# define _STR(x) #x
+# define STR(x) _STR(x)
+#endif
+
+/*
+ * pagemap kernel ABI bits
+ */
+
+#define PM_ENTRY_BYTES sizeof(uint64_t)
+#define PM_STATUS_BITS 3
+#define PM_STATUS_OFFSET (64 - PM_STATUS_BITS)
+#define PM_STATUS_MASK (((1LL << PM_STATUS_BITS) - 1) << PM_STATUS_OFFSET)
+#define PM_STATUS(nr) (((nr) << PM_STATUS_OFFSET) & PM_STATUS_MASK)
+#define PM_PSHIFT_BITS 6
+#define PM_PSHIFT_OFFSET (PM_STATUS_OFFSET - PM_PSHIFT_BITS)
+#define PM_PSHIFT_MASK (((1LL << PM_PSHIFT_BITS) - 1) << PM_PSHIFT_OFFSET)
+#define PM_PSHIFT(x) (((u64) (x) << PM_PSHIFT_OFFSET) & PM_PSHIFT_MASK)
+#define PM_PFRAME_MASK ((1LL << PM_PSHIFT_OFFSET) - 1)
+#define PM_PFRAME(x) ((x) & PM_PFRAME_MASK)
+
+#define PM_PRESENT PM_STATUS(4LL)
+#define PM_SWAP PM_STATUS(2LL)
+
+
+/*
+ * kernel page flags
+ */
+
+#define KPF_BYTES 8
+#define PROC_KPAGEFLAGS "/proc/kpageflags"
+
+/* copied from kpageflags_read() */
+#define KPF_LOCKED 0
+#define KPF_ERROR 1
+#define KPF_REFERENCED 2
+#define KPF_UPTODATE 3
+#define KPF_DIRTY 4
+#define KPF_LRU 5
+#define KPF_ACTIVE 6
+#define KPF_SLAB 7
+#define KPF_WRITEBACK 8
+#define KPF_RECLAIM 9
+#define KPF_BUDDY 10
+
+/* [11-20] new additions in 2.6.31 */
+#define KPF_MMAP 11
+#define KPF_ANON 12
+#define KPF_SWAPCACHE 13
+#define KPF_SWAPBACKED 14
+#define KPF_COMPOUND_HEAD 15
+#define KPF_COMPOUND_TAIL 16
+#define KPF_HUGE 17
+#define KPF_UNEVICTABLE 18
+#define KPF_HWPOISON 19
+#define KPF_NOPAGE 20
+#define KPF_KSM 21
+
+/* [32-] kernel hacking assistances */
+#define KPF_RESERVED 32
+#define KPF_MLOCKED 33
+#define KPF_MAPPEDTODISK 34
+#define KPF_PRIVATE 35
+#define KPF_PRIVATE_2 36
+#define KPF_OWNER_PRIVATE 37
+#define KPF_ARCH 38
+#define KPF_UNCACHED 39
+
+/* [48-] take some arbitrary free slots for expanding overloaded flags
+ * not part of kernel API
+ */
+#define KPF_READAHEAD 48
+#define KPF_SLOB_FREE 49
+#define KPF_SLUB_FROZEN 50
+#define KPF_SLUB_DEBUG 51
+
+#define KPF_ALL_BITS ((uint64_t)~0ULL)
+#define KPF_HACKERS_BITS (0xffffULL << 32)
+#define KPF_OVERLOADED_BITS (0xffffULL << 48)
+#define BIT(name) (1ULL << KPF_##name)
+#define BITS_COMPOUND (BIT(COMPOUND_HEAD) | BIT(COMPOUND_TAIL))
+
+static const char *page_flag_names[] = {
+ [KPF_LOCKED] = "L:locked",
+ [KPF_ERROR] = "E:error",
+ [KPF_REFERENCED] = "R:referenced",
+ [KPF_UPTODATE] = "U:uptodate",
+ [KPF_DIRTY] = "D:dirty",
+ [KPF_LRU] = "l:lru",
+ [KPF_ACTIVE] = "A:active",
+ [KPF_SLAB] = "S:slab",
+ [KPF_WRITEBACK] = "W:writeback",
+ [KPF_RECLAIM] = "I:reclaim",
+ [KPF_BUDDY] = "B:buddy",
+
+ [KPF_MMAP] = "M:mmap",
+ [KPF_ANON] = "a:anonymous",
+ [KPF_SWAPCACHE] = "s:swapcache",
+ [KPF_SWAPBACKED] = "b:swapbacked",
+ [KPF_COMPOUND_HEAD] = "H:compound_head",
+ [KPF_COMPOUND_TAIL] = "T:compound_tail",
+ [KPF_HUGE] = "G:huge",
+ [KPF_UNEVICTABLE] = "u:unevictable",
+ [KPF_HWPOISON] = "X:hwpoison",
+ [KPF_NOPAGE] = "n:nopage",
+ [KPF_KSM] = "x:ksm",
+
+ [KPF_RESERVED] = "r:reserved",
+ [KPF_MLOCKED] = "m:mlocked",
+ [KPF_MAPPEDTODISK] = "d:mappedtodisk",
+ [KPF_PRIVATE] = "P:private",
+ [KPF_PRIVATE_2] = "p:private_2",
+ [KPF_OWNER_PRIVATE] = "O:owner_private",
+ [KPF_ARCH] = "h:arch",
+ [KPF_UNCACHED] = "c:uncached",
+
+ [KPF_READAHEAD] = "I:readahead",
+ [KPF_SLOB_FREE] = "P:slob_free",
+ [KPF_SLUB_FROZEN] = "A:slub_frozen",
+ [KPF_SLUB_DEBUG] = "E:slub_debug",
+};
+
+
+static const char *debugfs_known_mountpoints[] = {
+ "/sys/kernel/debug",
+ "/debug",
+ 0,
+};
+
+/*
+ * data structures
+ */
+
+static int opt_raw; /* for kernel developers */
+static int opt_list; /* list pages (in ranges) */
+static int opt_no_summary; /* don't show summary */
+static pid_t opt_pid; /* process to walk */
+
+#define MAX_ADDR_RANGES 1024
+static int nr_addr_ranges;
+static unsigned long opt_offset[MAX_ADDR_RANGES];
+static unsigned long opt_size[MAX_ADDR_RANGES];
+
+#define MAX_VMAS 10240
+static int nr_vmas;
+static unsigned long pg_start[MAX_VMAS];
+static unsigned long pg_end[MAX_VMAS];
+
+#define MAX_BIT_FILTERS 64
+static int nr_bit_filters;
+static uint64_t opt_mask[MAX_BIT_FILTERS];
+static uint64_t opt_bits[MAX_BIT_FILTERS];
+
+static int page_size;
+
+static int pagemap_fd;
+static int kpageflags_fd;
+
+static int opt_hwpoison;
+static int opt_unpoison;
+
+static char hwpoison_debug_fs[MAX_PATH+1];
+static int hwpoison_inject_fd;
+static int hwpoison_forget_fd;
+
+#define HASH_SHIFT 13
+#define HASH_SIZE (1 << HASH_SHIFT)
+#define HASH_MASK (HASH_SIZE - 1)
+#define HASH_KEY(flags) (flags & HASH_MASK)
+
+static unsigned long total_pages;
+static unsigned long nr_pages[HASH_SIZE];
+static uint64_t page_flags[HASH_SIZE];
+
+
+/*
+ * helper functions
+ */
+
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+
+#define min_t(type, x, y) ({ \
+ type __min1 = (x); \
+ type __min2 = (y); \
+ __min1 < __min2 ? __min1 : __min2; })
+
+#define max_t(type, x, y) ({ \
+ type __max1 = (x); \
+ type __max2 = (y); \
+ __max1 > __max2 ? __max1 : __max2; })
+
+static unsigned long pages2mb(unsigned long pages)
+{
+ return (pages * page_size) >> 20;
+}
+
+static void fatal(const char *x, ...)
+{
+ va_list ap;
+
+ va_start(ap, x);
+ vfprintf(stderr, x, ap);
+ va_end(ap);
+ exit(EXIT_FAILURE);
+}
+
+static int checked_open(const char *pathname, int flags)
+{
+ int fd = open(pathname, flags);
+
+ if (fd < 0) {
+ perror(pathname);
+ exit(EXIT_FAILURE);
+ }
+
+ return fd;
+}
+
+/*
+ * pagemap/kpageflags routines
+ */
+
+static unsigned long do_u64_read(int fd, char *name,
+ uint64_t *buf,
+ unsigned long index,
+ unsigned long count)
+{
+ long bytes;
+
+ if (index > ULONG_MAX / 8)
+ fatal("index overflow: %lu\n", index);
+
+ if (lseek(fd, index * 8, SEEK_SET) < 0) {
+ perror(name);
+ exit(EXIT_FAILURE);
+ }
+
+ bytes = read(fd, buf, count * 8);
+ if (bytes < 0) {
+ perror(name);
+ exit(EXIT_FAILURE);
+ }
+ if (bytes % 8)
+ fatal("partial read: %lu bytes\n", bytes);
+
+ return bytes / 8;
+}
+
+static unsigned long kpageflags_read(uint64_t *buf,
+ unsigned long index,
+ unsigned long pages)
+{
+ return do_u64_read(kpageflags_fd, PROC_KPAGEFLAGS, buf, index, pages);
+}
+
+static unsigned long pagemap_read(uint64_t *buf,
+ unsigned long index,
+ unsigned long pages)
+{
+ return do_u64_read(pagemap_fd, "/proc/pid/pagemap", buf, index, pages);
+}
+
+static unsigned long pagemap_pfn(uint64_t val)
+{
+ unsigned long pfn;
+
+ if (val & PM_PRESENT)
+ pfn = PM_PFRAME(val);
+ else
+ pfn = 0;
+
+ return pfn;
+}
+
+
+/*
+ * page flag names
+ */
+
+static char *page_flag_name(uint64_t flags)
+{
+ static char buf[65];
+ int present;
+ int i, j;
+
+ for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+ present = (flags >> i) & 1;
+ if (!page_flag_names[i]) {
+ if (present)
+ fatal("unknown flag bit %d\n", i);
+ continue;
+ }
+ buf[j++] = present ? page_flag_names[i][0] : '_';
+ }
+
+ return buf;
+}
+
+static char *page_flag_longname(uint64_t flags)
+{
+ static char buf[1024];
+ int i, n;
+
+ for (i = 0, n = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+ if (!page_flag_names[i])
+ continue;
+ if ((flags >> i) & 1)
+ n += snprintf(buf + n, sizeof(buf) - n, "%s,",
+ page_flag_names[i] + 2);
+ }
+ if (n)
+ n--;
+ buf[n] = '\0';
+
+ return buf;
+}
+
+
+/*
+ * page list and summary
+ */
+
+static void show_page_range(unsigned long voffset,
+ unsigned long offset, uint64_t flags)
+{
+ static uint64_t flags0;
+ static unsigned long voff;
+ static unsigned long index;
+ static unsigned long count;
+
+ if (flags == flags0 && offset == index + count &&
+ (!opt_pid || voffset == voff + count)) {
+ count++;
+ return;
+ }
+
+ if (count) {
+ if (opt_pid)
+ printf("%lx\t", voff);
+ printf("%lx\t%lx\t%s\n",
+ index, count, page_flag_name(flags0));
+ }
+
+ flags0 = flags;
+ index = offset;
+ voff = voffset;
+ count = 1;
+}
+
+static void show_page(unsigned long voffset,
+ unsigned long offset, uint64_t flags)
+{
+ if (opt_pid)
+ printf("%lx\t", voffset);
+ printf("%lx\t%s\n", offset, page_flag_name(flags));
+}
+
+static void show_summary(void)
+{
+ int i;
+
+ printf(" flags\tpage-count MB"
+ " symbolic-flags\t\t\tlong-symbolic-flags\n");
+
+ for (i = 0; i < ARRAY_SIZE(nr_pages); i++) {
+ if (nr_pages[i])
+ printf("0x%016llx\t%10lu %8lu %s\t%s\n",
+ (unsigned long long)page_flags[i],
+ nr_pages[i],
+ pages2mb(nr_pages[i]),
+ page_flag_name(page_flags[i]),
+ page_flag_longname(page_flags[i]));
+ }
+
+ printf(" total\t%10lu %8lu\n",
+ total_pages, pages2mb(total_pages));
+}
+
+
+/*
+ * page flag filters
+ */
+
+static int bit_mask_ok(uint64_t flags)
+{
+ int i;
+
+ for (i = 0; i < nr_bit_filters; i++) {
+ if (opt_bits[i] == KPF_ALL_BITS) {
+ if ((flags & opt_mask[i]) == 0)
+ return 0;
+ } else {
+ if ((flags & opt_mask[i]) != opt_bits[i])
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
+static uint64_t expand_overloaded_flags(uint64_t flags)
+{
+ /* SLOB/SLUB overload several page flags */
+ if (flags & BIT(SLAB)) {
+ if (flags & BIT(PRIVATE))
+ flags ^= BIT(PRIVATE) | BIT(SLOB_FREE);
+ if (flags & BIT(ACTIVE))
+ flags ^= BIT(ACTIVE) | BIT(SLUB_FROZEN);
+ if (flags & BIT(ERROR))
+ flags ^= BIT(ERROR) | BIT(SLUB_DEBUG);
+ }
+
+ /* PG_reclaim is overloaded as PG_readahead in the read path */
+ if ((flags & (BIT(RECLAIM) | BIT(WRITEBACK))) == BIT(RECLAIM))
+ flags ^= BIT(RECLAIM) | BIT(READAHEAD);
+
+ return flags;
+}
+
+static uint64_t well_known_flags(uint64_t flags)
+{
+ /* hide flags intended only for kernel hacker */
+ flags &= ~KPF_HACKERS_BITS;
+
+ /* hide non-hugeTLB compound pages */
+ if ((flags & BITS_COMPOUND) && !(flags & BIT(HUGE)))
+ flags &= ~BITS_COMPOUND;
+
+ return flags;
+}
+
+static uint64_t kpageflags_flags(uint64_t flags)
+{
+ flags = expand_overloaded_flags(flags);
+
+ if (!opt_raw)
+ flags = well_known_flags(flags);
+
+ return flags;
+}
+
+/* verify that a mountpoint is actually a debugfs instance */
+static int debugfs_valid_mountpoint(const char *debugfs)
+{
+ struct statfs st_fs;
+
+ if (statfs(debugfs, &st_fs) < 0)
+ return -ENOENT;
+ else if (st_fs.f_type != (long) DEBUGFS_MAGIC)
+ return -ENOENT;
+
+ return 0;
+}
+
+/* find the path to the mounted debugfs */
+static const char *debugfs_find_mountpoint(void)
+{
+ const char **ptr;
+ char type[100];
+ FILE *fp;
+
+ ptr = debugfs_known_mountpoints;
+ while (*ptr) {
+ if (debugfs_valid_mountpoint(*ptr) == 0) {
+ strcpy(hwpoison_debug_fs, *ptr);
+ return hwpoison_debug_fs;
+ }
+ ptr++;
+ }
+
+ /* give up and parse /proc/mounts */
+ fp = fopen("/proc/mounts", "r");
+ if (fp == NULL)
+ perror("Can't open /proc/mounts for read");
+
+ while (fscanf(fp, "%*s %"
+ STR(MAX_PATH)
+ "s %99s %*s %*d %*d\n",
+ hwpoison_debug_fs, type) == 2) {
+ if (strcmp(type, "debugfs") == 0)
+ break;
+ }
+ fclose(fp);
+
+ if (strcmp(type, "debugfs") != 0)
+ return NULL;
+
+ return hwpoison_debug_fs;
+}
+
+/* mount the debugfs somewhere if it's not mounted */
+
+static void debugfs_mount(void)
+{
+ const char **ptr;
+
+ /* see if it's already mounted */
+ if (debugfs_find_mountpoint())
+ return;
+
+ ptr = debugfs_known_mountpoints;
+ while (*ptr) {
+ if (mount(NULL, *ptr, "debugfs", 0, NULL) == 0) {
+ /* save the mountpoint */
+ strcpy(hwpoison_debug_fs, *ptr);
+ break;
+ }
+ ptr++;
+ }
+
+ if (*ptr == NULL) {
+ perror("mount debugfs");
+ exit(EXIT_FAILURE);
+ }
+}
+
+/*
+ * page actions
+ */
+
+static void prepare_hwpoison_fd(void)
+{
+ char buf[MAX_PATH + 1];
+
+ debugfs_mount();
+
+ if (opt_hwpoison && !hwpoison_inject_fd) {
+ snprintf(buf, MAX_PATH, "%s/hwpoison/corrupt-pfn",
+ hwpoison_debug_fs);
+ hwpoison_inject_fd = checked_open(buf, O_WRONLY);
+ }
+
+ if (opt_unpoison && !hwpoison_forget_fd) {
+ snprintf(buf, MAX_PATH, "%s/hwpoison/unpoison-pfn",
+ hwpoison_debug_fs);
+ hwpoison_forget_fd = checked_open(buf, O_WRONLY);
+ }
+}
+
+static int hwpoison_page(unsigned long offset)
+{
+ char buf[100];
+ int len;
+
+ len = sprintf(buf, "0x%lx\n", offset);
+ len = write(hwpoison_inject_fd, buf, len);
+ if (len < 0) {
+ perror("hwpoison inject");
+ return len;
+ }
+ return 0;
+}
+
+static int unpoison_page(unsigned long offset)
+{
+ char buf[100];
+ int len;
+
+ len = sprintf(buf, "0x%lx\n", offset);
+ len = write(hwpoison_forget_fd, buf, len);
+ if (len < 0) {
+ perror("hwpoison forget");
+ return len;
+ }
+ return 0;
+}
+
+/*
+ * page frame walker
+ */
+
+static int hash_slot(uint64_t flags)
+{
+ int k = HASH_KEY(flags);
+ int i;
+
+ /* Explicitly reserve slot 0 for flags 0: the following logic
+ * cannot distinguish an unoccupied slot from slot (flags==0).
+ */
+ if (flags == 0)
+ return 0;
+
+ /* search through the remaining (HASH_SIZE-1) slots */
+ for (i = 1; i < ARRAY_SIZE(page_flags); i++, k++) {
+ if (!k || k >= ARRAY_SIZE(page_flags))
+ k = 1;
+ if (page_flags[k] == 0) {
+ page_flags[k] = flags;
+ return k;
+ }
+ if (page_flags[k] == flags)
+ return k;
+ }
+
+ fatal("hash table full: bump up HASH_SHIFT?\n");
+ exit(EXIT_FAILURE);
+}
+
+static void add_page(unsigned long voffset,
+ unsigned long offset, uint64_t flags)
+{
+ flags = kpageflags_flags(flags);
+
+ if (!bit_mask_ok(flags))
+ return;
+
+ if (opt_hwpoison)
+ hwpoison_page(offset);
+ if (opt_unpoison)
+ unpoison_page(offset);
+
+ if (opt_list == 1)
+ show_page_range(voffset, offset, flags);
+ else if (opt_list == 2)
+ show_page(voffset, offset, flags);
+
+ nr_pages[hash_slot(flags)]++;
+ total_pages++;
+}
+
+#define KPAGEFLAGS_BATCH (64 << 10) /* 64k pages */
+static void walk_pfn(unsigned long voffset,
+ unsigned long index,
+ unsigned long count)
+{
+ uint64_t buf[KPAGEFLAGS_BATCH];
+ unsigned long batch;
+ long pages;
+ unsigned long i;
+
+ while (count) {
+ batch = min_t(unsigned long, count, KPAGEFLAGS_BATCH);
+ pages = kpageflags_read(buf, index, batch);
+ if (pages == 0)
+ break;
+
+ for (i = 0; i < pages; i++)
+ add_page(voffset + i, index + i, buf[i]);
+
+ index += pages;
+ count -= pages;
+ }
+}
+
+#define PAGEMAP_BATCH (64 << 10)
+static void walk_vma(unsigned long index, unsigned long count)
+{
+ uint64_t buf[PAGEMAP_BATCH];
+ unsigned long batch;
+ unsigned long pages;
+ unsigned long pfn;
+ unsigned long i;
+
+ while (count) {
+ batch = min_t(unsigned long, count, PAGEMAP_BATCH);
+ pages = pagemap_read(buf, index, batch);
+ if (pages == 0)
+ break;
+
+ for (i = 0; i < pages; i++) {
+ pfn = pagemap_pfn(buf[i]);
+ if (pfn)
+ walk_pfn(index + i, pfn, 1);
+ }
+
+ index += pages;
+ count -= pages;
+ }
+}
+
+static void walk_task(unsigned long index, unsigned long count)
+{
+ const unsigned long end = index + count;
+ unsigned long start;
+ int i = 0;
+
+ while (index < end) {
+
+ while (pg_end[i] <= index)
+ if (++i >= nr_vmas)
+ return;
+ if (pg_start[i] >= end)
+ return;
+
+ start = max_t(unsigned long, pg_start[i], index);
+ index = min_t(unsigned long, pg_end[i], end);
+
+ assert(start < index);
+ walk_vma(start, index - start);
+ }
+}
+
+static void add_addr_range(unsigned long offset, unsigned long size)
+{
+ if (nr_addr_ranges >= MAX_ADDR_RANGES)
+ fatal("too many addr ranges\n");
+
+ opt_offset[nr_addr_ranges] = offset;
+ opt_size[nr_addr_ranges] = min_t(unsigned long, size, ULONG_MAX-offset);
+ nr_addr_ranges++;
+}
+
+static void walk_addr_ranges(void)
+{
+ int i;
+
+ kpageflags_fd = checked_open(PROC_KPAGEFLAGS, O_RDONLY);
+
+ if (!nr_addr_ranges)
+ add_addr_range(0, ULONG_MAX);
+
+ for (i = 0; i < nr_addr_ranges; i++)
+ if (!opt_pid)
+ walk_pfn(0, opt_offset[i], opt_size[i]);
+ else
+ walk_task(opt_offset[i], opt_size[i]);
+
+ close(kpageflags_fd);
+}
+
+
+/*
+ * user interface
+ */
+
+static const char *page_flag_type(uint64_t flag)
+{
+ if (flag & KPF_HACKERS_BITS)
+ return "(r)";
+ if (flag & KPF_OVERLOADED_BITS)
+ return "(o)";
+ return " ";
+}
+
+static void usage(void)
+{
+ int i, j;
+
+ printf(
+"page-types [options]\n"
+" -r|--raw Raw mode, for kernel developers\n"
+" -d|--describe flags Describe flags\n"
+" -a|--addr addr-spec Walk a range of pages\n"
+" -b|--bits bits-spec Walk pages with specified bits\n"
+" -p|--pid pid Walk process address space\n"
+#if 0 /* planned features */
+" -f|--file filename Walk file address space\n"
+#endif
+" -l|--list Show page details in ranges\n"
+" -L|--list-each Show page details one by one\n"
+" -N|--no-summary Don't show summary info\n"
+" -X|--hwpoison hwpoison pages\n"
+" -x|--unpoison unpoison pages\n"
+" -h|--help Show this usage message\n"
+"flags:\n"
+" 0x10 bitfield format, e.g.\n"
+" anon bit-name, e.g.\n"
+" 0x10,anon comma-separated list, e.g.\n"
+"addr-spec:\n"
+" N one page at offset N (unit: pages)\n"
+" N+M pages range from N to N+M-1\n"
+" N,M pages range from N to M-1\n"
+" N, pages range from N to end\n"
+" ,M pages range from 0 to M-1\n"
+"bits-spec:\n"
+" bit1,bit2 (flags & (bit1|bit2)) != 0\n"
+" bit1,bit2=bit1 (flags & (bit1|bit2)) == bit1\n"
+" bit1,~bit2 (flags & (bit1|bit2)) == bit1\n"
+" =bit1,bit2 flags == (bit1|bit2)\n"
+"bit-names:\n"
+ );
+
+ for (i = 0, j = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+ if (!page_flag_names[i])
+ continue;
+ printf("%16s%s", page_flag_names[i] + 2,
+ page_flag_type(1ULL << i));
+ if (++j > 3) {
+ j = 0;
+ putchar('\n');
+ }
+ }
+ printf("\n "
+ "(r) raw mode bits (o) overloaded bits\n");
+}
+
+static unsigned long long parse_number(const char *str)
+{
+ unsigned long long n;
+
+ n = strtoll(str, NULL, 0);
+
+ if (n == 0 && str[0] != '0')
+ fatal("invalid name or number: %s\n", str);
+
+ return n;
+}
+
+static void parse_pid(const char *str)
+{
+ FILE *file;
+ char buf[5000];
+
+ opt_pid = parse_number(str);
+
+ sprintf(buf, "/proc/%d/pagemap", opt_pid);
+ pagemap_fd = checked_open(buf, O_RDONLY);
+
+ sprintf(buf, "/proc/%d/maps", opt_pid);
+ file = fopen(buf, "r");
+ if (!file) {
+ perror(buf);
+ exit(EXIT_FAILURE);
+ }
+
+ while (fgets(buf, sizeof(buf), file) != NULL) {
+ unsigned long vm_start;
+ unsigned long vm_end;
+ unsigned long long pgoff;
+ int major, minor;
+ char r, w, x, s;
+ unsigned long ino;
+ int n;
+
+ n = sscanf(buf, "%lx-%lx %c%c%c%c %llx %x:%x %lu",
+ &vm_start,
+ &vm_end,
+ &r, &w, &x, &s,
+ &pgoff,
+ &major, &minor,
+ &ino);
+ if (n < 10) {
+ fprintf(stderr, "unexpected line: %s\n", buf);
+ continue;
+ }
+ pg_start[nr_vmas] = vm_start / page_size;
+ pg_end[nr_vmas] = vm_end / page_size;
+ if (++nr_vmas >= MAX_VMAS) {
+ fprintf(stderr, "too many VMAs\n");
+ break;
+ }
+ }
+ fclose(file);
+}
+
+static void parse_file(const char *name)
+{
+}
+
+static void parse_addr_range(const char *optarg)
+{
+ unsigned long offset;
+ unsigned long size;
+ char *p;
+
+ p = strchr(optarg, ',');
+ if (!p)
+ p = strchr(optarg, '+');
+
+ if (p == optarg) {
+ offset = 0;
+ size = parse_number(p + 1);
+ } else if (p) {
+ offset = parse_number(optarg);
+ if (p[1] == '\0')
+ size = ULONG_MAX;
+ else {
+ size = parse_number(p + 1);
+ if (*p == ',') {
+ if (size < offset)
+ fatal("invalid range: %lu,%lu\n",
+ offset, size);
+ size -= offset;
+ }
+ }
+ } else {
+ offset = parse_number(optarg);
+ size = 1;
+ }
+
+ add_addr_range(offset, size);
+}
+
+static void add_bits_filter(uint64_t mask, uint64_t bits)
+{
+ if (nr_bit_filters >= MAX_BIT_FILTERS)
+ fatal("too much bit filters\n");
+
+ opt_mask[nr_bit_filters] = mask;
+ opt_bits[nr_bit_filters] = bits;
+ nr_bit_filters++;
+}
+
+static uint64_t parse_flag_name(const char *str, int len)
+{
+ int i;
+
+ if (!*str || !len)
+ return 0;
+
+ if (len <= 8 && !strncmp(str, "compound", len))
+ return BITS_COMPOUND;
+
+ for (i = 0; i < ARRAY_SIZE(page_flag_names); i++) {
+ if (!page_flag_names[i])
+ continue;
+ if (!strncmp(str, page_flag_names[i] + 2, len))
+ return 1ULL << i;
+ }
+
+ return parse_number(str);
+}
+
+static uint64_t parse_flag_names(const char *str, int all)
+{
+ const char *p = str;
+ uint64_t flags = 0;
+
+ while (1) {
+ if (*p == ',' || *p == '=' || *p == '\0') {
+ if ((*str != '~') || (*str == '~' && all && *++str))
+ flags |= parse_flag_name(str, p - str);
+ if (*p != ',')
+ break;
+ str = p + 1;
+ }
+ p++;
+ }
+
+ return flags;
+}
+
+static void parse_bits_mask(const char *optarg)
+{
+ uint64_t mask;
+ uint64_t bits;
+ const char *p;
+
+ p = strchr(optarg, '=');
+ if (p == optarg) {
+ mask = KPF_ALL_BITS;
+ bits = parse_flag_names(p + 1, 0);
+ } else if (p) {
+ mask = parse_flag_names(optarg, 0);
+ bits = parse_flag_names(p + 1, 0);
+ } else if (strchr(optarg, '~')) {
+ mask = parse_flag_names(optarg, 1);
+ bits = parse_flag_names(optarg, 0);
+ } else {
+ mask = parse_flag_names(optarg, 0);
+ bits = KPF_ALL_BITS;
+ }
+
+ add_bits_filter(mask, bits);
+}
+
+static void describe_flags(const char *optarg)
+{
+ uint64_t flags = parse_flag_names(optarg, 0);
+
+ printf("0x%016llx\t%s\t%s\n",
+ (unsigned long long)flags,
+ page_flag_name(flags),
+ page_flag_longname(flags));
+}
+
+static const struct option opts[] = {
+ { "raw" , 0, NULL, 'r' },
+ { "pid" , 1, NULL, 'p' },
+ { "file" , 1, NULL, 'f' },
+ { "addr" , 1, NULL, 'a' },
+ { "bits" , 1, NULL, 'b' },
+ { "describe" , 1, NULL, 'd' },
+ { "list" , 0, NULL, 'l' },
+ { "list-each" , 0, NULL, 'L' },
+ { "no-summary", 0, NULL, 'N' },
+ { "hwpoison" , 0, NULL, 'X' },
+ { "unpoison" , 0, NULL, 'x' },
+ { "help" , 0, NULL, 'h' },
+ { NULL , 0, NULL, 0 }
+};
+
+int main(int argc, char *argv[])
+{
+ int c;
+
+ page_size = getpagesize();
+
+ while ((c = getopt_long(argc, argv,
+ "rp:f:a:b:d:lLNXxh", opts, NULL)) != -1) {
+ switch (c) {
+ case 'r':
+ opt_raw = 1;
+ break;
+ case 'p':
+ parse_pid(optarg);
+ break;
+ case 'f':
+ parse_file(optarg);
+ break;
+ case 'a':
+ parse_addr_range(optarg);
+ break;
+ case 'b':
+ parse_bits_mask(optarg);
+ break;
+ case 'd':
+ describe_flags(optarg);
+ exit(0);
+ case 'l':
+ opt_list = 1;
+ break;
+ case 'L':
+ opt_list = 2;
+ break;
+ case 'N':
+ opt_no_summary = 1;
+ break;
+ case 'X':
+ opt_hwpoison = 1;
+ prepare_hwpoison_fd();
+ break;
+ case 'x':
+ opt_unpoison = 1;
+ prepare_hwpoison_fd();
+ break;
+ case 'h':
+ usage();
+ exit(0);
+ default:
+ usage();
+ exit(1);
+ }
+ }
+
+ if (opt_list && opt_pid)
+ printf("voffset\t");
+ if (opt_list == 1)
+ printf("offset\tlen\tflags\n");
+ if (opt_list == 2)
+ printf("offset\tflags\n");
+
+ walk_addr_ranges();
+
+ if (opt_list == 1)
+ show_page_range(0, 0, 0); /* drain the buffer */
+
+ if (opt_no_summary)
+ return 0;
+
+ if (opt_list)
+ printf("\n\n");
+
+ show_summary();
+
+ return 0;
+}
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index 99f89aa10169..6513fe2d90b8 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -18,10 +18,11 @@ migrate_pages function call takes two sets of nodes and moves pages of a
process that are located on the from nodes to the destination nodes.
Page migration functions are provided by the numactl package by Andi Kleen
(a version later than 0.9.3 is required. Get it from
-ftp://ftp.suse.com/pub/people/ak). numactl provided libnuma which
-provides an interface similar to other numa functionality for page migration.
-cat /proc/<pid>/numa_maps allows an easy review of where the pages of
-a process are located. See also the numa_maps manpage in the numactl package.
+ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma
+which provides an interface similar to other numa functionality for page
+migration. cat /proc/<pid>/numa_maps allows an easy review of where the
+pages of a process are located. See also the numa_maps documentation in the
+proc(5) man page.
Manual migration is useful if for example the scheduler has relocated
a process to a processor on a distant node. A batch scheduler or an
@@ -36,7 +37,8 @@ locations.
Larger installations usually partition the system using cpusets into
sections of nodes. Paul Jackson has equipped cpusets with the ability to
-move pages when a task is moved to another cpuset (See ../cpusets.txt).
+move pages when a task is moved to another cpuset (See
+Documentation/cgroups/cpusets.txt).
Cpusets allows the automation of process locality. If a task is moved to
a new cpuset then also all its pages are moved with it so that the
performance of the process does not sink dramatically. Also the pages
diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
new file mode 100644
index 000000000000..df09b9650a81
--- /dev/null
+++ b/Documentation/vm/pagemap.txt
@@ -0,0 +1,147 @@
+pagemap, from the userspace perspective
+---------------------------------------
+
+pagemap is a new (as of 2.6.25) set of interfaces in the kernel that allow
+userspace programs to examine the page tables and related information by
+reading files in /proc.
+
+There are three components to pagemap:
+
+ * /proc/pid/pagemap. This file lets a userspace process find out which
+ physical frame each virtual page is mapped to. It contains one 64-bit
+ value for each virtual page, containing the following data (from
+ fs/proc/task_mmu.c, above pagemap_read):
+
+ * Bits 0-54 page frame number (PFN) if present
+ * Bits 0-4 swap type if swapped
+ * Bits 5-54 swap offset if swapped
+ * Bits 55-60 page shift (page size = 1<<page shift)
+ * Bit 61 reserved for future use
+ * Bit 62 page swapped
+ * Bit 63 page present
+
+ If the page is not present but in swap, then the PFN contains an
+ encoding of the swap file number and the page's offset into the
+ swap. Unmapped pages return a null PFN. This allows determining
+ precisely which pages are mapped (or in swap) and comparing mapped
+ pages between processes.
+
+ Efficient users of this interface will use /proc/pid/maps to
+ determine which areas of memory are actually mapped and llseek to
+ skip over unmapped regions.
+
+ * /proc/kpagecount. This file contains a 64-bit count of the number of
+ times each page is mapped, indexed by PFN.
+
+ * /proc/kpageflags. This file contains a 64-bit set of flags for each
+ page, indexed by PFN.
+
+ The flags are (from fs/proc/page.c, above kpageflags_read):
+
+ 0. LOCKED
+ 1. ERROR
+ 2. REFERENCED
+ 3. UPTODATE
+ 4. DIRTY
+ 5. LRU
+ 6. ACTIVE
+ 7. SLAB
+ 8. WRITEBACK
+ 9. RECLAIM
+ 10. BUDDY
+ 11. MMAP
+ 12. ANON
+ 13. SWAPCACHE
+ 14. SWAPBACKED
+ 15. COMPOUND_HEAD
+ 16. COMPOUND_TAIL
+ 16. HUGE
+ 18. UNEVICTABLE
+ 19. HWPOISON
+ 20. NOPAGE
+ 21. KSM
+
+Short descriptions to the page flags:
+
+ 0. LOCKED
+ page is being locked for exclusive access, eg. by undergoing read/write IO
+
+ 7. SLAB
+ page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
+ When compound page is used, SLUB/SLQB will only set this flag on the head
+ page; SLOB will not flag it at all.
+
+10. BUDDY
+ a free memory block managed by the buddy system allocator
+ The buddy system organizes free memory in blocks of various orders.
+ An order N block has 2^N physically contiguous pages, with the BUDDY flag
+ set for and _only_ for the first page.
+
+15. COMPOUND_HEAD
+16. COMPOUND_TAIL
+ A compound page with order N consists of 2^N physically contiguous pages.
+ A compound page with order 2 takes the form of "HTTT", where H donates its
+ head page and T donates its tail page(s). The major consumers of compound
+ pages are hugeTLB pages (Documentation/vm/hugetlbpage.txt), the SLUB etc.
+ memory allocators and various device drivers. However in this interface,
+ only huge/giga pages are made visible to end users.
+17. HUGE
+ this is an integral part of a HugeTLB page
+
+19. HWPOISON
+ hardware detected memory corruption on this page: don't touch the data!
+
+20. NOPAGE
+ no page frame exists at the requested address
+
+21. KSM
+ identical memory pages dynamically shared between one or more processes
+
+ [IO related page flags]
+ 1. ERROR IO error occurred
+ 3. UPTODATE page has up-to-date data
+ ie. for file backed page: (in-memory data revision >= on-disk one)
+ 4. DIRTY page has been written to, hence contains new data
+ ie. for file backed page: (in-memory data revision > on-disk one)
+ 8. WRITEBACK page is being synced to disk
+
+ [LRU related page flags]
+ 5. LRU page is in one of the LRU lists
+ 6. ACTIVE page is in the active LRU list
+18. UNEVICTABLE page is in the unevictable (non-)LRU list
+ It is somehow pinned and not a candidate for LRU page reclaims,
+ eg. ramfs pages, shmctl(SHM_LOCK) and mlock() memory segments
+ 2. REFERENCED page has been referenced since last LRU list enqueue/requeue
+ 9. RECLAIM page will be reclaimed soon after its pageout IO completed
+11. MMAP a memory mapped page
+12. ANON a memory mapped page that is not part of a file
+13. SWAPCACHE page is mapped to swap space, ie. has an associated swap entry
+14. SWAPBACKED page is backed by swap/RAM
+
+The page-types tool in this directory can be used to query the above flags.
+
+Using pagemap to do something useful:
+
+The general procedure for using pagemap to find out about a process' memory
+usage goes like this:
+
+ 1. Read /proc/pid/maps to determine which parts of the memory space are
+ mapped to what.
+ 2. Select the maps you are interested in -- all of them, or a particular
+ library, or the stack or the heap, etc.
+ 3. Open /proc/pid/pagemap and seek to the pages you would like to examine.
+ 4. Read a u64 for each page from pagemap.
+ 5. Open /proc/kpagecount and/or /proc/kpageflags. For each PFN you just
+ read, seek to that entry in the file, and read the data you want.
+
+For example, to find the "unique set size" (USS), which is the amount of
+memory that a process is using that is not shared with any other process,
+you can go through every map in the process, find the PFNs, look those up
+in kpagecount, and tally up the number of pages that are only referenced
+once.
+
+Other notes:
+
+Reading from any of the files will return -EINVAL if you are not starting
+the read on an 8-byte boundary (e.g., if you seeked an odd number of bytes
+into the file), or if the size of the read is not a multiple of 8 bytes.
diff --git a/Documentation/vm/slabinfo.c b/Documentation/vm/slabinfo.c
deleted file mode 100644
index e4230ed16ee7..000000000000
--- a/Documentation/vm/slabinfo.c
+++ /dev/null
@@ -1,1364 +0,0 @@
-/*
- * Slabinfo: Tool to get reports about slabs
- *
- * (C) 2007 sgi, Christoph Lameter <clameter@sgi.com>
- *
- * Compile by:
- *
- * gcc -o slabinfo slabinfo.c
- */
-#include <stdio.h>
-#include <stdlib.h>
-#include <sys/types.h>
-#include <dirent.h>
-#include <strings.h>
-#include <string.h>
-#include <unistd.h>
-#include <stdarg.h>
-#include <getopt.h>
-#include <regex.h>
-#include <errno.h>
-
-#define MAX_SLABS 500
-#define MAX_ALIASES 500
-#define MAX_NODES 1024
-
-struct slabinfo {
- char *name;
- int alias;
- int refs;
- int aliases, align, cache_dma, cpu_slabs, destroy_by_rcu;
- int hwcache_align, object_size, objs_per_slab;
- int sanity_checks, slab_size, store_user, trace;
- int order, poison, reclaim_account, red_zone;
- unsigned long partial, objects, slabs, objects_partial, objects_total;
- unsigned long alloc_fastpath, alloc_slowpath;
- unsigned long free_fastpath, free_slowpath;
- unsigned long free_frozen, free_add_partial, free_remove_partial;
- unsigned long alloc_from_partial, alloc_slab, free_slab, alloc_refill;
- unsigned long cpuslab_flush, deactivate_full, deactivate_empty;
- unsigned long deactivate_to_head, deactivate_to_tail;
- unsigned long deactivate_remote_frees, order_fallback;
- int numa[MAX_NODES];
- int numa_partial[MAX_NODES];
-} slabinfo[MAX_SLABS];
-
-struct aliasinfo {
- char *name;
- char *ref;
- struct slabinfo *slab;
-} aliasinfo[MAX_ALIASES];
-
-int slabs = 0;
-int actual_slabs = 0;
-int aliases = 0;
-int alias_targets = 0;
-int highest_node = 0;
-
-char buffer[4096];
-
-int show_empty = 0;
-int show_report = 0;
-int show_alias = 0;
-int show_slab = 0;
-int skip_zero = 1;
-int show_numa = 0;
-int show_track = 0;
-int show_first_alias = 0;
-int validate = 0;
-int shrink = 0;
-int show_inverted = 0;
-int show_single_ref = 0;
-int show_totals = 0;
-int sort_size = 0;
-int sort_active = 0;
-int set_debug = 0;
-int show_ops = 0;
-int show_activity = 0;
-
-/* Debug options */
-int sanity = 0;
-int redzone = 0;
-int poison = 0;
-int tracking = 0;
-int tracing = 0;
-
-int page_size;
-
-regex_t pattern;
-
-void fatal(const char *x, ...)
-{
- va_list ap;
-
- va_start(ap, x);
- vfprintf(stderr, x, ap);
- va_end(ap);
- exit(EXIT_FAILURE);
-}
-
-void usage(void)
-{
- printf("slabinfo 5/7/2007. (c) 2007 sgi. clameter@sgi.com\n\n"
- "slabinfo [-ahnpvtsz] [-d debugopts] [slab-regexp]\n"
- "-a|--aliases Show aliases\n"
- "-A|--activity Most active slabs first\n"
- "-d<options>|--debug=<options> Set/Clear Debug options\n"
- "-D|--display-active Switch line format to activity\n"
- "-e|--empty Show empty slabs\n"
- "-f|--first-alias Show first alias\n"
- "-h|--help Show usage information\n"
- "-i|--inverted Inverted list\n"
- "-l|--slabs Show slabs\n"
- "-n|--numa Show NUMA information\n"
- "-o|--ops Show kmem_cache_ops\n"
- "-s|--shrink Shrink slabs\n"
- "-r|--report Detailed report on single slabs\n"
- "-S|--Size Sort by size\n"
- "-t|--tracking Show alloc/free information\n"
- "-T|--Totals Show summary information\n"
- "-v|--validate Validate slabs\n"
- "-z|--zero Include empty slabs\n"
- "-1|--1ref Single reference\n"
- "\nValid debug options (FZPUT may be combined)\n"
- "a / A Switch on all debug options (=FZUP)\n"
- "- Switch off all debug options\n"
- "f / F Sanity Checks (SLAB_DEBUG_FREE)\n"
- "z / Z Redzoning\n"
- "p / P Poisoning\n"
- "u / U Tracking\n"
- "t / T Tracing\n"
- );
-}
-
-unsigned long read_obj(const char *name)
-{
- FILE *f = fopen(name, "r");
-
- if (!f)
- buffer[0] = 0;
- else {
- if (!fgets(buffer, sizeof(buffer), f))
- buffer[0] = 0;
- fclose(f);
- if (buffer[strlen(buffer)] == '\n')
- buffer[strlen(buffer)] = 0;
- }
- return strlen(buffer);
-}
-
-
-/*
- * Get the contents of an attribute
- */
-unsigned long get_obj(const char *name)
-{
- if (!read_obj(name))
- return 0;
-
- return atol(buffer);
-}
-
-unsigned long get_obj_and_str(const char *name, char **x)
-{
- unsigned long result = 0;
- char *p;
-
- *x = NULL;
-
- if (!read_obj(name)) {
- x = NULL;
- return 0;
- }
- result = strtoul(buffer, &p, 10);
- while (*p == ' ')
- p++;
- if (*p)
- *x = strdup(p);
- return result;
-}
-
-void set_obj(struct slabinfo *s, const char *name, int n)
-{
- char x[100];
- FILE *f;
-
- snprintf(x, 100, "%s/%s", s->name, name);
- f = fopen(x, "w");
- if (!f)
- fatal("Cannot write to %s\n", x);
-
- fprintf(f, "%d\n", n);
- fclose(f);
-}
-
-unsigned long read_slab_obj(struct slabinfo *s, const char *name)
-{
- char x[100];
- FILE *f;
- size_t l;
-
- snprintf(x, 100, "%s/%s", s->name, name);
- f = fopen(x, "r");
- if (!f) {
- buffer[0] = 0;
- l = 0;
- } else {
- l = fread(buffer, 1, sizeof(buffer), f);
- buffer[l] = 0;
- fclose(f);
- }
- return l;
-}
-
-
-/*
- * Put a size string together
- */
-int store_size(char *buffer, unsigned long value)
-{
- unsigned long divisor = 1;
- char trailer = 0;
- int n;
-
- if (value > 1000000000UL) {
- divisor = 100000000UL;
- trailer = 'G';
- } else if (value > 1000000UL) {
- divisor = 100000UL;
- trailer = 'M';
- } else if (value > 1000UL) {
- divisor = 100;
- trailer = 'K';
- }
-
- value /= divisor;
- n = sprintf(buffer, "%ld",value);
- if (trailer) {
- buffer[n] = trailer;
- n++;
- buffer[n] = 0;
- }
- if (divisor != 1) {
- memmove(buffer + n - 2, buffer + n - 3, 4);
- buffer[n-2] = '.';
- n++;
- }
- return n;
-}
-
-void decode_numa_list(int *numa, char *t)
-{
- int node;
- int nr;
-
- memset(numa, 0, MAX_NODES * sizeof(int));
-
- if (!t)
- return;
-
- while (*t == 'N') {
- t++;
- node = strtoul(t, &t, 10);
- if (*t == '=') {
- t++;
- nr = strtoul(t, &t, 10);
- numa[node] = nr;
- if (node > highest_node)
- highest_node = node;
- }
- while (*t == ' ')
- t++;
- }
-}
-
-void slab_validate(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- set_obj(s, "validate", 1);
-}
-
-void slab_shrink(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- set_obj(s, "shrink", 1);
-}
-
-int line = 0;
-
-void first_line(void)
-{
- if (show_activity)
- printf("Name Objects Alloc Free %%Fast Fallb O\n");
- else
- printf("Name Objects Objsize Space "
- "Slabs/Part/Cpu O/S O %%Fr %%Ef Flg\n");
-}
-
-/*
- * Find the shortest alias of a slab
- */
-struct aliasinfo *find_one_alias(struct slabinfo *find)
-{
- struct aliasinfo *a;
- struct aliasinfo *best = NULL;
-
- for(a = aliasinfo;a < aliasinfo + aliases; a++) {
- if (a->slab == find &&
- (!best || strlen(best->name) < strlen(a->name))) {
- best = a;
- if (strncmp(a->name,"kmall", 5) == 0)
- return best;
- }
- }
- return best;
-}
-
-unsigned long slab_size(struct slabinfo *s)
-{
- return s->slabs * (page_size << s->order);
-}
-
-unsigned long slab_activity(struct slabinfo *s)
-{
- return s->alloc_fastpath + s->free_fastpath +
- s->alloc_slowpath + s->free_slowpath;
-}
-
-void slab_numa(struct slabinfo *s, int mode)
-{
- int node;
-
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (!highest_node) {
- printf("\n%s: No NUMA information available.\n", s->name);
- return;
- }
-
- if (skip_zero && !s->slabs)
- return;
-
- if (!line) {
- printf("\n%-21s:", mode ? "NUMA nodes" : "Slab");
- for(node = 0; node <= highest_node; node++)
- printf(" %4d", node);
- printf("\n----------------------");
- for(node = 0; node <= highest_node; node++)
- printf("-----");
- printf("\n");
- }
- printf("%-21s ", mode ? "All slabs" : s->name);
- for(node = 0; node <= highest_node; node++) {
- char b[20];
-
- store_size(b, s->numa[node]);
- printf(" %4s", b);
- }
- printf("\n");
- if (mode) {
- printf("%-21s ", "Partial slabs");
- for(node = 0; node <= highest_node; node++) {
- char b[20];
-
- store_size(b, s->numa_partial[node]);
- printf(" %4s", b);
- }
- printf("\n");
- }
- line++;
-}
-
-void show_tracking(struct slabinfo *s)
-{
- printf("\n%s: Kernel object allocation\n", s->name);
- printf("-----------------------------------------------------------------------\n");
- if (read_slab_obj(s, "alloc_calls"))
- printf(buffer);
- else
- printf("No Data\n");
-
- printf("\n%s: Kernel object freeing\n", s->name);
- printf("------------------------------------------------------------------------\n");
- if (read_slab_obj(s, "free_calls"))
- printf(buffer);
- else
- printf("No Data\n");
-
-}
-
-void ops(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (read_slab_obj(s, "ops")) {
- printf("\n%s: kmem_cache operations\n", s->name);
- printf("--------------------------------------------\n");
- printf(buffer);
- } else
- printf("\n%s has no kmem_cache operations\n", s->name);
-}
-
-const char *onoff(int x)
-{
- if (x)
- return "On ";
- return "Off";
-}
-
-void slab_stats(struct slabinfo *s)
-{
- unsigned long total_alloc;
- unsigned long total_free;
- unsigned long total;
-
- if (!s->alloc_slab)
- return;
-
- total_alloc = s->alloc_fastpath + s->alloc_slowpath;
- total_free = s->free_fastpath + s->free_slowpath;
-
- if (!total_alloc)
- return;
-
- printf("\n");
- printf("Slab Perf Counter Alloc Free %%Al %%Fr\n");
- printf("--------------------------------------------------\n");
- printf("Fastpath %8lu %8lu %3lu %3lu\n",
- s->alloc_fastpath, s->free_fastpath,
- s->alloc_fastpath * 100 / total_alloc,
- s->free_fastpath * 100 / total_free);
- printf("Slowpath %8lu %8lu %3lu %3lu\n",
- total_alloc - s->alloc_fastpath, s->free_slowpath,
- (total_alloc - s->alloc_fastpath) * 100 / total_alloc,
- s->free_slowpath * 100 / total_free);
- printf("Page Alloc %8lu %8lu %3lu %3lu\n",
- s->alloc_slab, s->free_slab,
- s->alloc_slab * 100 / total_alloc,
- s->free_slab * 100 / total_free);
- printf("Add partial %8lu %8lu %3lu %3lu\n",
- s->deactivate_to_head + s->deactivate_to_tail,
- s->free_add_partial,
- (s->deactivate_to_head + s->deactivate_to_tail) * 100 / total_alloc,
- s->free_add_partial * 100 / total_free);
- printf("Remove partial %8lu %8lu %3lu %3lu\n",
- s->alloc_from_partial, s->free_remove_partial,
- s->alloc_from_partial * 100 / total_alloc,
- s->free_remove_partial * 100 / total_free);
-
- printf("RemoteObj/SlabFrozen %8lu %8lu %3lu %3lu\n",
- s->deactivate_remote_frees, s->free_frozen,
- s->deactivate_remote_frees * 100 / total_alloc,
- s->free_frozen * 100 / total_free);
-
- printf("Total %8lu %8lu\n\n", total_alloc, total_free);
-
- if (s->cpuslab_flush)
- printf("Flushes %8lu\n", s->cpuslab_flush);
-
- if (s->alloc_refill)
- printf("Refill %8lu\n", s->alloc_refill);
-
- total = s->deactivate_full + s->deactivate_empty +
- s->deactivate_to_head + s->deactivate_to_tail;
-
- if (total)
- printf("Deactivate Full=%lu(%lu%%) Empty=%lu(%lu%%) "
- "ToHead=%lu(%lu%%) ToTail=%lu(%lu%%)\n",
- s->deactivate_full, (s->deactivate_full * 100) / total,
- s->deactivate_empty, (s->deactivate_empty * 100) / total,
- s->deactivate_to_head, (s->deactivate_to_head * 100) / total,
- s->deactivate_to_tail, (s->deactivate_to_tail * 100) / total);
-}
-
-void report(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- printf("\nSlabcache: %-20s Aliases: %2d Order : %2d Objects: %lu\n",
- s->name, s->aliases, s->order, s->objects);
- if (s->hwcache_align)
- printf("** Hardware cacheline aligned\n");
- if (s->cache_dma)
- printf("** Memory is allocated in a special DMA zone\n");
- if (s->destroy_by_rcu)
- printf("** Slabs are destroyed via RCU\n");
- if (s->reclaim_account)
- printf("** Reclaim accounting active\n");
-
- printf("\nSizes (bytes) Slabs Debug Memory\n");
- printf("------------------------------------------------------------------------\n");
- printf("Object : %7d Total : %7ld Sanity Checks : %s Total: %7ld\n",
- s->object_size, s->slabs, onoff(s->sanity_checks),
- s->slabs * (page_size << s->order));
- printf("SlabObj: %7d Full : %7ld Redzoning : %s Used : %7ld\n",
- s->slab_size, s->slabs - s->partial - s->cpu_slabs,
- onoff(s->red_zone), s->objects * s->object_size);
- printf("SlabSiz: %7d Partial: %7ld Poisoning : %s Loss : %7ld\n",
- page_size << s->order, s->partial, onoff(s->poison),
- s->slabs * (page_size << s->order) - s->objects * s->object_size);
- printf("Loss : %7d CpuSlab: %7d Tracking : %s Lalig: %7ld\n",
- s->slab_size - s->object_size, s->cpu_slabs, onoff(s->store_user),
- (s->slab_size - s->object_size) * s->objects);
- printf("Align : %7d Objects: %7d Tracing : %s Lpadd: %7ld\n",
- s->align, s->objs_per_slab, onoff(s->trace),
- ((page_size << s->order) - s->objs_per_slab * s->slab_size) *
- s->slabs);
-
- ops(s);
- show_tracking(s);
- slab_numa(s, 1);
- slab_stats(s);
-}
-
-void slabcache(struct slabinfo *s)
-{
- char size_str[20];
- char dist_str[40];
- char flags[20];
- char *p = flags;
-
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (actual_slabs == 1) {
- report(s);
- return;
- }
-
- if (skip_zero && !show_empty && !s->slabs)
- return;
-
- if (show_empty && s->slabs)
- return;
-
- store_size(size_str, slab_size(s));
- snprintf(dist_str, 40, "%lu/%lu/%d", s->slabs - s->cpu_slabs,
- s->partial, s->cpu_slabs);
-
- if (!line++)
- first_line();
-
- if (s->aliases)
- *p++ = '*';
- if (s->cache_dma)
- *p++ = 'd';
- if (s->hwcache_align)
- *p++ = 'A';
- if (s->poison)
- *p++ = 'P';
- if (s->reclaim_account)
- *p++ = 'a';
- if (s->red_zone)
- *p++ = 'Z';
- if (s->sanity_checks)
- *p++ = 'F';
- if (s->store_user)
- *p++ = 'U';
- if (s->trace)
- *p++ = 'T';
-
- *p = 0;
- if (show_activity) {
- unsigned long total_alloc;
- unsigned long total_free;
-
- total_alloc = s->alloc_fastpath + s->alloc_slowpath;
- total_free = s->free_fastpath + s->free_slowpath;
-
- printf("%-21s %8ld %10ld %10ld %3ld %3ld %5ld %1d\n",
- s->name, s->objects,
- total_alloc, total_free,
- total_alloc ? (s->alloc_fastpath * 100 / total_alloc) : 0,
- total_free ? (s->free_fastpath * 100 / total_free) : 0,
- s->order_fallback, s->order);
- }
- else
- printf("%-21s %8ld %7d %8s %14s %4d %1d %3ld %3ld %s\n",
- s->name, s->objects, s->object_size, size_str, dist_str,
- s->objs_per_slab, s->order,
- s->slabs ? (s->partial * 100) / s->slabs : 100,
- s->slabs ? (s->objects * s->object_size * 100) /
- (s->slabs * (page_size << s->order)) : 100,
- flags);
-}
-
-/*
- * Analyze debug options. Return false if something is amiss.
- */
-int debug_opt_scan(char *opt)
-{
- if (!opt || !opt[0] || strcmp(opt, "-") == 0)
- return 1;
-
- if (strcasecmp(opt, "a") == 0) {
- sanity = 1;
- poison = 1;
- redzone = 1;
- tracking = 1;
- return 1;
- }
-
- for ( ; *opt; opt++)
- switch (*opt) {
- case 'F' : case 'f':
- if (sanity)
- return 0;
- sanity = 1;
- break;
- case 'P' : case 'p':
- if (poison)
- return 0;
- poison = 1;
- break;
-
- case 'Z' : case 'z':
- if (redzone)
- return 0;
- redzone = 1;
- break;
-
- case 'U' : case 'u':
- if (tracking)
- return 0;
- tracking = 1;
- break;
-
- case 'T' : case 't':
- if (tracing)
- return 0;
- tracing = 1;
- break;
- default:
- return 0;
- }
- return 1;
-}
-
-int slab_empty(struct slabinfo *s)
-{
- if (s->objects > 0)
- return 0;
-
- /*
- * We may still have slabs even if there are no objects. Shrinking will
- * remove them.
- */
- if (s->slabs != 0)
- set_obj(s, "shrink", 1);
-
- return 1;
-}
-
-void slab_debug(struct slabinfo *s)
-{
- if (strcmp(s->name, "*") == 0)
- return;
-
- if (sanity && !s->sanity_checks) {
- set_obj(s, "sanity", 1);
- }
- if (!sanity && s->sanity_checks) {
- if (slab_empty(s))
- set_obj(s, "sanity", 0);
- else
- fprintf(stderr, "%s not empty cannot disable sanity checks\n", s->name);
- }
- if (redzone && !s->red_zone) {
- if (slab_empty(s))
- set_obj(s, "red_zone", 1);
- else
- fprintf(stderr, "%s not empty cannot enable redzoning\n", s->name);
- }
- if (!redzone && s->red_zone) {
- if (slab_empty(s))
- set_obj(s, "red_zone", 0);
- else
- fprintf(stderr, "%s not empty cannot disable redzoning\n", s->name);
- }
- if (poison && !s->poison) {
- if (slab_empty(s))
- set_obj(s, "poison", 1);
- else
- fprintf(stderr, "%s not empty cannot enable poisoning\n", s->name);
- }
- if (!poison && s->poison) {
- if (slab_empty(s))
- set_obj(s, "poison", 0);
- else
- fprintf(stderr, "%s not empty cannot disable poisoning\n", s->name);
- }
- if (tracking && !s->store_user) {
- if (slab_empty(s))
- set_obj(s, "store_user", 1);
- else
- fprintf(stderr, "%s not empty cannot enable tracking\n", s->name);
- }
- if (!tracking && s->store_user) {
- if (slab_empty(s))
- set_obj(s, "store_user", 0);
- else
- fprintf(stderr, "%s not empty cannot disable tracking\n", s->name);
- }
- if (tracing && !s->trace) {
- if (slabs == 1)
- set_obj(s, "trace", 1);
- else
- fprintf(stderr, "%s can only enable trace for one slab at a time\n", s->name);
- }
- if (!tracing && s->trace)
- set_obj(s, "trace", 1);
-}
-
-void totals(void)
-{
- struct slabinfo *s;
-
- int used_slabs = 0;
- char b1[20], b2[20], b3[20], b4[20];
- unsigned long long max = 1ULL << 63;
-
- /* Object size */
- unsigned long long min_objsize = max, max_objsize = 0, avg_objsize;
-
- /* Number of partial slabs in a slabcache */
- unsigned long long min_partial = max, max_partial = 0,
- avg_partial, total_partial = 0;
-
- /* Number of slabs in a slab cache */
- unsigned long long min_slabs = max, max_slabs = 0,
- avg_slabs, total_slabs = 0;
-
- /* Size of the whole slab */
- unsigned long long min_size = max, max_size = 0,
- avg_size, total_size = 0;
-
- /* Bytes used for object storage in a slab */
- unsigned long long min_used = max, max_used = 0,
- avg_used, total_used = 0;
-
- /* Waste: Bytes used for alignment and padding */
- unsigned long long min_waste = max, max_waste = 0,
- avg_waste, total_waste = 0;
- /* Number of objects in a slab */
- unsigned long long min_objects = max, max_objects = 0,
- avg_objects, total_objects = 0;
- /* Waste per object */
- unsigned long long min_objwaste = max,
- max_objwaste = 0, avg_objwaste,
- total_objwaste = 0;
-
- /* Memory per object */
- unsigned long long min_memobj = max,
- max_memobj = 0, avg_memobj,
- total_objsize = 0;
-
- /* Percentage of partial slabs per slab */
- unsigned long min_ppart = 100, max_ppart = 0,
- avg_ppart, total_ppart = 0;
-
- /* Number of objects in partial slabs */
- unsigned long min_partobj = max, max_partobj = 0,
- avg_partobj, total_partobj = 0;
-
- /* Percentage of partial objects of all objects in a slab */
- unsigned long min_ppartobj = 100, max_ppartobj = 0,
- avg_ppartobj, total_ppartobj = 0;
-
-
- for (s = slabinfo; s < slabinfo + slabs; s++) {
- unsigned long long size;
- unsigned long used;
- unsigned long long wasted;
- unsigned long long objwaste;
- unsigned long percentage_partial_slabs;
- unsigned long percentage_partial_objs;
-
- if (!s->slabs || !s->objects)
- continue;
-
- used_slabs++;
-
- size = slab_size(s);
- used = s->objects * s->object_size;
- wasted = size - used;
- objwaste = s->slab_size - s->object_size;
-
- percentage_partial_slabs = s->partial * 100 / s->slabs;
- if (percentage_partial_slabs > 100)
- percentage_partial_slabs = 100;
-
- percentage_partial_objs = s->objects_partial * 100
- / s->objects;
-
- if (percentage_partial_objs > 100)
- percentage_partial_objs = 100;
-
- if (s->object_size < min_objsize)
- min_objsize = s->object_size;
- if (s->partial < min_partial)
- min_partial = s->partial;
- if (s->slabs < min_slabs)
- min_slabs = s->slabs;
- if (size < min_size)
- min_size = size;
- if (wasted < min_waste)
- min_waste = wasted;
- if (objwaste < min_objwaste)
- min_objwaste = objwaste;
- if (s->objects < min_objects)
- min_objects = s->objects;
- if (used < min_used)
- min_used = used;
- if (s->objects_partial < min_partobj)
- min_partobj = s->objects_partial;
- if (percentage_partial_slabs < min_ppart)
- min_ppart = percentage_partial_slabs;
- if (percentage_partial_objs < min_ppartobj)
- min_ppartobj = percentage_partial_objs;
- if (s->slab_size < min_memobj)
- min_memobj = s->slab_size;
-
- if (s->object_size > max_objsize)
- max_objsize = s->object_size;
- if (s->partial > max_partial)
- max_partial = s->partial;
- if (s->slabs > max_slabs)
- max_slabs = s->slabs;
- if (size > max_size)
- max_size = size;
- if (wasted > max_waste)
- max_waste = wasted;
- if (objwaste > max_objwaste)
- max_objwaste = objwaste;
- if (s->objects > max_objects)
- max_objects = s->objects;
- if (used > max_used)
- max_used = used;
- if (s->objects_partial > max_partobj)
- max_partobj = s->objects_partial;
- if (percentage_partial_slabs > max_ppart)
- max_ppart = percentage_partial_slabs;
- if (percentage_partial_objs > max_ppartobj)
- max_ppartobj = percentage_partial_objs;
- if (s->slab_size > max_memobj)
- max_memobj = s->slab_size;
-
- total_partial += s->partial;
- total_slabs += s->slabs;
- total_size += size;
- total_waste += wasted;
-
- total_objects += s->objects;
- total_used += used;
- total_partobj += s->objects_partial;
- total_ppart += percentage_partial_slabs;
- total_ppartobj += percentage_partial_objs;
-
- total_objwaste += s->objects * objwaste;
- total_objsize += s->objects * s->slab_size;
- }
-
- if (!total_objects) {
- printf("No objects\n");
- return;
- }
- if (!used_slabs) {
- printf("No slabs\n");
- return;
- }
-
- /* Per slab averages */
- avg_partial = total_partial / used_slabs;
- avg_slabs = total_slabs / used_slabs;
- avg_size = total_size / used_slabs;
- avg_waste = total_waste / used_slabs;
-
- avg_objects = total_objects / used_slabs;
- avg_used = total_used / used_slabs;
- avg_partobj = total_partobj / used_slabs;
- avg_ppart = total_ppart / used_slabs;
- avg_ppartobj = total_ppartobj / used_slabs;
-
- /* Per object object sizes */
- avg_objsize = total_used / total_objects;
- avg_objwaste = total_objwaste / total_objects;
- avg_partobj = total_partobj * 100 / total_objects;
- avg_memobj = total_objsize / total_objects;
-
- printf("Slabcache Totals\n");
- printf("----------------\n");
- printf("Slabcaches : %3d Aliases : %3d->%-3d Active: %3d\n",
- slabs, aliases, alias_targets, used_slabs);
-
- store_size(b1, total_size);store_size(b2, total_waste);
- store_size(b3, total_waste * 100 / total_used);
- printf("Memory used: %6s # Loss : %6s MRatio:%6s%%\n", b1, b2, b3);
-
- store_size(b1, total_objects);store_size(b2, total_partobj);
- store_size(b3, total_partobj * 100 / total_objects);
- printf("# Objects : %6s # PartObj: %6s ORatio:%6s%%\n", b1, b2, b3);
-
- printf("\n");
- printf("Per Cache Average Min Max Total\n");
- printf("---------------------------------------------------------\n");
-
- store_size(b1, avg_objects);store_size(b2, min_objects);
- store_size(b3, max_objects);store_size(b4, total_objects);
- printf("#Objects %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_slabs);store_size(b2, min_slabs);
- store_size(b3, max_slabs);store_size(b4, total_slabs);
- printf("#Slabs %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_partial);store_size(b2, min_partial);
- store_size(b3, max_partial);store_size(b4, total_partial);
- printf("#PartSlab %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
- store_size(b1, avg_ppart);store_size(b2, min_ppart);
- store_size(b3, max_ppart);
- store_size(b4, total_partial * 100 / total_slabs);
- printf("%%PartSlab%10s%% %10s%% %10s%% %10s%%\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_partobj);store_size(b2, min_partobj);
- store_size(b3, max_partobj);
- store_size(b4, total_partobj);
- printf("PartObjs %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_ppartobj);store_size(b2, min_ppartobj);
- store_size(b3, max_ppartobj);
- store_size(b4, total_partobj * 100 / total_objects);
- printf("%% PartObj%10s%% %10s%% %10s%% %10s%%\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_size);store_size(b2, min_size);
- store_size(b3, max_size);store_size(b4, total_size);
- printf("Memory %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_used);store_size(b2, min_used);
- store_size(b3, max_used);store_size(b4, total_used);
- printf("Used %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- store_size(b1, avg_waste);store_size(b2, min_waste);
- store_size(b3, max_waste);store_size(b4, total_waste);
- printf("Loss %10s %10s %10s %10s\n",
- b1, b2, b3, b4);
-
- printf("\n");
- printf("Per Object Average Min Max\n");
- printf("---------------------------------------------\n");
-
- store_size(b1, avg_memobj);store_size(b2, min_memobj);
- store_size(b3, max_memobj);
- printf("Memory %10s %10s %10s\n",
- b1, b2, b3);
- store_size(b1, avg_objsize);store_size(b2, min_objsize);
- store_size(b3, max_objsize);
- printf("User %10s %10s %10s\n",
- b1, b2, b3);
-
- store_size(b1, avg_objwaste);store_size(b2, min_objwaste);
- store_size(b3, max_objwaste);
- printf("Loss %10s %10s %10s\n",
- b1, b2, b3);
-}
-
-void sort_slabs(void)
-{
- struct slabinfo *s1,*s2;
-
- for (s1 = slabinfo; s1 < slabinfo + slabs; s1++) {
- for (s2 = s1 + 1; s2 < slabinfo + slabs; s2++) {
- int result;
-
- if (sort_size)
- result = slab_size(s1) < slab_size(s2);
- else if (sort_active)
- result = slab_activity(s1) < slab_activity(s2);
- else
- result = strcasecmp(s1->name, s2->name);
-
- if (show_inverted)
- result = -result;
-
- if (result > 0) {
- struct slabinfo t;
-
- memcpy(&t, s1, sizeof(struct slabinfo));
- memcpy(s1, s2, sizeof(struct slabinfo));
- memcpy(s2, &t, sizeof(struct slabinfo));
- }
- }
- }
-}
-
-void sort_aliases(void)
-{
- struct aliasinfo *a1,*a2;
-
- for (a1 = aliasinfo; a1 < aliasinfo + aliases; a1++) {
- for (a2 = a1 + 1; a2 < aliasinfo + aliases; a2++) {
- char *n1, *n2;
-
- n1 = a1->name;
- n2 = a2->name;
- if (show_alias && !show_inverted) {
- n1 = a1->ref;
- n2 = a2->ref;
- }
- if (strcasecmp(n1, n2) > 0) {
- struct aliasinfo t;
-
- memcpy(&t, a1, sizeof(struct aliasinfo));
- memcpy(a1, a2, sizeof(struct aliasinfo));
- memcpy(a2, &t, sizeof(struct aliasinfo));
- }
- }
- }
-}
-
-void link_slabs(void)
-{
- struct aliasinfo *a;
- struct slabinfo *s;
-
- for (a = aliasinfo; a < aliasinfo + aliases; a++) {
-
- for (s = slabinfo; s < slabinfo + slabs; s++)
- if (strcmp(a->ref, s->name) == 0) {
- a->slab = s;
- s->refs++;
- break;
- }
- if (s == slabinfo + slabs)
- fatal("Unresolved alias %s\n", a->ref);
- }
-}
-
-void alias(void)
-{
- struct aliasinfo *a;
- char *active = NULL;
-
- sort_aliases();
- link_slabs();
-
- for(a = aliasinfo; a < aliasinfo + aliases; a++) {
-
- if (!show_single_ref && a->slab->refs == 1)
- continue;
-
- if (!show_inverted) {
- if (active) {
- if (strcmp(a->slab->name, active) == 0) {
- printf(" %s", a->name);
- continue;
- }
- }
- printf("\n%-12s <- %s", a->slab->name, a->name);
- active = a->slab->name;
- }
- else
- printf("%-20s -> %s\n", a->name, a->slab->name);
- }
- if (active)
- printf("\n");
-}
-
-
-void rename_slabs(void)
-{
- struct slabinfo *s;
- struct aliasinfo *a;
-
- for (s = slabinfo; s < slabinfo + slabs; s++) {
- if (*s->name != ':')
- continue;
-
- if (s->refs > 1 && !show_first_alias)
- continue;
-
- a = find_one_alias(s);
-
- if (a)
- s->name = a->name;
- else {
- s->name = "*";
- actual_slabs--;
- }
- }
-}
-
-int slab_mismatch(char *slab)
-{
- return regexec(&pattern, slab, 0, NULL, 0);
-}
-
-void read_slab_dir(void)
-{
- DIR *dir;
- struct dirent *de;
- struct slabinfo *slab = slabinfo;
- struct aliasinfo *alias = aliasinfo;
- char *p;
- char *t;
- int count;
-
- if (chdir("/sys/kernel/slab") && chdir("/sys/slab"))
- fatal("SYSFS support for SLUB not active\n");
-
- dir = opendir(".");
- while ((de = readdir(dir))) {
- if (de->d_name[0] == '.' ||
- (de->d_name[0] != ':' && slab_mismatch(de->d_name)))
- continue;
- switch (de->d_type) {
- case DT_LNK:
- alias->name = strdup(de->d_name);
- count = readlink(de->d_name, buffer, sizeof(buffer));
-
- if (count < 0)
- fatal("Cannot read symlink %s\n", de->d_name);
-
- buffer[count] = 0;
- p = buffer + count;
- while (p > buffer && p[-1] != '/')
- p--;
- alias->ref = strdup(p);
- alias++;
- break;
- case DT_DIR:
- if (chdir(de->d_name))
- fatal("Unable to access slab %s\n", slab->name);
- slab->name = strdup(de->d_name);
- slab->alias = 0;
- slab->refs = 0;
- slab->aliases = get_obj("aliases");
- slab->align = get_obj("align");
- slab->cache_dma = get_obj("cache_dma");
- slab->cpu_slabs = get_obj("cpu_slabs");
- slab->destroy_by_rcu = get_obj("destroy_by_rcu");
- slab->hwcache_align = get_obj("hwcache_align");
- slab->object_size = get_obj("object_size");
- slab->objects = get_obj("objects");
- slab->objects_partial = get_obj("objects_partial");
- slab->objects_total = get_obj("objects_total");
- slab->objs_per_slab = get_obj("objs_per_slab");
- slab->order = get_obj("order");
- slab->partial = get_obj("partial");
- slab->partial = get_obj_and_str("partial", &t);
- decode_numa_list(slab->numa_partial, t);
- free(t);
- slab->poison = get_obj("poison");
- slab->reclaim_account = get_obj("reclaim_account");
- slab->red_zone = get_obj("red_zone");
- slab->sanity_checks = get_obj("sanity_checks");
- slab->slab_size = get_obj("slab_size");
- slab->slabs = get_obj_and_str("slabs", &t);
- decode_numa_list(slab->numa, t);
- free(t);
- slab->store_user = get_obj("store_user");
- slab->trace = get_obj("trace");
- slab->alloc_fastpath = get_obj("alloc_fastpath");
- slab->alloc_slowpath = get_obj("alloc_slowpath");
- slab->free_fastpath = get_obj("free_fastpath");
- slab->free_slowpath = get_obj("free_slowpath");
- slab->free_frozen= get_obj("free_frozen");
- slab->free_add_partial = get_obj("free_add_partial");
- slab->free_remove_partial = get_obj("free_remove_partial");
- slab->alloc_from_partial = get_obj("alloc_from_partial");
- slab->alloc_slab = get_obj("alloc_slab");
- slab->alloc_refill = get_obj("alloc_refill");
- slab->free_slab = get_obj("free_slab");
- slab->cpuslab_flush = get_obj("cpuslab_flush");
- slab->deactivate_full = get_obj("deactivate_full");
- slab->deactivate_empty = get_obj("deactivate_empty");
- slab->deactivate_to_head = get_obj("deactivate_to_head");
- slab->deactivate_to_tail = get_obj("deactivate_to_tail");
- slab->deactivate_remote_frees = get_obj("deactivate_remote_frees");
- slab->order_fallback = get_obj("order_fallback");
- chdir("..");
- if (slab->name[0] == ':')
- alias_targets++;
- slab++;
- break;
- default :
- fatal("Unknown file type %lx\n", de->d_type);
- }
- }
- closedir(dir);
- slabs = slab - slabinfo;
- actual_slabs = slabs;
- aliases = alias - aliasinfo;
- if (slabs > MAX_SLABS)
- fatal("Too many slabs\n");
- if (aliases > MAX_ALIASES)
- fatal("Too many aliases\n");
-}
-
-void output_slabs(void)
-{
- struct slabinfo *slab;
-
- for (slab = slabinfo; slab < slabinfo + slabs; slab++) {
-
- if (slab->alias)
- continue;
-
-
- if (show_numa)
- slab_numa(slab, 0);
- else if (show_track)
- show_tracking(slab);
- else if (validate)
- slab_validate(slab);
- else if (shrink)
- slab_shrink(slab);
- else if (set_debug)
- slab_debug(slab);
- else if (show_ops)
- ops(slab);
- else if (show_slab)
- slabcache(slab);
- else if (show_report)
- report(slab);
- }
-}
-
-struct option opts[] = {
- { "aliases", 0, NULL, 'a' },
- { "activity", 0, NULL, 'A' },
- { "debug", 2, NULL, 'd' },
- { "display-activity", 0, NULL, 'D' },
- { "empty", 0, NULL, 'e' },
- { "first-alias", 0, NULL, 'f' },
- { "help", 0, NULL, 'h' },
- { "inverted", 0, NULL, 'i'},
- { "numa", 0, NULL, 'n' },
- { "ops", 0, NULL, 'o' },
- { "report", 0, NULL, 'r' },
- { "shrink", 0, NULL, 's' },
- { "slabs", 0, NULL, 'l' },
- { "track", 0, NULL, 't'},
- { "validate", 0, NULL, 'v' },
- { "zero", 0, NULL, 'z' },
- { "1ref", 0, NULL, '1'},
- { NULL, 0, NULL, 0 }
-};
-
-int main(int argc, char *argv[])
-{
- int c;
- int err;
- char *pattern_source;
-
- page_size = getpagesize();
-
- while ((c = getopt_long(argc, argv, "aAd::Defhil1noprstvzTS",
- opts, NULL)) != -1)
- switch (c) {
- case '1':
- show_single_ref = 1;
- break;
- case 'a':
- show_alias = 1;
- break;
- case 'A':
- sort_active = 1;
- break;
- case 'd':
- set_debug = 1;
- if (!debug_opt_scan(optarg))
- fatal("Invalid debug option '%s'\n", optarg);
- break;
- case 'D':
- show_activity = 1;
- break;
- case 'e':
- show_empty = 1;
- break;
- case 'f':
- show_first_alias = 1;
- break;
- case 'h':
- usage();
- return 0;
- case 'i':
- show_inverted = 1;
- break;
- case 'n':
- show_numa = 1;
- break;
- case 'o':
- show_ops = 1;
- break;
- case 'r':
- show_report = 1;
- break;
- case 's':
- shrink = 1;
- break;
- case 'l':
- show_slab = 1;
- break;
- case 't':
- show_track = 1;
- break;
- case 'v':
- validate = 1;
- break;
- case 'z':
- skip_zero = 0;
- break;
- case 'T':
- show_totals = 1;
- break;
- case 'S':
- sort_size = 1;
- break;
-
- default:
- fatal("%s: Invalid option '%c'\n", argv[0], optopt);
-
- }
-
- if (!show_slab && !show_alias && !show_track && !show_report
- && !validate && !shrink && !set_debug && !show_ops)
- show_slab = 1;
-
- if (argc > optind)
- pattern_source = argv[optind];
- else
- pattern_source = ".*";
-
- err = regcomp(&pattern, pattern_source, REG_ICASE|REG_NOSUB);
- if (err)
- fatal("%s: Invalid pattern '%s' code %d\n",
- argv[0], pattern_source, err);
- read_slab_dir();
- if (show_alias)
- alias();
- else
- if (show_totals)
- totals();
- else {
- link_slabs();
- rename_slabs();
- sort_slabs();
- output_slabs();
- }
- return 0;
-}
diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
index 7c13f22a0c9e..07375e73981a 100644
--- a/Documentation/vm/slub.txt
+++ b/Documentation/vm/slub.txt
@@ -41,6 +41,9 @@ Possible debug options are
P Poisoning (object and padding)
U User tracking (free and alloc)
T Trace (please only use on single slabs)
+ A Toggle failslab filter mark for the cache
+ O Switch debugging off for caches that would have
+ caused higher minimum slab orders
- Switch all debugging off (useful if the kernel is
configured with CONFIG_SLUB_DEBUG_ON)
@@ -59,6 +62,14 @@ to the dentry cache with
slub_debug=F,dentry
+Debugging options may require the minimum possible slab order to increase as
+a result of storing the metadata (for example, caches with PAGE_SIZE object
+sizes). This has a higher liklihood of resulting in slab allocation errors
+in low memory situations or if there's high fragmentation of memory. To
+switch off debugging for such caches by default, use
+
+ slub_debug=O
+
In case you forgot to enable debugging on the kernel command line: It is
possible to enable debugging manually when the kernel is up. Look at the
contents of:
@@ -235,7 +246,7 @@ been overwritten. Here a string of 8 characters was written into a slab that
has the length of 8 characters. However, a 8 character string needs a
terminating 0. That zero has overwritten the first byte of the Redzone field.
After reporting the details of the issue encountered the FIX SLUB message
-tell us that SLUB has restored the Redzone to its proper value and then
+tells us that SLUB has restored the Redzone to its proper value and then
system operations continue.
Emergency operations:
@@ -266,4 +277,4 @@ of other objects.
slub_debug=FZ,dentry
-Christoph Lameter, <clameter@sgi.com>, May 30, 2007
+Christoph Lameter, May 30, 2007
diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
new file mode 100644
index 000000000000..29bdf62aac09
--- /dev/null
+++ b/Documentation/vm/transhuge.txt
@@ -0,0 +1,299 @@
+= Transparent Hugepage Support =
+
+== Objective ==
+
+Performance critical computing applications dealing with large memory
+working sets are already running on top of libhugetlbfs and in turn
+hugetlbfs. Transparent Hugepage Support is an alternative means of
+using huge pages for the backing of virtual memory with huge pages
+that supports the automatic promotion and demotion of page sizes and
+without the shortcomings of hugetlbfs.
+
+Currently it only works for anonymous memory mappings but in the
+future it can expand over the pagecache layer starting with tmpfs.
+
+The reason applications are running faster is because of two
+factors. The first factor is almost completely irrelevant and it's not
+of significant interest because it'll also have the downside of
+requiring larger clear-page copy-page in page faults which is a
+potentially negative effect. The first factor consists in taking a
+single page fault for each 2M virtual region touched by userland (so
+reducing the enter/exit kernel frequency by a 512 times factor). This
+only matters the first time the memory is accessed for the lifetime of
+a memory mapping. The second long lasting and much more important
+factor will affect all subsequent accesses to the memory for the whole
+runtime of the application. The second factor consist of two
+components: 1) the TLB miss will run faster (especially with
+virtualization using nested pagetables but almost always also on bare
+metal without virtualization) and 2) a single TLB entry will be
+mapping a much larger amount of virtual memory in turn reducing the
+number of TLB misses. With virtualization and nested pagetables the
+TLB can be mapped of larger size only if both KVM and the Linux guest
+are using hugepages but a significant speedup already happens if only
+one of the two is using hugepages just because of the fact the TLB
+miss is going to run faster.
+
+== Design ==
+
+- "graceful fallback": mm components which don't have transparent
+ hugepage knowledge fall back to breaking a transparent hugepage and
+ working on the regular pages and their respective regular pmd/pte
+ mappings
+
+- if a hugepage allocation fails because of memory fragmentation,
+ regular pages should be gracefully allocated instead and mixed in
+ the same vma without any failure or significant delay and without
+ userland noticing
+
+- if some task quits and more hugepages become available (either
+ immediately in the buddy or through the VM), guest physical memory
+ backed by regular pages should be relocated on hugepages
+ automatically (with khugepaged)
+
+- it doesn't require memory reservation and in turn it uses hugepages
+ whenever possible (the only possible reservation here is kernelcore=
+ to avoid unmovable pages to fragment all the memory but such a tweak
+ is not specific to transparent hugepage support and it's a generic
+ feature that applies to all dynamic high order allocations in the
+ kernel)
+
+- this initial support only offers the feature in the anonymous memory
+ regions but it'd be ideal to move it to tmpfs and the pagecache
+ later
+
+Transparent Hugepage Support maximizes the usefulness of free memory
+if compared to the reservation approach of hugetlbfs by allowing all
+unused memory to be used as cache or other movable (or even unmovable
+entities). It doesn't require reservation to prevent hugepage
+allocation failures to be noticeable from userland. It allows paging
+and all other advanced VM features to be available on the
+hugepages. It requires no modifications for applications to take
+advantage of it.
+
+Applications however can be further optimized to take advantage of
+this feature, like for example they've been optimized before to avoid
+a flood of mmap system calls for every malloc(4k). Optimizing userland
+is by far not mandatory and khugepaged already can take care of long
+lived page allocations even for hugepage unaware applications that
+deals with large amounts of memory.
+
+In certain cases when hugepages are enabled system wide, application
+may end up allocating more memory resources. An application may mmap a
+large region but only touch 1 byte of it, in that case a 2M page might
+be allocated instead of a 4k page for no good. This is why it's
+possible to disable hugepages system-wide and to only have them inside
+MADV_HUGEPAGE madvise regions.
+
+Embedded systems should enable hugepages only inside madvise regions
+to eliminate any risk of wasting any precious byte of memory and to
+only run faster.
+
+Applications that gets a lot of benefit from hugepages and that don't
+risk to lose memory by using hugepages, should use
+madvise(MADV_HUGEPAGE) on their critical mmapped regions.
+
+== sysfs ==
+
+Transparent Hugepage Support can be entirely disabled (mostly for
+debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to
+avoid the risk of consuming more memory resources) or enabled system
+wide. This can be achieved with one of:
+
+echo always >/sys/kernel/mm/transparent_hugepage/enabled
+echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
+echo never >/sys/kernel/mm/transparent_hugepage/enabled
+
+It's also possible to limit defrag efforts in the VM to generate
+hugepages in case they're not immediately free to madvise regions or
+to never try to defrag memory and simply fallback to regular pages
+unless hugepages are immediately available. Clearly if we spend CPU
+time to defrag memory, we would expect to gain even more by the fact
+we use hugepages later instead of regular pages. This isn't always
+guaranteed, but it may be more likely in case the allocation is for a
+MADV_HUGEPAGE region.
+
+echo always >/sys/kernel/mm/transparent_hugepage/defrag
+echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
+echo never >/sys/kernel/mm/transparent_hugepage/defrag
+
+khugepaged will be automatically started when
+transparent_hugepage/enabled is set to "always" or "madvise, and it'll
+be automatically shutdown if it's set to "never".
+
+khugepaged runs usually at low frequency so while one may not want to
+invoke defrag algorithms synchronously during the page faults, it
+should be worth invoking defrag at least in khugepaged. However it's
+also possible to disable defrag in khugepaged by writing 0 or enable
+defrag in khugepaged by writing 1:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/defrag
+
+You can also control how many pages khugepaged should scan at each
+pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
+
+and how many milliseconds to wait in khugepaged between each pass (you
+can set this to 0 to run khugepaged at 100% utilization of one core):
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
+
+and how many milliseconds to wait in khugepaged if there's an hugepage
+allocation failure to throttle the next allocation attempt.
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
+
+The khugepaged progress can be seen in the number of pages collapsed:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed
+
+for each pass:
+
+/sys/kernel/mm/transparent_hugepage/khugepaged/full_scans
+
+== Boot parameter ==
+
+You can change the sysfs boot time defaults of Transparent Hugepage
+Support by passing the parameter "transparent_hugepage=always" or
+"transparent_hugepage=madvise" or "transparent_hugepage=never"
+(without "") to the kernel command line.
+
+== Need of application restart ==
+
+The transparent_hugepage/enabled values only affect future
+behavior. So to make them effective you need to restart any
+application that could have been using hugepages. This also applies to
+the regions registered in khugepaged.
+
+== get_user_pages and follow_page ==
+
+get_user_pages and follow_page if run on a hugepage, will return the
+head or tail pages as usual (exactly as they would do on
+hugetlbfs). Most gup users will only care about the actual physical
+address of the page and its temporary pinning to release after the I/O
+is complete, so they won't ever notice the fact the page is huge. But
+if any driver is going to mangle over the page structure of the tail
+page (like for checking page->mapping or other bits that are relevant
+for the head page and not the tail page), it should be updated to jump
+to check head page instead (while serializing properly against
+split_huge_page() to avoid the head and tail pages to disappear from
+under it, see the futex code to see an example of that, hugetlbfs also
+needed special handling in futex code for similar reasons).
+
+NOTE: these aren't new constraints to the GUP API, and they match the
+same constrains that applies to hugetlbfs too, so any driver capable
+of handling GUP on hugetlbfs will also work fine on transparent
+hugepage backed mappings.
+
+In case you can't handle compound pages if they're returned by
+follow_page, the FOLL_SPLIT bit can be specified as parameter to
+follow_page, so that it will split the hugepages before returning
+them. Migration for example passes FOLL_SPLIT as parameter to
+follow_page because it's not hugepage aware and in fact it can't work
+at all on hugetlbfs (but it instead works fine on transparent
+hugepages thanks to FOLL_SPLIT). migration simply can't deal with
+hugepages being returned (as it's not only checking the pfn of the
+page and pinning it during the copy but it pretends to migrate the
+memory in regular page sizes and with regular pte/pmd mappings).
+
+== Optimizing the applications ==
+
+To be guaranteed that the kernel will map a 2M page immediately in any
+memory region, the mmap region has to be hugepage naturally
+aligned. posix_memalign() can provide that guarantee.
+
+== Hugetlbfs ==
+
+You can use hugetlbfs on a kernel that has transparent hugepage
+support enabled just fine as always. No difference can be noted in
+hugetlbfs other than there will be less overall fragmentation. All
+usual features belonging to hugetlbfs are preserved and
+unaffected. libhugetlbfs will also work fine as usual.
+
+== Graceful fallback ==
+
+Code walking pagetables but unware about huge pmds can simply call
+split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+pmd_offset. It's trivial to make the code transparent hugepage aware
+by just grepping for "pmd_offset" and adding split_huge_page_pmd where
+missing after pmd_offset returns the pmd. Thanks to the graceful
+fallback design, with a one liner change, you can avoid to write
+hundred if not thousand of lines of complex code to make your code
+hugepage aware.
+
+If you're not walking pagetables but you run into a physical hugepage
+but you can't handle it natively in your code, you can split it by
+calling split_huge_page(page). This is what the Linux VM does before
+it tries to swapout the hugepage for example.
+
+Example to make mremap.c transparent hugepage aware with a one liner
+change:
+
+diff --git a/mm/mremap.c b/mm/mremap.c
+--- a/mm/mremap.c
++++ b/mm/mremap.c
+@@ -41,6 +41,7 @@ static pmd_t *get_old_pmd(struct mm_stru
+ return NULL;
+
+ pmd = pmd_offset(pud, addr);
++ split_huge_page_pmd(mm, pmd);
+ if (pmd_none_or_clear_bad(pmd))
+ return NULL;
+
+== Locking in hugepage aware code ==
+
+We want as much code as possible hugepage aware, as calling
+split_huge_page() or split_huge_page_pmd() has a cost.
+
+To make pagetable walks huge pmd aware, all you need to do is to call
+pmd_trans_huge() on the pmd returned by pmd_offset. You must hold the
+mmap_sem in read (or write) mode to be sure an huge pmd cannot be
+created from under you by khugepaged (khugepaged collapse_huge_page
+takes the mmap_sem in write mode in addition to the anon_vma lock). If
+pmd_trans_huge returns false, you just fallback in the old code
+paths. If instead pmd_trans_huge returns true, you have to take the
+mm->page_table_lock and re-run pmd_trans_huge. Taking the
+page_table_lock will prevent the huge pmd to be converted into a
+regular pmd from under you (split_huge_page can run in parallel to the
+pagetable walk). If the second pmd_trans_huge returns false, you
+should just drop the page_table_lock and fallback to the old code as
+before. Otherwise you should run pmd_trans_splitting on the pmd. In
+case pmd_trans_splitting returns true, it means split_huge_page is
+already in the middle of splitting the page. So if pmd_trans_splitting
+returns true it's enough to drop the page_table_lock and call
+wait_split_huge_page and then fallback the old code paths. You are
+guaranteed by the time wait_split_huge_page returns, the pmd isn't
+huge anymore. If pmd_trans_splitting returns false, you can proceed to
+process the huge pmd and the hugepage natively. Once finished you can
+drop the page_table_lock.
+
+== compound_lock, get_user_pages and put_page ==
+
+split_huge_page internally has to distribute the refcounts in the head
+page to the tail pages before clearing all PG_head/tail bits from the
+page structures. It can do that easily for refcounts taken by huge pmd
+mappings. But the GUI API as created by hugetlbfs (that returns head
+and tail pages if running get_user_pages on an address backed by any
+hugepage), requires the refcount to be accounted on the tail pages and
+not only in the head pages, if we want to be able to run
+split_huge_page while there are gup pins established on any tail
+page. Failure to be able to run split_huge_page if there's any gup pin
+on any tail page, would mean having to split all hugepages upfront in
+get_user_pages which is unacceptable as too many gup users are
+performance critical and they must work natively on hugepages like
+they work natively on hugetlbfs already (hugetlbfs is simpler because
+hugetlbfs pages cannot be splitted so there wouldn't be requirement of
+accounting the pins on the tail pages for hugetlbfs). If we wouldn't
+account the gup refcounts on the tail pages during gup, we won't know
+anymore which tail page is pinned by gup and which is not while we run
+split_huge_page. But we still have to add the gup pin to the head page
+too, to know when we can free the compound page in case it's never
+splitted during its lifetime. That requires changing not just
+get_page, but put_page as well so that when put_page runs on a tail
+page (and only on a tail page) it will find its respective head page,
+and then it will decrease the head page refcount in addition to the
+tail page refcount. To obtain a head page reliably and to decrease its
+refcount without race conditions, put_page has to serialize against
+__split_huge_page_refcount using a special per-page lock called
+compound_lock.
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt
new file mode 100644
index 000000000000..97bae3c576c2
--- /dev/null
+++ b/Documentation/vm/unevictable-lru.txt
@@ -0,0 +1,690 @@
+ ==============================
+ UNEVICTABLE LRU INFRASTRUCTURE
+ ==============================
+
+========
+CONTENTS
+========
+
+ (*) The Unevictable LRU
+
+ - The unevictable page list.
+ - Memory control group interaction.
+ - Marking address spaces unevictable.
+ - Detecting Unevictable Pages.
+ - vmscan's handling of unevictable pages.
+
+ (*) mlock()'d pages.
+
+ - History.
+ - Basic management.
+ - mlock()/mlockall() system call handling.
+ - Filtering special vmas.
+ - munlock()/munlockall() system call handling.
+ - Migrating mlocked pages.
+ - mmap(MAP_LOCKED) system call handling.
+ - munmap()/exit()/exec() system call handling.
+ - try_to_unmap().
+ - try_to_munlock() reverse map scan.
+ - Page reclaim in shrink_*_list().
+
+
+============
+INTRODUCTION
+============
+
+This document describes the Linux memory manager's "Unevictable LRU"
+infrastructure and the use of this to manage several types of "unevictable"
+pages.
+
+The document attempts to provide the overall rationale behind this mechanism
+and the rationale for some of the design decisions that drove the
+implementation. The latter design rationale is discussed in the context of an
+implementation description. Admittedly, one can obtain the implementation
+details - the "what does it do?" - by reading the code. One hopes that the
+descriptions below add value by provide the answer to "why does it do that?".
+
+
+===================
+THE UNEVICTABLE LRU
+===================
+
+The Unevictable LRU facility adds an additional LRU list to track unevictable
+pages and to hide these pages from vmscan. This mechanism is based on a patch
+by Larry Woodman of Red Hat to address several scalability problems with page
+reclaim in Linux. The problems have been observed at customer sites on large
+memory x86_64 systems.
+
+To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of
+main memory will have over 32 million 4k pages in a single zone. When a large
+fraction of these pages are not evictable for any reason [see below], vmscan
+will spend a lot of time scanning the LRU lists looking for the small fraction
+of pages that are evictable. This can result in a situation where all CPUs are
+spending 100% of their time in vmscan for hours or days on end, with the system
+completely unresponsive.
+
+The unevictable list addresses the following classes of unevictable pages:
+
+ (*) Those owned by ramfs.
+
+ (*) Those mapped into SHM_LOCK'd shared memory regions.
+
+ (*) Those mapped into VM_LOCKED [mlock()ed] VMAs.
+
+The infrastructure may also be able to handle other conditions that make pages
+unevictable, either by definition or by circumstance, in the future.
+
+
+THE UNEVICTABLE PAGE LIST
+-------------------------
+
+The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list
+called the "unevictable" list and an associated page flag, PG_unevictable, to
+indicate that the page is being managed on the unevictable list.
+
+The PG_unevictable flag is analogous to, and mutually exclusive with, the
+PG_active flag in that it indicates on which LRU list a page resides when
+PG_lru is set.
+
+The Unevictable LRU infrastructure maintains unevictable pages on an additional
+LRU list for a few reasons:
+
+ (1) We get to "treat unevictable pages just like we treat other pages in the
+ system - which means we get to use the same code to manipulate them, the
+ same code to isolate them (for migrate, etc.), the same code to keep track
+ of the statistics, etc..." [Rik van Riel]
+
+ (2) We want to be able to migrate unevictable pages between nodes for memory
+ defragmentation, workload management and memory hotplug. The linux kernel
+ can only migrate pages that it can successfully isolate from the LRU
+ lists. If we were to maintain pages elsewhere than on an LRU-like list,
+ where they can be found by isolate_lru_page(), we would prevent their
+ migration, unless we reworked migration code to find the unevictable pages
+ itself.
+
+
+The unevictable list does not differentiate between file-backed and anonymous,
+swap-backed pages. This differentiation is only important while the pages are,
+in fact, evictable.
+
+The unevictable list benefits from the "arrayification" of the per-zone LRU
+lists and statistics originally proposed and posted by Christoph Lameter.
+
+The unevictable list does not use the LRU pagevec mechanism. Rather,
+unevictable pages are placed directly on the page's zone's unevictable list
+under the zone lru_lock. This allows us to prevent the stranding of pages on
+the unevictable list when one task has the page isolated from the LRU and other
+tasks are changing the "evictability" state of the page.
+
+
+MEMORY CONTROL GROUP INTERACTION
+--------------------------------
+
+The unevictable LRU facility interacts with the memory control group [aka
+memory controller; see Documentation/cgroups/memory.txt] by extending the
+lru_list enum.
+
+The memory controller data structure automatically gets a per-zone unevictable
+list as a result of the "arrayification" of the per-zone LRU lists (one per
+lru_list enum element). The memory controller tracks the movement of pages to
+and from the unevictable list.
+
+When a memory control group comes under memory pressure, the controller will
+not attempt to reclaim pages on the unevictable list. This has a couple of
+effects:
+
+ (1) Because the pages are "hidden" from reclaim on the unevictable list, the
+ reclaim process can be more efficient, dealing only with pages that have a
+ chance of being reclaimed.
+
+ (2) On the other hand, if too many of the pages charged to the control group
+ are unevictable, the evictable portion of the working set of the tasks in
+ the control group may not fit into the available memory. This can cause
+ the control group to thrash or to OOM-kill tasks.
+
+
+MARKING ADDRESS SPACES UNEVICTABLE
+----------------------------------
+
+For facilities such as ramfs none of the pages attached to the address space
+may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE
+address space flag is provided, and this can be manipulated by a filesystem
+using a number of wrapper functions:
+
+ (*) void mapping_set_unevictable(struct address_space *mapping);
+
+ Mark the address space as being completely unevictable.
+
+ (*) void mapping_clear_unevictable(struct address_space *mapping);
+
+ Mark the address space as being evictable.
+
+ (*) int mapping_unevictable(struct address_space *mapping);
+
+ Query the address space, and return true if it is completely
+ unevictable.
+
+These are currently used in two places in the kernel:
+
+ (1) By ramfs to mark the address spaces of its inodes when they are created,
+ and this mark remains for the life of the inode.
+
+ (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called.
+
+ Note that SHM_LOCK is not required to page in the locked pages if they're
+ swapped out; the application must touch the pages manually if it wants to
+ ensure they're in memory.
+
+
+DETECTING UNEVICTABLE PAGES
+---------------------------
+
+The function page_evictable() in vmscan.c determines whether a page is
+evictable or not using the query function outlined above [see section "Marking
+address spaces unevictable"] to check the AS_UNEVICTABLE flag.
+
+For address spaces that are so marked after being populated (as SHM regions
+might be), the lock action (eg: SHM_LOCK) can be lazy, and need not populate
+the page tables for the region as does, for example, mlock(), nor need it make
+any special effort to push any pages in the SHM_LOCK'd area to the unevictable
+list. Instead, vmscan will do this if and when it encounters the pages during
+a reclamation scan.
+
+On an unlock action (such as SHM_UNLOCK), the unlocker (eg: shmctl()) must scan
+the pages in the region and "rescue" them from the unevictable list if no other
+condition is keeping them unevictable. If an unevictable region is destroyed,
+the pages are also "rescued" from the unevictable list in the process of
+freeing them.
+
+page_evictable() also checks for mlocked pages by testing an additional page
+flag, PG_mlocked (as wrapped by PageMlocked()). If the page is NOT mlocked,
+and a non-NULL VMA is supplied, page_evictable() will check whether the VMA is
+VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and
+update the appropriate statistics if the vma is VM_LOCKED. This method allows
+efficient "culling" of pages in the fault path that are being faulted in to
+VM_LOCKED VMAs.
+
+
+VMSCAN'S HANDLING OF UNEVICTABLE PAGES
+--------------------------------------
+
+If unevictable pages are culled in the fault path, or moved to the unevictable
+list at mlock() or mmap() time, vmscan will not encounter the pages until they
+have become evictable again (via munlock() for example) and have been "rescued"
+from the unevictable list. However, there may be situations where we decide,
+for the sake of expediency, to leave a unevictable page on one of the regular
+active/inactive LRU lists for vmscan to deal with. vmscan checks for such
+pages in all of the shrink_{active|inactive|page}_list() functions and will
+"cull" such pages that it encounters: that is, it diverts those pages to the
+unevictable list for the zone being scanned.
+
+There may be situations where a page is mapped into a VM_LOCKED VMA, but the
+page is not marked as PG_mlocked. Such pages will make it all the way to
+shrink_page_list() where they will be detected when vmscan walks the reverse
+map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK,
+shrink_page_list() will cull the page at that point.
+
+To "cull" an unevictable page, vmscan simply puts the page back on the LRU list
+using putback_lru_page() - the inverse operation to isolate_lru_page() - after
+dropping the page lock. Because the condition which makes the page unevictable
+may change once the page is unlocked, putback_lru_page() will recheck the
+unevictable state of a page that it places on the unevictable list. If the
+page has become unevictable, putback_lru_page() removes it from the list and
+retries, including the page_unevictable() test. Because such a race is a rare
+event and movement of pages onto the unevictable list should be rare, these
+extra evictabilty checks should not occur in the majority of calls to
+putback_lru_page().
+
+
+=============
+MLOCKED PAGES
+=============
+
+The unevictable page list is also useful for mlock(), in addition to ramfs and
+SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in
+NOMMU situations, all mappings are effectively mlocked.
+
+
+HISTORY
+-------
+
+The "Unevictable mlocked Pages" infrastructure is based on work originally
+posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU".
+Nick posted his patch as an alternative to a patch posted by Christoph Lameter
+to achieve the same objective: hiding mlocked pages from vmscan.
+
+In Nick's patch, he used one of the struct page LRU list link fields as a count
+of VM_LOCKED VMAs that map the page. This use of the link field for a count
+prevented the management of the pages on an LRU list, and thus mlocked pages
+were not migratable as isolate_lru_page() could not find them, and the LRU list
+link field was not available to the migration subsystem.
+
+Nick resolved this by putting mlocked pages back on the lru list before
+attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When
+Nick's patch was integrated with the Unevictable LRU work, the count was
+replaced by walking the reverse map to determine whether any VM_LOCKED VMAs
+mapped the page. More on this below.
+
+
+BASIC MANAGEMENT
+----------------
+
+mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable
+pages. When such a page has been "noticed" by the memory management subsystem,
+the page is marked with the PG_mlocked flag. This can be manipulated using the
+PageMlocked() functions.
+
+A PG_mlocked page will be placed on the unevictable list when it is added to
+the LRU. Such pages can be "noticed" by memory management in several places:
+
+ (1) in the mlock()/mlockall() system call handlers;
+
+ (2) in the mmap() system call handler when mmapping a region with the
+ MAP_LOCKED flag;
+
+ (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE
+ flag
+
+ (4) in the fault path, if mlocked pages are "culled" in the fault path,
+ and when a VM_LOCKED stack segment is expanded; or
+
+ (5) as mentioned above, in vmscan:shrink_page_list() when attempting to
+ reclaim a page in a VM_LOCKED VMA via try_to_unmap()
+
+all of which result in the VM_LOCKED flag being set for the VMA if it doesn't
+already have it set.
+
+mlocked pages become unlocked and rescued from the unevictable list when:
+
+ (1) mapped in a range unlocked via the munlock()/munlockall() system calls;
+
+ (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including
+ unmapping at task exit;
+
+ (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file;
+ or
+
+ (4) before a page is COW'd in a VM_LOCKED VMA.
+
+
+mlock()/mlockall() SYSTEM CALL HANDLING
+---------------------------------------
+
+Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup()
+for each VMA in the range specified by the call. In the case of mlockall(),
+this is the entire active address space of the task. Note that mlock_fixup()
+is used for both mlocking and munlocking a range of memory. A call to mlock()
+an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED is
+treated as a no-op, and mlock_fixup() simply returns.
+
+If the VMA passes some filtering as described in "Filtering Special Vmas"
+below, mlock_fixup() will attempt to merge the VMA with its neighbors or split
+off a subset of the VMA if the range does not cover the entire VMA. Once the
+VMA has been merged or split or neither, mlock_fixup() will call
+__mlock_vma_pages_range() to fault in the pages via get_user_pages() and to
+mark the pages as mlocked via mlock_vma_page().
+
+Note that the VMA being mlocked might be mapped with PROT_NONE. In this case,
+get_user_pages() will be unable to fault in the pages. That's okay. If pages
+do end up getting faulted into this VM_LOCKED VMA, we'll handle them in the
+fault path or in vmscan.
+
+Also note that a page returned by get_user_pages() could be truncated or
+migrated out from under us, while we're trying to mlock it. To detect this,
+__mlock_vma_pages_range() checks page_mapping() after acquiring the page lock.
+If the page is still associated with its mapping, we'll go ahead and call
+mlock_vma_page(). If the mapping is gone, we just unlock the page and move on.
+In the worst case, this will result in a page mapped in a VM_LOCKED VMA
+remaining on a normal LRU list without being PageMlocked(). Again, vmscan will
+detect and cull such pages.
+
+mlock_vma_page() will call TestSetPageMlocked() for each page returned by
+get_user_pages(). We use TestSetPageMlocked() because the page might already
+be mlocked by another task/VMA and we don't want to do extra work. We
+especially do not want to count an mlocked page more than once in the
+statistics. If the page was already mlocked, mlock_vma_page() need do nothing
+more.
+
+If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the
+page from the LRU, as it is likely on the appropriate active or inactive list
+at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will put
+back the page - by calling putback_lru_page() - which will notice that the page
+is now mlocked and divert the page to the zone's unevictable list. If
+mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle
+it later if and when it attempts to reclaim the page.
+
+
+FILTERING SPECIAL VMAS
+----------------------
+
+mlock_fixup() filters several classes of "special" VMAs:
+
+1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind
+ these mappings are inherently pinned, so we don't need to mark them as
+ mlocked. In any case, most of the pages have no struct page in which to so
+ mark the page. Because of this, get_user_pages() will fail for these VMAs,
+ so there is no sense in attempting to visit them.
+
+2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We
+ neither need nor want to mlock() these pages. However, to preserve the
+ prior behavior of mlock() - before the unevictable/mlock changes -
+ mlock_fixup() will call make_pages_present() in the hugetlbfs VMA range to
+ allocate the huge pages and populate the ptes.
+
+3) VMAs with VM_DONTEXPAND or VM_RESERVED are generally userspace mappings of
+ kernel pages, such as the VDSO page, relay channel pages, etc. These pages
+ are inherently unevictable and are not managed on the LRU lists.
+ mlock_fixup() treats these VMAs the same as hugetlbfs VMAs. It calls
+ make_pages_present() to populate the ptes.
+
+Note that for all of these special VMAs, mlock_fixup() does not set the
+VM_LOCKED flag. Therefore, we won't have to deal with them later during
+munlock(), munmap() or task exit. Neither does mlock_fixup() account these
+VMAs against the task's "locked_vm".
+
+
+munlock()/munlockall() SYSTEM CALL HANDLING
+-------------------------------------------
+
+The munlock() and munlockall() system calls are handled by the same functions -
+do_mlock[all]() - as the mlock() and mlockall() system calls with the unlock vs
+lock operation indicated by an argument. So, these system calls are also
+handled by mlock_fixup(). Again, if called for an already munlocked VMA,
+mlock_fixup() simply returns. Because of the VMA filtering discussed above,
+VM_LOCKED will not be set in any "special" VMAs. So, these VMAs will be
+ignored for munlock.
+
+If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the
+specified range. The range is then munlocked via the function
+__mlock_vma_pages_range() - the same function used to mlock a VMA range -
+passing a flag to indicate that munlock() is being performed.
+
+Because the VMA access protections could have been changed to PROT_NONE after
+faulting in and mlocking pages, get_user_pages() was unreliable for visiting
+these pages for munlocking. Because we don't want to leave pages mlocked,
+get_user_pages() was enhanced to accept a flag to ignore the permissions when
+fetching the pages - all of which should be resident as a result of previous
+mlocking.
+
+For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling
+munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked
+flag using TestClearPageMlocked(). As with mlock_vma_page(),
+munlock_vma_page() use the Test*PageMlocked() function to handle the case where
+the page might have already been unlocked by another task. If the page was
+mlocked, munlock_vma_page() updates that zone statistics for the number of
+mlocked pages. Note, however, that at this point we haven't checked whether
+the page is mapped by other VM_LOCKED VMAs.
+
+We can't call try_to_munlock(), the function that walks the reverse map to
+check for other VM_LOCKED VMAs, without first isolating the page from the LRU.
+try_to_munlock() is a variant of try_to_unmap() and thus requires that the page
+not be on an LRU list [more on these below]. However, the call to
+isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). So,
+we go ahead and clear PG_mlocked up front, as this might be the only chance we
+have. If we can successfully isolate the page, we go ahead and
+try_to_munlock(), which will restore the PG_mlocked flag and update the zone
+page statistics if it finds another VMA holding the page mlocked. If we fail
+to isolate the page, we'll have left a potentially mlocked page on the LRU.
+This is fine, because we'll catch it later if and if vmscan tries to reclaim
+the page. This should be relatively rare.
+
+
+MIGRATING MLOCKED PAGES
+-----------------------
+
+A page that is being migrated has been isolated from the LRU lists and is held
+locked across unmapping of the page, updating the page's address space entry
+and copying the contents and state, until the page table entry has been
+replaced with an entry that refers to the new page. Linux supports migration
+of mlocked pages and other unevictable pages. This involves simply moving the
+PG_mlocked and PG_unevictable states from the old page to the new page.
+
+Note that page migration can race with mlocking or munlocking of the same page.
+This has been discussed from the mlock/munlock perspective in the respective
+sections above. Both processes (migration and m[un]locking) hold the page
+locked. This provides the first level of synchronization. Page migration
+zeros out the page_mapping of the old page before unlocking it, so m[un]lock
+can skip these pages by testing the page mapping under page lock.
+
+To complete page migration, we place the new and old pages back onto the LRU
+after dropping the page lock. The "unneeded" page - old page on success, new
+page on failure - will be freed when the reference count held by the migration
+process is released. To ensure that we don't strand pages on the unevictable
+list because of a race between munlock and migration, page migration uses the
+putback_lru_page() function to add migrated pages back to the LRU.
+
+
+mmap(MAP_LOCKED) SYSTEM CALL HANDLING
+-------------------------------------
+
+In addition the the mlock()/mlockall() system calls, an application can request
+that a region of memory be mlocked supplying the MAP_LOCKED flag to the mmap()
+call. Furthermore, any mmap() call or brk() call that expands the heap by a
+task that has previously called mlockall() with the MCL_FUTURE flag will result
+in the newly mapped memory being mlocked. Before the unevictable/mlock
+changes, the kernel simply called make_pages_present() to allocate pages and
+populate the page table.
+
+To mlock a range of memory under the unevictable/mlock infrastructure, the
+mmap() handler and task address space expansion functions call
+mlock_vma_pages_range() specifying the vma and the address range to mlock.
+mlock_vma_pages_range() filters VMAs like mlock_fixup(), as described above in
+"Filtering Special VMAs". It will clear the VM_LOCKED flag, which will have
+already been set by the caller, in filtered VMAs. Thus these VMA's need not be
+visited for munlock when the region is unmapped.
+
+For "normal" VMAs, mlock_vma_pages_range() calls __mlock_vma_pages_range() to
+fault/allocate the pages and mlock them. Again, like mlock_fixup(),
+mlock_vma_pages_range() downgrades the mmap semaphore to read mode before
+attempting to fault/allocate and mlock the pages and "upgrades" the semaphore
+back to write mode before returning.
+
+The callers of mlock_vma_pages_range() will have already added the memory range
+to be mlocked to the task's "locked_vm". To account for filtered VMAs,
+mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the
+callers then subtract a non-negative return value from the task's locked_vm. A
+negative return value represent an error - for example, from get_user_pages()
+attempting to fault in a VMA with PROT_NONE access. In this case, we leave the
+memory range accounted as locked_vm, as the protections could be changed later
+and pages allocated into that region.
+
+
+munmap()/exit()/exec() SYSTEM CALL HANDLING
+-------------------------------------------
+
+When unmapping an mlocked region of memory, whether by an explicit call to
+munmap() or via an internal unmap from exit() or exec() processing, we must
+munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages.
+Before the unevictable/mlock changes, mlocking did not mark the pages in any
+way, so unmapping them required no processing.
+
+To munlock a range of memory under the unevictable/mlock infrastructure, the
+munmap() handler and task address space call tear down function
+munlock_vma_pages_all(). The name reflects the observation that one always
+specifies the entire VMA range when munlock()ing during unmap of a region.
+Because of the VMA filtering when mlocking() regions, only "normal" VMAs that
+actually contain mlocked pages will be passed to munlock_vma_pages_all().
+
+munlock_vma_pages_all() clears the VM_LOCKED VMA flag and, like mlock_fixup()
+for the munlock case, calls __munlock_vma_pages_range() to walk the page table
+for the VMA's memory range and munlock_vma_page() each resident page mapped by
+the VMA. This effectively munlocks the page, only if this is the last
+VM_LOCKED VMA that maps the page.
+
+
+try_to_unmap()
+--------------
+
+Pages can, of course, be mapped into multiple VMAs. Some of these VMAs may
+have VM_LOCKED flag set. It is possible for a page mapped into one or more
+VM_LOCKED VMAs not to have the PG_mlocked flag set and therefore reside on one
+of the active or inactive LRU lists. This could happen if, for example, a task
+in the process of munlocking the page could not isolate the page from the LRU.
+As a result, vmscan/shrink_page_list() might encounter such a page as described
+in section "vmscan's handling of unevictable pages". To handle this situation,
+try_to_unmap() checks for VM_LOCKED VMAs while it is walking a page's reverse
+map.
+
+try_to_unmap() is always called, by either vmscan for reclaim or for page
+migration, with the argument page locked and isolated from the LRU. Separate
+functions handle anonymous and mapped file pages, as these types of pages have
+different reverse map mechanisms.
+
+ (*) try_to_unmap_anon()
+
+ To unmap anonymous pages, each VMA in the list anchored in the anon_vma
+ must be visited - at least until a VM_LOCKED VMA is encountered. If the
+ page is being unmapped for migration, VM_LOCKED VMAs do not stop the
+ process because mlocked pages are migratable. However, for reclaim, if
+ the page is mapped into a VM_LOCKED VMA, the scan stops.
+
+ try_to_unmap_anon() attempts to acquire in read mode the mmap semphore of
+ the mm_struct to which the VMA belongs. If this is successful, it will
+ mlock the page via mlock_vma_page() - we wouldn't have gotten to
+ try_to_unmap_anon() if the page were already mlocked - and will return
+ SWAP_MLOCK, indicating that the page is unevictable.
+
+ If the mmap semaphore cannot be acquired, we are not sure whether the page
+ is really unevictable or not. In this case, try_to_unmap_anon() will
+ return SWAP_AGAIN.
+
+ (*) try_to_unmap_file() - linear mappings
+
+ Unmapping of a mapped file page works the same as for anonymous mappings,
+ except that the scan visits all VMAs that map the page's index/page offset
+ in the page's mapping's reverse map priority search tree. It also visits
+ each VMA in the page's mapping's non-linear list, if the list is
+ non-empty.
+
+ As for anonymous pages, on encountering a VM_LOCKED VMA for a mapped file
+ page, try_to_unmap_file() will attempt to acquire the associated
+ mm_struct's mmap semaphore to mlock the page, returning SWAP_MLOCK if this
+ is successful, and SWAP_AGAIN, if not.
+
+ (*) try_to_unmap_file() - non-linear mappings
+
+ If a page's mapping contains a non-empty non-linear mapping VMA list, then
+ try_to_un{map|lock}() must also visit each VMA in that list to determine
+ whether the page is mapped in a VM_LOCKED VMA. Again, the scan must visit
+ all VMAs in the non-linear list to ensure that the pages is not/should not
+ be mlocked.
+
+ If a VM_LOCKED VMA is found in the list, the scan could terminate.
+ However, there is no easy way to determine whether the page is actually
+ mapped in a given VMA - either for unmapping or testing whether the
+ VM_LOCKED VMA actually pins the page.
+
+ try_to_unmap_file() handles non-linear mappings by scanning a certain
+ number of pages - a "cluster" - in each non-linear VMA associated with the
+ page's mapping, for each file mapped page that vmscan tries to unmap. If
+ this happens to unmap the page we're trying to unmap, try_to_unmap() will
+ notice this on return (page_mapcount(page) will be 0) and return
+ SWAP_SUCCESS. Otherwise, it will return SWAP_AGAIN, causing vmscan to
+ recirculate this page. We take advantage of the cluster scan in
+ try_to_unmap_cluster() as follows:
+
+ For each non-linear VMA, try_to_unmap_cluster() attempts to acquire the
+ mmap semaphore of the associated mm_struct for read without blocking.
+
+ If this attempt is successful and the VMA is VM_LOCKED,
+ try_to_unmap_cluster() will retain the mmap semaphore for the scan;
+ otherwise it drops it here.
+
+ Then, for each page in the cluster, if we're holding the mmap semaphore
+ for a locked VMA, try_to_unmap_cluster() calls mlock_vma_page() to
+ mlock the page. This call is a no-op if the page is already locked,
+ but will mlock any pages in the non-linear mapping that happen to be
+ unlocked.
+
+ If one of the pages so mlocked is the page passed in to try_to_unmap(),
+ try_to_unmap_cluster() will return SWAP_MLOCK, rather than the default
+ SWAP_AGAIN. This will allow vmscan to cull the page, rather than
+ recirculating it on the inactive list.
+
+ Again, if try_to_unmap_cluster() cannot acquire the VMA's mmap sem, it
+ returns SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED
+ VMA, but couldn't be mlocked.
+
+
+try_to_munlock() REVERSE MAP SCAN
+---------------------------------
+
+ [!] TODO/FIXME: a better name might be page_mlocked() - analogous to the
+ page_referenced() reverse map walker.
+
+When munlock_vma_page() [see section "munlock()/munlockall() System Call
+Handling" above] tries to munlock a page, it needs to determine whether or not
+the page is mapped by any VM_LOCKED VMA without actually attempting to unmap
+all PTEs from the page. For this purpose, the unevictable/mlock infrastructure
+introduced a variant of try_to_unmap() called try_to_munlock().
+
+try_to_munlock() calls the same functions as try_to_unmap() for anonymous and
+mapped file pages with an additional argument specifing unlock versus unmap
+processing. Again, these functions walk the respective reverse maps looking
+for VM_LOCKED VMAs. When such a VMA is found for anonymous pages and file
+pages mapped in linear VMAs, as in the try_to_unmap() case, the functions
+attempt to acquire the associated mmap semphore, mlock the page via
+mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the
+pre-clearing of the page's PG_mlocked done by munlock_vma_page.
+
+If try_to_unmap() is unable to acquire a VM_LOCKED VMA's associated mmap
+semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() to
+recycle the page on the inactive list and hope that it has better luck with the
+page next time.
+
+For file pages mapped into non-linear VMAs, the try_to_munlock() logic works
+slightly differently. On encountering a VM_LOCKED non-linear VMA that might
+map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking the
+page. munlock_vma_page() will just leave the page unlocked and let vmscan deal
+with it - the usual fallback position.
+
+Note that try_to_munlock()'s reverse map walk must visit every VMA in a page's
+reverse map to determine that a page is NOT mapped into any VM_LOCKED VMA.
+However, the scan can terminate when it encounters a VM_LOCKED VMA and can
+successfully acquire the VMA's mmap semphore for read and mlock the page.
+Although try_to_munlock() might be called a great many times when munlocking a
+large region or tearing down a large address space that has been mlocked via
+mlockall(), overall this is a fairly rare event.
+
+
+PAGE RECLAIM IN shrink_*_list()
+-------------------------------
+
+shrink_active_list() culls any obviously unevictable pages - i.e.
+!page_evictable(page, NULL) - diverting these to the unevictable list.
+However, shrink_active_list() only sees unevictable pages that made it onto the
+active/inactive lru lists. Note that these pages do not have PageUnevictable
+set - otherwise they would be on the unevictable list and shrink_active_list
+would never see them.
+
+Some examples of these unevictable pages on the LRU lists are:
+
+ (1) ramfs pages that have been placed on the LRU lists when first allocated.
+
+ (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to
+ allocate or fault in the pages in the shared memory region. This happens
+ when an application accesses the page the first time after SHM_LOCK'ing
+ the segment.
+
+ (3) mlocked pages that could not be isolated from the LRU and moved to the
+ unevictable list in mlock_vma_page().
+
+ (4) Pages mapped into multiple VM_LOCKED VMAs, but try_to_munlock() couldn't
+ acquire the VMA's mmap semaphore to test the flags and set PageMlocked.
+ munlock_vma_page() was forced to let the page back on to the normal LRU
+ list for vmscan to handle.
+
+shrink_inactive_list() also diverts any unevictable pages that it finds on the
+inactive lists to the appropriate zone's unevictable list.
+
+shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
+after shrink_active_list() had moved them to the inactive list, or pages mapped
+into VM_LOCKED VMAs that munlock_vma_page() couldn't isolate from the LRU to
+recheck via try_to_munlock(). shrink_inactive_list() won't notice the latter,
+but will pass on to shrink_page_list().
+
+shrink_page_list() again culls obviously unevictable pages that it could
+encounter for similar reason to shrink_inactive_list(). Pages mapped into
+VM_LOCKED VMAs but without PG_mlocked set will make it all the way to
+try_to_unmap(). shrink_page_list() will divert them to the unevictable list
+when try_to_unmap() returns SWAP_MLOCK, as discussed above.