path: root/ipc
AgeCommit message (Collapse)Author
2010-05-27ipc/sem.c: optimize update_queue() for bulk wakeup callsManfred Spraul
The following series of patches tries to fix the spinlock contention reported by Chris Mason - his benchmark exposes problems of the current code: - In the worst case, the algorithm used by update_queue() is O(N^2). Bulk wake-up calls can enter this worst case. The patch series fix that. Note that the benchmark app doesn't expose the problem, it just should be fixed: Real world apps might do the wake-ups in another order than perfect FIFO. - The part of the code that runs within the semaphore array spinlock is significantly larger than necessary. The patch series fixes that. This change is responsible for the main improvement. - The cacheline with the spinlock is also used for a variable that is read in the hot path (sem_base) and for a variable that is unnecessarily written to multiple times (sem_otime). The last step of the series cacheline-aligns the spinlock. This patch: The SysV semaphore code allows to perform multiple operations on all semaphores in the array as atomic operations. After a modification, update_queue() checks which of the waiting tasks can complete. The algorithm that is used to identify the tasks is O(N^2) in the worst case. For some cases, it is simple to avoid the O(N^2). The patch adds a detection logic for some cases, especially for the case of an array where all sleeping tasks are single sembuf operations and a multi-sembuf operation is used to wake up multiple tasks. A big database application uses that approach. The patch fixes wakeup due to semctl(,,SETALL,) - the initial version of the patch breaks that. [ make do_smart_update() static] Signed-off-by: Manfred Spraul <> Cc: Chris Mason <> Cc: Zach Brown <> Cc: Jens Axboe <> Cc: Nick Piggin <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2010-05-25kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, ↵Alexey Dobriyan
SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [ fix drivers/dma/timb_dma.c] [ fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <> Acked-by: WANG Cong <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2010-05-19Merge branch 'timers-for-linus' of ↵Linus Torvalds
git:// * 'timers-for-linus' of git:// clocksource: Add clocksource_register_hz/khz interface posix-cpu-timers: Optimize run_posix_cpu_timers() time: Remove xtime_cache mqueue: Convert message queue timeout to use hrtimers hrtimers: Provide schedule_hrtimeout for CLOCK_REALTIME timers: Introduce the concept of timer slack for legacy timers ntp: Remove tickadj ntp: Make time_adjust static time: Add xtime, wall_to_monotonic to feature-removal-schedule timer: Try to survive timer callback preempt_count leak timer: Split out timer function call timer: Print function name for timer callbacks modifying preemption count time: Clean up warp_clock() cpu-timers: Avoid iterating over all threads in fastpath_timer_check() cpu-timers: Change SIGEV_NONE timer implementation cpu-timers: Return correct previous timer reload value cpu-timers: Cleanup arm_timer() cpu-timers: Simplify RLIMIT_CPU handling
2010-05-11mqueue: fix kernel BUG caused by double free() on mq_open()André Goddard Rosa
In case of aborting because we reach the maximum amount of memory which can be allocated to message queues per user (RLIMIT_MSGQUEUE), we would try to free the message area twice when bailing out: first by the error handling code itself, and then later when cleaning up the inode through delete_inode(). Signed-off-by: André Goddard Rosa <> Cc: Alexey Dobriyan <> Cc: Al Viro <> Cc: <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2010-05-10Merge branch 'linus' into timers/coreThomas Gleixner
Reason: Further posix_cpu_timer patches depend on mainline changes Signed-off-by: Thomas Gleixner <>
2010-04-06mqueue: Convert message queue timeout to use hrtimersCarsten Emde
The message queue functions mq_timedsend() and mq_timedreceive() have not yet been converted to use the hrtimer interface. This patch replaces the call to schedule_timeout() by a call to schedule_hrtimeout() and transforms the expiration time from timespec to ktime as required. [ tglx: Fixed whitespace wreckage ] Signed-off-by: Carsten Emde <> Tested-by: Pradyumna Sampath <> Cc: Arjan van de Veen <> Cc: Andrew Morton <> LKML-Reference: <> Signed-off-by: Thomas Gleixner <>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <> Guess-its-ok-by: Christoph Lameter <> Cc: Ingo Molnar <> Cc: Lee Schermerhorn <>
2010-03-22ppc64 sys_ipc breakage in 2.6.34-rc2Anton Blanchard
I chased down a fail on ppc64 on 2.6.34-rc2 where an application that uses shared memory was getting a SEGV. Commit baed7fc9b580bd3fb8252ff1d9b36eaf1f86b670 ("Add generic sys_ipc wrapper") changed the second argument from an unsigned long to an int. When we call shmget the system call wrappers for sys_ipc will sign extend second (ie the size) which truncates it. It took a while to track down because the call succeeds and strace shows the untruncated size :) The patch below changes second from an int to an unsigned long which fixes shmget on ppc64 (and I assume s390, sparc64 and mips64). Signed-off-by: Anton Blanchard <> -- I assume the function prototypes for the other IPC methods would cause us to sign or zero extend second where appropriate (avoiding any security issues). Come to think of it, the syscall wrappers for each method should do that for us as well. Signed-off-by: Linus Torvalds <>
2010-03-12ipc: use rlimit helpersJiri Slaby
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in 3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2010-03-12Add generic sys_ipc wrapperChristoph Hellwig
Add a generic implementation of the ipc demultiplexer syscall. Except for s390 and sparc64 all implementations of the sys_ipc are nearly identical. There are slight differences in the types of the parameters, where mips and powerpc as the only 64-bit architectures with sys_ipc use unsigned long for the "third" argument as it gets casted to a pointer later, while it traditionally is an "int" like most other paramters. frv goes even further and uses unsigned long for all parameters execept for "ptr" which is a pointer type everywhere. The change from int to unsigned long for "third" and back to "int" for the others on frv should be fine due to the in-register calling conventions for syscalls (we already had a similar issue with the generic sys_ptrace), but I'd prefer to have the arch maintainers looks over this in details. Except for that h8300, m68k and m68knommu lack an impplementation of the semtimedop sub call which this patch adds, and various architectures have gets used - at least on i386 it seems superflous as the compat code on x86-64 and ia64 doesn't even bother to implement it. [ add sys_ipc to sys_ni.c] Signed-off-by: Christoph Hellwig <> Cc: Ralf Baechle <> Cc: Benjamin Herrenschmidt <> Cc: Paul Mundt <> Cc: Jeff Dike <> Cc: Hirokazu Takata <> Cc: Thomas Gleixner <> Cc: Ingo Molnar <> Reviewed-by: H. Peter Anvin <> Cc: Al Viro <> Cc: Arnd Bergmann <> Cc: Heiko Carstens <> Cc: Martin Schwidefsky <> Cc: "Luck, Tony" <> Cc: James Morris <> Cc: Andreas Schwab <> Acked-by: Jesper Nilsson <> Acked-by: Russell King <> Acked-by: David Howells <> Acked-by: Kyle McMartin <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2010-03-03mqueue: fix typo "failues" -> "failures"André Goddard Rosa
Signed-off-by: André Goddard Rosa <> Signed-off-by: Al Viro <>
2010-03-03mqueue: only set error codes if they are really necessaryAndré Goddard Rosa
... postponing assignments until they're needed. Doesn't change code size. Signed-off-by: André Goddard Rosa <> Signed-off-by: Al Viro <>
2010-03-03mqueue: simplify do_open() error handlingAndré Goddard Rosa
It reduces code size: text data bss dec hex filename 9925 72 16 10013 271d ipc/mqueue-BEFORE.o 9885 72 16 9973 26f5 ipc/mqueue-AFTER.o Signed-off-by: André Goddard Rosa <> Signed-off-by: Al Viro <>
2010-03-03mqueue: apply mathematics distributivity on mq_bytes calculationAndré Goddard Rosa
Code size reduction: text data bss dec hex filename 9941 72 16 10029 272d ipc/mqueue-BEFORE.o 9925 72 16 10013 271d ipc/mqueue-AFTER.o Signed-off-by: André Goddard Rosa <> Signed-off-by: Al Viro <>
2010-03-03mqueue: remove unneeded info->messages initializationAndré Goddard Rosa
... and abort earlier if we couldn't allocate the message pointers array, avoiding the u->mq_bytes accounting logic. It reduces code size: text data bss dec hex filename 9949 72 16 10037 2735 ipc/mqueue-BEFORE.o 9941 72 16 10029 272d ipc/mqueue-AFTER.o Signed-off-by: André Goddard Rosa <> Signed-off-by: Al Viro <>
2010-03-03mqueue: fix mq_open() file descriptor leak on user-space processesAndré Goddard Rosa
We leak fd on lookup_one_len() failure Signed-off-by: André Goddard Rosa <> Signed-off-by: Al Viro <>
2010-01-16nommu: fix SYSV SHM for NOMMUDavid Howells
Commit c4caa778157dbbf04116f0ac2111e389b5cd7a29 ("file ->get_unmapped_area() shouldn't duplicate work of get_unmapped_area()") broke SYSV SHM for NOMMU by taking away the pointer to shm_get_unmapped_area() from shm_file_operations. Put it back conditionally on CONFIG_MMU=n. file->f_ops->get_unmapped_area() is used to find out the base address for a mapping of a mappable chardev device or mappable memory-based file (such as a ramfs file). It needs to be called prior to file->f_ops->mmap() being called. Signed-off-by: David Howells <> Acked-by: Al Viro <> Cc: Greg Ungerer <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-16Merge branch 'master' of ↵Linus Torvalds
git:// * 'master' of git:// (38 commits) direct I/O fallback sync simplification ocfs: stop using do_sync_mapping_range cleanup blockdev_direct_IO locking make generic_acl slightly more generic sanitize xattr handler prototypes libfs: move EXPORT_SYMBOL for d_alloc_name vfs: force reval of target when following LAST_BIND symlinks (try #7) ima: limit imbalance msg Untangling ima mess, part 3: kill dead code in ima Untangling ima mess, part 2: deal with counters Untangling ima mess, part 1: alloc_file() O_TRUNC open shouldn't fail after file truncation ima: call ima_inode_free ima_inode_free IMA: clean up the IMA counts updating code ima: only insert at inode creation time ima: valid return code from ima_inode_alloc fs: move get_empty_filp() deffinition to internal.h Sanitize exec_permission_lite() Kill cached_lookup() and real_lookup() Kill path_lookup_open() ... Trivial conflicts in fs/direct-io.c
2009-12-16Untangling ima mess, part 2: deal with countersAl Viro
* do ima_get_count() in __dentry_open() * stop doing that in followups * move ima_path_check() to right after nameidata_to_filp() * don't bump counters on it Signed-off-by: Al Viro <>
2009-12-16Untangling ima mess, part 1: alloc_file()Al Viro
There are 2 groups of alloc_file() callers: * ones that are followed by ima_counts_get * ones giving non-regular files So let's pull that ima_counts_get() into alloc_file(); it's a no-op in case of non-regular files. Signed-off-by: Al Viro <>
2009-12-16switch alloc_file() to passing struct pathAl Viro
... and have the caller grab both mnt and dentry; kill leak in infiniband, while we are at it. Signed-off-by: Al Viro <>
2009-12-16ipc: remove unreachable code in sem.cAmerigo Wang
This line is unreachable, remove it. [ remove unneeded initialisation of `err'] Signed-off-by: WANG Cong <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-16ipc/sem.c: optimize single sops when semval is zeroManfred Spraul
If multiple simple decrements on the same semaphore are pending, then the current code scans all decrement operations, even if the semaphore value is already 0. The patch optimizes that: if the semaphore value is 0, then there is no need to scan the q->alter entries. Note that this is a common case: It happens if 100 decrements by one are pending and now an increment by one increases the semaphore value from 0 to 1. Without this patch, all 100 entries are scanned. With the patch, only one entry is scanned, then woken up. Then the new rule triggers and the scanning is aborted, without looking at the remaining 99 tasks. With this patch, single sop increment/decrement by 1 are now O(1). (same as with Nick's patch) Signed-off-by: Manfred Spraul <> Cc: Nick Piggin <> Cc: Pierre Peiffer <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-16ipc/sem.c: optimize single semop operationsManfred Spraul
sysv sem has the concept of semaphore arrays that consist out of multiple semaphores. Atomic operations that affect multiple semaphores are supported. The patch optimizes single semaphore operation calls that affect only one semaphore: It's not necessary to scan all pending operations, it is sufficient to scan the per-semaphore list. The idea is from Nick Piggin version of an ipc sem improvement, the implementation is different: The code tries to keep as much common code as possible. As the result, the patch is simpler, but optimizes fewer cases. Signed-off-by: Manfred Spraul <> Cc: Nick Piggin <> Cc: Pierre Peiffer <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-16ipc/sem.c: add a per-semaphore pending listManfred Spraul
Based on Nick's findings: sysv sem has the concept of semaphore arrays that consist out of multiple semaphores. Atomic operations that affect multiple semaphores are supported. The patch is the first step for optimizing simple, single semaphore operations: In addition to the global list of all pending operations, a 2nd, per-semaphore list with the simple operations is added. Note: this patch does not make sense by itself, the new list is used nowhere. Signed-off-by: Manfred Spraul <> Cc: Nick Piggin <> Cc: Pierre Peiffer <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-16ipc/sem.c: optimize if semops failManfred Spraul
Reduce the amount of scanning of the list of pending semaphore operations: If try_atomic_semop failed, then no changes were applied. Thus no need to restart. Additionally, this patch correct an incorrect comment: It's possible to wait for arbitrary semaphore values (do a dec by <x>, wait-for-zero, inc by <x> in one atomic operation) Both changes are from Nick Piggin, the patch is the result of a different split of the individual changes. Signed-off-by: Manfred Spraul <> Cc: Nick Piggin <> Cc: Pierre Peiffer <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-16ipc/sem.c: sem preempt improveNick Piggin
The strange sysv semaphore wakeup scheme has a kind of busy-wait lock involved, which could deadlock if preemption is enabled during the "lock". It is an implementation detail (due to a spinlock being held) that this is actually the case. However if "spinlocks" are made preemptible, or if the sem lock is changed to a sleeping lock for example, then the wakeup would become buggy. So this might be a bugfix for -rt kernels. Imagine waker being preempted by wakee and never clearing IN_WAKEUP -- if wakee has higher RT priority then there is a priority inversion deadlock. Even if there is not a priority inversion to cause a deadlock, then there is still time wasted spinning. Signed-off-by: Nick Piggin <> Signed-off-by: Manfred Spraul <> Cc: Pierre Peiffer <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-16ipc/sem.c: sem use list operationsNick Piggin
Replace the handcoded list operations in update_queue() with the standard list_for_each_entry macros. list_for_each_entry_safe() must be used, because list entries can disappear immediately uppon the wakeup event. Signed-off-by: Nick Piggin <> Signed-off-by: Manfred Spraul <> Cc: Pierre Peiffer <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-16ipc/sem.c: sem optimise undo list searchNick Piggin
Around a month ago, there was some discussion about an improvement of the sysv sem algorithm: Most (at least: some important) users only use simple semaphore operations, therefore it's worthwile to optimize this use case. This patch: Move last looked up sem_undo struct to the head of the task's undo list. Attempt to move common entries to the front of the list so search time is reduced. This reduces lookup_undo on oprofile of problematic SAP workload by 30% (see patch 4 for a description of SAP workload). Signed-off-by: Nick Piggin <> Signed-off-by: Manfred Spraul <> Cc: Pierre Peiffer <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-16ipc ns: fix memory leak (idr)Serge E. Hallyn
We have apparently had a memory leak since 7ca7e564e049d8b350ec9d958ff25eaa24226352 "ipc: store ipcs into IDRs" in 2007. The idr of which 3 exist for each ipc namespace is never freed. This patch simply frees them when the ipcns is freed. I don't believe any idr_remove() are done from rcu (and could therefore be delayed until after this idr_destroy()), so the patch should be safe. Some quick testing showed no harm, and the memory leak fixed. Caught by kmemleak. Signed-off-by: Serge E. Hallyn <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-12-11file ->get_unmapped_area() shouldn't duplicate work of get_unmapped_area()Al Viro
... we should call mm ->get_unmapped_area() instead and let our caller do the final checks. Acked-by: David S. Miller <> Signed-off-by: Al Viro <>
2009-12-09Merge branch 'for-linus' of ↵Linus Torvalds
git:// * 'for-linus' of git:// (42 commits) tree-wide: fix misspelling of "definition" in comments reiserfs: fix misspelling of "journaled" doc: Fix a typo in slub.txt. inotify: remove superfluous return code check hdlc: spelling fix in find_pvc() comment doc: fix regulator docs cut-and-pasteism mtd: Fix comment in Kconfig doc: Fix IRQ chip docs tree-wide: fix assorted typos all over the place drivers/ata/libata-sff.c: comment spelling fixes fix typos/grammos in Documentation/edac.txt sysctl: add missing comments fs/debugfs/inode.c: fix comment typos sgivwfb: Make use of ARRAY_SIZE. sky2: fix sky2_link_down copy/paste comment error tree-wide: fix typos "couter" -> "counter" tree-wide: fix typos "offest" -> "offset" fix kerneldoc for set_irq_msi() spidev: fix double "of of" in comment comment typo fix: sybsystem -> subsystem ...
2009-12-04ipc: fix unused variable warningFelipe Contreras
Commit a0d092f introduced the following warning: ipc/msg.c: In function ?msgctl_down?: ipc/msg.c:415: warning: ?msqid64? may be used uninitialized in this function The gcc warning in this case is actually bogus, as msqid64 is touched only iff cmd == IPC_SET, and in such case, copy_msqid_from_user() initializes it properly. Signed-off-by: Felipe Contreras <> Signed-off-by: Jiri Kosina <>
2009-11-12sysctl ipc: Remove dead binary sysctl support code.Eric W. Biederman
Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name and .strategy members of sysctl tables are dead code. Remove them. Signed-off-by: Eric W. Biederman <>
2009-09-27const: mark struct vm_struct_operationsAlexey Dobriyan
* mark struct vm_area_struct::vm_ops as const * mark vm_ops in AGP code But leave TTM code alone, something is fishy there with global vm_ops being used. Signed-off-by: Alexey Dobriyan <> Signed-off-by: Linus Torvalds <>
2009-09-24sysctl: remove "struct file *" argument of ->proc_handlerAlexey Dobriyan
It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: Alexey Dobriyan <> Cc: David Howells <> Cc: "Eric W. Biederman" <> Cc: Al Viro <> Cc: Ralf Baechle <> Cc: Martin Schwidefsky <> Cc: Ingo Molnar <> Cc: "David S. Miller" <> Cc: James Morris <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-09-23seq_file: constify seq_operationsJames Morris
Make all seq_operations structs const, to help mitigate against revectoring user-triggerable function pointers. This is derived from the grsecurity patch, although generated from scratch because it's simpler than extracting the changes from there. Signed-off-by: James Morris <> Acked-by: Serge Hallyn <> Acked-by: Casey Schaufler <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-09-22hugetlbfs: allow the creation of files suitable for MAP_PRIVATE on the vfs ↵Eric B Munson
internal mount This patchset adds a flag to mmap that allows the user to request that an anonymous mapping be backed with huge pages. This mapping will borrow functionality from the huge page shm code to create a file on the kernel internal mount and use it to approximate an anonymous mapping. The MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without both flags being preset. A new flag is necessary because there is no other way to hook into huge pages without creating a file on a hugetlbfs mount which wouldn't be MAP_ANONYMOUS. To userspace, this mapping will behave just like an anonymous mapping because the file is not accessible outside of the kernel. This patchset is meant to simplify the programming model. Presently there is a large chunk of boiler platecode, contained in libhugetlbfs, required to create private, hugepage backed mappings. This patch set would allow use of hugepages without linking to libhugetlbfs or having hugetblfs mounted. Unification of the VM code would provide these same benefits, but it has been resisted each time that it has been suggested for several reasons: it would break PAGE_SIZE assumptions across the kernel, it makes page-table abstractions really expensive, and it does not provide any benefit on architectures that do not support huge pages, incurring fast path penalties without providing any benefit on these architectures. This patch: There are two means of creating mappings backed by huge pages: 1. mmap() a file created on hugetlbfs 2. Use shm which creates a file on an internal mount which essentially maps it MAP_SHARED The internal mount is only used for shared mappings but there is very little that stops it being used for private mappings. This patch extends hugetlbfs_file_setup() to deal with the creation of files that will be mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is used in a subsequent patch to implement the MAP_HUGETLB mmap() flag. Signed-off-by: Eric Munson <> Acked-by: David Rientjes <> Cc: Mel Gorman <> Cc: Adam Litke <> Cc: David Gibson <> Cc: Lee Schermerhorn <> Cc: Nick Piggin <> Cc: Hugh Dickins <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-09-22const: mark remaining super_operations constAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-09-14fix undefined reference to user_shm_unlockHugh Dickins
My 353d5c30c666580347515da609dd74a2b8e9b828 "mm: fix hugetlb bug due to user_shm_unlock call" broke the CONFIG_SYSVIPC !CONFIG_MMU build of both 2.6.31 and "undefined reference to `user_shm_unlock'". gcc didn't understand my comment! so couldn't figure out to optimize away user_shm_unlock() from the error path in the hugetlb-less case, as it does elsewhere. Help it to do so, in a language it understands. Reported-by: Mike Frysinger <> Signed-off-by: Hugh Dickins <> Cc: Signed-off-by: Linus Torvalds <>
2009-08-24mm: fix hugetlb bug due to user_shm_unlock callHugh Dickins
2.6.30's commit 8a0bdec194c21c8fdef840989d0d7b742bb5d4bc removed user_shm_lock() calls in hugetlb_file_setup() but left the user_shm_unlock call in shm_destroy(). In detail: Assume that can_do_hugetlb_shm() returns true and hence user_shm_lock() is not called in hugetlb_file_setup(). However, user_shm_unlock() is called in any case in shm_destroy() and in the following atomic_dec_and_lock(&up->__count) in free_uid() is executed and if up->__count gets zero, also cleanup_user_struct() is scheduled. Note that sched_destroy_user() is empty if CONFIG_USER_SCHED is not set. However, the ref counter up->__count gets unexpectedly non-positive and the corresponding structs are freed even though there are live references to them, resulting in a kernel oops after a lots of shmget(SHM_HUGETLB)/shmctl(IPC_RMID) cycles and CONFIG_USER_SCHED set. Hugh changed Stefan's suggested patch: can_do_hugetlb_shm() at the time of shm_destroy() may give a different answer from at the time of hugetlb_file_setup(). And fixed newseg()'s no_id error path, which has missed user_shm_unlock() ever since it came in 2.6.9. Reported-by: Stefan Huber <> Signed-off-by: Hugh Dickins <> Tested-by: Stefan Huber <> Cc: Signed-off-by: Linus Torvalds <>
2009-06-29integrity: ima mq_open imbalance msg fixMimi Zohar
This patch fixes an imbalance message as reported by Sanchin Sant. As we don't need to measure the message queue, just increment the counters. Reported-by: Sanchin Sant <> Signed-off-by: Mimi Zohar <> Acked-by: Serge Hallyn <> Signed-off-by: James Morris <>
2009-06-21ipc: unbreak 32-bit shmctl/semctl/msgctlJohannes Weiner
31a985f "ipc: use __ARCH_WANT_IPC_PARSE_VERSION in ipc/util.h" would choose the implementation of ipc_parse_version() based on a symbol defined in <asm/unistd.h>. But it failed to also include this header and thus broke IPC_64-passing 32-bit userspace because the flag wasn't masked out properly anymore and the command not understood. Include <linux/unistd.h> to give the architecture a chance to ask for the no-no-op ipc_parse_version(). Signed-off-by: Johannes Weiner <> Acked-by: Arnd Bergmann <> Signed-off-by: Linus Torvalds <>
2009-06-18ipcns: move free_ipcs() protoAlexey Dobriyan
Function is really private to ipc/ and avoid struct kern_ipc_perm forward declaration. Signed-off-by: Alexey Dobriyan <> Reviewed-by: WANG Cong <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-06-18ipcns: make free_ipc_ns() staticAlexey Dobriyan
Signed-off-by: Alexey Dobriyan <> Reviewed-by: WANG Cong <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-06-18ipcns: extract create_ipc_ns()Alexey Dobriyan
clone_ipc_ns() is misnamed, it doesn't clone anything and doesn't use passed parameter. Rename it. create_ipc_ns() will be used by C/R to create fresh ipcns. Signed-off-by: Alexey Dobriyan <> Acked-by: Serge Hallyn <> Reviewed-by: WANG Cong <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-06-18ipcns: remove useless get/put while CLONE_NEWIPCAlexey Dobriyan
copy_ipcs() doesn't actually copy anything. If new ipcns is created, it's created from scratch, in this case get/put on old ipcns isn't needed. Signed-off-by: Alexey Dobriyan <> Acked-by: Serge Hallyn <> Reviewed-by: WANG Cong <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-06-18ipc: use __ARCH_WANT_IPC_PARSE_VERSION in ipc/util.hArnd Bergmann
The definition of ipc_parse_version depends on __ARCH_WANT_IPC_PARSE_VERSION, but the header file declares it conditionally based on the architecture. Use the macro consistently to make it easier to add new architectures. Signed-off-by: Arnd Bergmann <> Acked-by: Serge Hallyn <> Signed-off-by: Andrew Morton <> Signed-off-by: Linus Torvalds <>
2009-06-11Merge branch 'for-linus' of ↵Linus Torvalds
git:// * 'for-linus' of git:// (44 commits) nommu: Provide mmap_min_addr definition. TOMOYO: Add description of lists and structures. TOMOYO: Remove unused field. integrity: ima audit dentry_open failure TOMOYO: Remove unused parameter. security: use mmap_min_addr indepedently of security models TOMOYO: Simplify policy reader. TOMOYO: Remove redundant markers. SELinux: define audit permissions for audit tree netlink messages TOMOYO: Remove unused mutex. tomoyo: avoid get+put of task_struct smack: Remove redundant initialization. integrity: nfsd imbalance bug fix rootplug: Remove redundant initialization. smack: do not beyond ARRAY_SIZE of data integrity: move ima_counts_get integrity: path_check update IMA: Add __init notation to ima functions IMA: Minimal IMA policy and boot param for TCB IMA policy selinux: remove obsolete read buffer limit from sel_read_bool ...
2009-06-10Merge branch 'rcu-for-linus' of ↵Linus Torvalds
git:// * 'rcu-for-linus' of git:// rcu: rcu_sched_grace_period(): kill the bogus flush_signals() rculist: use list_entry_rcu in places where it's appropriate rculist.h: introduce list_entry_rcu() and list_first_entry_rcu() rcu: Update RCU tracing documentation for __rcu_pending rcu: Add __rcu_pending tracing to hierarchical RCU RCU: make treercu be default