summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2014-04-21ENGR00305254-7 imx6sx sarbesd board bring upKe Qinghua
Enable early suspend/late resume code Signed-off-by: Ke Qinghua <qinghua.ke@freescale.com> Acked-by: Jason Liu
2014-04-21power: wakeup_reason: rename irq_count to irqcountGreg Hackmann
On x86, irq_count conflicts with a declaration in arch/x86/include/asm/processor.h Change-Id: I3e4fde0ff64ef59ff5ed2adc0ea3a644641ee0b7 Signed-off-by: Greg Hackmann <ghackmann@google.com>
2014-04-21Power: Add guard condition for maximum wakeup reasonsRuchi Kandoi
Ensure the array for the wakeup reason IRQs does not overflow. Change-Id: Iddc57a3aeb1888f39d4e7b004164611803a4d37c Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com> (cherry picked from commit b5ea40cdfcf38296535f931a7e5e7bf47b6fad7f)
2014-04-21ENGR00300018-1 Update USB and Power Manager driver for android linux kernel 3.10Ke Qinghua
Add Wakelock update for android linux kernel 3.10 Signed-off-by: Ke Qinghua <qinghua.ke@freescale.com>
2014-04-21POWER: fix compile warnings in log_wakeup_reasonRuchi Kandoi
Change I81addaf420f1338255c5d0638b0d244a99d777d1 introduced compile warnings, fix these. Change-Id: I05482a5335599ab96c0a088a7d175c8d4cf1cf69 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-04-21Power: add an API to log wakeup reasonsRuchi Kandoi
Add API log_wakeup_reason() and expose it to userspace via sysfs path /sys/kernel/wakeup_reasons/last_resume_reason Change-Id: I81addaf420f1338255c5d0638b0d244a99d777d1 Signed-off-by: Ruchi Kandoi <kandoiruchi@google.com>
2014-04-21ARM: Fix "Make low-level printk work" to use a separate config optionArve Hjønnevåg
Signed-off-by: Arve Hjønnevåg <arve@android.com>
2014-04-21anonymous vma names: fix build with !MMUColin Cross
Disable PR_SET_VMA when building with !MMU Change-Id: I896b6979b99aa61df85caf4c3ec22eb8a8204e64 Signed-off-by: Colin Cross <ccross@android.com>
2014-04-21mm: fix anon vma namingColin Cross
Fix two bugs caused by merging anon vma_naming, a typo in mempolicy.c and a bad merge in sys.c. Change-Id: Ia4ced447d50573e68195e95ea2f2b4d9456b8a90 Signed-off-by: Colin Cross <ccross@android.com>
2014-04-21mm: add a field to store names for private anonymous memoryColin Cross
Userspace processes often have multiple allocators that each do anonymous mmaps to get memory. When examining memory usage of individual processes or systems as a whole, it is useful to be able to break down the various heaps that were allocated by each layer and examine their size, RSS, and physical memory usage. This patch adds a user pointer to the shared union in vm_area_struct that points to a null terminated string inside the user process containing a name for the vma. vmas that point to the same address will be merged, but vmas that point to equivalent strings at different addresses will not be merged. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The names of named anonymous vmas are shown in /proc/pid/maps as [anon:<name>] and in /proc/pid/smaps in a new "Name" field that is only present for named vmas. If the userspace pointer is no longer valid all or part of the name will be replaced with "<fault>". The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fieds that are only used on file-backed mappings, so it does not increase memory usage. Change-Id: Ie2ffc0967d4ffe7ee4c70781313c7b00cf7e3092 Signed-off-by: Colin Cross <ccross@android.com>
2014-04-21add extra free kbytes tunableRik van Riel
Add a userspace visible knob to tell the VM to keep an extra amount of memory free, by increasing the gap between each zone's min and low watermarks. This is useful for realtime applications that call system calls and have a bound on the number of allocations that happen in any short time period. In this application, extra_free_kbytes would be left at an amount equal to or larger than than the maximum number of allocations that happen in any burst. It may also be useful to reduce the memory use of virtual machines (temporarily?), in a way that does not cause memory fragmentation like ballooning does. [ccross] Revived for use on old kernels where no other solution exists. The tunable will be removed on kernels that do better at avoiding direct reclaim. Change-Id: I765a42be8e964bfd3e2886d1ca85a29d60c3bb3e Signed-off-by: Rik van Riel<riel@redhat.com> Signed-off-by: Colin Cross <ccross@android.com>
2014-04-21sigtimedwait: use freezable blocking callColin Cross
Avoid waking up every thread sleeping in a sigtimedwait call during suspend and resume by calling a freezable blocking call. Previous patches modified the freezer to avoid sending wakeups to threads that are blocked in freezable blocking calls. This call was selected to be converted to a freezable call because it doesn't hold any locks or release any resources when interrupted that might be needed by another freezing task or a kernel driver during suspend, and is a common site where idle userspace tasks are blocked. Change-Id: Ic27469b60a67d50cdc0d0c78975951a99c25adcd Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-04-21nanosleep: use freezable blocking callColin Cross
Avoid waking up every thread sleeping in a nanosleep call during suspend and resume by calling a freezable blocking call. Previous patches modified the freezer to avoid sending wakeups to threads that are blocked in freezable blocking calls. This call was selected to be converted to a freezable call because it doesn't hold any locks or release any resources when interrupted that might be needed by another freezing task or a kernel driver during suspend, and is a common site where idle userspace tasks are blocked. Change-Id: I93383201d4dd62130cd9a9153842d303fc2e2986 Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-04-21freezer: skip waking up tasks with PF_FREEZER_SKIP setColin Cross
Android goes through suspend/resume very often (every few seconds when on a busy wifi network with the screen off), and a significant portion of the energy used to go in and out of suspend is spent in the freezer. If a task has called freezer_do_not_count(), don't bother waking it up. If it happens to wake up later it will call freezer_count() and immediately enter the refrigerator. Combined with patches to convert freezable helpers to use freezer_do_not_count() and convert common sites where idle userspace tasks are blocked to use the freezable helpers, this reduces the time and energy required to suspend and resume. Change-Id: I6ba019d24273619849af757a413271da3261d7db Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-04-21freezer: shorten freezer sleep time using exponential backoffColin Cross
All tasks can easily be frozen in under 10 ms, switch to using an initial 1 ms sleep followed by exponential backoff until 8 ms. Also convert the printed time to ms instead of centiseconds. Change-Id: I7b198b16eefb623c2b0fc45dce50d9bca320afdc Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-04-21lockdep: remove task argument from debug_check_no_locks_heldColin Cross
The only existing caller to debug_check_no_locks_held calls it with 'current' as the task, and the freezer needs to call debug_check_no_locks_held but doesn't already have a current task pointer, so remove the argument. It is already assuming that the current task is relevant by dumping the current stack trace as part of the warning. This was originally part of 6aa9707099c (lockdep: check that no locks held at freeze time) which was reverted in dbf520a9d7d4. Change-Id: Idbaf1332ce6c80dc49c1d31c324c7fbf210657c5 Original-author: Mandeep Singh Baines <msb@chromium.org> Acked-by: Pavel Machek <pavel@ucw.cz> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2014-04-21alarmtimer: add alarm_expires_remainingTodd Poynor
Similar to hrtimer_expires_remaining, return the amount of time remaining until alarm expiry. Change-Id: I8c57512d619ac66bcdaf2d9ccdf0d7f74af2ff66 Signed-off-by: Todd Poynor <toddpoynor@google.com>
2014-04-21alarmtimer: add alarm_start_relativeTodd Poynor
Start an alarmtimer with an expires time relative to the current time of the associated clock. Change-Id: Ifb5309a15e0d502bb4d0209ca5510a56ee7fa88c Signed-off-by: Todd Poynor <toddpoynor@google.com>
2014-04-21alarmtimer: add alarm_forward_nowTodd Poynor
Similar to hrtimer_forward_now, move the expires time forward to an interval from the current time of the associated clock. Change-Id: I73fed223321167507b6eddcb7a57d235ffcfc1be Signed-off-by: Todd Poynor <toddpoynor@google.com>
2014-04-21alarmtimer: add alarm_restartTodd Poynor
Analogous to hrtimer_restart, restart an alarmtimer after the expires time has already been updated (as with alarm_forward). Change-Id: Ia2613bbb467404cb2c35c11efa772bc56294963a Signed-off-by: Todd Poynor <toddpoynor@google.com>
2014-04-21trace: add non-hierarchical function_graph optionJamie Gennis
Add the 'funcgraph-flat' option to the function_graph tracer to use the default trace printing format rather than the hierarchical formatting normally used. Change-Id: If2900bfb86e6f8f51379f56da4f6fabafa630909 Signed-off-by: Jamie Gennis <jgennis@google.com>
2014-04-21trace: Add an option to show tgids in trace outputJamie Gennis
The tgids are tracked along side the saved_cmdlines tracking, and can be included in trace output by enabling the 'print-tgid' trace option. This is useful when doing post-processing of the trace data, as it allows events to be grouped by tgid. Change-Id: I52ed04c3a8ca7fddbb868b792ce5d21ceb76250e Signed-off-by: Jamie Gennis <jgennis@google.com>
2014-04-21trace/events: add gpu trace eventsJamie Gennis
Change-Id: I0607b9c776acf61cb796b8572cf8cfb8b2dc1377 Signed-off-by: Jamie Gennis <jgennis@google.com>
2014-04-21hardlockup: detect hard lockups without NMIs using secondary cpusColin Cross
Emulate NMIs on systems where they are not available by using timer interrupts on other cpus. Each cpu will use its softlockup hrtimer to check that the next cpu is processing hrtimer interrupts by verifying that a counter is increasing. This patch is useful on systems where the hardlockup detector is not available due to a lack of NMIs, for example most ARM SoCs. Without this patch any cpu stuck with interrupts disabled can cause a hardware watchdog reset with no debugging information, but with this patch the kernel can detect the lockup and panic, which can result in useful debugging info. Signed-off-by: Colin Cross <ccross@android.com>
2014-04-21PM / Suspend: Print wall time at suspend entry and exitTodd Poynor
Change-Id: I92f252414c013b018b9a392eae1ee039aa0e89dc Signed-off-by: Todd Poynor <toddpoynor@google.com>
2014-04-21debug: add parameters to prevent entering debug mode on errorsColin Cross
On non-developer devices kgdb prevents CONFIG_PANIC_TIMEOUT from rebooting the device after a panic. Add module parameters debug_core.break_on_exception and debug_core.break_on_panic to allow skipping debug on panics and exceptions respectively. Both default to true to preserve existing behavior. Change-Id: I75dce7263e96cee069a9750920cce83dc6f98e8c Signed-off-by: Colin Cross <ccross@android.com>
2014-04-21kdb: support new lines without carriage returnsColin Cross
kdb expects carriage returns through the serial port to terminate commands. Modify it to accept the first seen carriage return or new line as a terminator, but not treat \r\n as two terminators. Change-Id: I06166017e7703d24310eefcb71c3a7d427088db7 Signed-off-by: Colin Cross <ccross@android.com>
2014-04-21sched: Add a generic notifier when a task struct is about to be freedSan Mehat
This patch adds a notifier which can be used by subsystems that may be interested in when a task has completely died and is about to have it's last resource freed. The Android lowmemory killer uses this to determine when a task it has killed has finally given up its goods. Signed-off-by: San Mehat <san@google.com>
2014-04-21proc: smaps: Allow smaps access for CAP_SYS_RESOURCESan Mehat
Signed-off-by: San Mehat <san@google.com>
2014-04-21PM / Sleep: Add wake lock api wrapper on top of wakeup sourcesArve Hjønnevåg
Change-Id: Icaad02fe1e8856fdc2e4215f380594a5dde8e002 Signed-off-by: Arve Hjønnevåg <arve@android.com>
2014-04-21mm: Add min_free_order_shift tunable.Arve Hjønnevåg
By default the kernel tries to keep half as much memory free at each order as it does for one order below. This can be too agressive when running without swap. Change-Id: I5efc1a0b50f41ff3ac71e92d2efd175dedd54ead Signed-off-by: Arve Hjønnevåg <arve@android.com>
2014-04-21ARM: Make low-level printk workTony Lindgren
Makes low-level printk work. Signed-off-by: Tony Lindgren <tony@atomide.com>
2014-04-21cgroup: Add generic cgroup subsystem permission checksColin Cross
Rather than using explicit euid == 0 checks when trying to move tasks into a cgroup via CFS, move permission checks into each specific cgroup subsystem. If a subsystem does not specify a 'allow_attach' handler, then we fall back to doing our checks the old way. Use the 'allow_attach' handler for the 'cpu' cgroup to allow non-root processes to add arbitrary processes to a 'cpu' cgroup if it has the CAP_SYS_NICE capability set. This version of the patch adds a 'allow_attach' handler instead of reusing the 'can_attach' handler. If the 'can_attach' handler is reused, a new cgroup that implements 'can_attach' but not the permission checks could end up with no permission checks at all. Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c Original-Author: San Mehat <san@google.com> Signed-off-by: Colin Cross <ccross@android.com>
2014-04-21power: Add option to log time spent in suspendColin Cross
Prints the time spent in suspend in the kernel log, and keeps statistics on the time spent in suspend in /sys/kernel/debug/suspend_time Change-Id: Ia6b9ebe4baa0f7f5cd211c6a4f7e813aefd3fa1d Signed-off-by: Colin Cross <ccross@android.com> Signed-off-by: Todd Poynor <toddpoynor@google.com>
2014-04-21panic: Add board ID to panic outputNishanth Menon
At times, it is necessary for boards to provide some additional information as part of panic logs. Provide information on the board hardware as part of panic logs. It is safer to print this information at the very end in case something bad happens as part of the information retrieval itself. To use this, set global mach_panic_string to an appropriate string in the board file. Change-Id: Id12cdda87b0cd2940dd01d52db97e6162f671b4d Signed-off-by: Nishanth Menon <nm@ti.com>
2014-04-21PM: Print pending wakeup IRQ preventing suspend to dmesgTodd Poynor
Prints the name of the first action for a pending wakeup IRQ. Change-Id: I36f90735c75fb7c7ab1084775ec0d0ab02336e6e Signed-off-by: Todd Poynor <toddpoynor@google.com>
2014-04-21Revert "genirq: Do not consider disabled wakeup irqs"Arve Hjønnevåg
This reverts commit 9c6079aa1bfcf7e14de10b824779ce39b679bcb8.
2014-04-21Add build option to to set the default panic timeout.Arve Hjønnevåg
2014-04-21sched: Enable might_sleep before initializing drivers.Arve Hjønnevåg
This allows detection of init bugs in built-in drivers. Signed-off-by: Arve Hjønnevåg <arve@android.com>
2014-04-17futex: update documentation for ordering guaranteesDavidlohr Bueso
Commits 11d4616bd07f ("futex: revert back to the explicit waiter counting code") and 69cd9eba3886 ("futex: avoid race between requeue and wake") changed some of the finer details of how we think about futexes. One was a late fix and the other a consequence of overlooking the whole requeuing logic. The first change caused our documentation to be incorrect, and the second made us aware that we need to explicitly add more details to it. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-17futex: avoid race between requeue and wakeLinus Torvalds
Jan Stancek reported: "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP) occasionally fails, because some threads fail to wake up. Testcase creates 5 threads, which are all waiting on same condition. Main thread then calls pthread_cond_broadcast() without holding mutex, which calls: futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..) This immediately wakes up single thread A, which unlocks mutex and tries to wake up another thread: futex(uaddr2, FUTEX_WAKE_PRIVATE, 1) If thread A manages to call futex_wake() before any waiters are requeued for uaddr2, no other thread is woken up" The ordering constraints for the hash bucket waiter counting are that the waiter counts have to be incremented _before_ getting the spinlock (because the spinlock acts as part of the memory barrier), but the "requeue" operation didn't honor those rules, and nobody had even thought about that case. This fairly simple patch just increments the waiter count for the target hash bucket (hb2) when requeing a futex before taking the locks. It then decrements them again after releasing the lock - the code that actually moves the futex(es) between hash buckets will do the additional required waiter count housekeeping. Reported-and-tested-by: Jan Stancek <jstancek@redhat.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # 3.14 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-17futex: revert back to the explicit waiter counting codeLinus Torvalds
Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid taking the hb->lock if there's nothing to wake up") causes java threads getting stuck on futexes when runing specjbb on a power7 numa box. The cause appears to be that the powerpc spinlocks aren't using the same ticket lock model that we use on x86 (and other) architectures, which in turn result in the "spin_is_locked()" test in hb_waiters_pending() occasionally reporting an unlocked spinlock even when there are pending waiters. So this reinstates Davidlohr Bueso's original explicit waiter counting code, which I had convinced Davidlohr to drop in favor of figuring out the pending waiters by just using the existing state of the spinlock and the wait queue. Reported-and-tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Original-code-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-17futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() testHeiko Carstens
If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there is no runtime check necessary, allow to skip the test within futex_init(). This allows to get rid of some code which would always give the same result, and also allows the compiler to optimize a couple of if statements away. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Finn Thain <fthain@telegraphics.com.au> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-17futexes: Fix futex_hashsize initializationHeiko Carstens
"futexes: Increase hash table size for better performance" introduces a new alloc_large_system_hash() call. alloc_large_system_hash() however may allocate less memory than requested, e.g. limited by MAX_ORDER. Hence pass a pointer to alloc_large_system_hash() which will contain the hash shift when the function returns. Afterwards correctly set futex_hashsize. Fixes a crash on s390 where the requested allocation size was 4MB but only 1MB was allocated. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Jason Low <jason.low2@hp.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17futexes: Avoid taking the hb->lock if there's nothing to wake upDavidlohr Bueso
In futex_wake() there is clearly no point in taking the hb->lock if we know beforehand that there are no tasks to be woken. While the hash bucket's plist head is a cheap way of knowing this, we cannot rely 100% on it as there is a racy window between the futex_wait call and when the task is actually added to the plist. To this end, we couple it with the spinlock check as tasks trying to enter the critical region are most likely potential waiters that will be added to the plist, thus preventing tasks sleeping forever if wakers don't acknowledge all possible waiters. Furthermore, the futex ordering guarantees are preserved, ensuring that waiters either observe the changed user space value before blocking or is woken by a concurrent waker. For wakers, this is done by relying on the barriers in get_futex_key_refs() -- for archs that do not have implicit mb in atomic_inc(), we explicitly add them through a new futex_get_mm function. For waiters we rely on the fact that spin_lock calls already update the head counter, so spinners are visible even if the lock hasn't been acquired yet. For more details please refer to the updated comments in the code and related discussion: https://lkml.org/lkml/2013/11/26/556 Special thanks to tglx for careful review and feedback. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17futexes: Document multiprocessor ordering guaranteesThomas Gleixner
That's essential, if you want to hack on futexes. Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Jason Low <jason.low2@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17futexes: Increase hash table size for better performanceDavidlohr Bueso
Currently, the futex global hash table suffers from its fixed, smallish (for today's standards) size of 256 entries, as well as its lack of NUMA awareness. Large systems, using many futexes, can be prone to high amounts of collisions; where these futexes hash to the same bucket and lead to extra contention on the same hb->lock. Furthermore, cacheline bouncing is a reality when we have multiple hb->locks residing on the same cacheline and different futexes hash to adjacent buckets. This patch keeps the current static size of 16 entries for small systems, or otherwise, 256 * ncpus (or larger as we need to round the number to a power of 2). Note that this number of CPUs accounts for all CPUs that can ever be available in the system, taking into consideration things like hotpluging. While we do impose extra overhead at bootup by making the hash table larger, this is a one time thing, and does not shadow the benefits of this patch. Furthermore, as suggested by tglx, by cache aligning the hash buckets we can avoid access across cacheline boundaries and also avoid massive cache line bouncing if multiple cpus are hammering away at different hash buckets which happen to reside in the same cache line. Also, similar to other core kernel components (pid, dcache, tcp), by using alloc_large_system_hash() we benefit from its NUMA awareness and thus the table is distributed among the nodes instead of in a single one. For a custom microbenchmark that pounds on the uaddr hashing -- making the wait path fail at futex_wait_setup() returning -EWOULDBLOCK for large amounts of futexes, we can see the following benefits on a 80-core, 8-socket 1Tb server: +---------+--------------------+------------------------+-----------------------+-------------------------------+ | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) | +---------+--------------------+------------------------+-----------------------+-------------------------------+ |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             | |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             | |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             | |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             | |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             | |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              | +---------+--------------------+------------------------+-----------------------+-------------------------------+ Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Waiman Long <Waiman.Long@hp.com> Reviewed-and-tested-by: Jason Low <jason.low2@hp.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17futex: Use freezable blocking callColin Cross
Avoid waking up every thread sleeping in a futex_wait call during suspend and resume by calling a freezable blocking call. Previous patches modified the freezer to avoid sending wakeups to threads that are blocked in freezable blocking calls. This call was selected to be converted to a freezable call because it doesn't hold any locks or release any resources when interrupted that might be needed by another freezing task or a kernel driver during suspend, and is a common site where idle userspace tasks are blocked. Signed-off-by: Colin Cross <ccross@android.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: arve@android.com Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-04-17futexes: Clean up various detailsJason Low
- Remove unnecessary head variables. - Delete unused parameter in queue_unlock(). Reviewed-by: Darren Hart <dvhart@linux.intel.com> Reviewed-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jason Low <jason.low2@hp.com> Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Jeff Mahoney <jeffm@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Scott Norton <scott.norton@hp.com> Cc: Tom Vaden <tom.vaden@hp.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-17futex: move user address verification up to common codeLinus Torvalds
When debugging the read-only hugepage case, I was confused by the fact that get_futex_key() did an access_ok() only for the non-shared futex case, since the user address checking really isn't in any way specific to the private key handling. Now, it turns out that the shared key handling does effectively do the equivalent checks inside get_user_pages_fast() (it doesn't actually check the address range on x86, but does check the page protections for being a user page). So it wasn't actually a bug, but the fact that we treat the address differently for private and shared futexes threw me for a loop. Just move the check up, so that it gets done for both cases. Also, use the 'rw' parameter for the type, even if it doesn't actually matter any more (it's a historical artifact of the old racy i386 "page faults from kernel space don't check write protections"). Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>