summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2012-10-07random: remove rand_initialize_irq()Theodore Ts'o
commit c5857ccf293968348e5eb4ebedc68074de3dcda6 upstream. With the new interrupt sampling system, we are no longer using the timer_rand_state structure in the irq descriptor, so we can stop initializing it now. [ Merged in fixes from Sedat to find some last missing references to rand_initialize_irq() ] Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Sedat Dilek <sedat.dilek@gmail.com> [PG: in .34 the irqdesc.h content is in irq.h instead.] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07random: make 'add_interrupt_randomness()' do something saneTheodore Ts'o
commit 775f4b297b780601e61787b766f306ed3e1d23eb upstream. We've been moving away from add_interrupt_randomness() for various reasons: it's too expensive to do on every interrupt, and flooding the CPU with interrupts could theoretically cause bogus floods of entropy from a somewhat externally controllable source. This solves both problems by limiting the actual randomness addition to just once a second or after 64 interrupts, whicever comes first. During that time, the interrupt cycle data is buffered up in a per-cpu pool. Also, we make sure the the nonblocking pool used by urandom is initialized before we start feeding the normal input pool. This assures that /dev/urandom is returning unpredictable data as soon as possible. (Based on an original patch by Linus, but significantly modified by tytso.) Tested-by: Eric Wustrow <ewust@umich.edu> Reported-by: Eric Wustrow <ewust@umich.edu> Reported-by: Nadia Heninger <nadiah@cs.ucsd.edu> Reported-by: Zakir Durumeric <zakir@umich.edu> Reported-by: J. Alex Halderman <jhalderm@umich.edu>. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> [PG: minor adjustment required since .34 doesn't have f9e4989eb8 which renames "status" to "random" in kernel/irq/handle.c ] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()Oleg Nesterov
commit d80e731ecab420ddcb79ee9d0ac427acbc187b4b upstream. This patch is intentionally incomplete to simplify the review. It ignores ep_unregister_pollwait() which plays with the same wqh. See the next change. epoll assumes that the EPOLL_CTL_ADD'ed file controls everything f_op->poll() needs. In particular it assumes that the wait queue can't go away until eventpoll_release(). This is not true in case of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand which is not connected to the file. This patch adds the special event, POLLFREE, currently only for epoll. It expects that init_poll_funcptr()'ed hook should do the necessary cleanup. Perhaps it should be defined as EPOLLFREE in eventpoll. __cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if ->signalfd_wqh is not empty, we add the new signalfd_cleanup() helper. ep_poll_callback(POLLFREE) simply does list_del_init(task_list). This make this poll entry inconsistent, but we don't care. If you share epoll fd which contains our sigfd with another process you should blame yourself. signalfd is "really special". I simply do not know how we can define the "right" semantics if it used with epoll. The main problem is, epoll calls signalfd_poll() once to establish the connection with the wait queue, after that signalfd_poll(NULL) returns the different/inconsistent results depending on who does EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd has nothing to do with the file, it works with the current thread. In short: this patch is the hack which tries to fix the symptoms. It also assumes that nobody can take tasklist_lock under epoll locks, this seems to be true. Note: - we do not have wake_up_all_poll() but wake_up_poll() is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE. - signalfd_cleanup() uses POLLHUP along with POLLFREE, we need a couple of simple changes in eventpoll.c to make sure it can't be "lost". Reported-by: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi()Darren Hart
commit 6f7b0a2a5c0fb03be7c25bd1745baa50582348ef upstream. If uaddr == uaddr2, then we have broken the rule of only requeueing from a non-pi futex to a pi futex with this call. If we attempt this, as the trinity test suite manages to do, we miss early wakeups as q.key is equal to key2 (because they are the same uaddr). We will then attempt to dereference the pi_mutex (which would exist had the futex_q been properly requeued to a pi futex) and trigger a NULL pointer dereference. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/r/ad82bfe7f7d130247fbe2b5b4275654807774227.1342809673.git.dvhart@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07futex: Fix bug in WARN_ON for NULL q.pi_stateDarren Hart
commit f27071cb7fe3e1d37a9dbe6c0dfc5395cd40fa43 upstream. The WARN_ON in futex_wait_requeue_pi() for a NULL q.pi_state was testing the address (&q.pi_state) of the pointer instead of the value (q.pi_state) of the pointer. Correct it accordingly. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: Dave Jones <davej@redhat.com> Link: http://lkml.kernel.org/r/1c85d97f6e5f79ec389a4ead3e367363c74bd09a.1342809673.git.dvhart@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07futex: Test for pi_mutex on fault in futex_wait_requeue_pi()Darren Hart
commit b6070a8d9853eda010a549fa9a09eb8d7269b929 upstream. If fixup_pi_state_owner() faults, pi_mutex may be NULL. Test for pi_mutex != NULL before testing the owner against current and possibly unlocking it. Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: Dave Jones <davej@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Link: http://lkml.kernel.org/r/dc59890338fc413606f04e5c5b131530734dae3d.1342809673.git.dvhart@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07ntp: Fix STA_INS/DEL clearing bugJohn Stultz
commit 6b1859dba01c7d512b72d77e3fd7da8354235189 upstream. In commit 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d, I introduced a bug that kept the STA_INS or STA_DEL bit from being cleared from time_status via adjtimex() without forcing STA_PLL first. Usually once the STA_INS is set, it isn't cleared until the leap second is applied, so its unlikely this affected anyone. However during testing I noticed it took some effort to cancel a leap second once STA_INS was set. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Link: http://lkml.kernel.org/r/1342156917-25092-2-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07ntp: Fix integer overflow when setting timeSasha Levin
commit a078c6d0e6288fad6d83fb6d5edd91ddb7b6ab33 upstream. 'long secs' is passed as divisor to div_s64, which accepts a 32bit divisor. On 64bit machines that value is trimmed back from 8 bytes back to 4, causing a divide by zero when the number is bigger than (1 << 32) - 1 and all 32 lower bits are 0. Use div64_long() instead. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Cc: johnstul@us.ibm.com Link: http://lkml.kernel.org/r/1331829374-31543-2-git-send-email-levinsasha928@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [WT: div64_long() does not exist on 2.6.32 and needs a deeper backport than desired. Instead, address the issue by controlling that the divisor is correct for use as an s32 divisor] Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07cred: copy_process() should clear child->replacement_session_keyringOleg Nesterov
commit 79549c6dfda0603dba9a70a53467ce62d9335c33 upstream keyctl_session_to_parent(task) sets ->replacement_session_keyring, it should be processed and cleared by key_replace_session_keyring(). However, this task can fork before it notices TIF_NOTIFY_RESUME and the new child gets the bogus ->replacement_session_keyring copied by dup_task_struct(). This is obviously wrong and, if nothing else, this leads to put_cred(already_freed_cred). change copy_creds() to clear this member. If copy_process() fails before this point the wrong ->replacement_session_keyring doesn't matter, exit_creds() won't be called. Cc: <stable@vger.kernel.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07block: Fix io_context leak after failure of clone with CLONE_IOLouis Rilling
commit b69f2292063d2caf37ca9aec7d63ded203701bf3 upstream With CLONE_IO, parent's io_context->nr_tasks is incremented, but never decremented whenever copy_process() fails afterwards, which prevents exit_io_context() from calling IO schedulers exit functions. Give a task_struct to exit_io_context(), and call exit_io_context() instead of put_io_context() in copy_process() cleanup path. Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07sched: Fix signed unsigned comparison in check_preempt_tick()Mike Galbraith
commit d7d8294415f0ce4254827d4a2a5ee88b00be52a8 upstream. Signed unsigned comparison may lead to superfluous resched if leftmost is right of the current task, wasting a few cycles, and inadvertently _lengthening_ the current task's slice. Reported-by: Venkatesh Pallipadi <venki@google.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1294202477.9384.5.camel@marge.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07usb: Fix deadlock in hid_reset when Dell iDRAC is resetStuart Hayes
This was fixed upstream by commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c ('workqueue: implement concurrency managed dynamic worker pool'), but that is far too large a change for stable. When Dell iDRAC is reset, the iDRAC's USB keyboard/mouse device stops responding but is not actually disconnected. This causes usbhid to hid hid_io_error(), and you get a chain of calls like... hid_reset() usb_reset_device() usb_reset_and_verify_device() usb_ep0_reinit() usb_disble_endpoint() usb_hcd_disable_endpoint() ehci_endpoint_disable() Along the way, as a result of an error/timeout with a USB transaction, ehci_clear_tt_buffer() calls usb_hub_clear_tt_buffer() (to clear a failed transaction out of the transaction translator in the hub), which schedules hub_tt_work() to be run (using keventd), and then sets qh->clearing_tt=1 so that nobody will mess with that qh until the TT is cleared. But run_workqueue() never happens for the keventd workqueue on that CPU, so hub_tt_work() never gets run. And qh->clearing_tt never gets changed back to 0. This causes ehci_endpoint_disable() to get stuck in a loop waiting for qh->clearing_tt to go to 0. Part of the problem is hid_reset() is itself running on keventd. So when that thread gets a timeout trying to talk to the HID device, it schedules clear_work (to run hub_tt_work) to run, and then gets stuck in ehci_endpoint_disable waiting for it to run. However, clear_work never gets run because the workqueue for that CPU is still waiting for hid_reset to finish. A much less invasive patch for earlier kernels is to just schedule clear_work on khubd if the usb code needs to clear the TT and it sees that it is already running on keventd. Khubd isn't used by default because it can get blocked by device enumeration sometimes, but I think it should be ok for a backup for unusual cases like this just to prevent deadlock. Signed-off-by: Stuart Hayes <stuart_hayes@dell.com> Signed-off-by: Shyam Iyer <shyam_iyer@dell.com> [bwh: Use current_is_keventd() rather than checking current->{flags,comm}] Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07time: Move ktime_t overflow checking into timespec_valid_strictJohn Stultz
This is a -stable backport of cee58483cf56e0ba355fdd97ff5e8925329aa936 Andreas Bombe reported that the added ktime_t overflow checking added to timespec_valid in commit 4e8b14526ca7 ("time: Improve sanity checking of timekeeping inputs") was causing problems with X.org because it caused timeouts larger then KTIME_T to be invalid. Previously, these large timeouts would be clamped to KTIME_MAX and would never expire, which is valid. This patch splits the ktime_t overflow checking into a new timespec_valid_strict function, and converts the timekeeping codes internal checking to use this more strict function. Reported-and-tested-by: Andreas Bombe <aeb@debian.org> Cc: Zhouping Liu <zliu@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07time: Avoid making adjustments if we haven't accumulated anythingJohn Stultz
This is a -stable backport of bf2ac312195155511a0f79325515cbb61929898a If update_wall_time() is called and the current offset isn't large enough to accumulate, avoid re-calling timekeeping_adjust which may change the clock freq and can cause 1ns inconsistencies with CLOCK_REALTIME_COARSE/CLOCK_MONOTONIC_COARSE. Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1345595449-34965-5-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07time: Improve sanity checking of timekeeping inputsJohn Stultz
This is a -stable backport of 4e8b14526ca7fb046a81c94002c1c43b6fdf0e9b Unexpected behavior could occur if the time is set to a value large enough to overflow a 64bit ktime_t (which is something larger then the year 2262). Also unexpected behavior could occur if large negative offsets are injected via adjtimex. So this patch improves the sanity check timekeeping inputs by improving the timespec_valid() check, and then makes better use of timespec_valid() to make sure we don't set the time to an invalid negative value or one that overflows ktime_t. Note: This does not protect from setting the time close to overflowing ktime_t and then letting natural accumulation cause the overflow. Reported-by: CAI Qian <caiqian@redhat.com> Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Zhouping Liu <zliu@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1344454580-17031-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07timekeeping: Add missing update call in timekeeping_resume()Thomas Gleixner
This is a backport of 3e997130bd2e8c6f5aaa49d6e3161d4d29b43ab0 The leap second rework unearthed another issue of inconsistent data. On timekeeping_resume() the timekeeper data is updated, but nothing calls timekeeping_update(), so now the update code in the timer interrupt sees stale values. This has been the case before those changes, but then the timer interrupt was using stale data as well so this went unnoticed for quite some time. Add the missing update call, so all the data is consistent everywhere. Reported-by: Andreas Schwab <schwab@linux-m68k.org> Reported-and-tested-by: "Rafael J. Wysocki" <rjw@sisk.pl> Reported-and-tested-by: Martin Steigerwald <Martin@lichtvoll.de> Cc: LKML <linux-kernel@vger.kernel.org> Cc: Linux PM list <linux-pm@vger.kernel.org> Cc: John Stultz <johnstul@us.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>, Cc: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07hrtimer: Update hrtimer base offsets each hrtimer_interruptJohn Stultz
This is a backport of 5baefd6d84163443215f4a99f6a20f054ef11236 The update of the hrtimer base offsets on all cpus cannot be made atomically from the timekeeper.lock held and interrupt disabled region as smp function calls are not allowed there. clock_was_set(), which enforces the update on all cpus, is called either from preemptible process context in case of do_settimeofday() or from the softirq context when the offset modification happened in the timer interrupt itself due to a leap second. In both cases there is a race window for an hrtimer interrupt between dropping timekeeper lock, enabling interrupts and clock_was_set() issuing the updates. Any interrupt which arrives in that window will see the new time but operate on stale offsets. So we need to make sure that an hrtimer interrupt always sees a consistent state of time and offsets. ktime_get_update_offsets() allows us to get the current monotonic time and update the per cpu hrtimer base offsets from hrtimer_interrupt() to capture a consistent state of monotonic time and the offsets. The function replaces the existing ktime_get() calls in hrtimer_interrupt(). The overhead of the new function vs. ktime_get() is minimal as it just adds two store operations. This ensures that any changes to realtime or boottime offsets are noticed and stored into the per-cpu hrtimer base structures, prior to any hrtimer expiration and guarantees that timers are not expired early. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-8-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07timekeeping: Provide hrtimer update functionThomas Gleixner
This is a backport of f6c06abfb3972ad4914cef57d8348fcb2932bc3b To finally fix the infamous leap second issue and other race windows caused by functions which change the offsets between the various time bases (CLOCK_MONOTONIC, CLOCK_REALTIME and CLOCK_BOOTTIME) we need a function which atomically gets the current monotonic time and updates the offsets of CLOCK_REALTIME and CLOCK_BOOTTIME with minimalistic overhead. The previous patch which provides ktime_t offsets allows us to make this function almost as cheap as ktime_get() which is going to be replaced in hrtimer_interrupt(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-7-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07hrtimers: Move lock held region in hrtimer_interrupt()Thomas Gleixner
This is a backport of 196951e91262fccda81147d2bcf7fdab08668b40 We need to update the base offsets from this code and we need to do that under base->lock. Move the lock held region around the ktime_get() calls. The ktime_get() calls are going to be replaced with a function which gets the time and the offsets atomically. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: John Stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/1341960205-56738-6-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07timekeeping: Maintain ktime_t based offsets for hrtimersThomas Gleixner
This is a backport of 5b9fe759a678e05be4937ddf03d50e950207c1c0 We need to update the hrtimer clock offsets from the hrtimer interrupt context. To avoid conversions from timespec to ktime_t maintain a ktime_t based representation of those offsets in the timekeeper. This puts the conversion overhead into the code which updates the underlying offsets and provides fast accessible values in the hrtimer interrupt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: John Stultz <johnstul@us.ibm.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-4-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07timekeeping: Fix leapsecond triggered load spike issueJohn Stultz
This is a backport of 4873fa070ae84a4115f0b3c9dfabc224f1bc7c51 The timekeeping code misses an update of the hrtimer subsystem after a leap second happened. Due to that timers based on CLOCK_REALTIME are either expiring a second early or late depending on whether a leap second has been inserted or deleted until an operation is initiated which causes that update. Unless the update happens by some other means this discrepancy between the timekeeping and the hrtimer data stays forever and timers are expired either early or late. The reported immediate workaround - $ data -s "`date`" - is causing a call to clock_was_set() which updates the hrtimer data structures. See: http://www.sheeri.com/content/mysql-and-leap-second-high-cpu-and-fix Add the missing clock_was_set() call to update_wall_time() in case of a leap second event. The actual update is deferred to softirq context as the necessary smp function call cannot be invoked from hard interrupt context. Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-3-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07hrtimer: Provide clock_was_set_delayed()John Stultz
This is a backport of f55a6faa384304c89cfef162768e88374d3312cb clock_was_set() cannot be called from hard interrupt context because it calls on_each_cpu(). For fixing the widely reported leap seconds issue it is necessary to call it from hard interrupt context, i.e. the timer tick code, which does the timekeeping updates. Provide a new function which denotes it in the hrtimer cpu base structure of the cpu on which it is called and raise the hrtimer softirq. We then execute the clock_was_set() notificiation from softirq context in run_hrtimer_softirq(). The hrtimer softirq is rarely used, so polling the flag there is not a performance issue. [ tglx: Made it depend on CONFIG_HIGH_RES_TIMERS. We really should get rid of all this ifdeffery ASAP ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Reported-by: Jan Engelhardt <jengelh@inai.de> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Prarit Bhargava <prarit@redhat.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1341960205-56738-2-git-send-email-johnstul@us.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07time: Move common updates to a functionThomas Gleixner
This is a backport of cc06268c6a87db156af2daed6e96a936b955cc82 While not a bugfix itself, it allows following fixes to backport in a more straightforward manner. CC: Thomas Gleixner <tglx@linutronix.de> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07timekeeping: Fix CLOCK_MONOTONIC inconsistency during leapsecondJohn Stultz
This is a backport of fad0c66c4bb836d57a5f125ecd38bed653ca863a which resolves a bug the previous commit. Commit 6b43ae8a61 (ntp: Fix leap-second hrtimer livelock) broke the leapsecond update of CLOCK_MONOTONIC. The missing leapsecond update to wall_to_monotonic causes discontinuities in CLOCK_MONOTONIC. Adjust wall_to_monotonic when NTP inserted a leapsecond. Reported-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Tested-by: Richard Cochran <richardcochran@gmail.com> Link: http://lkml.kernel.org/r/1338400497-12420-1-git-send-email-john.stultz@linaro.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07ntp: Correct TAI offset during leap secondRichard Cochran
This is a backport of dd48d708ff3e917f6d6b6c2b696c3f18c019feed When repeating a UTC time value during a leap second (when the UTC time should be 23:59:60), the TAI timescale should not stop. The kernel NTP code increments the TAI offset one second too late. This patch fixes the issue by incrementing the offset during the leap second itself. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07ntp: Fix leap-second hrtimer livelockJohn Stultz
This is a backport of 6b43ae8a619d17c4935c3320d2ef9e92bdeed05d This should have been backported when it was commited, but I mistook the problem as requiring the ntp_lock changes that landed in 3.4 in order for it to occur. Unfortunately the same issue can happen (with only one cpu) as follows: do_adjtimex() write_seqlock_irq(&xtime_lock); process_adjtimex_modes() process_adj_status() ntp_start_leap_timer() hrtimer_start() hrtimer_reprogram() tick_program_event() clockevents_program_event() ktime_get() seq = req_seqbegin(xtime_lock); [DEADLOCK] This deadlock will no always occur, as it requires the leap_timer to force a hrtimer_reprogram which only happens if its set and there's no sooner timer to expire. NOTE: This patch, being faithful to the original commit, introduces a bug (we don't update wall_to_monotonic), which will be resovled by backporting a following fix. Original commit message below: Since commit 7dffa3c673fbcf835cd7be80bb4aec8ad3f51168 the ntp subsystem has used an hrtimer for triggering the leapsecond adjustment. However, this can cause a potential livelock. Thomas diagnosed this as the following pattern: CPU 0 CPU 1 do_adjtimex() spin_lock_irq(&ntp_lock); process_adjtimex_modes(); timer_interrupt() process_adj_status(); do_timer() ntp_start_leap_timer(); write_lock(&xtime_lock); hrtimer_start(); update_wall_time(); hrtimer_reprogram(); ntp_tick_length() tick_program_event() spin_lock(&ntp_lock); clockevents_program_event() ktime_get() seq = req_seqbegin(xtime_lock); This patch tries to avoid the problem by reverting back to not using an hrtimer to inject leapseconds, and instead we handle the leapsecond processing in the second_overflow() function. The downside to this change is that on systems that support highres timers, the leap second processing will occur on a HZ tick boundary, (ie: ~1-10ms, depending on HZ) after the leap second instead of possibly sooner (~34us in my tests w/ x86_64 lapic). This patch applies on top of tip/timers/core. CC: Sasha Levin <levinsasha928@gmail.com> CC: Thomas Gleixner <tglx@linutronix.de> Reported-by: Sasha Levin <levinsasha928@gmail.com> Diagnoised-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sasha Levin <levinsasha928@gmail.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linux Kernel <linux-kernel@vger.kernel.org> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-10-07futex: Fix uninterruptible loop due to gate_areaHugh Dickins
commit e6780f7243eddb133cc20ec37fa69317c218b709 upstream. It was found (by Sasha) that if you use a futex located in the gate area we get stuck in an uninterruptible infinite loop, much like the ZERO_PAGE issue. While looking at this problem, PeterZ realized you'll get into similar trouble when hitting any install_special_pages() mapping. And are there still drivers setting up their own special mmaps without page->mapping, and without special VM or pte flags to make get_user_pages fail? In most cases, if page->mapping is NULL, we do not need to retry at all: Linus points out that even /proc/sys/vm/drop_caches poses no problem, because it ends up using remove_mapping(), which takes care not to interfere when the page reference count is raised. But there is still one case which does need a retry: if memory pressure called shmem_writepage in between get_user_pages_fast dropping page table lock and our acquiring page lock, then the page gets switched from filecache to swapcache (and ->mapping set to NULL) whatever the refcount. Fault it back in to get the page->mapping needed for key->shared.inode. Reported-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [PG: 2.6.34 variable is page, not page_head, since it doesn't have a5b338f2] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Willy Tarreau <w@1wt.eu>
2012-03-04PM / Sleep: Fix freezer failures due to racy usermodehelper_is_disabled()Srivatsa S. Bhat
[ Upstream commit b298d289c79211508f11cb50749b0d1d54eb244a ] Commit a144c6a (PM: Print a warning if firmware is requested when tasks are frozen) introduced usermodehelper_is_disabled() to warn and exit immediately if firmware is requested when usermodehelpers are disabled. However, it is racy. Consider the following scenario, currently used in drivers/base/firmware_class.c: ... if (usermodehelper_is_disabled()) goto out; /* Do actual work */ ... out: return err; Nothing prevents someone from disabling usermodehelpers just after the check in the 'if' condition, which means that it is quite possible to try doing the "actual work" with usermodehelpers disabled, leading to undesirable consequences. In particular, this race condition in _request_firmware() causes task freezing failures whenever suspend/hibernation is in progress because, it wrongly waits to get the firmware/microcode image from userspace when actually the usermodehelpers are disabled or userspace has been frozen. Some of the example scenarios that cause freezing failures due to this race are those that depend on userspace via request_firmware(), such as x86 microcode module initialization and microcode image reload. Previous discussions about this issue can be found at: http://thread.gmane.org/gmane.linux.kernel/1198291/focus=1200591 This patch adds proper synchronization to fix this issue. It is to be noted that this patchset fixes the freezing failures but doesn't remove the warnings. IOW, it does not attempt to add explicit synchronization to x86 microcode driver to avoid requesting microcode image at inopportune moments. Because, the warnings were introduced to highlight such cases, in the first place. And we need not silence the warnings, since we take care of the *real* problem (freezing failure) and hence, after that, the warnings are pretty harmless anyway. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-04PM: Print a warning if firmware is requested when tasks are frozenRafael J. Wysocki
[ Upstream commit a144c6a6c924aa1da04dd77fb84b89927354fdff ] Some drivers erroneously use request_firmware() from their ->resume() (or ->thaw(), or ->restore()) callbacks, which is not going to work unless the firmware has been built in. This causes system resume to stall until the firmware-loading timeout expires, which makes users think that the resume has failed and reboot their machines unnecessarily. For this reason, make _request_firmware() print a warning and return immediately with error code if it has been called when tasks are frozen and it's impossible to start any new usermode helpers. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Reviewed-by: Valdis Kletnieks <valdis.kletnieks@vt.edu> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-04relay: prevent integer overflow in relay_open()Dan Carpenter
commit f6302f1bcd75a042df69866d98b8d775a668f8f1 upstream. "subbuf_size" and "n_subbufs" come from the user and they need to be capped to prevent an integer overflow. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-01-25kprobes: initialize before using a hlistAnanth N Mavinakayanahalli
commit d496aab567e7e52b3e974c9192a5de6e77dce32c upstream. Commit ef53d9c5e ("kprobes: improve kretprobe scalability with hashed locking") introduced a bug where we can potentially leak kretprobe_instances since we initialize a hlist head after having used it. Initialize the hlist head before using it. Reported by: Jim Keniston <jkenisto@us.ibm.com> Acked-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Srinivasa D S <srinivasa@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-12PM / Sleep: Fix race between CPU hotplug and freezerSrivatsa S. Bhat
commit 79cfbdfa87e84992d509e6c1648a18e1d7e68c20 upstream. The CPU hotplug notifications sent out by the _cpu_up() and _cpu_down() functions depend on the value of the 'tasks_frozen' argument passed to them (which indicates whether tasks have been frozen or not). (Examples for such CPU hotplug notifications: CPU_ONLINE, CPU_ONLINE_FROZEN, CPU_DEAD, CPU_DEAD_FROZEN). Thus, it is essential that while the callbacks for those notifications are running, the state of the system with respect to the tasks being frozen or not remains unchanged, *throughout that duration*. Hence there is a need for synchronizing the CPU hotplug code with the freezer subsystem. Since the freezer is involved only in the Suspend/Hibernate call paths, this patch hooks the CPU hotplug code to the suspend/hibernate notifiers PM_[SUSPEND|HIBERNATE]_PREPARE and PM_POST_[SUSPEND|HIBERNATE] to prevent the race between CPU hotplug and freezer, thus ensuring that CPU hotplug notifications will always be run with the state of the system really being what the notifications indicate, _throughout_ their execution time. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-06hung_task: fix false positive during vforkMandeep Singh Baines
commit f9fab10bbd768b0e5254e53a4a8477a94bfc4b96 upstream. vfork parent uninterruptibly and unkillably waits for its child to exec/exit. This wait is of unbounded length. Ignore such waits in the hung_task detector. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Reported-by: Sasha Levin <levinsasha928@gmail.com> LKML-Reference: <1325344394.28904.43.camel@lappy> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: John Kacur <jkacur@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-03Revert "clockevents: Set noop handler in clockevents_exchange_device()"Linus Torvalds
commit 3b87487ac5008072f138953b07505a7e3493327f upstream. This reverts commit de28f25e8244c7353abed8de0c7792f5f883588c. It results in resume problems for various people. See for example http://thread.gmane.org/gmane.linux.kernel/1233033 http://thread.gmane.org/gmane.linux.kernel/1233389 http://thread.gmane.org/gmane.linux.kernel/1233159 http://thread.gmane.org/gmane.linux.kernel/1227868/focus=1230877 and the fedora and ubuntu bug reports https://bugzilla.redhat.com/show_bug.cgi?id=767248 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/904569 which got bisected down to the stable version of this commit. Reported-by: Jonathan Nieder <jrnieder@gmail.com> Reported-by: Phil Miller <mille121@illinois.edu> Reported-by: Philip Langdale <philipl@overt.org> Reported-by: Tim Gardner <tim.gardner@canonical.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-21Make TASKSTATS require root accessLinus Torvalds
commit 1a51410abe7d0ee4b1d112780f46df87d3621043 upstream. Ok, this isn't optimal, since it means that 'iotop' needs admin capabilities, and we may have to work on this some more. But at the same time it is very much not acceptable to let anybody just read anybody elses IO statistics quite at this level. Use of the GENL_ADMIN_PERM suggested by Johannes Berg as an alternative to checking the capabilities by hand. Reported-by: Vasiliy Kulikov <segoon@openwall.com> Cc: Johannes Berg <johannes.berg@intel.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Moritz Mühlenhoff <jmm@inutil.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09clockevents: Set noop handler in clockevents_exchange_device()Thomas Gleixner
commit de28f25e8244c7353abed8de0c7792f5f883588c upstream. If a device is shutdown, then there might be a pending interrupt, which will be processed after we reenable interrupts, which causes the original handler to be run. If the old handler is the (broadcast) periodic handler the shutdown state might hang the kernel completely. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09tick-broadcast: Stop active broadcast device when replacing itThomas Gleixner
commit c1be84309c58b1e7c6d626e28fba41a22b364c3d upstream. When a better rated broadcast device is installed, then the current active device is not disabled, which results in two running broadcast devices. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09genirq: Fix race condition when stopping the irq threadIdo Yariv
commit 550acb19269d65f32e9ac4ddb26c2b2070e37f1c upstream. In irq_wait_for_interrupt(), the should_stop member is verified before setting the task's state to TASK_INTERRUPTIBLE and calling schedule(). In case kthread_stop sets should_stop and wakes up the process after should_stop is checked by the irq thread but before the task's state is changed, the irq thread might never exit: kthread_stop irq_wait_for_interrupt ------------ ---------------------- ... ... while (!kthread_should_stop()) { kthread->should_stop = 1; wake_up_process(k); wait_for_completion(&kthread->exited); ... set_current_state(TASK_INTERRUPTIBLE); ... schedule(); } Fix this by checking if the thread should stop after modifying the task's state. [ tglx: Simplified it a bit ] Signed-off-by: Ido Yariv <ido@wizery.com> Link: http://lkml.kernel.org/r/1322740508-22640-1-git-send-email-ido@wizery.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-09timekeeping: add arch_offset hook to ktime_get functionsHector Palacios
commit d004e024058a0eaca097513ce62cbcf978913e0a upstream. ktime_get and ktime_get_ts were calling timekeeping_get_ns() but later they were not calling arch_gettimeoffset() so architectures using this mechanism returned 0 ns when calling these functions. This happened for example when running Busybox's ping which calls syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts) which eventually calls ktime_get. As a result the returned ping travel time was zero. Signed-off-by: Hector Palacios <hector.palacios@digi.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-26genirq: Add IRQF_RESUME_EARLY and resume such IRQs earlierIan Campbell
commit 9bab0b7fbaceec47d32db51cd9e59c82fb071f5a upstream This adds a mechanism to resume selected IRQs during syscore_resume instead of dpm_resume_noirq. Under Xen we need to resume IRQs associated with IPIs early enough that the resched IPI is unmasked and we can therefore schedule ourselves out of the stop_machine where the suspend/resume takes place. This issue was introduced by 676dc3cf5bc3 "xen: Use IRQF_FORCE_RESUME". Back ported to 2.6.32 (which lacks syscore support) by calling the relavant resume function directly from sysdev_resume). v2: Fixed non-x86 build errors. Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1318713254.11016.52.camel@dagon.hellion.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-26PM / Suspend: Off by one in pm_suspend()Dan Carpenter
commit 528f7ce6e439edeac38f6b3f8561f1be129b5e91 upstream. In enter_state() we use "state" as an offset for the pm_states[] array. The pm_states[] array only has PM_SUSPEND_MAX elements so this test is off by one. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-08Revert "genirq: Add IRQF_RESUME_EARLY and resume such IRQs earlier"Greg Kroah-Hartman
This reverts commit 0f12a6ad9fa3a03f2bcee36c9cb704821e244c40. It causes too many build errors and needs to be done properly. Reported-by: Jiri Slaby <jslaby@suse.cz> Reported-by: Christoph Biedl <linux-kernel.bfrz@manchmal.in-ulm.de> Cc: Ian Campbell <ian.campbell@citrix.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07genirq: Add IRQF_RESUME_EARLY and resume such IRQs earlierIan Campbell
commit 9bab0b7fbaceec47d32db51cd9e59c82fb071f5a upstream This adds a mechanism to resume selected IRQs during syscore_resume instead of dpm_resume_noirq. Under Xen we need to resume IRQs associated with IPIs early enough that the resched IPI is unmasked and we can therefore schedule ourselves out of the stop_machine where the suspend/resume takes place. This issue was introduced by 676dc3cf5bc3 "xen: Use IRQF_FORCE_RESUME". Back ported to 2.6.32 (which lacks syscore support) by calling the relavant resume function directly from sysdev_resume). Signed-off-by: Ian Campbell <ian.campbell@citrix.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Jeremy Fitzhardinge <Jeremy.Fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1318713254.11016.52.camel@dagon.hellion.org.uk Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07time: Change jiffies_to_clock_t() argument type to unsigned longhank
commit cbbc719fccdb8cbd87350a05c0d33167c9b79365 upstream. The parameter's origin type is long. On an i386 architecture, it can easily be larger than 0x80000000, causing this function to convert it to a sign-extended u64 type. Change the type to unsigned long so we get the correct result. Signed-off-by: hank <pyu@redhat.com> Cc: John Stultz <john.stultz@linaro.org> [ build fix ] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-07kmod: prevent kmod_loop_msg overflow in __request_module()Jiri Kosina
commit 37252db6aa576c34fd794a5a54fb32d7a8b3a07a upstream. Due to post-increment in condition of kmod_loop_msg in __request_module(), the system log can be spammed by much more than 5 instances of the 'runaway loop' message if the number of events triggering it makes the kmod_loop_msg to overflow. Fix that by making sure we never increment it past the threshold. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-29futex: Fix regression with read only mappingsShawn Bohrer
commit 9ea71503a8ed9184d2d0b8ccc4d269d05f7940ae upstream. commit 7485d0d3758e8e6491a5c9468114e74dc050785d (futexes: Remove rw parameter from get_futex_key()) in 2.6.33 fixed two problems: First, It prevented a loop when encountering a ZERO_PAGE. Second, it fixed RW MAP_PRIVATE futex operations by forcing the COW to occur by unconditionally performing a write access get_user_pages_fast() to get the page. The commit also introduced a user-mode regression in that it broke futex operations on read-only memory maps. For example, this breaks workloads that have one or more reader processes doing a FUTEX_WAIT on a futex within a read only shared file mapping, and a writer processes that has a writable mapping issuing the FUTEX_WAKE. This fixes the regression for valid futex operations on RO mappings by trying a RO get_user_pages_fast() when the RW get_user_pages_fast() fails. This change makes it necessary to also check for invalid use cases, such as anonymous RO mappings (which can never change) and the ZERO_PAGE which the commit referenced above was written to address. This patch does restore the original behavior with RO MAP_PRIVATE mappings, which have inherent user-mode usage problems and don't really make sense. With this patch performing a FUTEX_WAIT within a RO MAP_PRIVATE mapping will be successfully woken provided another process updates the region of the underlying mapped file. However, the mmap() man page states that for a MAP_PRIVATE mapping: It is unspecified whether changes made to the file after the mmap() call are visible in the mapped region. So user-mode users attempting to use futex operations on RO MAP_PRIVATE mappings are depending on unspecified behavior. Additionally a RO MAP_PRIVATE mapping could fail to wake up in the following case. Thread-A: call futex(FUTEX_WAIT, memory-region-A). get_futex_key() return inode based key. sleep on the key Thread-B: call mprotect(PROT_READ|PROT_WRITE, memory-region-A) Thread-B: write memory-region-A. COW happen. This process's memory-region-A become related to new COWed private (ie PageAnon=1) page. Thread-B: call futex(FUETX_WAKE, memory-region-A). get_futex_key() return mm based key. IOW, we fail to wake up Thread-A. Once again doing something like this is just silly and users who do something like this get what they deserve. While RO MAP_PRIVATE mappings are nonsensical, checking for a private mapping requires walking the vmas and was deemed too costly to avoid a userspace hang. This Patch is based on Peter Zijlstra's initial patch with modifications to only allow RO mappings for futex operations that need VERIFY_READ access. Reported-by: David Oliver <david@rgmadvisors.com> Signed-off-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Darren Hart <dvhart@linux.intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: peterz@infradead.org Cc: eric.dumazet@gmail.com Cc: zvonler@rgmadvisors.com Cc: hughd@google.com Link: http://lkml.kernel.org/r/1309450892-30676-1-git-send-email-sbohrer@rgmadvisors.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-08-08perf: overflow/perf_count_sw_cpu_clock crashes recent kernelsPeter Zijlstra
The below patch is for -stable only, upstream has a much larger patch that contains the below hunk in commit a8b0ca17b80e92faab46ee7179ba9e99ccb61233 Vince found that under certain circumstances software event overflows go wrong and deadlock. Avoid trying to delete a timer from the timer callback. Reported-by: Vince Weaver <vweaver1@eecs.utk.edu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13PM / Hibernate: Fix free_unnecessary_pages()Rafael J. Wysocki
commit 4d4cf23cdde2f8f9324f5684a7f349e182039529 upstream. There is a bug in free_unnecessary_pages() that causes it to attempt to free too many pages in some cases, which triggers the BUG_ON() in memory_bm_clear_bit() for copy_bm. Namely, if count_data_pages() is initially greater than alloc_normal, we get to_free_normal equal to 0 and "save" greater from 0. In that case, if the sum of "save" and count_highmem_pages() is greater than alloc_highmem, we subtract a positive number from to_free_normal. Hence, since to_free_normal was 0 before the subtraction and is an unsigned int, the result is converted to a huge positive number that is used as the number of pages to free. Fix this bug by checking if to_free_normal is actually greater than or equal to the number we're going to subtract from it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reported-and-tested-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13PM / Hibernate: Avoid hitting OOM during preallocation of memoryRafael J. Wysocki
commit 6715045ddc7472a22be5e49d4047d2d89b391f45 upstream. There is a problem in hibernate_preallocate_memory() that it calls preallocate_image_memory() with an argument that may be greater than the total number of available non-highmem memory pages. If that's the case, the OOM condition is guaranteed to trigger, which in turn can cause significant slowdown to occur during hibernation. To avoid that, make preallocate_image_memory() adjust its argument before calling preallocate_image_pages(), so that the total number of saveable non-highem pages left is not less than the minimum size of a hibernation image. Change hibernate_preallocate_memory() to try to allocate from highmem if the number of pages allocated by preallocate_image_memory() is too low. Modify free_unnecessary_pages() to take all possible memory allocation patterns into account. Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Tested-by: M. Vefa Bicakci <bicave@superonline.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-07-13taskstats: don't allow duplicate entries in listener modeVasiliy Kulikov
commit 26c4caea9d697043cc5a458b96411b86d7f6babd upstream. Currently a single process may register exit handlers unlimited times. It may lead to a bloated listeners chain and very slow process terminations. Eg after 10KK sent TASKSTATS_CMD_ATTR_REGISTER_CPUMASKs ~300 Mb of kernel memory is stolen for the handlers chain and "time id" shows 2-7 seconds instead of normal 0.003. It makes it possible to exhaust all kernel memory and to eat much of CPU time by triggerring numerous exits on a single CPU. The patch limits the number of times a single process may register itself on a single CPU to one. One little issue is kept unfixed - as taskstats_exit() is called before exit_files() in do_exit(), the orphaned listener entry (if it was not explicitly deregistered) is kept until the next someone's exit() and implicit deregistration in send_cpu_listeners(). So, if a process registered itself as a listener exits and the next spawned process gets the same pid, it would inherit taskstats attributes. Signed-off-by: Vasiliy Kulikov <segooon@gmail.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>