summaryrefslogtreecommitdiff
path: root/init
AgeCommit message (Collapse)Author
2011-03-23userns: add a user_namespace as creator/owner of uts_namespaceSerge E. Hallyn
The expected course of development for user namespaces targeted capabilities is laid out at https://wiki.ubuntu.com/UserNamespace. Goals: - Make it safe for an unprivileged user to unshare namespaces. They will be privileged with respect to the new namespace, but this should only include resources which the unprivileged user already owns. - Provide separate limits and accounting for userids in different namespaces. Status: Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and CAP_SETGID capabilities. What this gets you is a whole new set of userids, meaning that user 500 will have a different 'struct user' in your namespace than in other namespaces. So any accounting information stored in struct user will be unique to your namespace. However, throughout the kernel there are checks which - simply check for a capability. Since root in a child namespace has all capabilities, this means that a child namespace is not constrained. - simply compare uid1 == uid2. Since these are the integer uids, uid 500 in namespace 1 will be said to be equal to uid 500 in namespace 2. As a result, the lxc implementation at lxc.sf.net does not use user namespaces. This is actually helpful because it leaves us free to develop user namespaces in such a way that, for some time, user namespaces may be unuseful. Bugs aside, this patchset is supposed to not at all affect systems which are not actively using user namespaces, and only restrict what tasks in child user namespace can do. They begin to limit privilege to a user namespace, so that root in a container cannot kill or ptrace tasks in the parent user namespace, and can only get world access rights to files. Since all files currently belong to the initila user namespace, that means that child user namespaces can only get world access rights to *all* files. While this temporarily makes user namespaces bad for system containers, it starts to get useful for some sandboxing. I've run the 'runltplite.sh' with and without this patchset and found no difference. This patch: copy_process() handles CLONE_NEWUSER before the rest of the namespaces. So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace will have the new user namespace as its owner. That is what we want, since we want root in that new userns to be able to have privilege over it. Changelog: Feb 15: don't set uts_ns->user_ns if we didn't create a new uts_ns. Feb 23: Move extern init_user_ns declaration from init/version.c to utsname.h. Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Daniel Lezcano <daniel.lezcano@free.fr> Acked-by: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-23pid: remove the child_reaper special case in init/main.cEric W. Biederman
This patchset is a cleanup and a preparation to unshare the pid namespace. These prerequisites prepare for Eric's patchset to give a file descriptor to a namespace and join an existing namespace. This patch: It turns out that the existing assignment in copy_process of the child_reaper can handle the initial assignment of child_reaper we just need to generalize the test in kernel/fork.c Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22init: return proper error code in do_mounts_rd()Davidlohr Bueso
In do_mounts_rd() if memory cannot be allocated, return -ENOMEM. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22calibrate: retry with wider bounds when converge seems to failPhil Carmody
Systems with unmaskable interrupts such as SMIs may massively underestimate loops_per_jiffy, and fail to converge anywhere near the real value. A case seen on x86_64 was an initial estimate of 256<<12, which converged to 511<<12 where the real value should have been over 630<<12. This admitedly requires bypassing the TSC calibration (lpj_fine), and a failure to settle in the direct calibration too, but is physically possible. This failure does not depend on my previous calibration optimisation, but by luck is easy to fix with the optimisation in place with a trivial retry loop. In the context of the optimised converging method, as we can no longer trust the starting estimate, enlarge the search bounds exponentially so that the number of retries is logarithmically bounded. [akpm@linux-foundation.org: mention x86_64 SMIs in comment] Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Tested-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22calibrate: home in on correct lpj value more quicklyPhil Carmody
Binary chop with a jiffy-resync on each step to find an upper bound is slow, so just race in a tight-ish loop to find an underestimate. If done with lots of individual steps, sometimes several hundreds of iterations would be required, which would impose a significant overhead, and make the initial estimate very low. By taking slowly increasing steps there will be less overhead. E.g. an x86_64 2.67GHz could have fitted in 613 individual small delays, but in reality should have been able to fit in a single delay 644 times longer, so underestimated by 31 steps. To reach the equivalent of 644 small delays with the accelerating scheme now requires about 130 iterations, so has <1/4th of the overhead, and can therefore be expected to underestimate by only 7 steps. As now we have a better initial estimate we can binary chop over a smaller range. With the loop overhead in the initial estimate kept low, and the step sizes moderate, we won't have under-estimated by much, so chose as tight a range as we can. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Tested-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22calibrate: extract fall-back calculation into own helperPhil Carmody
The motivation for this patch series is that currently our OMAP calibrates itself using the trial-and-error binary chop fallback that some other architectures no longer need to perform. This is a lengthy process, taking 0.2s in an environment where boot time is of great interest. Patch 2/4 has two optimisations. Firstly, it replaces the initial repeated- doubling to find the relevant power of 2 with a tight loop that just does as much as it can in a jiffy. Secondly, it doesn't binary chop over an entire power of 2 range, it choses a much smaller range based on how much it squeezed in, and failed to squeeze in, during the first stage. Both are significant optimisations, and bring our calibration down from 23 jiffies to 5, and, in the process, often arrive at a more accurate lpj value. The 'bands' and 'sub-logarithmic' growth may look over-engineered, but they only cost a small level of inaccuracy in the initial guess (for all architectures) in order to avoid the very large inaccuracies that appeared during testing (on x86_64 architectures, and presumably others with less metronomic operation). Note that due to the existence of the TSC and other timers, the x86_64 will not typically use this fallback routine, but I wanted to code defensively, able to cope with all kinds of processor behaviours and kernel command line options. Patch 3/4 is an additional trap for the nightmare scenario where the initial estimate is very inaccurate, possibly due to things like SMIs. It simply retries with a larger bound. Stephen said: I tried this patch set out on an MSM7630. : : Before: : : Calibrating delay loop... 681.57 BogoMIPS (lpj=3407872) : : After: : : Calibrating delay loop... 680.75 BogoMIPS (lpj=3403776) : : But the really good news is calibration time dropped from ~247ms to ~56ms. : Sadly we won't be able to benefit from this should my udelay patches make : it into ARM because we would be using calibrate_delay_direct() instead (at : least on machines who choose to). Can we somehow reapply the logic behind : this to calibrate_delay_direct()? That would be even better, but this is : definitely a boot time improvement. : : Or maybe we could just replace calibrate_delay_direct() with this fallback : calculation? If __delay() is a thin wrapper around read_current_timer() : it should work just as well (plus patch 3 makes it handle SMIs). I'll try : that out. This patch: ... so that it can be modified more clinically. This is almost entirely cosmetic. The only change to the operation is that the global variable is only set once after the estimation is completed, rather than taking on all the intermediate values. However, there are no readers of that variable, so this change is unimportant. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Tested-by: Stephen Boyd <sboyd@codeaurora.org> Cc: Greg KH <greg@kroah.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22smp: move smp setup functions to kernel/smp.cAmerigo Wang
Move setup_nr_cpu_ids(), smp_init() and some other SMP boot parameter setup functions from init/main.c to kenrel/smp.c, saves some #ifdef CONFIG_SMP. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Rakib Mullick <rakib.mullick@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Tejun Heo <tj@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-22fs: use appropriate printk priority levelsMandeep Singh Baines
printk()s without a priority level default to KERN_WARNING. To reduce noise at KERN_WARNING, this patch set the priority level appriopriately for unleveled printks()s. This should be useful to folks that look at dmesg warnings closely. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-18Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (47 commits) doc: CONFIG_UNEVICTABLE_LRU doesn't exist anymore Update cpuset info & webiste for cgroups dcdbas: force SMI to happen when expected arch/arm/Kconfig: remove one to many l's in the word. asm-generic/user.h: Fix spelling in comment drm: fix printk typo 'sracth' Remove one to many n's in a word Documentation/filesystems/romfs.txt: fixing link to genromfs drivers:scsi Change printk typo initate -> initiate serial, pch uart: Remove duplicate inclusion of linux/pci.h header fs/eventpoll.c: fix spelling mm: Fix out-of-date comments which refers non-existent functions drm: Fix printk typo 'failled' coh901318.c: Change initate to initiate. mbox-db5500.c Change initate to initiate. edac: correct i82975x error-info reported edac: correct i82975x mci initialisation edac: correct commented info fs: update comments to point correct document target: remove duplicate include of target/target_core_device.h from drivers/target/target_core_hba.c ... Trivial conflict in fs/eventpoll.c (spelling vs addition)
2011-03-16Merge branch 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds
* 'config' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: BKL: That's all, folks fs/locks.c: Remove stale FIXME left over from BKL conversion ipx: remove the BKL appletalk: remove the BKL x25: remove the BKL ufs: remove the BKL hpfs: remove the BKL drivers: remove extraneous includes of smp_lock.h tracing: don't trace the BKL adfs: remove the big kernel lock
2011-03-16Merge branch 'driver-core-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6 * 'driver-core-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core-2.6: (50 commits) printk: do not mangle valid userspace syslog prefixes efivars: Add Documentation efivars: Expose efivars functionality to external drivers. efivars: Parameterize operations. efivars: Split out variable registration efivars: parameterize efivars efivars: Make efivars bin_attributes dynamic efivars: move efivars globals into struct efivars drivers:misc: ti-st: fix debugging code kref: Fix typo in kref documentation UIO: add PRUSS UIO driver support Fix spelling mistakes in Documentation/zh_CN/SubmittingPatches firmware: Fix unaligned memory accesses in dmi-sysfs firmware: Add documentation for /sys/firmware/dmi firmware: Expose DMI type 15 System Event Log firmware: Break out system_event_log in dmi-sysfs firmware: Basic dmi-sysfs support firmware: Add DMI entry types to the headers Driver core: convert platform_{get,set}_drvdata to static inline functions Translate linux-2.6/Documentation/magic-number.txt into Chinese ...
2011-03-15Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (184 commits) perf probe: Clean up probe_point_lazy_walker() return value tracing: Fix irqoff selftest expanding max buffer tracing: Align 4 byte ints together in struct tracer tracing: Export trace_set_clr_event() tracing: Explain about unstable clock on resume with ring buffer warning ftrace/graph: Trace function entry before updating index ftrace: Add .ref.text as one of the safe areas to trace tracing: Adjust conditional expression latency formatting. tracing: Fix event alignment: skb:kfree_skb tracing: Fix event alignment: mce:mce_record tracing: Fix event alignment: kvm:kvm_hv_hypercall tracing: Fix event alignment: module:module_request tracing: Fix event alignment: ftrace:context_switch and ftrace:wakeup tracing: Remove lock_depth from event entry perf header: Stop using 'self' perf session: Use evlist/evsel for managing perf.data attributes perf top: Don't let events to eat up whole header line perf top: Fix events overflow in top command ring-buffer: Remove unused #include <linux/trace_irq.h> tracing: Add an 'overwrite' trace_option. ...
2011-03-15vfs: Add name to file handle conversion supportAneesh Kumar K.V
The syscall also return mount id which can be used to lookup file system specific information such as uuid in /proc/<pid>/mountinfo Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-05BKL: That's all, folksArnd Bergmann
This removes the implementation of the big kernel lock, at last. A lot of people have worked on this in the past, I so the credit for this patch should be with everyone who participated in the hunt. The names on the Cc list are the people that were the most active in this, according to the recorded git history, in alphabetical order. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Alan Cox <alan@linux.intel.com> Cc: Alessio Igor Bogani <abogani@texware.it> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Hendry <andrew.hendry@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hans Verkuil <hverkuil@xs4all.nl> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Jan Blunck <jblunck@infradead.org> Cc: John Kacur <jkacur@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Oliver Neukum <oliver@neukum.org> Cc: Paul Menage <menage@google.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
2011-03-04perf cgroup: Fix a typo in kernel configLi Zefan
s/specificied/specified Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4D6F348C.2050804@cn.fujitsu.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16perf: Add cgroup supportStephane Eranian
This kernel patch adds the ability to filter monitoring based on container groups (cgroups). This is for use in per-cpu mode only. The cgroup to monitor is passed as a file descriptor in the pid argument to the syscall. The file descriptor must be opened to the cgroup name in the cgroup filesystem. For instance, if the cgroup name is foo and cgroupfs is mounted in /cgroup, then the file descriptor is opened to /cgroup/foo. Cgroup mode is activated by passing PERF_FLAG_PID_CGROUP in the flags argument to the syscall. For instance to measure in cgroup foo on CPU1 assuming cgroupfs is mounted under /cgroup: struct perf_event_attr attr; int cgroup_fd, fd; cgroup_fd = open("/cgroup/foo", O_RDONLY); fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP); close(cgroup_fd); Signed-off-by: Stephane Eranian <eranian@google.com> [ added perf_cgroup_{exit,attach} ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4d590250.114ddf0a.689e.4482@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-15Merge branch 'master' into for-nextJiri Kosina
2011-02-10fix jiffy calculations in calibrate_delay_direct to handle overflowTim Deegan
Fixes a hang when booting as dom0 under Xen, when jiffies can be quite large by the time the kernel init gets this far. Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com> [jbeulich@novell.com: !time_after() -> time_before_eq() as suggested by Jiri Slaby] Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-03sysfs: Capitalize description of SYSFS_DEPRECATED{_V2} optionsFerenc Wagner
Signed-off-by: Ferenc Wagner <wferi@niif.hu> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-01-20Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status to init/main.c lockdep: Move early boot local IRQ enable/disable status to init/main.c
2011-01-20kconfig: rename CONFIG_EMBEDDED to CONFIG_EXPERTDavid Rientjes
The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option is used to configure any non-standard kernel with a much larger scope than only small devices. This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes references to the option throughout the kernel. A new CONFIG_EMBEDDED option is added that automatically selects CONFIG_EXPERT when enabled and can be used in the future to isolate options that should only be considered for embedded systems (RISC architectures, SLOB, etc). Calling the option "EXPERT" more accurately represents its intention: only expert users who understand the impact of the configuration changes they are making should enable it. Reviewed-by: Ingo Molnar <mingo@elte.hu> Acked-by: David Woodhouse <david.woodhouse@intel.com> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Greg KH <gregkh@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jens Axboe <axboe@kernel.dk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Robin Holt <holt@sgi.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-20lockdep: Move early boot local IRQ enable/disable status to init/main.cTejun Heo
During early boot, local IRQ is disabled until IRQ subsystem is properly initialized. During this time, no one should enable local IRQ and some operations which usually are not allowed with IRQ disabled, e.g. operations which might sleep or require communications with other processors, are allowed. lockdep tracked this with early_boot_irqs_off/on() callbacks. As other subsystems need this information too, move it to init/main.c and make it generally available. While at it, toggle the boolean to early_boot_irqs_disabled instead of enabled so that it can be initialized with %false and %true indicates the exceptional condition. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Pekka Enberg <penberg@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> LKML-Reference: <20110120110635.GB6036@htj.dyndns.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-17Kconfig: BLK_THROTTLE -> BLK_DEV_THROTTLINGMichael Witten
It would seem that `CONFIG_BLK_THROTTLE' doesn't exist, as it is only referenced in the documentation for `CONFIG_BLK_CGROUP'. The only other choice is `CONFIG_BLK_DEV_THROTTLING': $ git grep --cached THROTTL -- \*Kconfig block/Kconfig:config BLK_DEV_THROTTLING init/Kconfig: CONFIG_BLK_THROTTLE=y. Signed-off-by: Michael Witten <mfwitten@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-01-17Kconfig: Typo: seti -> setMichael Witten
Also, I introduced some punctuation to facilitate reading. Signed-off-by: Michael Witten <mfwitten@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-01-15Merge branches 'core-fixes-for-linus', 'x86-fixes-for-linus', ↵Linus Torvalds
'timers-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: avoid pointless blocked-task warnings rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter status rtmutex: Fix comment about why new_owner can be NULL in wake_futex_pi() * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: x86, olpc: Add missing Kconfig dependencies x86, mrst: Set correct APB timer IRQ affinity for secondary cpu x86: tsc: Fix calibration refinement conditionals to avoid divide by zero x86, ia64, acpi: Clean up x86-ism in drivers/acpi/numa.c * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: timekeeping: Make local variables static time: Rename misnamed minsec argument of clocks_calc_mult_shift() * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: tracing: Remove syscall_exit_fields tracing: Only process module tracepoints once perf record: Add "nodelay" mode, disabled by default perf sched: Fix list of events, dropping unsupported ':r' modifier Revert "perf tools: Emit clearer message for sys_perf_event_open ENOENT return" perf top: Fix annotate segv perf evsel: Fix order of event list deletion
2011-01-14Merge branch 'rcu/next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/urgent
2011-01-14rcu: demote SRCU_SYNCHRONIZE_DELAY from kernel-parameter statusPaul E. McKenney
Because the adaptive synchronize_srcu_expedited() approach has worked very well in testing, remove the kernel parameter and replace it by a C-preprocessor macro. If someone finds problems with this approach, a more complex and aggressively adaptive approach might be required. Longer term, SRCU will be merged with the other RCU implementations, at which point synchronize_srcu_expedited() will be event driven, just as synchronize_sched_expedited() currently is. At that point, there will be no need for this adaptive approach. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-01-13Merge branch 'for-next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) Documentation/trace/events.txt: Remove obsolete sched_signal_send. writeback: fix global_dirty_limits comment runtime -> real-time ppc: fix comment typo singal -> signal drivers: fix comment typo diable -> disable. m68k: fix comment typo diable -> disable. wireless: comment typo fix diable -> disable. media: comment typo fix diable -> disable. remove doc for obsolete dynamic-printk kernel-parameter remove extraneous 'is' from Documentation/iostats.txt Fix spelling milisec -> ms in snd_ps3 module parameter description Fix spelling mistakes in comments Revert conflicting V4L changes i7core_edac: fix typos in comments mm/rmap.c: fix comment sound, ca0106: Fix assignment to 'channel'. hrtimer: fix a typo in comment init/Kconfig: fix typo anon_inodes: fix wrong function name in comment fix comment typos concerning "consistent" poll: fix a typo in comment ... Fix up trivial conflicts in: - drivers/net/wireless/iwlwifi/iwl-core.c (moved to iwl-legacy.c) - fs/ext4/ext4.h Also fix missed 'diabled' typo in drivers/net/bnx2x/bnx2x.h while at it.
2011-01-13decompressors: add boot-time XZ supportLasse Collin
This implements the API defined in <linux/decompress/generic.h> which is used for kernel, initramfs, and initrd decompression. This patch together with the first patch is enough for XZ-compressed initramfs and initrd; XZ-compressed kernel will need arch-specific changes. The buffering requirements described in decompress_unxz.c are stricter than with gzip, so the relevant changes should be done to the arch-specific code when adding support for XZ-compressed kernel. Similarly, the heap size in arch-specific pre-boot code may need to be increased (30 KiB is enough). The XZ decompressor needs memmove(), memeq() (memcmp() == 0), and memzero() (memset(ptr, 0, size)), which aren't available in all arch-specific pre-boot environments. I'm including simple versions in decompress_unxz.c, but a cleaner solution would naturally be nicer. Signed-off-by: Lasse Collin <lasse.collin@tukaani.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Alain Knaff <alain@knaff.lu> Cc: Albin Tonnerre <albin.tonnerre@free-electrons.com> Cc: Phillip Lougher <phillip@lougher.demon.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-07Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits) usb: don't use flush_scheduled_work() speedtch: don't abuse struct delayed_work media/video: don't use flush_scheduled_work() media/video: explicitly flush request_module work ioc4: use static work_struct for ioc4_load_modules() init: don't call flush_scheduled_work() from do_initcalls() s390: don't use flush_scheduled_work() rtc: don't use flush_scheduled_work() mmc: update workqueue usages mfd: update workqueue usages dvb: don't use flush_scheduled_work() leds-wm8350: don't use flush_scheduled_work() mISDN: don't use flush_scheduled_work() macintosh/ams: don't use flush_scheduled_work() vmwgfx: don't use flush_scheduled_work() tpm: don't use flush_scheduled_work() sonypi: don't use flush_scheduled_work() hvsi: don't use flush_scheduled_work() xen: don't use flush_scheduled_work() gdrom: don't use flush_scheduled_work() ... Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c as per Tejun.
2011-01-06Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits) sched: Change wait_for_completion_*_timeout() to return a signed long sched, autogroup: Fix reference leak sched, autogroup: Fix potential access to freed memory sched: Remove redundant CONFIG_CGROUP_SCHED ifdef sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight sched: Move periodic share updates to entity_tick() printk: Use this_cpu_{read|write} api on printk_pending sched: Make pushable_tasks CONFIG_SMP dependant sched: Add 'autogroup' scheduling feature: automated per session task groups sched: Fix unregister_fair_sched_group() sched: Remove unused argument dest_cpu to migrate_task() mutexes, sched: Introduce arch_mutex_cpu_relax() sched: Add some clock info to sched_debug cpu: Remove incorrect BUG_ON cpu: Remove unused variable sched: Fix UP build breakage sched: Make task dump print all 15 chars of proc comm sched: Update tg->shares after cpu.shares write sched: Allow update_cfs_load() to update global load sched: Implement demand based update_cfs_load() ...
2011-01-06Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (146 commits) tools, perf: Documentation for the power events API perf: Add calls to suspend trace point perf script: Make some lists static perf script: Use the default lost event handler perf session: Warn about errors when processing pipe events too perf tools: Fix perf_event.h header usage perf test: Clarify some error reports in the open syscall test x86, NMI: Add touch_nmi_watchdog to io_check_error delay x86: Avoid calling arch_trigger_all_cpu_backtrace() at the same time x86: Only call smp_processor_id in non-preempt cases perf timechart: Adjust perf timechart to the new power events perf: Clean up power events by introducing new, more generic ones perf: Do not export power_frequency, but power_start event perf test: Add test for counting open syscalls perf evsel: Auto allocate resources needed for some methods perf evsel: Use {cpu,thread}_map to shorten list of parameters perf tools: Refactor all_tids to hold nr and the map perf tools: Refactor cpumap to hold nr and the map perf evsel: Introduce per cpu and per thread open helpers perf evsel: Steal the counter reading routines from stat ...
2011-01-06Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: rcu: remove unused __list_for_each_rcu() macro rculist: fix borked __list_for_each_rcu() macro rcu: reduce __call_rcu()-induced contention on rcu_node structures rcu: limit rcu_node leaf-level fanout rcu: fine-tune grace-period begin/end checks rcu: Keep gpnum and completed fields synchronized rcu: Stop chasing QS if another CPU did it for us rcu: increase synchronize_sched_expedited() batching rcu: Make synchronize_srcu_expedited() fast if running readers rcu: fix race condition in synchronize_sched_expedited() rcu: update documentation/comments for Lai's adoption patch rcu,cleanup: simplify the code when cpu is dying rcu,cleanup: move synchronize_sched_expedited() out of sched.c rcu: get rid of obsolete "classic" names in TREE_RCU tracing rcu: Distinguish between boosting and boosted rcu: document TINY_RCU and TINY_PREEMPT_RCU tracing. rcu: add tracing for TINY_RCU and TINY_PREEMPT_RCU rcu: priority boosting for TINY_PREEMPT_RCU rcu: move TINY_RCU from softirq to kthread rcu: add priority-inversion testing to rcutorture
2011-01-05Merge commit 'v2.6.37' into perf/coreIngo Molnar
Merge reason: Add the final .37 tree. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-05Merge commit 'v2.6.37' into sched/coreIngo Molnar
Merge reason: Merge the final .37 tree. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-03name_to_dev_t() must not call __init codeJan Beulich
The function can't be __init itself (being called from some sysfs handler), and hence none of the functions it calls can be either. Signed-off-by: Jan Beulich <jbeulich@novell.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-12-24init: don't call flush_scheduled_work() from do_initcalls()Tejun Heo
The call to flush_scheduled_work() in do_initcalls() is there to make sure all works queued to system_wq by initcalls finish before the init sections are dropped. However, the call doesn't make much sense at this point - there already are multiple different workqueues and different subsystems are free to create and use their own. Ordering requirements are and should be expressed explicitly. Drop the call to prepare for the deprecation and removal of flush_scheduled_work(). Andrew suggested adding sanity check where the workqueue code checks whether any pending or running work has the work function in the init text section. However, checking this for running works requires the worker to keep track of the current function being executed, and checking only the pending works will miss most cases. As a violation will almost always be caught by the usual page fault mechanism, I don't think it would be worthwhile to make the workqueue code track extra state just for this. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org>
2010-12-23Merge branch 'rcu/next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu
2010-12-22init/Kconfig: fix typoJim Cromie
Signed-off-by:  Jim Cromie <jim.cromie@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-12-16init: Initialized IDR earlierPeter Zijlstra
perf_event_init() wants to start using IDR trees, its needs in turn are satisfied by mm_init(). Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101117222056.206992649@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-16perf: Move perf_event_init() into main.cPeter Zijlstra
Currently we call perf_event_init() from sched_init(). In order to make it more obvious move it to the cannnonical location. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20101117222056.093629821@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-08Merge branch 'linus' into sched/coreIngo Molnar
Merge reason: we want to queue up dependent cleanup Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-12-07Merge commit 'v2.6.37-rc5' into perf/coreIngo Molnar
Merge reason: Pick up the latest -rc. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-30sched: Add 'autogroup' scheduling feature: automated per session task groupsMike Galbraith
A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. Currently, only per session autogroups are implemented, but the patch leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), a new task group is created, the process is moved into the new task group, and a reference to the preveious task group is dropped. Child processes inherit this task group thereafter, and increase it's refcount. When the last thread of a process exits, the process's reference is dropped, such that when the last process referencing an autogroup exits, the autogroup is destroyed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem: cat /proc/<pid>/autogroup Displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the boot option noautogroup, and can also be turned on/off on the fly via: echo [01] > /proc/sys/kernel/sched_autogroup_enabled ... which will automatically move tasks to/from the root task group. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Paul Turner <pjt@google.com> Cc: Oleg Nesterov <oleg@redhat.com> [ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-29rcu: Make synchronize_srcu_expedited() fast if running readersPaul E. McKenney
The synchronize_srcu_expedited() function is currently quick if there are no active readers, but will delay a full jiffy if there are any. If these readers leave their SRCU read-side critical sections quickly, this is way too long to wait. So this commit first waits ten microseconds, and only then falls back to jiffy-at-a-time waiting. Reported-by: Avi Kivity <avi@redhat.com> Reported-by: Marcelo Tosatti <mtosatti@redhat.com> Tested-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: add tracing for TINY_RCU and TINY_PREEMPT_RCUPaul E. McKenney
Add tracing for the tiny RCU implementations, including statistics on boosting in the case of TINY_PREEMPT_RCU and RCU_BOOST. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-29rcu: priority boosting for TINY_PREEMPT_RCUPaul E. McKenney
Add priority boosting, but only for TINY_PREEMPT_RCU. This is enabled by the default-off RCU_BOOST kernel parameter. The priority to which to boost preempted RCU readers is controlled by the RCU_BOOST_PRIO kernel parameter (defaulting to real-time priority 1) and the time to wait before boosting the readers blocking a given grace period is controlled by the RCU_BOOST_DELAY kernel parameter (defaulting to 500 milliseconds). Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2010-11-26perf, arch: Cleanup perf-pmu init vs lockup-detectorPeter Zijlstra
The perf hardware pmu got initialized at various points in the boot, some before early_initcall() some after (notably arch_initcall). The problem is that the NMI lockup detector is ran from early_initcall() and expects the hardware pmu to be present. Sanitize this by moving all architecture hardware pmu implementations to initialize at early_initcall() and move the lockup detector to an explicit initcall right after that. Cc: paulus <paulus@samba.org> Cc: davem <davem@davemloft.net> Cc: Michael Cree <mcree@orcon.net.nz> Cc: Deng-Cheng Zhu <dengcheng.zhu@gmail.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Acked-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1290707759.2145.119.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-11-25cgroups: make swap accounting default behavior configurableMichal Hocko
Swap accounting can be configured by CONFIG_CGROUP_MEM_RES_CTLR_SWAP configuration option and then it is turned on by default. There is a boot option (noswapaccount) which can disable this feature. This makes it hard for distributors to enable the configuration option as this feature leads to a bigger memory consumption and this is a no-go for general purpose distribution kernel. On the other hand swap accounting may be very usuful for some workloads. This patch adds a new configuration option which controls the default behavior (CGROUP_MEM_RES_CTLR_SWAP_ENABLED). If the option is selected then the feature is turned on by default. It also adds a new boot parameter swapaccount[=1|0] which enhances the original noswapaccount parameter semantic by means of enable/disable logic (defaults to 1 if no value is provided to be still consistent with noswapaccount). The default behavior is unchanged (if CONFIG_CGROUP_MEM_RES_CTLR_SWAP is enabled then CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED is enabled as well) Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-11-17BKL: remove extraneous #include <smp_lock.h>Arnd Bergmann
The big kernel lock has been removed from all these files at some point, leaving only the #include. Remove this too as a cleanup. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>