summaryrefslogtreecommitdiff
path: root/kernel/cgroup.c
AgeCommit message (Collapse)Author
2014-07-25cgroup: remove synchronize_rcu() from cgroup_attach_{task|proc}()Li Zefan
These 2 syncronize_rcu()s make attaching a task to a cgroup quite slow, and it can't be ignored in some situations. A real case from Colin Cross: Android uses cgroups heavily to manage thread priorities, putting threads in a background group with reduced cpu.shares when they are not visible to the user, and in a foreground group when they are. Some RPCs from foreground threads to background threads will temporarily move the background thread into the foreground group for the duration of the RPC. This results in many calls to cgroup_attach_task. In cgroup_attach_task() it's task->cgroups that is protected by RCU, and put_css_set() calls kfree_rcu() to free it. If we remove this synchronize_rcu(), there can be threads in RCU-read sections accessing their old cgroup via current->cgroups with concurrent rmdir operation, but this is safe. # time for ((i=0; i<50; i++)) { echo $$ > /mnt/sub/tasks; echo $$ > /mnt/tasks; } real 0m2.524s user 0m0.008s sys 0m0.004s With this patch: real 0m0.004s user 0m0.004s sys 0m0.000s tj: These synchronize_rcu()s are utterly confused. synchornize_rcu() necessarily has to come between two operations to guarantee that the changes made by the former operation are visible to all rcu readers before proceeding to the latter operation. Here, synchornize_rcu() are at the end of attach operations with nothing beyond it. Its only effect would be delaying completion of write(2) to sysfs tasks/procs files until all rcu readers see the change, which doesn't mean anything. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Colin Cross <ccross@google.com> Conflicts: kernel/cgroup.c
2012-01-16Merge branch 'linux-3.1.y' into android-tegra-nv-3.1Varun Wadekar
Linux 3.1.9 Conflicts: Makefile Change-Id: I22227ab33ba7ddaba8e6fe049393c58a83d73648 Signed-off-by: Varun Wadekar <vwadekar@nvidia.com>
2012-01-12cgroup: fix to allow mounting a hierarchy by nameLi Zefan
commit 0d19ea866562e46989412a0676412fa0983c9ce7 upstream. If we mount a hierarchy with a specified name, the name is unique, and we can use it to mount the hierarchy without specifying its set of subsystem names. This feature is documented is Documentation/cgroups/cgroups.txt section 2.3 Here's an example: # mount -t cgroup -o cpuset,name=myhier xxx /cgroup1 # mount -t cgroup -o name=myhier xxx /cgroup2 But it was broken by commit 32a8cf235e2f192eb002755076994525cdbaa35a (cgroup: make the mount options parsing more accurate) This fixes the regression. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2012-01-11cgroups: fix a css_set not found bug in cgroup_attach_procMandeep Singh Baines
commit e0197aae59e55c06db172bfbe1a1cdb8c0e1cab3 upstream. There is a BUG when migrating a PF_EXITING proc. Since css_set_prefetch() is not called for the PF_EXITING case, find_existing_css_set() will return NULL inside cgroup_task_migrate() causing a BUG. This bug is easy to reproduce. Create a zombie and echo its pid to cgroup.procs. $ cat zombie.c \#include <unistd.h> int main() { if (fork()) pause(); return 0; } $ We are hitting this bug pretty regularly on ChromeOS. This bug is already fixed by Tejun Heo's cgroup patchset which is targetted for the next merge window: https://lkml.org/lkml/2011/11/1/356 I've create a smaller patch here which just fixes this bug so that a fix can be merged into the current release and stable. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Downstream-Bug-Report: http://crosbug.com/23953 Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: containers@lists.linux-foundation.org Cc: cgroups@vger.kernel.org Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Menage <paul@paulmenage.org> Cc: Olof Johansson <olofj@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Change-Id: I61bf2d48574d6ce5418b988e9547937c2efdd084 Reviewed-on: http://git-master/r/74185 Reviewed-by: Automatic_Commit_Validation_User Reviewed-by: Rohan Somvanshi <rsomvanshi@nvidia.com> Tested-by: Rohan Somvanshi <rsomvanshi@nvidia.com>
2012-01-06cgroups: fix a css_set not found bug in cgroup_attach_procMandeep Singh Baines
commit e0197aae59e55c06db172bfbe1a1cdb8c0e1cab3 upstream. There is a BUG when migrating a PF_EXITING proc. Since css_set_prefetch() is not called for the PF_EXITING case, find_existing_css_set() will return NULL inside cgroup_task_migrate() causing a BUG. This bug is easy to reproduce. Create a zombie and echo its pid to cgroup.procs. $ cat zombie.c \#include <unistd.h> int main() { if (fork()) pause(); return 0; } $ We are hitting this bug pretty regularly on ChromeOS. This bug is already fixed by Tejun Heo's cgroup patchset which is targetted for the next merge window: https://lkml.org/lkml/2011/11/1/356 I've create a smaller patch here which just fixes this bug so that a fix can be merged into the current release and stable. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Downstream-Bug-Report: http://crosbug.com/23953 Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: containers@lists.linux-foundation.org Cc: cgroups@vger.kernel.org Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Menage <paul@paulmenage.org> Cc: Olof Johansson <olofj@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-30cgroup: Add generic cgroup subsystem permission checksColin Cross
Rather than using explicit euid == 0 checks when trying to move tasks into a cgroup via CFS, move permission checks into each specific cgroup subsystem. If a subsystem does not specify a 'allow_attach' handler, then we fall back to doing our checks the old way. Use the 'allow_attach' handler for the 'cpu' cgroup to allow non-root processes to add arbitrary processes to a 'cpu' cgroup if it has the CAP_SYS_NICE capability set. This version of the patch adds a 'allow_attach' handler instead of reusing the 'can_attach' handler. If the 'can_attach' handler is reused, a new cgroup that implements 'can_attach' but not the permission checks could end up with no permission checks at all. Change-Id: Icfa950aa9321d1ceba362061d32dc7dfa2c64f0c Original-Author: San Mehat <san@google.com> Signed-off-by: Colin Cross <ccross@android.com>
2011-11-30Revert "cgroup: Add generic cgroup subsystem permission checks."Colin Cross
This reverts commit 1d38bc7d0523af2233b4280e2aeab34c6a076665. Change-Id: I2c5066b696cbdd5ca117ed74718bcb7e70e878e7 Signed-off-by: Colin Cross <ccross@android.com>
2011-11-30cgroup: Remove call to synchronize_rcu in cgroup_attach_taskColin Cross
synchronize_rcu can be very expensive, averaging 100 ms in some cases. In cgroup_attach_task, it is used to prevent a task->cgroups pointer dereferenced in an RCU read side critical section from being invalidated, by delaying the call to put_css_set until after an RCU grace period. To avoid the call to synchronize_rcu, make the put_css_set call rcu-safe by moving the deletion of the css_set links into free_css_set_work, scheduled by the rcu callback free_css_set_rcu. The decrement of the cgroup refcount is no longer synchronous with the call to put_css_set, which can result in the cgroup refcount staying positive after the last call to cgroup_attach_task returns. To allow the cgroup to be deleted with cgroup_rmdir synchronously after cgroup_attach_task, have rmdir check the refcount of all associated css_sets. If cgroup_rmdir is called on a cgroup for which the css_sets all have refcount zero but the cgroup refcount is nonzero, reuse the rmdir waitqueue to block the rmdir until free_css_set_work is called. Signed-off-by: Colin Cross <ccross@android.com>
2011-11-30cgroup: Set CGRP_RELEASABLE when adding to a cgroupColin Cross
Changes the meaning of CGRP_RELEASABLE to be set on any cgroup that has ever had a task or cgroup in it, or had css_get called on it. The bit is set in cgroup_attach_task, cgroup_create, and __css_get. It is not necessary to set the bit in cgroup_fork, as the task is either in the root cgroup, in which can never be released, or the task it was forked from already set the bit in croup_attach_task. Signed-off-by: Colin Cross <ccross@android.com>
2011-11-30cgroup: Add generic cgroup subsystem permission checks.San Mehat
Rather than using explicit euid == 0 checks when trying to move tasks into a cgroup via CFS, move permission checks into each specific cgroup subsystem. If a subsystem does not specify a 'can_attach' handler, then we fall back to doing our checks the old way. This way non-root processes can add arbitrary processes to a cgroup if all the registered subsystems on that cgroup agree. Also change explicit euid == 0 check to CAP_SYS_ADMIN Signed-off-by: San Mehat <san@google.com>
2011-07-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (54 commits) tpm_nsc: Fix bug when loading multiple TPM drivers tpm: Move tpm_tis_reenable_interrupts out of CONFIG_PNP block tpm: Fix compilation warning when CONFIG_PNP is not defined TOMOYO: Update kernel-doc. tpm: Fix a typo tpm_tis: Probing function for Intel iTPM bug tpm_tis: Fix the probing for interrupts tpm_tis: Delay ACPI S3 suspend while the TPM is busy tpm_tis: Re-enable interrupts upon (S3) resume tpm: Fix display of data in pubek sysfs entry tpm_tis: Add timeouts sysfs entry tpm: Adjust interface timeouts if they are too small tpm: Use interface timeouts returned from the TPM tpm_tis: Introduce durations sysfs entry tpm: Adjust the durations if they are too small tpm: Use durations returned from TPM TOMOYO: Enable conditional ACL. TOMOYO: Allow using argv[]/envp[] of execve() as conditions. TOMOYO: Allow using executable's realpath and symlink's target as conditions. TOMOYO: Allow using owner/group etc. of file objects as conditions. ... Fix up trivial conflict in security/tomoyo/realpath.c
2011-07-26atomic: use <linux/atomic.h>Arun Sharma
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) fs: Merge split strings treewide: fix potentially dangerous trailing ';' in #defined values/expressions uwb: Fix misspelling of neighbourhood in comment net, netfilter: Remove redundant goto in ebt_ulog_packet trivial: don't touch files that are removed in the staging tree lib/vsprintf: replace link to Draft by final RFC number doc: Kconfig: `to be' -> `be' doc: Kconfig: Typo: square -> squared doc: Konfig: Documentation/power/{pm => apm-acpi}.txt drivers/net: static should be at beginning of declaration drivers/media: static should be at beginning of declaration drivers/i2c: static should be at beginning of declaration XTENSA: static should be at beginning of declaration SH: static should be at beginning of declaration MIPS: static should be at beginning of declaration ARM: static should be at beginning of declaration rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check Update my e-mail address PCIe ASPM: forcedly -> forcibly gma500: push through device driver tree ... Fix up trivial conflicts: - arch/arm/mach-ep93xx/dma-m2p.c (deleted) - drivers/gpio/gpio-ep93xx.c (renamed and context nearby) - drivers/net/r8169.c (just context changes)
2011-07-20kill file_permission() completelyAl Viro
convert the last remaining caller to inode_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-08rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_checkMichal Hocko
Since ca5ecddf (rcu: define __rcu address space modifier for sparse) rcu_dereference_check use rcu_read_lock_held as a part of condition automatically so callers do not have to do that as well. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-06-09cgroupfs: use init_cred when populating new cgroupfs mounteparis@redhat
We recently found that in some configurations SELinux was blocking the ability for cgroupfs to be mounted. The reason for this is because cgroupfs creates files and directories during the get_sb() call and also uses lookup_one_len() during that same get_sb() call. This is a problem since the security subsystem cannot initialize the superblock and the inodes in that filesystem until after the get_sb() call returns. Thus we leave the inodes in an unitialized state during get_sb(). For the vast majority of filesystems this is not an issue, but since cgroupfs uses lookup_on_len() it does search permission checks on the directories in the path it walks. Since the inode security state is not set up SELinux does these checks as if the inodes were 'unlabeled.' Many 'normal' userspace process do not have permission to interact with unlabeled inodes. The solution presented here is to do the permission checks of path walk and inode creation as the kernel rather than as the task that called mount. Since the kernel has permission to read/write/create unlabeled inodes the get_sb() call will complete successfully and the SELinux code will be able to initialize the superblock and those inodes created during the get_sb() call. This appears to be the same solution used by other filesystems such as devtmpfs to solve the same issue and should thus have no negative impact on other LSMs which currently work. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-05-26cgroup: remove the ns_cgroupDaniel Lezcano
The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and leads to some problems: * cgroup creation is out-of-control * cgroup name can conflict when pids are looping * it is not possible to have a single process handling a lot of namespaces without falling in a exponential creation time * we may want to create a namespace without creating a cgroup The ns_cgroup was replaced by a compatibility flag 'clone_children', where a newly created cgroup will copy the parent cgroup values. The userspace has to manually create a cgroup and add a task to the 'tasks' file. This patch removes the ns_cgroup as suggested in the following thread: https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html The 'cgroup_clone' function is removed because it is no longer used. This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a printk warning users that the feature is planned for removal. Since that time we have heard from XXX users who were affected by this. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26cgroups: use flex_array in attach_procBen Blum
Convert cgroup_attach_proc to use flex_array. The cgroup_attach_proc implementation requires a pre-allocated array to store task pointers to atomically move a thread-group, but asking for a monolithic array with kmalloc() may be unreliable for very large groups. Using flex_array provides the same functionality with less risk of failure. This is a post-patch for cgroup-procs-write.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26cgroups: make procs file writableBen Blum
Make procs file writable to move all threads by tgid at once. Add functionality that enables users to move all threads in a threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This current implementation makes use of a per-threadgroup rwsem that's taken for reading in the fork() path to prevent newly forking threads within the threadgroup from "escaping" while the move is in progress. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26cgroups: add per-thread subsystem callbacksBen Blum
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-07cgroup,rcu: convert call_rcu(__free_css_id_cb) to kfree_rcu()Lai Jiangshan
The rcu callback __free_css_id_cb() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(__free_css_id_cb). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07cgroup,rcu: convert call_rcu(free_cgroup_rcu) to kfree_rcu()Lai Jiangshan
The rcu callback free_cgroup_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_cgroup_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07cgroup,rcu: convert call_rcu(free_css_set_rcu) to kfree_rcu()Lai Jiangshan
The rcu callback free_css_set_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_css_set_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-22cgroups: if you list_empty() a head then don't list_del() itPhil Carmody
list_del() leaves poison in the prev and next pointers. The next list_empty() will compare those poisons, and say the list isn't empty. Any list operations that assume the node is on a list because of such a check will be fooled into dereferencing poison. One needs to INIT the node after the del, and fortunately there's already a wrapper for that - list_del_init(). Some of the dels are followed by deallocations, so can be ignored, and one can be merged with an add to make a move. Apart from that, I erred on the side of caution in making nodes list_empty()-queriable. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-16perf: Add cgroup supportStephane Eranian
This kernel patch adds the ability to filter monitoring based on container groups (cgroups). This is for use in per-cpu mode only. The cgroup to monitor is passed as a file descriptor in the pid argument to the syscall. The file descriptor must be opened to the cgroup name in the cgroup filesystem. For instance, if the cgroup name is foo and cgroupfs is mounted in /cgroup, then the file descriptor is opened to /cgroup/foo. Cgroup mode is activated by passing PERF_FLAG_PID_CGROUP in the flags argument to the syscall. For instance to measure in cgroup foo on CPU1 assuming cgroupfs is mounted under /cgroup: struct perf_event_attr attr; int cgroup_fd, fd; cgroup_fd = open("/cgroup/foo", O_RDONLY); fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP); close(cgroup_fd); Signed-off-by: Stephane Eranian <eranian@google.com> [ added perf_cgroup_{exit,attach} ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4d590250.114ddf0a.689e.4482@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16cgroup: Fix cgroup_subsys::exit callbackPeter Zijlstra
Make the ::exit method act like ::attach, it is after all very nearly the same thing. The bug had no effect on correctness - fixing it is an optimization for the scheduler. Also, later perf-cgroups patches rely on it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Menage <menage@google.com> LKML-Reference: <1297160655.13327.92.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-14Merge branch 'vfs-scale-working' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: kernel: fix hlist_bl again cgroups: Fix a lockdep warning at cgroup removal fs: namei fix ->put_link on wrong inode in do_filp_open
2011-01-14cgroup_fs: fix cgroup use of simple_lookup()Al Viro
cgroup can't use simple_lookup(), since that'd override its desired ->d_op. Tested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14cgroups: Fix a lockdep warning at cgroup removalLi Zefan
Commit 2fd6b7f5 ("fs: dcache scale subdirs") forgot to annotate a dentry lock, which caused a lockdep warning. Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-12switch cgroupAl Viro
switching it to s_d_op allows to kill the cgroup_lookup() kludge. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-07fs: dcache reduce branches in lookup pathNick Piggin
Reduce some branches and memory accesses in dcache lookup by adding dentry flags to indicate common d_ops are set, rather than having to check them. This saves a pointer memory access (dentry->d_op) in common path lookup situations, and saves another pointer load and branch in cases where we have d_op but not the particular operation. Patched with: git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache rationalise dget variantsNick Piggin
dget_locked was a shortcut to avoid the lazy lru manipulation when we already held dcache_lock (lru manipulation was relatively cheap at that point). However, how that the lru lock is an innermost one, we never hold it at any caller, so the lock cost can now be avoided. We already have well working lazy dcache LRU, so it should be fine to defer LRU manipulations to scan time. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache remove dcache_lockNick Piggin
dcache_lock no longer protects anything. remove it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale subdirsNick Piggin
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't using dcache_lock for these anyway (eg. using i_mutex). Note: if we change the locking rule in future so that ->d_child protection is provided only with ->d_parent->d_lock, it may allow us to reduce some locking. But it would be an exception to an otherwise regular locking scheme, so we'd have to see some good results. Probably not worthwhile. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale dentry refcountNick Piggin
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a 0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when we start protecting many other dentry members with d_lock. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: change d_delete semanticsNick Piggin
Change d_delete from a dentry deletion notification to a dentry caching advise, more like ->drop_inode. Require it to be constant and idempotent, and not take d_lock. This is how all existing filesystems use the callback anyway. This makes fine grained dentry locking of dput and dentry lru scanning much simpler. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07cgroup fs: avoid switching ->d_op on live dentryNick Piggin
Switching d_op on a live dentry is racy in general, so avoid it. In this case it is a negative dentry, which is safer, but there are still concurrent ops which may be called on d_op in that case (eg. d_revalidate). So in general a filesystem may not do this. Fix cgroupfs so as not to do this. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2010-10-29convert cgroup and cpusetAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-27cgroups: add check for strcpy destination string overflowEvgeny Kuznetsov
Function "strcpy" is used without check for maximum allowed source string length and could cause destination string overflow. Check for string length is added before using "strcpy". Function now is return error if source string length is more than a maximum. akpm: presently considered NotABug, but add the check for general future-safeness and robustness. Signed-off-by: Evgeny Kuznetsov <EXT-Eugeny.Kuznetsov@nokia.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27cgroup: make the mount options parsing more accurateDaniel Lezcano
Current behavior: ================= (1) When we mount a cgroup, we can specify the 'all' option which means to enable all the cgroup subsystems. This is the default option when no option is specified. (2) If we want to mount a cgroup with a subset of the supported cgroup subsystems, we have to specify a subsystems name list for the mount option. (3) If we specify another option like 'noprefix' or 'release_agent', the actual code wants the 'all' or a subsystem name option specified also. Not critical but a bit not friendly as we should assume (1) in this case. (4) Logically, the 'all' option is mutually exclusive with a subsystem name, but this is not detected. In other words: succeed : mount -t cgroup -o all,freezer cgroup /cgroup => is it 'all' or 'freezer' ? fails : mount -t cgroup -o noprefix cgroup /cgroup => succeed if we do '-o noprefix,all' The following patches consolidate a bit the mount options check. New behavior: ============= (1) untouched (2) untouched (3) the 'all' option will be by default when specifying other than a subsystem name option (4) raises an error In other words: fails : mount -t cgroup -o all,freezer cgroup /cgroup succeed : mount -t cgroup -o noprefix cgroup /cgroup For the sake of lisibility, the if ... then ... else ... if ... indentation when parsing the options has been changed to: if ... then ... continue fi Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27cgroup: add clone_children control fileDaniel Lezcano
The ns_cgroup is a control group interacting with the namespaces. When a new namespace is created, a corresponding cgroup is automatically created too. The cgroup name is the pid of the process who did 'unshare' or the child of 'clone'. This cgroup is tied with the namespace because it prevents a process to escape the control group and use the post_clone callback, so the child cgroup inherits the values of the parent cgroup. Unfortunately, the more we use this cgroup and the more we are facing problems with it: (1) when a process unshares, the cgroup name may conflict with a previous cgroup with the same pid, so unshare or clone return -EEXIST (2) the cgroup creation is out of control because there may have an application creating several namespaces where the system will automatically create several cgroups in his back and let them on the cgroupfs (eg. a vrf based on the network namespace). (3) the mix of (1) and (2) force an administrator to regularly check and clean these cgroups. This patchset removes the ns_cgroup by adding a new flag to the cgroup and the cgroupfs mount option. It enables the copy of the parent cgroup when a child cgroup is created. We can then safely remove the ns_cgroup as this flag brings a compatibility. We have now to manually create and add the task to a cgroup, which is consistent with the cgroup framework. This patch: Sent as an answer to a previous thread around the ns_cgroup. https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html It adds a control file 'clone_children' for a cgroup. This control file is a boolean specifying if the child cgroup should be a clone of the parent cgroup or not. The default value is 'false'. This flag makes the child cgroup to call the post_clone callback of all the subsystem, if it is available. At present, the cpuset is the only one which had implemented the post_clone callback. The option can be set at mount time by specifying the 'clone_children' mount option. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Cc: Matt Helsley <matthltc@us.ibm.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-25fs: do not assign default i_ino in new_inodeChristoph Hellwig
Instead of always assigning an increasing inode number in new_inode move the call to assign it into those callers that actually need it. For now callers that need it is estimated conservatively, that is the call is added to all filesystems that do not assign an i_ino by themselves. For a few more filesystems we can avoid assigning any inode number given that they aren't user visible, and for others it could be done lazily when an inode number is actually needed, but that's left for later patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-22Merge branch 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits) BKL: remove BKL from freevxfs BKL: remove BKL from qnx4 autofs4: Only declare function when CONFIG_COMPAT is defined autofs: Only declare function when CONFIG_COMPAT is defined ncpfs: Lock socket in ncpfs while setting its callbacks fs/locks.c: prepare for BKL removal BKL: Remove BKL from ncpfs BKL: Remove BKL from OCFS2 BKL: Remove BKL from squashfs BKL: Remove BKL from jffs2 BKL: Remove BKL from ecryptfs BKL: Remove BKL from afs BKL: Remove BKL from USB gadgetfs BKL: Remove BKL from autofs4 BKL: Remove BKL from isofs BKL: Remove BKL from fat BKL: Remove BKL from ext2 filesystem BKL: Remove BKL from do_new_mount() BKL: Remove BKL from cgroup BKL: Remove BKL from NTFS ...
2010-10-07Merge branch 'rcu/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu
2010-10-04BKL: Remove BKL from cgroupJan Blunck
The BKL is only used in remount_fs and get_sb that are both protected by the superblocks s_umount rw_semaphore. Therefore it is safe to remove the BKL entirely. Signed-off-by: Jan Blunck <jblunck@infradead.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04BKL: Explicitly add BKL around get_sb/fill_superJan Blunck
This patch is a preparation necessary to remove the BKL from do_new_mount(). It explicitly adds calls to lock_kernel()/unlock_kernel() around get_sb/fill_super operations for filesystems that still uses the BKL. I've read through all the code formerly covered by the BKL inside do_kern_mount() and have satisfied myself that it doesn't need the BKL any more. do_kern_mount() is already called without the BKL when mounting the rootfs and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called from various places without BKL: simple_pin_fs(), nfs_do_clone_mount() through nfs_follow_mountpoint(), afs_mntpt_do_automount() through afs_mntpt_follow_link(). Both later functions are actually the filesystems follow_link inode operation. vfs_kern_mount() is calling the specified get_sb function and lets the filesystem do its job by calling the given fill_super function. Therefore I think it is safe to push down the BKL from the VFS to the low-level filesystems get_sb/fill_super operation. [arnd: do not add the BKL to those file systems that already don't use it elsewhere] Signed-off-by: Jan Blunck <jblunck@infradead.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Christoph Hellwig <hch@infradead.org>
2010-09-09cgroups: fix API thinkoMichael S. Tsirkin
Add cgroup_attach_task_all() The existing cgroup_attach_task_current_cg() API is called by a thread to attach another thread to all of its cgroups; this is unsuitable for cases where a privileged task wants to attach itself to the cgroups of a less privileged one, since the call must be made from the context of the target task. This patch adds a more generic cgroup_attach_task_all() API that allows both the source task and to-be-moved task to be specified. cgroup_attach_task_current_cg() becomes a specialization of the more generic new function. [menage@google.com: rewrote changelog] [akpm@linux-foundation.org: address reviewer comments] Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ben Blum <bblum@google.com> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-19cgroups: __rcu annotationsArnd Bergmann
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2010-08-11cgroups: save space for the terminatorDan Carpenter
The original code didn't leave enough space for a NULL terminator. These strings are copied with strcpy() into fixed length buffers in cgroup_root_from_opts(). Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Serge E. Hallyn <serge@hallyn.com> Reviewd-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ben Blum <bblum@andrew.cmu.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>