summaryrefslogtreecommitdiff
path: root/fs/namei.c
AgeCommit message (Collapse)Author
2011-09-14restore pinning the victim dentry in vfs_rmdir()/vfs_rename_dir()Al Viro
We used to get the victim pinned by dentry_unhash() prior to commit 64252c75a219 ("vfs: remove dget() from dentry_unhash()") and ->rmdir() and ->rename() instances relied on that; most of them don't care, but ones that used d_delete() themselves do. As the result, we are getting rmdir() oopses on NFS now. Just grab the reference before locking the victim and drop it explicitly after unlocking, same as vfs_rename_other() does. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Simon Kirby <sim@hostway.ca> Cc: stable@kernel.org (3.0.x) Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-09vfs: automount should ignore LOOKUP_FOLLOWMiklos Szeredi
Prior to 2.6.38 automount would not trigger on either stat(2) or lstat(2) on the automount point. After 2.6.38, with the introduction of the ->d_automount() infrastructure, stat(2) and others would start triggering automount while lstat(2), etc. still would not. This is a regression and a userspace ABI change. Problem originally reported here: http://thread.gmane.org/gmane.linux.kernel.autofs/6098 It appears that there was an attempt at fixing various userspace tools to not trigger the automount. But since the stat system call is rather common it is impossible to "fix" all userspace. This patch reverts the original behavior, which is to not trigger on stat(2) and other symlink following syscalls. [ It's not really clear what the right behavior is. Apparently Solaris does the "automount on stat, leave alone on lstat". And some programs can get unhappy when "stat+open+fstat" ends up giving a different result from the fstat than from the initial stat. But the change in 2.6.38 resulted in problems for some people, so we're going back to old behavior. Maybe we can re-visit this discussion at some future date - Linus ] Reported-by: Leonardo Chiquitto <leonardo.lists@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Acked-by: Ian Kent <raven@themaw.net> Cc: David Howells <dhowells@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-07vfs: rename 'do_follow_link' to 'should_follow_link'Linus Torvalds
Al points out that the do_follow_link() helper function really is misnamed - it's about whether we should try to follow a symlink or not, not about actually doing the following. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-07Fix POSIX ACL permission checkAri Savolainen
After commit 3567866bf261: "RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached" posix_acl_permission is being called with an unsupported flag and the permission check fails. This patch fixes the issue. Signed-off-by: Ari Savolainen <ari.m.savolainen@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-08-06vfs: optimize inode cache access patternsLinus Torvalds
The inode structure layout is largely random, and some of the vfs paths really do care. The path lookup in particular is already quite D$ intensive, and profiles show that accessing the 'inode->i_op->xyz' fields is quite costly. We already optimized the dcache to not unnecessarily load the d_op structure for members that are often NULL using the DCACHE_OP_xyz bits in dentry->d_flags, and this does something very similar for the inode ops that are used during pathname lookup. It also re-orders the fields so that the fields accessed by 'stat' are together at the beginning of the inode structure, and roughly in the order accessed. The effect of this seems to be in the 1-2% range for an empty kernel "make -j" run (which is fairly kernel-intensive, mostly in filename lookup), so it's visible. The numbers are fairly noisy, though, and likely depend a lot on exact microarchitecture. So there's more tuning to be done. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-08-03RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cachedAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-08-01VFS: Fix automount for negative autofs dentriesDavid Howells
Autofs may set the DCACHE_NEED_AUTOMOUNT flag on negative dentries. These need attention from the automounter daemon regardless of the LOOKUP_FOLLOW flag. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-25vfs: fix check_acl compile error when CONFIG_FS_POSIX_ACL is not setLinus Torvalds
Commit e77819e57f08 ("vfs: move ACL cache lookup into generic code") didn't take the FS_POSIX_ACL config variable into account - when that is not set, ACL's go away, and the cache helper functions do not exist, causing compile errors like fs/namei.c: In function 'check_acl': fs/namei.c:191:10: error: implicit declaration of function 'negative_cached_acl' fs/namei.c:196:2: error: implicit declaration of function 'get_cached_acl' fs/namei.c:196:6: warning: assignment makes pointer from integer without a cast fs/namei.c:212:11: error: implicit declaration of function 'set_cached_acl' Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Acked-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25vfs: make gcc generate more obvious code for acl permission checkingLinus Torvalds
The "fsuid is the inode owner" case is not necessarily always the likely case, but it's the case that doesn't do anything odd and that we want in straight-line code. Make gcc not generate random "jump around for the fun of it" code. This just helps me read profiles. That thing is one of the hottest parts of the whole pathname lookup. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25fs: take the ACL checks to common codeChristoph Hellwig
Replace the ->check_acl method with a ->get_acl method that simply reads an ACL from disk after having a cache miss. This means we can replace the ACL checking boilerplate code with a single implementation in namei.c. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-25vfs: move ACL cache lookup into generic codeLinus Torvalds
This moves logic for checking the cached ACL values from low-level filesystems into generic code. The end result is a streamlined ACL check that doesn't need to load the inode->i_op->check_acl pointer at all for the common cached case. The filesystems also don't need to check for a non-blocking RCU walk case in their acl_check() functions, because that is all handled at a VFS layer. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20VFS: Fixup kerneldoc for generic_permission()Tobias Klauser
The flags parameter went away in d749519b444db985e40b897f73ce1898b11f997e Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20unexport kern_path_parent()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20switch vfs_path_lookup() to struct pathAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20kill lookup_create()Al Viro
folded into the only caller (kern_path_create()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20new helpers: kern_path_create/user_path_createAl Viro
combination of kern_path_parent() and lookup_create(). Does *not* expose struct nameidata to caller. Syscalls converted to that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20kill LOOKUP_CONTINUEAl Viro
LOOKUP_PARENT is equivalent to it now Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20don't transliterate lower bits of ->intent.open.flags to FMODE_...Al Viro
->create() instances are much happier that way... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20Don't pass nameidata when calling vfs_create() from mknod()Al Viro
All instances can cope with that now (and ceph one actually starts working properly). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20merge do_revalidate() into its only callerAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20no reason to keep exec_permission() separate nowAl Viro
cache footprint alone makes it a bad idea... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20massage generic_permission() to treat directories on a separate pathAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20->permission() sanitizing: don't pass flags to exec_permission()Al Viro
pass mask instead; kill security_inode_exec_permission() since we can use security_inode_permission() instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20->permission() sanitizing: don't pass flags to ->permission()Al Viro
not used by the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20->permission() sanitizing: don't pass flags to generic_permission()Al Viro
redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of them removes that bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20->permission() sanitizing: don't pass flags to ->check_acl()Al Viro
not used in the instances anymore. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20->permission() sanitizing: pass MAY_NOT_BLOCK to ->check_acl()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20->permission() sanitizing: MAY_NOT_BLOCKAl Viro
Duplicate the flags argument into mask bitmap. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20kill check_acl callback of generic_permission()Al Viro
its value depends only on inode and does not change; we might as well store it in ->i_op->check_acl and be done with that. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20lockless get_write_access/deny_write_accessAl Viro
new helpers: atomic_inc_unless_negative()/atomic_dec_unless_positive() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20move exec_permission() up to the rest of permission-related functionsAl Viro
... and convert the comment before it into linuxdoc form. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20kill file_permission() completelyAl Viro
convert the last remaining caller to inode_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20switch path_init() to exec_permission()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20make exec_permission(dir) really equivalent to inode_permission(dir, MAY_EXEC)Al Viro
capability overrides apply only to the default case; if fs has ->permission() that does _not_ call generic_permission(), we have no business doing them. Moreover, if it has ->permission() that does call generic_permission(), we have no need to recheck capabilities. Besides, the capability overrides should apply only if we got EACCES from acl_permission_check(); any other value (-EIO, etc.) should be returned to caller, capabilities or not capabilities. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-20fs: add a DCACHE_NEED_LOOKUP flag for d_flagsJosef Bacik
Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order for readdir, but unfortunately in the case of anything that stats the files in order that readdir spits back (like oh say ls) that means we still have to do the normal lookup of the file, which means looking up our other index and then looking up the inode. What I want is a way to create dummy dentries when we find them in readdir so that when ls or anything else subsequently does a stat(), we already have the location information in the dentry and can go straight to the inode itself. The lookup stuff just assumes that if it finds a dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP flag so that the lookup code knows it still needs to run i_op->lookup() on the parent to get the inode for the dentry. I have tested this with btrfs and I went from something that looks like this http://people.redhat.com/jwhiter/ls-noreada.png To this http://people.redhat.com/jwhiter/ls-good.png Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-19vfs: fix race in rcu lookup of pruned dentryLinus Torvalds
Don't update *inode in __follow_mount_rcu() until we'd verified that there is mountpoint there. Kudos to Hugh Dickins for catching that one in the first place and eventually figuring out the solution (and catching a braino in the earlier version of patch). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-12Fix ->d_lock locking order in unlazy_walk()Al Viro
Make sure that child is still a child of parent before nested locking of child->d_lock in unlazy_walk(); otherwise we are risking a violation of locking order and deadlocks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-20fix comment in generic_permission()Al Viro
CAP_DAC_OVERRIDE is enough for MAY_EXEC on directory, even if no exec bits are set. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-20kill obsolete comment for follow_down()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-16VFS: Fix vfsmount overput on simultaneous automountAl Viro
[Kudos to dhowells for tracking that crap down] If two processes attempt to cause automounting on the same mountpoint at the same time, the vfsmount holding the mountpoint will be left with one too few references on it, causing a BUG when the kernel tries to clean up. The problem is that lock_mount() drops the caller's reference to the mountpoint's vfsmount in the case where it finds something already mounted on the mountpoint as it transits to the mounted filesystem and replaces path->mnt with the new mountpoint vfsmount. During a pathwalk, however, we don't take a reference on the vfsmount if it is the same as the one in the nameidata struct, but do_add_mount() doesn't know this. The fix is to make sure we have a ref on the vfsmount of the mountpoint before calling do_add_mount(). However, if lock_mount() doesn't transit, we're then left with an extra ref on the mountpoint vfsmount which needs releasing. We can handle that in follow_managed() by not making assumptions about what we can and what we cannot get from lookup_mnt() as the current code does. The callers of follow_managed() expect that reference to path->mnt will be grabbed iff path->mnt has been changed. follow_managed() and follow_automount() keep track of whether such reference has been grabbed and assume that it'll happen in those and only those cases that'll have us return with changed path->mnt. That assumption is almost correct - it breaks in case of racing automounts and in even harder to hit race between following a mountpoint and a couple of mount --move. The thing is, we don't need to make that assumption at all - after the end of loop in follow_manage() we can check if path->mnt has ended up unchanged and do mntput() if needed. The BUG can be reproduced with the following test program: #include <stdio.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <sys/wait.h> int main(int argc, char **argv) { int pid, ws; struct stat buf; pid = fork(); stat(argv[1], &buf); if (pid > 0) wait(&ws); return 0; } and the following procedure: (1) Mount an NFS volume that on the server has something else mounted on a subdirectory. For instance, I can mount / from my server: mount warthog:/ /mnt -t nfs4 -r On the server /data has another filesystem mounted on it, so NFS will see a change in FSID as it walks down the path, and will mark /mnt/data as being a mountpoint. This will cause the automount code to be triggered. !!! Do not look inside the mounted fs at this point !!! (2) Run the above program on a file within the submount to generate two simultaneous automount requests: /tmp/forkstat /mnt/data/testfile (3) Unmount the automounted submount: umount /mnt/data (4) Unmount the original mount: umount /mnt At this point the kernel should throw a BUG with something like the following: BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12] Note that the bug appears on the root dentry of the original mount, not the mountpoint and not the submount because sys_umount() hasn't got to its final mntput_no_expire() yet, but this isn't so obvious from the call trace: [<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82 [<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b [<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs] [<ffffffff811617f3>] kill_anon_super+0x1d/0x7e [<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs] [<ffffffff81161c17>] deactivate_locked_super+0x34/0x83 [<ffffffff811629ff>] deactivate_super+0x6f/0x7b [<ffffffff81186261>] mntput_no_expire+0x18d/0x199 [<ffffffff811862a8>] mntput+0x3b/0x44 [<ffffffff81186d87>] release_mounts+0xa2/0xbf [<ffffffff811876af>] sys_umount+0x47a/0x4ba [<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f [<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b as do_umount() is inlined. However, you can see release_mounts() in there. Note also that it may be necessary to have multiple CPU cores to be able to trigger this bug. Tested-by: Jeff Layton <jlayton@redhat.com> Tested-by: Ian Kent <raven@themaw.net> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-16fix wrong iput on d_inode introduced by e6bc45d65dTörök Edwin
Git bisection shows that commit e6bc45d65df8599fdbae73be9cec4ceed274db53 causes BUG_ONs under high I/O load: kernel BUG at fs/inode.c:1368! [ 2862.501007] Call Trace: [ 2862.501007] [<ffffffff811691d8>] d_kill+0xf8/0x140 [ 2862.501007] [<ffffffff81169c19>] dput+0xc9/0x190 [ 2862.501007] [<ffffffff8115577f>] fput+0x15f/0x210 [ 2862.501007] [<ffffffff81152171>] filp_close+0x61/0x90 [ 2862.501007] [<ffffffff81152251>] sys_close+0xb1/0x110 [ 2862.501007] [<ffffffff814c14fb>] system_call_fastpath+0x16/0x1b A reliable way to reproduce this bug is: Login to KDE, run 'rsnapshot sync', and apt-get install openjdk-6-jdk, and apt-get remove openjdk-6-jdk. The buggy part of the patch is this: struct inode *inode = NULL; ..... - if (nd.last.name[nd.last.len]) - goto slashes; inode = dentry->d_inode; - if (inode) - ihold(inode); + if (nd.last.name[nd.last.len] || !inode) + goto slashes; + ihold(inode) ... if (inode) iput(inode); /* truncate the inode here */ If nd.last.name[nd.last.len] is nonzero (and thus goto slashes branch is taken), and dentry->d_inode is non-NULL, then this code now does an additional iput on the inode, which is wrong. Fix this by only setting the inode variable if nd.last.name[nd.last.len] is 0. Reference: https://lkml.org/lkml/2011/6/15/50 Reported-by: Norbert Preining <preining@logic.at> Reported-by: Török Edwin <edwintorok@gmail.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Török Edwin <edwintorok@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-06-07vfs: make unlink() and rmdir() return ENOENT in preference to EROFSTheodore Ts'o
If user space attempts to remove a non-existent file or directory, and the file system is mounted read-only, return ENOENT instead of EROFS. Either error code is arguably valid/correct, but ENOENT is a more specific error message. Reported-by: Michael Tokarev <mjt@tls.msk.ru> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-30vfs: shrink_dcache_parent before rmdir, dir renameSage Weil
The dentry_unhash push-down series missed that shink_dcache_parent needs to be called prior to rmdir or dir rename to clear DCACHE_REFERENCED and allow efficient dentry reclaim. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-27Lift the check for automount points into do_lookup()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-27Trim excessive arguments of follow_mount_rcu()Al Viro
... and kill a useless local variable in follow_dotdot_rcu(), while we are at it - follow_mount_rcu(nd, path, inode) *always* assigned value to *inode, and always it had been path->dentry->d_inode (aka nd->path.dentry->d_inode, since it always got &nd->path as the second argument). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-27split __follow_mount_rcu() into normal and .. casesAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (25 commits) cifs: remove unnecessary dentry_unhash on rmdir/rename_dir ocfs2: remove unnecessary dentry_unhash on rmdir/rename_dir exofs: remove unnecessary dentry_unhash on rmdir/rename_dir nfs: remove unnecessary dentry_unhash on rmdir/rename_dir ext2: remove unnecessary dentry_unhash on rmdir/rename_dir ext3: remove unnecessary dentry_unhash on rmdir/rename_dir ext4: remove unnecessary dentry_unhash on rmdir/rename_dir btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir ceph: remove unnecessary dentry_unhash calls vfs: clean up vfs_rename_other vfs: clean up vfs_rename_dir vfs: clean up vfs_rmdir vfs: fix vfs_rename_dir for FS_RENAME_DOES_D_MOVE filesystems libfs: drop unneeded dentry_unhash vfs: update dentry_unhash() comment vfs: push dentry_unhash on rename_dir into file systems vfs: push dentry_unhash on rmdir into file systems vfs: remove dget() from dentry_unhash() vfs: dentry_unhash immediately prior to rmdir vfs: Block mmapped writes while the fs is frozen ...
2011-05-26vfs: clean up vfs_rename_otherSage Weil
Simplify control flow to match vfs_rename_dir. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26vfs: clean up vfs_rename_dirSage Weil
Simplify control flow through vfs_rename_dir. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-05-26vfs: clean up vfs_rmdirSage Weil
Simplify the control flow with an out label. Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>