1. 02 Apr, 2018 10 commits
  2. 28 Mar, 2018 1 commit
  3. 05 Sep, 2017 1 commit
    • Miklos Szeredi's avatar
      ovl: don't allow writing ioctl on lower layer · 7c6893e3
      Miklos Szeredi authored
      Problem with ioctl() is that it's a file operation, yet often used as an
      inode operation (i.e. modify the inode despite the file being opened for
      mnt_want_write_file() is used by filesystems in such cases to get write
      access on an arbitrary open file.
      Since overlayfs lets filesystems do all file operations, including ioctl,
      this can lead to mnt_want_write_file() returning OK for a lower file and
      modification of that lower file.
      This patch prevents modification by checking if the file is from an
      overlayfs lower layer and returning EPERM in that case.
      Need to introduce a mnt_want_write_file_path() variant that still does the
      old thing for inode operations that can do the copy up + modification
      correctly in such cases (fchown, fsetxattr, fremovexattr).
      This does not address the correctness of such ioctls on overlayfs (the
      correct way would be to copy up and attempt to perform ioctl on upper
      In theory this could be a regression.  We very much hope that nobody is
      relying on such a hack in any sane setup.
      While this patch meddles in VFS code, it has no effect on non-overlayfs
      Reported-by: default avatar"zhangyi (F)" <yi.zhang@huawei.com>
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
  4. 04 Sep, 2017 1 commit
  5. 06 Jul, 2017 1 commit
    • Jeff Layton's avatar
      fs: new infrastructure for writeback error handling and reporting · 5660e13d
      Jeff Layton authored
      Most filesystems currently use mapping_set_error and
      filemap_check_errors for setting and reporting/clearing writeback errors
      at the mapping level. filemap_check_errors is indirectly called from
      most of the filemap_fdatawait_* functions and from
      filemap_write_and_wait*. These functions are called from all sorts of
      contexts to wait on writeback to finish -- e.g. mostly in fsync, but
      also in truncate calls, getattr, etc.
      The non-fsync callers are problematic. We should be reporting writeback
      errors during fsync, but many places spread over the tree clear out
      errors before they can be properly reported, or report errors at
      nonsensical times.
      If I get -EIO on a stat() call, there is no reason for me to assume that
      it is because some previous writeback failed. The fact that it also
      clears out the error such that a subsequent fsync returns 0 is a bug,
      and a nasty one since that's potentially silent data corruption.
      This patch adds a small bit of new infrastructure for setting and
      reporting errors during address_space writeback. While the above was my
      original impetus for adding this, I think it's also the case that
      current fsync semantics are just problematic for userland. Most
      applications that call fsync do so to ensure that the data they wrote
      has hit the backing store.
      In the case where there are multiple writers to the file at the same
      time, this is really hard to determine. The first one to call fsync will
      see any stored error, and the rest get back 0. The processes with open
      fds may not be associated with one another in any way. They could even
      be in different containers, so ensuring coordination between all fsync
      callers is not really an option.
      One way to remedy this would be to track what file descriptor was used
      to dirty the file, but that's rather cumbersome and would likely be
      slow. However, there is a simpler way to improve the semantics here
      without incurring too much overhead.
      This set adds an errseq_t to struct address_space, and a corresponding
      one is added to struct file. Writeback errors are recorded in the
      mapping's errseq_t, and the one in struct file is used as the "since"
      This changes the semantics of the Linux fsync implementation such that
      applications can now use it to determine whether there were any
      writeback errors since fsync(fd) was last called (or since the file was
      opened in the case of fsync having never been called).
      Note that those writeback errors may have occurred when writing data
      that was dirtied via an entirely different fd, but that's the case now
      with the current mapping_set_error/filemap_check_error infrastructure.
      This will at least prevent you from getting a false report of success.
      The new behavior is still consistent with the POSIX spec, and is more
      reliable for application developers. This patch just adds some basic
      infrastructure for doing this, and ensures that the f_wb_err "cursor"
      is properly set when a file is opened. Later patches will change the
      existing code to use this new infrastructure for reporting errors at
      fsync time.
      Signed-off-by: default avatarJeff Layton <jlayton@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
  6. 27 Jun, 2017 1 commit
    • Jens Axboe's avatar
      fs: add fcntl() interface for setting/getting write life time hints · c75b1d94
      Jens Axboe authored
      Define a set of write life time hints:
      RWH_WRITE_LIFE_NOT_SET	No hint information set
      RWH_WRITE_LIFE_NONE	No hints about write life time
      RWH_WRITE_LIFE_SHORT	Data written has a short life time
      RWH_WRITE_LIFE_MEDIUM	Data written has a medium life time
      RWH_WRITE_LIFE_LONG	Data written has a long life time
      RWH_WRITE_LIFE_EXTREME	Data written has an extremely long life time
      The intent is for these values to be relative to each other, no
      absolute meaning should be attached to these flag names.
      Add an fcntl interface for querying these flags, and also for
      setting them as well:
      F_GET_RW_HINT		Returns the read/write hint set on the
      			underlying inode.
      F_SET_RW_HINT		Set one of the above write hints on the
      			underlying inode.
      F_GET_FILE_RW_HINT	Returns the read/write hint set on the
      			file descriptor.
      F_SET_FILE_RW_HINT	Set one of the above write hints on the
      			file descriptor.
      The user passes in a 64-bit pointer to get/set these values, and
      the interface returns 0/-1 on success/error.
      Sample program testing/implementing basic setting/getting of write
      hints is below.
      Add support for storing the write life time hint in the inode flags
      and in struct file as well, and pass them to the kiocb flags. If
      both a file and its corresponding inode has a write hint, then we
      use the one in the file, if available. The file hint can be used
      for sync/direct IO, for buffered writeback only the inode hint
      is available.
      This is in preparation for utilizing these hints in the block layer,
      to guide on-media data placement.
       * writehint.c: get or set an inode write hint
       #include <stdio.h>
       #include <fcntl.h>
       #include <stdlib.h>
       #include <unistd.h>
       #include <stdbool.h>
       #include <inttypes.h>
       #ifndef F_GET_RW_HINT
       #define F_LINUX_SPECIFIC_BASE	1024
       #define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
       #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
      static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
      int main(int argc, char *argv[])
      	uint64_t hint;
      	int fd, ret;
      	if (argc < 2) {
      		fprintf(stderr, "%s: file <hint>\n", argv[0]);
      		return 1;
      	fd = open(argv[1], O_RDONLY);
      	if (fd < 0) {
      		return 2;
      	if (argc > 2) {
      		hint = atoi(argv[2]);
      		ret = fcntl(fd, F_SET_RW_HINT, &hint);
      		if (ret < 0) {
      			perror("fcntl: F_SET_RW_HINT");
      			return 4;
      	ret = fcntl(fd, F_GET_RW_HINT, &hint);
      	if (ret < 0) {
      		perror("fcntl: F_GET_RW_HINT");
      		return 3;
      	printf("%s: hint %s\n", argv[1], str[hint]);
      	return 0;
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
  7. 27 Apr, 2017 1 commit
  8. 21 Apr, 2017 1 commit
  9. 20 Apr, 2017 1 commit
  10. 17 Apr, 2017 1 commit
  11. 07 Feb, 2017 2 commits
  12. 24 Dec, 2016 1 commit
  13. 11 Oct, 2016 1 commit
  14. 03 Oct, 2016 1 commit
  15. 16 Sep, 2016 2 commits
    • Miklos Szeredi's avatar
      vfs: do get_write_access() on upper layer of overlayfs · 4d0c5ba2
      Miklos Szeredi authored
      The problem with writecount is: we want consistent handling of it for
      underlying filesystems as well as overlayfs.  Making sure i_writecount is
      correct on all layers is difficult.  Instead this patch makes sure that
      when write access is acquired, it's always done on the underlying writable
      layer (called the upper layer).  We must also make sure to look at the
      writecount on this layer when checking for conflicting leases.
      Open for write already updates the upper layer's writecount.  Leaving only
      For truncate copy up must happen before get_write_access() so that the
      writecount is updated on the upper layer.  Problem with this is if
      something fails after that, then copy-up was done needlessly.  E.g. if
      break_lease() was interrupted.  Probably not a big deal in practice.
      Another interesting case is if there's a denywrite on a lower file that is
      then opened for write or truncated.  With this patch these will succeed,
      which is somewhat counterintuitive.  But I think it's still acceptable,
      considering that the copy-up does actually create a different file, so the
      old, denywrite mapping won't be touched.
      On non-overlayfs d_real() is an identity function and d_real_inode() is
      equivalent to d_inode() so this patch doesn't change behavior in that case.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Acked-by: default avatarJeff Layton <jlayton@poochiereds.net>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
    • Miklos Szeredi's avatar
      locks: fix file locking on overlayfs · c568d683
      Miklos Szeredi authored
      This patch allows flock, posix locks, ofd locks and leases to work
      correctly on overlayfs.
      Instead of using the underlying inode for storing lock context use the
      overlay inode.  This allows locks to be persistent across copy-up.
      This is done by introducing locks_inode() helper and using it instead of
      file_inode() to get the inode in locking code.  For non-overlayfs the two
      are equivalent, except for an extra pointer dereference in locks_inode().
      Since lock operations are in "struct file_operations" we must also make
      sure not to call underlying filesystem's lock operations.  Introcude a
      super block flag MS_NOREMOTELOCK to this effect.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Acked-by: default avatarJeff Layton <jlayton@poochiereds.net>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
  16. 30 Jun, 2016 1 commit
    • Miklos Szeredi's avatar
      vfs: merge .d_select_inode() into .d_real() · 2d902671
      Miklos Szeredi authored
      The two methods essentially do the same: find the real dentry/inode
      belonging to an overlay dentry.  The difference is in the usage:
      vfs_open() uses ->d_select_inode() and expects the function to perform
      copy-up if necessary based on the open flags argument.
      file_dentry() uses ->d_real() passing in the overlay dentry as well as the
      underlying inode.
      vfs_rename() uses ->d_select_inode() but passes zero flags.  ->d_real()
      with a zero inode would have worked just as well here.
      This patch merges the functionality of ->d_select_inode() into ->d_real()
      by adding an 'open_flags' argument to the latter.
      [Al Viro] Make the signature of d_real() match that of ->d_real() again.
      And constify the inode argument, while we are at it.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
  17. 11 May, 2016 1 commit
  18. 02 May, 2016 1 commit
  19. 30 Mar, 2016 1 commit
  20. 28 Mar, 2016 3 commits
  21. 22 Mar, 2016 1 commit
    • Jann Horn's avatar
      fs/coredump: prevent fsuid=0 dumps into user-controlled directories · 378c6520
      Jann Horn authored
      This commit fixes the following security hole affecting systems where
      all of the following conditions are fulfilled:
       - The fs.suid_dumpable sysctl is set to 2.
       - The kernel.core_pattern sysctl's value starts with "/". (Systems
         where kernel.core_pattern starts with "|/" are not affected.)
       - Unprivileged user namespace creation is permitted. (This is
         true on Linux >=3.8, but some distributions disallow it by
         default using a distro patch.)
      Under these conditions, if a program executes under secure exec rules,
      causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
      namespace, changes its root directory and crashes, the coredump will be
      written using fsuid=0 and a path derived from kernel.core_pattern - but
      this path is interpreted relative to the root directory of the process,
      allowing the attacker to control where a coredump will be written with
      root privileges.
      To fix the security issue, always interpret core_pattern for dumps that
      are written under SUID_DUMP_ROOT relative to the root directory of init.
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
  22. 22 Jan, 2016 1 commit
    • Al Viro's avatar
      wrappers for ->i_mutex access · 5955102c
      Al Viro authored
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  23. 04 Jan, 2016 1 commit
  24. 10 Jul, 2015 1 commit
    • Eric W. Biederman's avatar
      vfs: Commit to never having exectuables on proc and sysfs. · 90f8572b
      Eric W. Biederman authored
      Today proc and sysfs do not contain any executable files.  Several
      applications today mount proc or sysfs without noexec and nosuid and
      then depend on there being no exectuables files on proc or sysfs.
      Having any executable files show on proc or sysfs would cause
      a user space visible regression, and most likely security problems.
      Therefore commit to never allowing executables on proc and sysfs by
      adding a new flag to mark them as filesystems without executables and
      enforce that flag.
      Test the flag where MNT_NOEXEC is tested today, so that the only user
      visible effect will be that exectuables will be treated as if the
      execute bit is cleared.
      The filesystems proc and sysfs do not currently incoporate any
      executable files so this does not result in any user visible effects.
      This makes it unnecessary to vet changes to proc and sysfs tightly for
      adding exectuable files or changes to chattr that would modify
      existing files, as no matter what the individual file say they will
      not be treated as exectuable files by the vfs.
      Not having to vet changes to closely is important as without this we
      are only one proc_create call (or another goof up in the
      implementation of notify_change) from having problematic executables
      on proc.  Those mistakes are all too easy to make and would create
      a situation where there are security issues or the assumptions of
      some program having to be broken (and cause userspace regressions).
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
  25. 23 Jun, 2015 2 commits
    • Jan Kara's avatar
      fs: Call security_ops->inode_killpriv on truncate · 45f147a1
      Jan Kara authored
      Comment in include/linux/security.h says that ->inode_killpriv() should
      be called when setuid bit is being removed and that similar security
      labels (in fact this applies only to file capabilities) should be
      removed at this time as well. However we don't call ->inode_killpriv()
      when we remove suid bit on truncate.
      We fix the problem by calling ->inode_need_killpriv() and subsequently
      ->inode_killpriv() on truncate the same way as we do it on file write.
      After this patch there's only one user of should_remove_suid() - ocfs2 -
      and indeed it's buggy because it doesn't call ->inode_killpriv() on
      write. However fixing it is difficult because of special locking
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
    • Miklos Szeredi's avatar
      vfs: add file_path() helper · 9bf39ab2
      Miklos Szeredi authored
      	d_path(&file->f_path, ...);
      	file_path(file, ...);
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
  26. 19 Jun, 2015 1 commit
    • David Howells's avatar
      overlayfs: Make f_path always point to the overlay and f_inode to the underlay · 4bacc9c9
      David Howells authored
      Make file->f_path always point to the overlay dentry so that the path in
      /proc/pid/fd is correct and to ensure that label-based LSMs have access to the
      overlay as well as the underlay (path-based LSMs probably don't need it).
      Using my union testsuite to set things up, before the patch I see:
      	[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
      	[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
      	lr-x------. 1 root root 64 Jun  5 14:38 5 -> /a/foo107
      	[root@andromeda union-testsuite]# stat /mnt/a/foo107
      	Device: 23h/35d Inode: 13381       Links: 1
      	[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
      	Device: 23h/35d Inode: 13381       Links: 1
      After the patch:
      	[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
      	[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
      	lr-x------. 1 root root 64 Jun  5 14:22 5 -> /mnt/a/foo107
      	[root@andromeda union-testsuite]# stat /mnt/a/foo107
      	Device: 23h/35d Inode: 40346       Links: 1
      	[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
      	Device: 23h/35d Inode: 40346       Links: 1
      Note the change in where /proc/$$/fd/5 points to in the ls command.  It was
      pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
      (which is correct).
      The inode accessed, however, is the lower layer.  The union layer is on device
      25h/37d and the upper layer on 24h/36d.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>