1. 25 Feb, 2018 2 commits
  2. 24 Feb, 2018 1 commit
    • Al Viro's avatar
      lock_parent() needs to recheck if dentry got __dentry_kill'ed under it · 3b821409
      Al Viro authored
      
      
      In case when dentry passed to lock_parent() is protected from freeing only
      by the fact that it's on a shrink list and trylock of parent fails, we
      could get hit by __dentry_kill() (and subsequent dentry_kill(parent))
      between unlocking dentry and locking presumed parent.  We need to recheck
      that dentry is alive once we lock both it and parent *and* postpone
      rcu_read_unlock() until after that point.  Otherwise we could return
      a pointer to struct dentry that already is rcu-scheduled for freeing, with
      ->d_lock held on it; caller's subsequent attempt to unlock it can end
      up with memory corruption.
      
      Cc: stable@vger.kernel.org # 3.12+, counting backports
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3b821409
  3. 01 Feb, 2018 2 commits
  4. 26 Jan, 2018 2 commits
  5. 24 Jan, 2018 2 commits
  6. 15 Jan, 2018 2 commits
    • David Windsor's avatar
      vfs: Define usercopy region in names_cache slab caches · 6a9b8820
      David Windsor authored
      
      
      VFS pathnames are stored in the names_cache slab cache, either inline
      or across an entire allocation entry (when approaching PATH_MAX). These
      are copied to/from userspace, so they must be entirely whitelisted.
      
      cache object allocation:
          include/linux/fs.h:
              #define __getname()    kmem_cache_alloc(names_cachep, GFP_KERNEL)
      
      example usage trace:
          strncpy_from_user+0x4d/0x170
          getname_flags+0x6f/0x1f0
          user_path_at_empty+0x23/0x40
          do_mount+0x69/0xda0
          SyS_mount+0x83/0xd0
      
          fs/namei.c:
              getname_flags(...):
                  ...
                  result = __getname();
                  ...
                  kname = (char *)result->iname;
                  result->name = kname;
                  len = strncpy_from_user(kname, filename, EMBEDDED_NAME_MAX);
                  ...
                  if (unlikely(len == EMBEDDED_NAME_MAX)) {
                      const size_t size = offsetof(struct filename, iname[1]);
                      kname = (char *)result;
      
                      result = kzalloc(size, GFP_KERNEL);
                      ...
                      result->name = kname;
                      len = strncpy_from_user(kname, filename, PATH_MAX);
      
      In support of usercopy hardening, this patch defines the entire cache
      object in the names_cache slab cache as whitelisted, since it may entirely
      hold name strings to be copied to/from userspace.
      
      This patch is verbatim from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: default avatarDavid Windsor <dave@nullcore.net>
      [kees: adjust commit log, add usage trace]
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      6a9b8820
    • David Windsor's avatar
      dcache: Define usercopy region in dentry_cache slab cache · 80344266
      David Windsor authored
      
      
      When a dentry name is short enough, it can be stored directly in the
      dentry itself (instead in a separate kmalloc allocation). These dentry
      short names, stored in struct dentry.d_iname and therefore contained in
      the dentry_cache slab cache, need to be coped to userspace.
      
      cache object allocation:
          fs/dcache.c:
              __d_alloc(...):
                  ...
                  dentry = kmem_cache_alloc(dentry_cache, ...);
                  ...
                  dentry->d_name.name = dentry->d_iname;
      
      example usage trace:
          filldir+0xb0/0x140
          dcache_readdir+0x82/0x170
          iterate_dir+0x142/0x1b0
          SyS_getdents+0xb5/0x160
      
          fs/readdir.c:
              (called via ctx.actor by dir_emit)
              filldir(..., const char *name, ...):
                  ...
                  copy_to_user(..., name, namlen)
      
          fs/libfs.c:
              dcache_readdir(...):
                  ...
                  next = next_positive(dentry, p, 1)
                  ...
                  dir_emit(..., next->d_name.name, ...)
      
      In support of usercopy hardening, this patch defines a region in the
      dentry_cache slab cache in which userspace copy operations are allowed.
      
      This region is known as the slab cache's usercopy region. Slab caches can
      now check that each dynamic copy operation involving cache-managed memory
      falls entirely within the slab's usercopy region.
      
      This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
      whitelisting code in the last public patch of grsecurity/PaX based on my
      understanding of the code. Changes or omissions from the original code are
      mine and don't reflect the original grsecurity/PaX code.
      Signed-off-by: default avatarDavid Windsor <dave@nullcore.net>
      [kees: adjust hunks for kmalloc-specific things moved later]
      [kees: adjust commit log, provide usage trace]
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      80344266
  7. 28 Dec, 2017 1 commit
    • NeilBrown's avatar
      VFS: close race between getcwd() and d_move() · 61647823
      NeilBrown authored
      d_move() will call __d_drop() and then __d_rehash()
      on the dentry being moved.  This creates a small window
      when the dentry appears to be unhashed.  Many tests
      of d_unhashed() are made under ->d_lock and so are safe
      from racing with this window, but some aren't.
      In particular, getcwd() calls d_unlinked() (which calls
      d_unhashed()) without d_lock protection, so it can race.
      
      This races has been seen in practice with lustre, which uses d_move() as
      part of name lookup.  See:
         https://jira.hpdd.intel.com/browse/LU-9735
      It could race with a regular rename(), and result in ENOENT instead
      of either the 'before' or 'after' name.
      
      The race can be demonstrated with a simple program which
      has two threads, one renaming a directory back and forth
      while another calls getcwd() within that directory: it should never
      fail, but does.  See:
        https://patchwork.kernel.org/patch/9455345/
      
      
      
      We could fix this race by taking d_lock and rechecking when
      d_unhashed() reports true.  Alternately when can remove the window,
      which is the approach this patch takes.
      
      ___d_drop() is introduce which does *not* clear d_hash.pprev
      so the dentry still appears to be hashed.  __d_drop() calls
      ___d_drop(), then clears d_hash.pprev.
      __d_move() now uses ___d_drop() and only clears d_hash.pprev
      when not rehashing.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      61647823
  8. 26 Dec, 2017 1 commit
    • NeilBrown's avatar
      VFS: don't keep disconnected dentries on d_anon · f1ee6162
      NeilBrown authored
      The original purpose of the per-superblock d_anon list was to
      keep disconnected dentries in the cache between consecutive
      requests to the NFS server.  Dentries can be disconnected if
      a client holds a file open and repeatedly performs IO on it,
      and if the server drops the dentry, whether due to memory
      pressure, server restart, or "echo 3 > /proc/sys/vm/drop_caches".
      
      This purpose was thwarted by commit 75a6f82a
      
       ("freeing unlinked
      file indefinitely delayed") which caused disconnected dentries
      to be freed as soon as their refcount reached zero.
      
      This means that, when a dentry being used by nfsd gets disconnected, a
      new one needs to be allocated for every request (unless requests
      overlap).  As the dentry has no name, no parent, and no children,
      there is little of value to cache.  As small memory allocations are
      typically fast (from per-cpu free lists) this likely has little cost.
      
      This means that the original purpose of s_anon is no longer relevant:
      there is no longer any need to keep disconnected dentries on a list so
      they appear to be hashed.
      
      However, s_anon now has a new use.  When you mount an NFS filesystem,
      the dentry stored in s_root is just a placebo.  The "real" root dentry
      is allocated using d_obtain_root() and so it kept on the s_anon list.
      I don't know the reason for this, but suspect it related to NFSv4
      where a mount of "server:/some/path" require NFS to look up the root
      filehandle on the server, then walk down "/some" and "/path" to get
      the filehandle to mount.
      
      Whatever the reason, NFS depends on the s_anon list and on
      shrink_dcache_for_umount() pruning all dentries on this list.  So we
      cannot simply remove s_anon.
      
      We could just leave the code unchanged, but apart from that being
      potentially confusing, the (unfair) bit-spin-lock which protects
      s_anon can become a bottle neck when lots of disconnected dentries are
      being created.
      
      So this patch renames s_anon to s_roots, and stops storing
      disconnected dentries on the list.  Only dentries obtained with
      d_obtain_root() are now stored on this list.  There are many fewer of
      these (only NFS and NILFS2 use the call, and only during filesystem
      mount) so contention on the bit-lock will not be a problem.
      
      Possibly an alternate solution should be found for NFS and NILFS2, but
      that would require understanding their needs first.
      Signed-off-by: default avatarNeilBrown <neilb@suse.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f1ee6162
  9. 17 Dec, 2017 1 commit
  10. 07 Dec, 2017 1 commit
  11. 04 Dec, 2017 1 commit
    • Paul E. McKenney's avatar
      fs/dcache: Use release-acquire for name/length update · 7088efa9
      Paul E. McKenney authored
      
      
      The code in __d_alloc() carefully orders filling in the NUL character
      of the name (and the length, hash, and the name itself) with assigning
      of the name itself.  However, prepend_name() does not order the accesses
      to the ->name and ->len fields, other than on TSO systems.  This commit
      therefore replaces prepend_name()'s READ_ONCE() of ->name with an
      smp_load_acquire(), which orders against the subsequent READ_ONCE() of
      ->len.  Because READ_ONCE() now incorporates smp_read_barrier_depends(),
      prepend_name()'s smp_read_barrier_depends() is removed.  Finally,
      to save a line, the smp_wmb()/store pair in __d_alloc() is replaced
      by smp_store_release().
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <linux-fsdevel@vger.kernel.org>
      7088efa9
  12. 16 Nov, 2017 1 commit
  13. 25 Oct, 2017 1 commit
    • Mark Rutland's avatar
      locking/atomics, fs/dcache: Convert ACCESS_ONCE() to READ_ONCE()/WRITE_ONCE() · 66702eb5
      Mark Rutland authored
      
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't currently harmful.
      
      However, for some features it is necessary to instrument reads and
      writes separately, which is not possible with ACCESS_ONCE(). This
      distinction is critical to correct operation.
      
      It's possible to transform the bulk of kernel code using the Coccinelle
      script below. However, this doesn't handle comments, leaving references
      to ACCESS_ONCE() instances which have been removed. As a preparatory
      step, this patch converts the dcache code and comments to use
      {READ,WRITE}_ONCE() consistently.
      
      ----
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-4-git-send-email-paulmck@linux.vnet.ibm.com
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      66702eb5
  14. 24 Oct, 2017 1 commit
  15. 10 Jul, 2017 1 commit
  16. 08 Jul, 2017 1 commit
    • Al Viro's avatar
      dentry name snapshots · 49d31c2f
      Al Viro authored
      
      
      take_dentry_name_snapshot() takes a safe snapshot of dentry name;
      if the name is a short one, it gets copied into caller-supplied
      structure, otherwise an extra reference to external name is grabbed
      (those are never modified).  In either case the pointer to stable
      string is stored into the same structure.
      
      dentry must be held by the caller of take_dentry_name_snapshot(),
      but may be freely dropped afterwards - the snapshot will stay
      until destroyed by release_dentry_name_snapshot().
      
      Intended use:
      	struct name_snapshot s;
      
      	take_dentry_name_snapshot(&s, dentry);
      	...
      	access s.name
      	...
      	release_dentry_name_snapshot(&s);
      
      Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
      to pass down with event.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      49d31c2f
  17. 06 Jul, 2017 2 commits
    • Pavel Tatashin's avatar
      mm: update callers to use HASH_ZERO flag · 3d375d78
      Pavel Tatashin authored
      Update dcache, inode, pid, mountpoint, and mount hash tables to use
      HASH_ZERO, and remove initialization after allocations.  In case of
      places where HASH_EARLY was used such as in __pv_init_lock_hash the
      zeroed hash table was already assumed, because memblock zeroes the
      memory.
      
      CPU: SPARC M6, Memory: 7T
      Before fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 11.798s
      
      After fix:
        Dentry cache hash table entries: 1073741824
        Inode-cache hash table entries: 536870912
        Mount-cache hash table entries: 16777216
        Mountpoint-cache hash table entries: 16777216
        ftrace: allocating 20414 entries in 40 pages
        Total time: 3.198s
      
      CPU: Intel Xeon E5-2630, Memory: 2.2T:
      Before fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.245s
      
      After fix:
        Dentry cache hash table entries: 536870912
        Inode-cache hash table entries: 268435456
        Mount-cache hash table entries: 8388608
        Mountpoint-cache hash table entries: 8388608
        CPU: Physical Processor ID: 0
        Total time: 3.244s
      
      Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarBabu Moger <babu.moger@oracle.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3d375d78
    • David Howells's avatar
      VFS: Provide empty name qstr · cdf01226
      David Howells authored
      
      
      Provide an empty name (ie. "") qstr for general use.
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cdf01226
  18. 30 Jun, 2017 1 commit
  19. 15 Jun, 2017 1 commit
    • Al Viro's avatar
      Hang/soft lockup in d_invalidate with simultaneous calls · 81be24d2
      Al Viro authored
      
      
      It's not hard to trigger a bunch of d_invalidate() on the same
      dentry in parallel.  They end up fighting each other - any
      dentry picked for removal by one will be skipped by the rest
      and we'll go for the next iteration through the entire
      subtree, even if everything is being skipped.  Morevoer, we
      immediately go back to scanning the subtree.  The only thing
      we really need is to dissolve all mounts in the subtree and
      as soon as we've nothing left to do, we can just unhash the
      dentry and bugger off.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      81be24d2
  20. 03 May, 2017 1 commit
    • Josef Bacik's avatar
      fs: don't set *REFERENCED on single use objects · 563f4001
      Josef Bacik authored
      By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
      inode we create.  This is problematic as this means that it takes two
      trips through the LRU for any of these objects to be reclaimed,
      regardless of their actual lifetime.  With enough pressure from these
      caches we can easily evict our working set from page cache with single
      use objects.  So instead only set *REFERENCED if we've already been
      added to the LRU list.  This means that we've been touched since the
      first time we were accessed, and so more likely to need to hang out in
      cache.
      
      To illustrate this issue I wrote the following scripts
      
      https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure
      
      
      
      on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
      I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
      creates a new file system and creates 2 6.5gib files in order to take up
      13gib of the 16gib of ram with pagecache.  Then it runs a test program
      that reads these 2 files in a loop, and keeps track of how often it has
      to read bytes for each loop.  On an ideal system with no pressure we
      should have to read 0 bytes indefinitely.  The second thing this script
      does is start a fs_mark job that creates a ton of 0 length files,
      putting pressure on the system with slab only allocations.  On exit the
      script prints out how many bytes were read by the read-file program.
      The results are as follows
      
      Without patch:
      /mnt/btrfs-test/reads/file1: total read during loops 27262988288
      /mnt/btrfs-test/reads/file2: total read during loops 27262976000
      
      With patch:
      /mnt/btrfs-test/reads/file2: total read during loops 18640457728
      /mnt/btrfs-test/reads/file1: total read during loops 9565376512
      
      This patch results in a 50% reduction of the amount of pages evicted
      from our working set.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      563f4001
  21. 10 Jan, 2017 1 commit
    • Eric W. Biederman's avatar
      mnt: Protect the mountpoint hashtable with mount_lock · 3895dbf8
      Eric W. Biederman authored
      Protecting the mountpoint hashtable with namespace_sem was sufficient
      until a call to umount_mnt was added to mntput_no_expire.  At which
      point it became possible for multiple calls of put_mountpoint on
      the same hash chain to happen on the same time.
      
      Kristen Johansen <kjlx@templeofstupid.com> reported:
      > This can cause a panic when simultaneous callers of put_mountpoint
      > attempt to free the same mountpoint.  This occurs because some callers
      > hold the mount_hash_lock, while others hold the namespace lock.  Some
      > even hold both.
      >
      > In this submitter's case, the panic manifested itself as a GP fault in
      > put_mountpoint() when it called hlist_del() and attempted to dereference
      > a m_hash.pprev that had been poisioned by another thread.
      
      Al Viro observed that the simple fix is to switch from using the namespace_sem
      to the mount_lock to protect the mountpoint hash table.
      
      I have taken Al's suggested patch moved put_mountpoint in pivot_root
      (instead of taking mount_lock an additional time), and have replaced
      new_mountpoint with get_mountpoint a function that does the hash table
      lookup and addition under the mount_lock.   The introduction of get_mounptoint
      ensures that only the mount_lock is needed to manipulate the mountpoint
      hashtable.
      
      d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
      already set.  This allows get_mountpoint to use the setting of
      DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
      happens exactly once.
      
      Cc: stable@vger.kernel.org
      Fixes: ce07d891
      
       ("mnt: Honor MNT_LOCKED when detaching mounts")
      Reported-by: default avatarKrister Johansen <kjlx@templeofstupid.com>
      Suggested-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Acked-by: default avatarAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      3895dbf8
  22. 24 Dec, 2016 1 commit
  23. 04 Dec, 2016 2 commits
  24. 31 Jul, 2016 1 commit
  25. 29 Jul, 2016 2 commits
  26. 24 Jul, 2016 2 commits
    • Wei Fang's avatar
      fs/dcache.c: avoid soft-lockup in dput() · 47be6184
      Wei Fang authored
      
      
      We triggered soft-lockup under stress test which
      open/access/write/close one file concurrently on more than
      five different CPUs:
      
      WARN: soft lockup - CPU#0 stuck for 11s! [who:30631]
      ...
      [<ffffffc0003986f8>] dput+0x100/0x298
      [<ffffffc00038c2dc>] terminate_walk+0x4c/0x60
      [<ffffffc00038f56c>] path_lookupat+0x5cc/0x7a8
      [<ffffffc00038f780>] filename_lookup+0x38/0xf0
      [<ffffffc000391180>] user_path_at_empty+0x78/0xd0
      [<ffffffc0003911f4>] user_path_at+0x1c/0x28
      [<ffffffc00037d4fc>] SyS_faccessat+0xb4/0x230
      
      ->d_lock trylock may failed many times because of concurrently
      operations, and dput() may execute a long time.
      
      Fix this by replacing cpu_relax() with cond_resched().
      dput() used to be sleepable, so make it sleepable again
      should be safe.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarWei Fang <fangwei1@huawei.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      47be6184
    • Miklos Szeredi's avatar
      vfs: new d_init method · 285b102d
      Miklos Szeredi authored
      
      
      Allow filesystem to initialize dentry at allocation time.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      285b102d
  27. 21 Jul, 2016 1 commit
  28. 01 Jul, 2016 2 commits
  29. 30 Jun, 2016 1 commit
    • Miklos Szeredi's avatar
      vfs: merge .d_select_inode() into .d_real() · 2d902671
      Miklos Szeredi authored
      
      
      The two methods essentially do the same: find the real dentry/inode
      belonging to an overlay dentry.  The difference is in the usage:
      
      vfs_open() uses ->d_select_inode() and expects the function to perform
      copy-up if necessary based on the open flags argument.
      
      file_dentry() uses ->d_real() passing in the overlay dentry as well as the
      underlying inode.
      
      vfs_rename() uses ->d_select_inode() but passes zero flags.  ->d_real()
      with a zero inode would have worked just as well here.
      
      This patch merges the functionality of ->d_select_inode() into ->d_real()
      by adding an 'open_flags' argument to the latter.
      
      [Al Viro] Make the signature of d_real() match that of ->d_real() again.
      And constify the inode argument, while we are at it.
      Signed-off-by: default avatarMiklos Szeredi <mszeredi@redhat.com>
      2d902671
  30. 20 Jun, 2016 1 commit